<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Federico</forename><surname>D'asaro</surname></persName>
							<email>federico.dasaro@polito.it</email>
							<affiliation key="aff0">
								<orgName type="department">Data &amp; Space</orgName>
								<orgName type="institution">LINKS Foundation -AI</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Politecnico di Torino</orgName>
								<orgName type="department" key="dep2">Dipartimento di Automatica e Informatica (DAUIN)</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Juan</forename><forename type="middle">José Márquez</forename><surname>Villacís</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Data &amp; Space</orgName>
								<orgName type="institution">LINKS Foundation -AI</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><surname>Rizzo</surname></persName>
							<email>giuseppe.rizzo@linksfoundation.com</email>
							<affiliation key="aff0">
								<orgName type="department">Data &amp; Space</orgName>
								<orgName type="institution">LINKS Foundation -AI</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Politecnico di Torino</orgName>
								<orgName type="department" key="dep2">Dipartimento di Automatica e Informatica (DAUIN)</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andrea</forename><surname>Bottino</surname></persName>
							<email>andrea.bottino@polito.it</email>
							<affiliation key="aff1">
								<orgName type="department" key="dep1">Politecnico di Torino</orgName>
								<orgName type="department" key="dep2">Dipartimento di Automatica e Informatica (DAUIN)</orgName>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Using Large Speech Models for Feature Extraction in Cross-Lingual Speech Emotion Recognition</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">60A12A0A7BDC475E882925E064E6DEDC</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Cross-lingual Speech Emotion Recognition</term>
					<term>Large Speech models</term>
					<term>Transfer Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Large Speech Models (LSMs), pre-trained on extensive unlabeled data using Self-Supervised Learning (SSL) or Weakly-Supervised Learning (WSL), are increasingly employed for tasks like Speech Emotion Recognition (SER). Their capability to extract general-purpose features makes them a strong alternative to low-level descriptors. Most studies focus on English, with limited research on other languages. We evaluate English-Only and Multilingual LSMs from the Wav2Vec 2.0 and Whisper families as feature extractors for SER in eight languages. We have stacked three alternative downstream classifiers of increasing complexity, named Linear, Non-Linear, and Multi-Layer, on top of the LSMs. Results indicate that Whisper models perform best with a simple linear classifier using features from the last transformer layer, while Wav2Vec 2.0 models benefit from features from the middle and early transformer layers. When comparing English-Only and Multilingual LSMs, we find that Whisper models benefit from multilingual pre-training, excelling in Italian, Canadian French, French, Spanish, German and competitively on Greek, Egyptian Arabic, Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving the highest performance in Greek, Egyptian Arabic.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Speech Emotion Recognition (SER) aims to identify emotions from speech audio, enhancing Human-AI interaction in fields such as healthcare, education, and security <ref type="bibr" target="#b0">[1]</ref>. Traditional methods rely on Low-Level Descriptors (LLD) like spectral, prosodic, and voice quality features <ref type="bibr" target="#b1">[2]</ref>, using classifiers such as KNN, SVM, or Naïve Bayes <ref type="bibr" target="#b2">[3]</ref>. Deep learning has introduced advanced techniques, including Convolutional Neural Networks (CNNs) <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>, eventually followed by Recurrent Neural Networks (RNNs) <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>, and Transformers <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>. Transformers' ability to learn from extensive datasets has led to Large Speech Models (LSMs), which generalize across various speech tasks. Common training approaches for these models include Self-Supervised Learning (SSL), which uses data itself to learn generalpurpose features <ref type="bibr" target="#b11">[12]</ref>, and Weakly-Supervised Learning (WSL), which pairs audio with text for tasks like transcription and translation <ref type="bibr" target="#b12">[13]</ref>. The general-purpose knowl-edge of LSMs makes them effective feature extractors for SER. Research has adapted LSMs for SER in English <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b16">17]</ref>, but efforts for other languages are limited, focusing on Wav2Vec 2.0 <ref type="bibr" target="#b17">[18]</ref> for cross-lingual SER <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b21">21]</ref>.</p><p>This study examines how effective LSMs are as feature extractors for cross-lingual SER, using nine datasets across eight languages: Italian, German, French, Canadian French, Spanish, Greek, Persian, and Egyptian Arabic. Specifically, we utilize LSMs from the Wav2Vec 2.0 and Whisper <ref type="bibr" target="#b12">[13]</ref> model families, pre-trained with SSL and WSL approaches, respectively. We introduce Whisper due to its underexplored use in cross-lingual SER. To assess the effectiveness of LSMs as feature extractors, we test three classifiers of increasing complexity-Linear, Non-Linear, and Multi-Layer-across nine datasets. This evaluation determines which classifier best suits each LSM across different languages. Moreover, our study includes both English-Only and Multilingual models from the Wav2Vec 2.0 and Whisper families, aiming to evaluate the effectiveness of multilingual pre-training for cross-lingual SER.</p><p>The main contributions of this work are:</p><p>• We evaluate LSMs from the Wav2Vec 2.0 and Whisper models as feature extractors for crosslingual SER across eight languages. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Large Speech Models</head><p>Recent developments in natural language processing and computer vision have harnessed large volumes of unlabeled data through Self-Supervised Learning <ref type="bibr" target="#b22">[22,</ref><ref type="bibr" target="#b23">23,</ref><ref type="bibr" target="#b24">24]</ref>.</p><p>Building on techniques such as masked language and image modeling, Wav2Vec 2.0 <ref type="bibr" target="#b17">[18]</ref> introduced a LSM trained on extensive audio datasets using masked speech modeling. Wav2Vec 2.0 features seven 1D convolutional blocks for initial feature extraction, followed by 12 or 24 transformer blocks (depending on the model variant) for contextual processing. The model masks part of the latent features and reconstructs them using the surrounding context. To further refine LSMs for tasks like emotion recognition, methods such as WavLM <ref type="bibr" target="#b25">[25]</ref> have been developed. WavLM incorporates speech denoising alongside masked modeling, demonstrating broad effectiveness across various tasks in the SUPERB benchmark <ref type="bibr" target="#b26">[26]</ref>. Moreover, XLSR-53 <ref type="bibr" target="#b27">[27]</ref> extends the Wav2Vec 2.0 framework to cover 53 languages, sharing the latent space across these languages. This approach has shown superior performance over monolingual pretraining for automatic speech recognition. XLS-R <ref type="bibr" target="#b28">[28]</ref> further advances this by scaling to 128 languages, excelling in speech translation and language identification. In comparison, Whisper <ref type="bibr" target="#b12">[13]</ref> leverages large-scale weak supervision from audio-transcription pairs to train an encoder-decoder transformer. Using log-mel spectrograms, Whisper is trained in a multitask framework that includes multilingual transcription and translation, establishing itself as an effective zero-shot model for multilingual tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Cross-Language Speech Emotion Recognition</head><p>Emotion recognition in languages beyond English, like Italian <ref type="bibr" target="#b29">[29]</ref>, French <ref type="bibr" target="#b30">[30]</ref>, Persian <ref type="bibr" target="#b31">[31,</ref><ref type="bibr" target="#b32">32]</ref>, and Spanish <ref type="bibr" target="#b33">[33]</ref>, is crucial but often limited by data availability. Recent efforts have focused on improving cross-lingual and cross-modal knowledge transfer. Techniques like dual attention <ref type="bibr" target="#b21">[21]</ref> and tensor fusion <ref type="bibr" target="#b34">[34]</ref> enhance audio and text interaction in languages such as Italian, German, and Urdu. Self-supervised pre-training methods, including variational autoencoders, have also been effective in transferring knowledge across languages like German <ref type="bibr" target="#b35">[35,</ref><ref type="bibr" target="#b36">36]</ref>. The advent of LSMs pre-trained with self-supervision has further increased the potential for transfer learning due to their high generalization capabilities <ref type="bibr" target="#b14">[15]</ref>. However, most research primarily focuses on adapting multilingual Wav2Vec 2.0 models (XLSR-53) <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b37">37,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b21">21]</ref>. This work expands the scope of analyzed LSMs including WSL models as Whisper. Additionally, we evaluate the ability of English-only models to transfer knowledge to other languages, beyond just multilingual models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>In this section, we describe the methodology for evaluating the effectiveness of LSMs as feature extractors for downstream SER in various languages. We stack a classification model on top of the LSM backbone, with its parameters frozen. All LSMs used in this work share the same overall architecture, which we describe below along with the stacked classification model. Formally, the input audio 𝐴 (raw waveform or logmel spectrogram) passes through a convolutional encoder 𝓏 ∶ 𝐴 → 𝑍, mapping the audio to latent features 𝑍 = {𝑧 1 , … , 𝑧 𝑇 }, where 𝑇 is the sequence length and each frame 𝑧 𝑖 typically corresponds to 25 ms with 𝑧 𝑖 ∈ ℝ 𝑑 . Then, 𝑍 passes through a Transformer encoder consisting of 𝑙 layers 𝒽 𝑙 ∶ 𝑍 → 𝐻, enriching the latent features with contextual information, resulting in {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } for each of the 𝑙 = 1, … , 𝐿 Transformer layers. Here, 𝑙 = 𝐿 corresponds to the output features of the last layer, with ℎ 𝑙 𝑖 ∈ ℝ 𝑑 . The features {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } 𝑙=1,..,𝐿 are considered the extracted features from the LSM and are fed into a downstream classifier 𝓎 ∶ 𝐻 → 𝑌, which maps these features to the output class logits {𝑦 1 , … , 𝑦 𝑘 }. The output class label 𝑦 * for audio 𝐴 is given by:</p><formula xml:id="formula_0">𝑦 * = arg max 𝑘 softmax (𝓎 (𝒽 (𝓏(𝐴))))<label>(1)</label></formula><p>Inspired by previous work that uses probing to evaluate the quality of features extracted from backbone models <ref type="bibr" target="#b38">[38,</ref><ref type="bibr" target="#b39">39]</ref>, we evaluate three different downstream classifiers of increasing complexity: Linear Classifier (ℊ 𝑙 ), Non-Linear Classifier (ℊ 𝑛𝑙 ), and Multi-layer Classifier (ℊ 𝑚𝑙 ). Figure <ref type="figure" target="#fig_0">1</ref> illustrates their architecture, which is detailed below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Linear Classifier</head><p>For the linear classifier, we use a simple feed-forward neural network that consists solely of linear projections. The snowflake icon represents frozen weights, while the fire icon denotes trainable weights.</p><p>Specifically, given the features from the last Transformer layer {ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 }, they are first projected by a linear layer 𝓁 1 ∶ ℝ 𝑑 → ℝ 𝑚 that is shared across all frames, then aggregated by average pooling 𝓅, and finally pass through the classification layer ℴ ∶ ℝ 𝑚 → ℝ 𝑘 to obtain the output class logits. The function ℊ 𝑙 is compactly defined as:</p><formula xml:id="formula_1">ℊ 𝑙 (ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 ) = ℴ (𝓅 (𝓁 1 (ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 )))<label>(2)</label></formula><p>The absence of non-linear activations allows us to evaluate the quality of the features extracted from the LSM based on the linear classifier model's ability to handle the SER task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Non-Linear Classifier</head><p>To increase the complexity of the classification model, we utilize a series of linear layers interleaved with ReLU activations both before and after feature pooling. We follow the same architecture as in <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref>, but unlike them, we only feed the features from the last Transformer layer 𝐿 to the model. Each {ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 } passes through two shared linear layers, ReLU, and dropout blocks (𝒷), followed by a linear layer (𝓁 1 ). Linear layers are functions 𝓁 ∶ ℝ 𝑑 → ℝ 𝑚 . Projected features are averaged, pass through 𝓁 2 and ReLU, and are classified by ℴ. Thus, ℊ 𝑛𝑙 is:</p><formula xml:id="formula_2">ℊ 𝑛𝑙 (𝑥 = ℎ 𝐿 1 , … , ℎ 𝐿 𝑇 ) = ℴ (ReLU (𝓁 2 (𝓅 (𝓁 1 (𝒷 (𝑥))))))<label>(3)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Multi-Layer Classifier</head><p>As a third option, we adopt the approach from <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref>, which utilizes all hidden states of the Transformer encoder. </p><formula xml:id="formula_3">ℎ * 𝑡 = 𝐿 ∑ 𝑙=1 𝑤 𝑙 ⋅ ℎ 𝑙 𝑡 for 𝑡 = 1, … , 𝑇<label>(4)</label></formula><p>where 𝑤 1 , … , 𝑤 𝐿 are the weights assigned to each Transformer layer, ensuring 𝑤 𝑙 ∈ [0, 1] and ∑ 𝐿 𝑙=1 𝑤 𝑙 = 1. The resulting sequence {ℎ * 1 , … , ℎ * 𝑇 } is then processed by the same pipeline as the Non-Linear Classifier, resulting in:</p><formula xml:id="formula_4">ℊ 𝑚𝑙 (𝑥 = {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } 𝑙=1,..,𝐿 ) = ℊ 𝑛𝑙 (𝓈(𝑥))<label>(5)</label></formula><p>This classifier leverages internal layer information, which has proven beneficial for paralinguistic and linguistic downstream tasks <ref type="bibr" target="#b39">[39,</ref><ref type="bibr" target="#b40">40,</ref><ref type="bibr" target="#b41">41,</ref><ref type="bibr" target="#b42">42]</ref>. By investigating the contribution of internal LSM layers for SER across various languages, we corroborates previous findings for Wav2Vec 2.0 models and provide new insights for Whisper models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Datasets and Metrics</head><p>In this study, we conduct experiments using 9 distinct datasets spanning 8 different languages: Greek, French, Italian, German, Spanish, Egyptian Arabic, and Persian.</p><p>The datasets vary in their collection methodologies, such as acted emotions and elicitation methods. The participant demographics may be balanced by gender (e.g., CaFE, EYASE), by emotion (e.g., EMOVO), or may not be balanced at all. For all datasets, we conduct our experiments in a speaker-independent setting to prevent evaluation on speaker-dependent features. Table <ref type="table" target="#tab_2">1</ref> provides an overview of the dataset statistics, with a more detailed description given below. AESDD <ref type="bibr" target="#b43">[43]</ref>: The Acted Emotional Speech Dynamic Database comprises 500 recorded samples from 5 actors (3 females, 2 males) expressing 5 distinct emotions in Greek. Each actor performed 20 utterances per emotion, with some utterances recorded multiple times. In later versions, additional actors were included, bringing the total to 604 recordings from 6 actors.</p><p>CaFE <ref type="bibr" target="#b44">[44]</ref> DEMoS <ref type="bibr" target="#b45">[45]</ref>: DEMoS contains 9697 audio samples from 68 volunteer students (299 females, 131 males) expressing the Big Six emotions plus the neutral state in Italian. Instead of acted emotions, samples were generated using an elicitation approach. The recordings, with a mean duration of 2.9 seconds (std: 1.1s), are provided in 48 kHz, 16-bit, mono format.</p><p>EmoDB <ref type="bibr" target="#b46">[46]</ref>: This collection includes 535 utterances across 7 emotional states, spoken in German by 5 female and 5 male actors. Each actor performed a set of 10 sentences, which were down-sampled from the original 48 kHz to 16 kHz.</p><p>EmoMatch <ref type="bibr" target="#b33">[33]</ref>: Consisting of 2005 recordings, Emo-Match features samples from 50 non-actor Spanish speakers (20 females, 30 males) expressing the Big Six emotions and a neutral state. The dataset is a subset of the larger EmoSpanishDB and contains recordings sampled at 48 kHz with a 16-bit mono format.</p><p>EMOVO <ref type="bibr" target="#b47">[47]</ref>: EMOVO presents 588 Italian audio recordings from 3 male and 3 female actors simulating the Big Six emotions plus a neutral state. Each actor voiced 14 utterances, and the recordings are provided in 48 kHz, 16-bit stereo WAV format.</p><p>EYASE <ref type="bibr" target="#b48">[48]</ref>: EYASE contains 579 utterances in Egyptian Arabic, recorded by 3 male and 3 female professional actors. The recordings, ranging from 1 to 6 seconds in duration, were labeled as angry, happy, neutral, or sad and sampled at 44.1 kHz.</p><p>Oréau <ref type="bibr" target="#b49">[49]</ref>: The Oréau dataset features 502 audio samples from 32 non-professional actors (25 male, 7 female) who voiced 10 to 13 utterances in French for the Big Six emotions plus a neutral state.</p><p>ShEMO <ref type="bibr" target="#b50">[50]</ref>: ShEMO comprises 3000 semi-natural recordings from 87 native Persian speakers (31 female, 56 male). The dataset captures 5 of the Big Six emotions-sadness, anger, happiness, surprise, and fear-plus a neutral state. The samples were up-sampled to a frequency of 44.1 kHz in mono-channel format, with an average length of 4.11 seconds (std: 3.41s).</p><p>The audio is resampled to 16 kHz, and a stratified train/-validation/test split is performed with ratios of 80/10/10. All results are reported using the macro F1 score, expressed as a percentage. We conducted 3 runs, presenting the mean ± standard deviation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Experimental Details</head><p>Baseline As a baseline to evaluate LSM transfer learning capabilities, we adopt the Audio Spectrogram Transformer (AST) <ref type="bibr" target="#b51">[51]</ref>, a fully transformer-based architecture recently proposed as a substitute for CNNs <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10,</ref><ref type="bibr" target="#b10">11]</ref>.</p><p>We train AST from scratch on each of the 9 datasets using the same hyperparameters as <ref type="bibr" target="#b51">[51]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LSM Models</head><p>We use pre-trained checkpoints for both English-Only and Multilingual models: Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, XLS-R from the Wav2Vec 2.0 family, and Whisper Small (EN) (Whisper Small pretrained only on English data), Whisper Small, Whisper Medium from the Whisper family. The LSM backbones are kept frozen and used exclusively as feature extractors.</p><p>Training We follow the same hyperparameters settings as <ref type="bibr" target="#b14">[15]</ref> to train the downstream classifiers. Specifically, we train for 30 epochs using the Adam optimizer with a learning rate of 5.0e-04, weight decay of 1.0e-04, betas set to (0.9, 0.98), and epsilon of 1.0e-08. The dimension of the classifier projection 𝑚 is 256.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Results</head><p>To present our results, we first compare the performance of the various classifiers (see Section 3) for each LSM utilized. This analysis provides insights into the characteristics of features extracted from Wav2Vec 2.0 and Whisper models for downstream SER tasks. After identifying the best classifier for each LSM, we then compare the performance of English-Only and Multilingual LSMs across the 8 languages covered in this study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1.">Comparison between downstream classifiers</head><p>We examine the results in Table <ref type="table">2</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Performance of various LSM backbones using Linear, Non-Linear, and Multi-Layer classification methods. F1 scores are averaged across all 9 datasets. For each LSM, the best classifier is highlighted in bold. table shows average F1 scores across 9 datasets, highlighting the most effective classifier for each LSM in crosslingual SER tasks. For Wav2Vec 2.0 models, the Multi-Layer Classifier performs best, with F1 scores of 53.42, 57.50, and 40.89 for Wav2Vec 2.0 Base, Wav2Vec 2.0 Large, and XLS-R. The Linear and Non-Linear classifiers perform similarly, especially for Wav2Vec 2.0 Large and XLS-R, suggesting improvements are due to using features from internal Transformer layers rather than non-linear activations. For Whisper models, the Linear Classifier performs best, with F1 scores of 58.16, 60.87, and 60.72 for Whisper Small (EN), Whisper Small, and Whisper Medium. Increasing classifier complexity with non-linear activations decreases performance, likely due to general information loss caused by complex transformations. The Multi-Layer Classifier performs worse, indicating that using also features from internal layers is less effective than using features from the last layer alone.</p><p>This comparison reveals that Wav2Vec 2.0 models benefit from features extracted from internal Transformer layers and exhibit less sensitivity to classifier complexity, consistent with prior research <ref type="bibr" target="#b41">[41,</ref><ref type="bibr" target="#b39">39]</ref>. Conversely, Whisper models achieve better performance with features from the last Transformer layer when using a simple linear classifier, offering new insights into their effective-ness for SER across multiple languages. We hypothesize that this differing behavior may be related to their respective Self-Supervised and Weakly-Supervised pre-training approaches, which warrant further investigation. To gain further insights into the importance of Transformer layers in Wav2Vec 2.0 and Whisper for SER, we leverage the weights learned in the Multi-Layer classifier as follows.</p><p>Transformer Layer Weights. We analyze the weights 𝑤 1 , … , 𝑤 𝐿 from the Multi-Layer Classifier to assess Transformer layer importance. Figure <ref type="figure" target="#fig_1">2</ref> illustrates that Wav2Vec 2.0 models assign greater weight to the early and middle layers, whereas Whisper models emphasize the later layers. This observation confirms the earlier findings, suggesting that paralinguistic information in Whisper models is embedded in the features of the later Transformer layers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2.">Comparing English-Only and Multilingual LSMs Across Different Languages</head><p>In this section, we compare English-Only and Multilingual LSMs with the AST baseline across 9 datasets. Table <ref type="table" target="#tab_5">3</ref> displays F1 scores for the optimal classifiers found in the previous section: Multi-Layer for Wav2Vec 2.0 and Linear for Whisper models.</p><p>Transferring knowledge from LSMs proves to be effective across all datasets compared to the baseline. For instance, Wav2Vec 2.0 Large scores 53.40 in Egyptian Arabic, while Whisper Small scores 51.98 and AST scores 33.23. This indicates that LSMs are effective feature extractors for cross-lingual SER on multiple languages.</p><p>When comparing English-only and Multilingual models, we differentiate between the Wav2Vec 2.0 and Whisper families. For Wav2Vec 2.0, we observe that Wav2Vec 2.0 Base and Large generally outperform XLS-R (e.g., 87.85 and 88.31 vs. 67.71 for DEMos), except in Persian, where their performance is comparable. This indicates that multilingual pre-training may not be as effective for Wav2Vec 2.0 models across various languages. We speculate that this may be due to the limitations of SSL pre-training, which might struggle with the diverse range of languages and lose important paralinguistic features that are retained in English-only models. Further investigation with a wider range of SSL-pretrained LSMs could provide more insights. As regards to Whisper, Multilingual Whisper Small outperforms its English-only version, with the exception of Greek and Persian, likely due to limited pretraining data for these languages, which resulted in higher word error rates compared to other languages in this study <ref type="bibr" target="#b12">[13]</ref>. Multilingual Whisper models achieve best performance in Canadian French, Spanish (66.71, 73.13 with Whisper Small), Italian, German, and French (91.17, 90.64, 95.22 with Whisper Medium). This improvement is likely due to the larger pretraining datasets for these languages and the similarities between  Canadian French and French. We believe that multilingual pretraining benefits Whisper models by capturing language-specific features more effectively through WSL and multitask learning. However, further research is needed to evaluate the effectiveness of multilingual pretraining with WSL compared to SSL across a broader range of LSMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>This paper examines the capabilities of Wav2Vec 2.0 and Whisper models as feature extractors for cross-lingual SER across eight languages, considering both English-Only and Multilingual variants. Our findings reveal that LSMs are effective feature extractors compared to a full Transformer baseline trained from scratch. We observe that Whisper models encode acoustic information primarily in the features of the last Transformer layer, whereas Wav2Vec 2.0 models rely on features from middle and early layers. Furthermore, we show that multilingual pre-training benefits Whisper models, leading to strong performance in Italian, Canadian French, French, Spanish, German, and competitive results in Greek, Egyptian Arabic, and Persian. In contrast, English-Only Wav2Vec 2.0 models outperform their multilingual counterpart, XLS-R, in most languages, achieving top performance in Greek and Egyptian Arabic. We attribute the disparity in multilingual pre-training effectiveness to the differences between SSL and WSL strategies, which should be explored further.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: The three downstream classifiers used in this work are: Linear (red), Non-Linear (purple), and Multi-Layer (green).The snowflake icon represents frozen weights, while the fire icon denotes trainable weights.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Greyscale map of layer weight distribution from the Multi-Layer classification method. Weights are averaged over all 9 datasets for each model. Darker shades indicate higher weights.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>The features {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } 𝑙=1,..,𝐿 are combined into a new sequence {ℎ * 1 , … , ℎ * 𝑇 } using a learnable weighted sum. The function 𝓈 ∶ ℝ 𝐿×𝑇 ×𝑑 → ℝ 𝑇 ×𝑑 maps {ℎ 𝑙 1 , … , ℎ 𝑙 𝑇 } 𝑙=1,..,𝐿 to {ℎ</figDesc><table /><note>* 1 , … , ℎ * 𝑇 } as follows:</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1</head><label>1</label><figDesc>: This dataset includes recordings of 6 different sentences delivered by 12 actors (6 female, 6 male) portraying the Big Six emotions and a neutral state in Canadian French. It offers a high-quality version with a sampling rate of 192 kHz at 24 bits per sample, as well as Summary statistics of the 9 datasets used in this work.</figDesc><table><row><cell>Dataset</cell><cell>Language</cell><cell># Samples</cell><cell>Emotions</cell></row><row><cell>AESDD</cell><cell>Greek</cell><cell>500</cell><cell>anger, disgust, fear, happiness, and sadness</cell></row><row><cell>CaFE</cell><cell>Canadian French</cell><cell>936</cell><cell>anger, disgust, fear, happiness, surprise, sadness, and neutrality</cell></row><row><cell>DEMoS</cell><cell>Italian</cell><cell>9697</cell><cell>anger, disgust, fear, happiness, surprise, sadness, and neutrality</cell></row><row><cell>EmoDB</cell><cell>German</cell><cell>535</cell><cell>anger, disgust, fear, happiness, boredom, sadness, and neutrality</cell></row><row><cell>EmoMatch</cell><cell>Spanish</cell><cell>2005</cell><cell>anger, disgust, fear, happiness, surprise, sadness, and neutrality</cell></row><row><cell>EMOVO</cell><cell>Italian</cell><cell>588</cell><cell>anger, disgust, fear, happiness, surprise, sadness, and neutrality</cell></row><row><cell>EYASE</cell><cell>Egyptian Arabic</cell><cell>579</cell><cell>anger, happiness, sadness, and neutrality</cell></row><row><cell>Oréau</cell><cell>French</cell><cell>502</cell><cell>anger, disgust, fear, happiness, surprise, sadness, and neutrality</cell></row><row><cell>ShEMO</cell><cell>Persian</cell><cell>400</cell><cell>anger, happiness, sadness, and neutrality</cell></row><row><cell cols="4">a down-sampled version at 48 kHz and 16 bits per sample.</cell></row><row><cell cols="3">The total number of samples amounts to 936.</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head></head><label></label><figDesc>, comparing three classifier methods for Wav2Vec 2.0 and Whisper models. The</figDesc><table><row><cell>Backbone</cell><cell>Linear</cell><cell>Non-Linear Multi-Layer</cell></row><row><cell cols="3">Wav2Vec 2.0 Base 47.87 (± 0.93) 42.07 (± 5.27) 53.42 (± 1.27)</cell></row><row><cell cols="3">Wav2Vec 2.0 Large 12.09 (± 1.50) 12.93 (± 3.31) 57.50 (± 0.03)</cell></row><row><cell>XLS-R</cell><cell cols="2">5.43 (± 0.40) 5.86 (± 0.07) 40.89 (± 2.00)</cell></row></table><note>Whisper Small (EN) 58.<ref type="bibr" target="#b15">16</ref> (± 0.15) 53.50 (± 0.98) 49.73 (± 2.02) Whisper Small 60.87 (± 0.26) 54.86 (± 0.93) 45.14 (± 1.54) Whisper Medium 60.72 (± 0.16) 55.56 (± 1.09) 37.95 (± 2.27)</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head></head><label></label><figDesc>.35) 52.86 (± 0.07) 58.42 (± 4.14) 82.27 (± 0.23) 32.51 (± 4.89) 92.70 (± 1.67) 95.22 (± 0.84) ShEMO (fa) 36.15 (± 0.85) 60.55 (± 3.90) 57.52 (± 9.09) 67.93 (± 0.37) 61.24 (± 8.93) 63.88 (± 1.21) 63.85 (± 1.58)</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell>English-Only</cell><cell></cell><cell></cell><cell>Multilingual</cell><cell></cell></row><row><cell>Dataset/Model</cell><cell>AST</cell><cell>Wav2Vec 2.0 Base  ‡</cell><cell>Wav2Vec 2.0 Large  ‡</cell><cell>Whisper Small  †</cell><cell>XLS-R  ‡</cell><cell>Whisper Small  †</cell><cell>Whisper Medium  †</cell></row><row><cell>AESDD (el)</cell><cell cols="7">19.84 (± 0.16) 25.45 (± 0.98) 28.89 (± 2.64) 28.04 (± 0.99) 9.16 (± 1.25) 26.34 (± 1.65) 27.62 (± 0.62)</cell></row><row><cell>CaFE (fr-ca)</cell><cell cols="7">10.96 (± 6.26) 50.52 (± 3.54) 47.74 (± 0.33) 60.66 (± 0.76) 18.66 (± 0.01) 66.71 (± 0.72) 55.03 (± 0.38)</cell></row><row><cell>DEMoS (it)</cell><cell cols="7">13.75 (± 4.26) 87.85 (± 0.01) 88.31 (± 0.74) 88.24 (± 0.21) 67.71 (± 1.47) 90.61 (± 0.14) 91.17 (± 0.20)</cell></row><row><cell>EmoDB (de)</cell><cell cols="7">46.11 (± 6.55) 81.75 (± 7.30) 88.84 (± 7.48) 83.31 (± 0.18) 67.39 (± 4.33) 87.21 (± 1.11) 90.64 (± 1.47)</cell></row><row><cell cols="8">EmoMatch (es) 36.10 (± 2.63) 69.84 (± 0.69) 71.85 (± 1.55) 67.59 (± 0.35) 44.14 (± 0.25) 73.13 (± 2.54) 68.23 (± 0.78)</cell></row><row><cell>EMOVO (it)</cell><cell cols="7">15.74 (± 1.24) 16.47 (± 0.61) 20.33 (± 1.31) 27.30 (± 0.16) 14.86 (± 2.11) 41.05 (± 1.21) 50.19 (± 0.29)</cell></row><row><cell>EYASE (ar-eg)</cell><cell cols="7">33.23 (± 4.58) 46.31 (± 3.62) 53.40 (± 1.56) 42.65 (± 0.70) 47.27 (± 1.36) 51.98 (± 0.88) 37.32 (± 3.62)</cell></row><row><cell>Oréau (fr)</cell><cell>19.01 (± 2</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 3</head><label>3</label><figDesc>Performance of Wav2Vec and Whisper models across 9 datasets, divided into English-Only and Multilingual LSMs. AST is the baseline. † indicates a Linear Classifier, ‡ a Multi-Layer Classifier. Bold values are the highest scores, and underlined values highlight the best between English-Only and Multilingual models.</figDesc><table /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Deep representation learning for affective speech signal analysis and processing: Preventing unwanted signal disparities</title>
		<author>
			<persName><forename type="first">C.-C</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Sridhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-C</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B.-H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Busso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Signal Processing Magazine</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="page" from="22" to="38" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A survey of deep learning-based multimodal emotion recognition: Speech, text, and face</title>
		<author>
			<persName><forename type="first">H</forename><surname>Lian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Entropy</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page">1440</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A comprehensive review of speech emotion recognition systems</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">M</forename><surname>Wani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">S</forename><surname>Gunawan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A A</forename><surname>Qadri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kartiwi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ambikairajah</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE access</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="47795" to="47814" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Speech emotion recognition using cnn</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd ACM international conference on Multimedia</title>
				<meeting>the 22nd ACM international conference on Multimedia</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="801" to="804" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Speech emotion recognition from spectrograms with deep convolutional neural network</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Badshah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Rahim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>Baik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2017 international conference on platform technology and service (PlatCon)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Speech emotion recognition using deep 1d &amp; 2d cnn lstm networks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Biomedical signal processing and control</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<biblScope unit="page" from="312" to="323" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Enhancing privacy through domain adaptive noise injection for speech emotion recognition</title>
		<author>
			<persName><forename type="first">T</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hashemi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Annavaram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Narayanan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="7702" to="7706" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Speech emotion recognition using convolutional and recurrent neural networks</title>
		<author>
			<persName><forename type="first">W</forename><surname>Lim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), IEEE</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="1" to="4" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">N.-C</forename><surname>Ristea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">T</forename><surname>Ionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">S</forename><surname>Khan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.09581</idno>
		<title level="m">Septr: Separable transformer for audio spectrogram processing</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Coordvit: a novel method of improve vision transformer-based speech emotion recognition using coordinate information concatenate</title>
		<author>
			<persName><forename type="first">J.-Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-H</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2023 International conference on electronics, information, and communication (ICEIC)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="4" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">An enhanced speech emotion recognition using vision transformer</title>
		<author>
			<persName><forename type="first">S</forename><surname>Akinpelu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Viriri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Adegun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific Reports</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page">13126</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Audio self-supervised learning: A survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mallol-Ragolta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Parada-Cabaleiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Jing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kathan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">W</forename><surname>Schuller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Patterns</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Robust speech recognition via large-scale weak supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Brockman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mcleavey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="28492" to="28518" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Pepino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Riera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ferrer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.03502</idno>
		<title level="m">Emotion recognition from speech using wav2vec 2.0 embeddings</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Peft-ser: On the use of parameter efficient transfer learning approaches for speech emotion recognition using pre-trained speech models</title>
		<author>
			<persName><forename type="first">T</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narayanan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">11th International Conference on Affective Computing and Intelligent Interaction (ACII), IEEE</title>
				<imprint>
			<date type="published" when="2023">2023. 2023</date>
			<biblScope unit="page" from="1" to="8" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Evaluating parameter-efficient transfer learning approaches on sure benchmark for speech understanding</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mehrish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bhardwaj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Majumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mihalcea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Poria</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Trust-ser: On the trustworthiness of fine-tuning pre-trained speech embeddings for speech emotion recognition</title>
		<author>
			<persName><forename type="first">T</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hebbar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narayanan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="11201" to="11205" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">wav2vec 2.0: A framework for self-supervised learning of speech representations</title>
		<author>
			<persName><forename type="first">A</forename><surname>Baevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Auli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="12449" to="12460" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Multi-lingual multi-task speech emotion recognition using wav2vec 2.0</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sharma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="6907" to="6911" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title/>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">G</forename><surname>Upadhyay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martinez-Lucas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B.-H</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-C</forename></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Phonetic anchor-based transfer learning to facilitate unsupervised cross-lingual speech emotion recognition</title>
		<author>
			<persName><forename type="first">W.-S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-T</forename><surname>Chien</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Katz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-C</forename><surname>Busso</surname></persName>
		</author>
		<author>
			<persName><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A M</forename><surname>Zaidi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Latif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qadir</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2306.13804</idno>
		<title level="m">Cross-language speech emotion recognition using multimodal dual attention transformers</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<title level="m">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Narasimhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Salimans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<title level="m">Improving language understanding by generative pre-training</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Dosovitskiy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kolesnikov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Weissenborn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Unterthiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Minderer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Heigold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gelly</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.11929</idno>
		<title level="m">An image is worth 16x16 words: Transformers for image recognition at scale</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Wavlm: Large-scale self-supervised pre-training for full stack speech processing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kanda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yoshioka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xiao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Journal of Selected Topics in Signal Processing</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="1505" to="1518" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<author>
			<persName><forename type="first">S.-W</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-H</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-S</forename><surname>Chuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-I</forename><forename type="middle">J</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lakhotia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">Y</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G.-T</forename><surname>Lin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2105.01051</idno>
		<title level="m">Superb: Speech processing universal performance benchmark</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<title level="m" type="main">Unsupervised cross-lingual representation learning for speech recognition</title>
		<author>
			<persName><forename type="first">A</forename><surname>Conneau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Baevski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Collobert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Auli</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.13979</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Babu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tjandra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lakhotia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Von Platen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Saraf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2111.09296</idno>
		<title level="m">Xls-r: Self-supervised cross-lingual speech representation learning at scale</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Deep learning for the detection of emotion in human speech: The impact of audio sample duration and english versus italian languages</title>
		<author>
			<persName><forename type="first">A</forename><surname>Wurst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hopwood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-D</forename><surname>Yao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2023 32nd Wireless and Optical Communications Conference (WOCC), IEEE</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="6" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Cross-lingual and multilingual speech emotion recognition on english and french</title>
		<author>
			<persName><forename type="first">M</forename><surname>Neumann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="5769" to="5773" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">When low resource nlp meets unsupervised language model: Meta-pretraining then meta-learning for few-shot text classification (student abstract)</title>
		<author>
			<persName><forename type="first">S</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="13773" to="13774" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Cross lingual speech emotion recognition: Urdu vs. western languages</title>
		<author>
			<persName><forename type="first">S</forename><surname>Latif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Qayyum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Usman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qadir</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2018 International conference on frontiers of information technology (FIT), IEEE</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="88" to="93" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Emomatchspanishdb: study of speech emotion recognition machine learning models in a new spanish elicited database</title>
		<author>
			<persName><forename type="first">E</forename><surname>Garcia-Cuesta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Salvador</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">G</forename><surname>Pãez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Multimedia Tools and Applications</title>
		<imprint>
			<biblScope unit="volume">83</biblScope>
			<biblScope unit="page" from="13093" to="13112" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Zadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Poria</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Cambria</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-P</forename><surname>Morency</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1707.07250</idno>
		<title level="m">Tensor fusion network for multimodal sentiment analysis</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Mao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2007.00800</idno>
		<title level="m">A survey on self-supervised pre-training for sequential transfer learning in neural networks</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">A vector quantized masked autoencoder for speech emotion recognition</title>
		<author>
			<persName><forename type="first">S</forename><surname>Sadok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Leglaive</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Séguier</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2023 IEEE International conference on acoustics, speech, and signal processing workshops (ICASSPW)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<author>
			<persName><forename type="first">F</forename><surname>Catania</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Speech emotion recognition in italian using wav2vec 2</title>
		<title level="s">Authorea Preprints</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Analyzing hidden representations in end-to-end automatic speech recognition systems</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Belinkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Glass</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">K</forename><surname>Singla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Shah</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2101.00387</idno>
		<title level="m">What all do audio transformer models hear? probing acoustic representations for language delivery and its structure</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Layer-wise analysis of a self-supervised speech representation model</title>
		<author>
			<persName><forename type="first">A</forename><surname>Pasad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-C</forename><surname>Chou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Livescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="914" to="921" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">Exploration of a self-supervised speech model: A study on emotional corpora</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mohamied</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Spoken Language Technology Workshop (SLT), IEEE</title>
				<imprint>
			<date type="published" when="2022">2022. 2023</date>
			<biblScope unit="page" from="868" to="875" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Comparative layerwise analysis of self-supervised speech models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Pasad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Livescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b43">
	<analytic>
		<title level="a" type="main">Speech emotion recognition for performance interaction</title>
		<author>
			<persName><forename type="first">N</forename><surname>Vryzas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kotsakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Liatsou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Dimoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kalliris</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the Audio Engineering Society</title>
		<imprint>
			<biblScope unit="volume">66</biblScope>
			<biblScope unit="page" from="457" to="467" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b44">
	<analytic>
		<title level="a" type="main">A canadian french emotional speech dataset</title>
		<author>
			<persName><forename type="first">P</forename><surname>Gournay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Lahaie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lefebvre</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 9th ACM multimedia systems conference</title>
				<meeting>the 9th ACM multimedia systems conference</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="399" to="402" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<analytic>
		<title level="a" type="main">Demos: An italian emotional speech corpus: Elicitation methods, machine learning, and perception</title>
		<author>
			<persName><forename type="first">E</forename><surname>Parada-Cabaleiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Costantini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Batliner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schmitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">W</forename><surname>Schuller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language Resources and Evaluation</title>
		<imprint>
			<biblScope unit="volume">54</biblScope>
			<biblScope unit="page" from="341" to="383" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<analytic>
		<title level="a" type="main">A database of german emotional speech</title>
		<author>
			<persName><forename type="first">F</forename><surname>Burkhardt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Paeschke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rolfes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">F</forename><surname>Sendlmeier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Weiss</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Interspeech</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="1517" to="1520" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b47">
	<analytic>
		<title level="a" type="main">Emovo corpus: an italian emotional speech database</title>
		<author>
			<persName><forename type="first">G</forename><surname>Costantini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Iaderola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Paoloni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Todisco</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ninth international conference on language resources and evaluation (LREC&apos;14), European Language Resources Association (ELRA)</title>
				<meeting>the ninth international conference on language resources and evaluation (LREC&apos;14), European Language Resources Association (ELRA)</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="3501" to="3504" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b48">
	<analytic>
		<title level="a" type="main">Egyptian arabic speech emotion recognition using prosodic, spectral and wavelet features</title>
		<author>
			<persName><forename type="first">L</forename><surname>Abdel-Hamid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Speech Communication</title>
		<imprint>
			<biblScope unit="volume">122</biblScope>
			<biblScope unit="page" from="19" to="30" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b49">
	<monogr>
		<title level="m" type="main">French emotional speech database -oréau</title>
		<author>
			<persName><forename type="first">S</forename><surname>Oréau</surname></persName>
		</author>
		<ptr target="https://zenodo.org/records/4405783" />
		<imprint>
			<date type="published" when="2021">2021</date>
			<pubPlace>Zenodo</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b50">
	<analytic>
		<title level="a" type="main">Shemo: a large-scale validated database for persian speech emotion detection</title>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">Mohamad</forename><surname>Nezami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Karami</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Language Resources and Evaluation</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<biblScope unit="page" from="1" to="16" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b51">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-A</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Glass</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.01778</idno>
		<title level="m">Ast: Audio spectrogram transformer</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
