THAU-UPM at EmoSPeech-IberLEF2024: Efficient Adaptation of Mono-modal and Multi-modal Large Language Models for Automatic Speech Emotion Recognition Sergio Esteban-Romero1 , Jaime Bellver-Soler1 , Iván Martín-Fernández1 , Manuel Gil-Martín1 , Luis Fernando D’Haro1 and Fernando Fernández-Martínez1 1 Grupo de Tecnología del Habla y Aprendizaje Automático (THAU Group), Information Processing and Telecommunication Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid (UPM) Abstract Automatic Speech Emotion Recognition (SER) is a pivotal task across domains such as psychology, healthcare, or human-computer interaction. This study, conducted within the framework of the EmoSpeech challenge at IberLEF 2024, explores efficient adaptation techniques for mono-modal and multi-modal large language models (LLMs) for SER in Spanish. The challenge includes two tasks: one that relies solely on textual transcriptions and another that integrates audio signals. For the text-only task, we fine-tune the multilingual Gemma-2B model using the Low-Rank Adaptation (LoRA) method, optimizing low-rank adaptation parameters. For the multi-modal task, we employ Qwen-Audio-Chat, enhancing its performance using the LoRA technique, and we propose a novel Whisper-Gemma model that integrates Whisper-large-v3 audio encoder with the Gemma LLM, training only a projection layer. The metrics obtained as a result of the experimentation process demonstrate the potential of both approaches. The fine-tuned Gemma-2B model achieves an f1-macro score of 0.6094 on the text-only task, while the Qwen-Audio-Chat model reaches 0.8248, indicating significant improvements in emotional recognition capabilities when combining both modalities. Additionally, the Whisper-Gemma model achieves a competitive f1-macro score of 0.7904, underscoring the effectiveness of using pre-trained audio encoders and smaller LLMs in SER tasks. These findings highlight the value of parameter-efficient fine-tuning methods and the integration of robust audio encoders with LLMs to enhance SER performance. Keywords Speech Emotion Recognition, MultiModal Large Language Models, Low-Rank Adaptation, Parameter-efficient Fine-Tuning 1. Introduction Understanding human emotions is crucial in various domains ranging from psychology and healthcare to human-computer interaction and marketing [1]. Accurately recognizing them can provide valuable information on human behavior, preferences, or mental health, among others. Although emotions manifest in multiple modalities, our study focuses specifically on spoken Spanish audio and its transcripts as sources of emotional expression. In addition, a relationship between the topic of conversation and the emotion expressed has been studied in the literature [2]. This work contributes to the EmoSPeech challenge at Iberlef 2024 [3]. It poses the problem of automatic Speech Emotion Recognition (SER) through two distinct tasks: one that uses only textual transcriptions and the other that integrates the audio signals corresponding to such transcriptions [4]. By focusing on both modalities, we aim to capture the nature of emotional expression while leveraging the strengths of combining both modalities for enhanced accuracy and robustness. IberLEF 2024, September 2024, Valladolid, Spain $ sergio.estebanro@upm.es (S. Esteban-Romero); jaime.bellver@upm.es (J. Bellver-Soler); ivan.martinf@upm.es (I. Martín-Fernández); manuel.gilmartin@upm.es (M. Gil-Martín); luisfernando.dharo@upm.es (L. F. D’Haro); fernando.fernandezm@upm.es (F. Fernández-Martínez)  0009-0008-6336-7877 (S. Esteban-Romero); 0009-0006-7973-4913 (J. Bellver-Soler); 0009-0004-2769-9752 (I. Martín-Fernández); 0000-0002-4285-6224 (M. Gil-Martín); 0000-0002-3411-7384 (L. F. D’Haro); 0000-0003-3877-0089 (F. Fernández-Martínez) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The human voice is a complex signal that condenses a wealth of information about the speaker, including their age, gender, health, and even emotional state [5]. Subtle variations in pitch, tone and rhythm capture these nuances. Analyzing these acoustic features along with textual transcripts, we can learn more about the emotional content of the spoken words, helping to recognize emotions more accurately and appropriately in different contexts. In detail, the challenge dataset consists of a set of 6 emotions: neutral, disgust, anger, joy, sadness and fear. Our approach for the challenge is based on leveraging the strengths of a family of Language Models that, although smaller in size than the top performing LLMs [6, 7], have been trained on a vast amount of data spanning multiple disciplines and have therefore acquired a solid grounding of knowledge about our world. In particular, we focus on the Gemma [8] and Qwen [9, 10] models. More specifically, we explore parameter-efficient methods for adapting this pre-training sapiens to the context of SER, either using written language-only approaches or imbuing them with audio capabilities. When only textual transcripts are considered, we explore how adapting Gemma using Low-Rank Adaptation (LoRA) [11] and SER related data can capture semantic and syntactic information related to emotional patterns. Alternatively, when incorporating audio data, we fine-tune state-of-the-art MultiModal Large Language Models (MM-LLMs) so the LLM also has the ability to comprehend not only textual content but also the acoustic features extracted by the audio encoder, thus enriching the understanding of emotions across both modalities. In particular, we choose to fine-tune Qwen Audio [10] due to its demonstrated state-of- the-art performance across various benchmarks, leveraging its pre-trained capabilities. Additionally, to evaluate the trade-off between accuracy and computational efficiency, we compare Qwen-Audio’s performance with a smaller model. This lightweight model integrates features extracted by an audio encoder into the Gemma LLM. Here, only the projection layer that maps the audio encoder output to the LLM semantic space is trained, allowing the model to effectively handle the audio data. The contributions to emotion recognition described in this work can be summarized as: • Leveraging large language models (LLMs): We fine-tune LLMs to directly predict emotions from textual transcripts. This approach allows us to obtain probability distributions for the emotions under analysis. • Enhancing classification with audio features: We investigate the potential of incorporating audio features by fine-tuning multimodal LLMs (MM-LLMs). This aims to enhance the classification capabilities for emotion recognition. • Exploring the efficiency of adapting and using small LLMs: We explore training a projection layer that allows pre-trained encoders (inspired by MM-LLMs) to be used independently with the LLM for emotion recognition. This approach investigates the possibility of achieving good results with computationally less expensive models. This paper is organized as follows: In Section 2, we provide a comprehensive review of related work, discussing traditional and deep learning approaches for SER and recent advancements in multi- modal strategies and LLMs. Section 3 details our proposed methods for the EmoSpeech challenge, including the fine-tuning of Gemma-2B using LoRA for text-only classification and the adaptation of Qwen-Audio-Chat and Whisper-Gemma models for multimodal classification. Section 4 describes the materials and methods used in our experiments, including data sources, experimental setup, and hyperparameter configurations. In Section 5, we present our results and discussion, comparing in-house validation results and official challenge test results for both tasks. Finally, Section 6 concludes the paper, summarizing our findings and outlining future research directions to further improve automatic SER systems. 2. Related Work Traditionally, SER approaches have relied on extracting meaningful descriptors of the audio signal. This process often involved determining the most effective features to identify emotions, including but not limited to Fast Fourier Transform (FFT) and Mel-Frequency Cepstral Coefficients (MFCC) [12]. This research path travelled through approaches that exploit these and other features in order to train classical Machine Learning models such as Support Vector Machines (SVMs) [13], Linear Discriminant Analysis (LDA) [14], k-Nearest Neighbors [15] or even unsupervised clustering algorithms [16]. However, these types of approach have struggled to capture the specifics of emotion in audio [1]. With the advent of Deep Learning methods came systems that no longer rely on manually curated features but classified emotions directly from the audio signal or its spectral representations, using architectures such as Convolutional Neural Networks (CNNs) or Long Short-Term Memory (LSTM) Networks [17, 18, 19]. Studies in Spanish SER lack abundance, mainly due to the sparcity of available data. However, recent efforts go beyond the audio signal alone and incorporate textual transcriptions in hybrid approaches using deep audio encoders and text models and late fusion strategies [20, 2]. The recent use of multimodal strategies for SER based on textual models motivates a shift of focus towards Large Language Models (LLMs), that have been revolutionizing the field of Natural Language Processing (NLP) for the past few years. Although the most capable LLMs contain hundreds of billions of parameters, after recent advancements, some models with significantly fewer parameters have shown great adaptability, enabling them to tackle complex tasks efficiently. Examples include Qwen-7B [9] with 7 billion parameters, developed by Alibaba and Google Gemma-2B [21] with a more compact size of 2 billion parameters. These models demonstrate impressive performance despite their relatively small size. The Qwen [9] LLM family is based on the transformer-based LlaMa architecture [22]. However, Qwen models introduce several modifications such as untiding weights for input embedding and output projection, using a rotary positional embedding [23], removal of the bias input in all layers and an adjusted activation function among others. These refinements lead the Qwen-7B model to outperform Llama-7B on various benchmarks. However, a key limitation of Qwen is its training data, which is formed by English and Chinese limiting its usage in multilingual settings unless it is adapted. The Gemma 2B model, despite its relatively smaller size with 2 billion parameters, showcased a competitive performance on different benchmarks when compared to larger models such as Llama-7B [8]. Another interesting example of a relatively small LLM is Microsoft Phi-2, which provides outstanding results while maintaining a relatively low number of parameters [24]. It follows the training methodology presented in “Textbooks are all you need” [24], which emphasizes the usage of high-quality data to enhance performance when using smaller models. Recent advances have expanded the scope of LLMs by incorporating multimodal capabilities, so that they can now process and understand various data types at the same time beyond text, such as audio and images. By simultaneously processing different data modalities, MM-LLMs benefit from the semantic and syntactic understanding acquired by the LLM to extend its comprehension strengths over new domains such as image or audio processing. Typically, they are formed by encoders that convert each data type (e.g., image, audio) into a representation that is translated using a projection layer into a format the LLM can understand [25]. Finally, the output provided by the LLM combines the information from all modalities, capturing the relationships between them. As an example, Qwen-Audio extends the capabilities of the Qwen-7B LLM with audio understanding by integrating a fine-tuned version of Whisper-large-v2 as an audio encoder with projection layer to serve as an interpreter of the audio features for the text encoder. Whisper [26] is a state-of-the-art audio encoder that uses 80-channel log-mel spectrograms for speech recognition and audio transcription using a transformer-based architecture. Despite the fact that Whisper models are trained for speech recognition and translation, they show good performance in a variety of tasks while using their audio representations. It obtains outstanding results for transcription tasks due to its powerful embedding representations, which are proven to contain relevant information about the sounds that form an audio file [27, 28]. As a result, it provides Qwen-Audio with great reasoning capabilities and understanding about the audios given as input. Although Qwen-audio uses a fine-tuned version of Whisper-large-v2 as its audio encoder, recently an upgraded version of this speech recognition model has been developed. Whisper-large-v3 follows a similar architecture to that of its previous version but increases the size of the input by using 128 mel frequency bins instead of 80 [26]. This architectural upgrade and higher performance motivate its use over its predecessors in building systems that solve a downstream task. Although impressive in size and capabilities, Whisper V3 has not been trained explicitly for the SER problem. This issue is tackled by emo2vec [29], a model that although smaller in parameters is pre-trained with a specific focus on SER tasks, thus being able to generate audio representations that pose as more meaningful for the challenge at hand. Emo2vec uses a CNN as a feature extractor followed by a linear projection layer and a mask operation to obtain the input for the backbone network which is based on the data2vec architecture [30]. The backbone networks form the core component responsible for learning high-level features from the input data [31]. To the best of our knowledge, a comparative analysis between the strengths of a largely pre-trained, generalist audio model (Whisper) and smaller but more task-oriented approaches (emo2vec) on the SER task is yet to be carried out, which motivates further exploration of both. While MM-LLMs have achieved outstanding results in a wide variety of tasks, training them is challenging due to their large number of parameters which turn into a huge computational demand. LoRA [11] is a technique designed to address these challenges by allowing to efficiently fine-tune large models. It is based on the idea that most of the information of the original matrix which stores the updates of the model is captured in matrices with a lower intrinsic rank. Therefore, it can be decomposed into a product of lower rank adaptation matrices that will significantly reduce the number of trainable parameters compared to traditional fine-tuning, leading to faster training and lower memory requirements. In this work, we focus on fine-tuning the Gemma-2B model and the Qwen-Audio model. Furthermore, we propose a novel architecture where we explore the potential of utilizing a combination of smaller, pre-trained LLMs, like Gemma or Phi, alongside the Whisper-large-v3 and emo2vec audio encoders. 3. Proposal 3.1. Task 1 - Fine-tune Gemma using LoRA Our approach for the Speech Emotion Recognition through text transcription task works under the hypothesis that a fairly sized LLM that has been trained on a huge amount of multilingual data, such as Gemma [21], has the potential to understand and model the emotion of the speaker through the nuances embedded in the written language with few adjustments. For that reason, we fine-tune the Gemma model using the LoRA method on the transcriptions provided for the challenge dataset. In particular, we use the google/gemma-1.1-2b-it checkpoint from the HuggingFace Hub1 as a starting point. As reported in the model card, the 1.1 family of Gemma models are instruction tuned using a novel Reinforcement Learning with Human Feedback (RLHF) method that enhances their truthfulness and ability to follow instructions. We train our proposed system using the next token prediction task, i.e., we model the probability distribution of the next token 𝑥𝑛 , 𝑝(𝑥𝑛 |{𝑥0 , 𝑥1 , ...𝑥𝑛−1 }), given the previous sequence of tokens, or prompt, {𝑥0 , 𝑥1 , ...𝑥𝑛−1 }. The prompt used for the training process can be found in Table 1, where the tag is replaced on each case with the corresponding text transcription. Since the Gemma model has been trained as a conversational agent, we pre-process the prompt by applying a chat template that instructs the model to predict the next conversational turn: user∖n[prompt]∖nmodel where [prompt] is the original instruction prompt as shown in Table 1. During training, the label is appended as the first word of the model turn, so that our system learns to generate that word given the preceding context (in our case, 𝑥𝑛 would be the emotion assigned to a given sample and {𝑥0 , 𝑥1 , ...𝑥𝑛−1 } represents the prompt that includes a transcription of the audio belonging to that 1 https://huggingface.co/google/gemma-1.1-2b-it Table 1 Prompt used to both perform zero-shot evaluation and to fine-tune the model using LoRA Model Prompt Instruction: Given the following audio transcription, predict the emotion of the speaker. Gemma 2B The emotion can be one of the following: [neutral, sad, angry, happy, surprise, fear, disgust, contempt] The transcription is: Select the predominant emotion from neutral, disgust, anger, joy, sadness and fear. Answer only Qwen-Audio-Chat with one of the emotion words and nothing else. Consider also its transcription: sample). By modeling next token prediction, the LLM is able to elicit the nuances involved in the language behind each subject and how they translate into the perceived emotion. In the spirit of leveraging the immense capabilities of state-of-the-art Language Models and adapting them to the downstream task in the most efficient manner, we use the LoRA technique in order to build our systems for this subtask. This way, instead of adapting the whole set of parameters that constitute the LLM, a low rank adaptation of each weight matrix is appended and trained whilst keeping the original model untouched. This “addenda”, which requires significantly less computational budget to be trained, can be used in combination with the pre-trained LLM in order to modify its original behaviour, in this case with the aim of turning it into a SER system. Mathematically speaking, a forward pass over any given layer of the adapted model can be formulated as: 𝛼 ℎ = 𝑊0 𝑥 + Δ𝑊 𝑥 (1) 𝑟 where 𝑊0 is the original weight matrix of the Transformer layer that is not subject to backpropagation, Δ𝑊 is the LoRA matrix of rank 𝑟 that is trained at this stage, and 𝛼 is a design hyperparameter that controls the prevalence given to the transformations carried out by the LoRA surrogate with respect to the original LLM. This process enables learning a low-rank matrix representation of the original model that is easily trainable with the available resources and data, thus effectively combining large-scale pre-training knowledge with task-aware capabilities. The two design parameters associated with this technique, 𝛼 (LoRA alpha) and 𝑟 (LoRA rank), may be definitive on the overall performance of the final system. To evaluate this, we employ a 5-fold cross-validation (CV) scheme to compare different combinations of LoRa parameters, allowing us to identify the hyperparameter combination that yields the best average f1-macro score. 3.2. Task 2 - Fine-tuning Qwen-Audio using LoRA Our proposal for the SER task that uses the audio and corresponding text transcriptions is based on fine-tuning the Qwen-Audio-Chat [10] model. Figure 1 shows the architecture of Qwen-Audio-Chat model used for speech emotion classification. Our interest comes from its strengths, which are that it has been trained using huge amounts of data from diverse datasets across various tasks and it has shown remarkable performance on various benchmarks. In addition, SER is one of its pre-training objectives, making it a strong foundation for our study. We specifically utilized the Qwen-Audio-Chat2 checkpoint, which has improved reasoning and understanding capabilities due to an additional supervised training stage. However, the output of the model often includes the predicted emotion within a sentence, requiring a post-processing step to extract the actual predicted label. In case no emotion is returned or it falls outside from those expected, the predominant emotion ’neutral’ will be assigned. The prompt used to evaluate model performance is specified in Table 1, where the tag is replaced on each case with the transcription corresponding to the analyzed audio sample. To tailor the model for SER and address the output format, we fine-tuned the LLM in the Qwen- Audio-Chat architecture employing the LoRA technique [11]. The audio encoder remains frozen during 2 https://huggingface.co/Qwen/Qwen-Audio-Chat Next Token Prediction Qwen-7B LoRA Audio Encoder Tokenizer "Select the predominant emotion from neutral, disgust, anger, joy, sadness and fear. Answer only with one of the emotion words and nothing else. Consider also its transcription: " Figure 1: The diagram illustrates the architecture of Qwen-Audio [10] model used as an speech emotion classifier. The Audio Encoder and the Tokenizer provide a sequence of tokens that represent the audio signals and prompt with the transcriptions given as input, respectively. These sequences are concatenated and processed by the Qwen-7B LLM which has been fine-tuned using LoRA to provide the most likely emotion based on the combined information. Snowflakes represent the components that remain frozen throughout the training process, while the flame indicates those adapted via fine-tuning. The is expected to be one among those specified in the input prompt. the entire process. LoRA facilitates efficient training of large models, greatly reducing computational demands as described in Section 3.1. Here, we have also conducted a 5-Fold CV process to identify the LoRA rank and alpha configuration that achieves the best performance based on the average f1-macro. The chosen hyperparameters will be applied to the final model evaluated in the challenge test dataset. For this fine-tuning process, we use the Swift framework [32] due to its support for training MM-LLMs. 3.3. Task 2 - MM-LLM: Gemma + Whisper We propose a model architecture, see Figure 2, that integrates both audio and textual information inspired on Qwen-Audio model’s structure. However, the core difference with Qwen-Audio [10] or similar approaches [9, 33] is that only the projection layer is trained, as we hypothesize that it will allow the language model to comprehend the audio features extracted by the audio encoder. Another relevant modification in our proposal regarding Qwen-Audio [10] is that we employ Whisper-large-v3 [26] as the audio encoder, instead of Whisper-large-v2, since it processes spectrograms with higher resolution as input, 128 mels filters instead of 80 [26]. In addition, we prioritize efficiency by employing smaller LLMs, reducing in more than half the parameters of the original Qwen-Audio model [21, 24]. Our model employs a frozen LLM as the textual understanding backbone and a frozen audio encoder tailored for the extraction of acoustic features. We employ a linear layer as a projector that leverages the communication between the audio encoder and the LLM. To address the imbalanced emotion distribution present in the dataset, we applied a weighted cross-entropy loss function during training, assigning higher weights to underrepresented classes to enhance model performance and robustness. Then, we extract the final logits of the LLM that correspond to each emotion token to ensure the LLM to categorize with just the provided emotion labels. Finally, for inference, we apply a softmax layer to Linear (Projector) Layer Normalization ············· x32 ( ) x20 Linear Layer Normalization Linear Embedding GeLU Attention GeLU Tokenizer Linear Linear "Instruction: Layer Normalization Given the following audio information and the transcription, Layer Normalization predict the emotion of the speaker. The emotion can be one of the following: [ neutral, disgust, anger, joy, sadness, fear] Linear (Lm head) The transcription is: {transcripts} Attention Audio information: \n Output: Logit selector Layer Normalization emotion =" Softmax x2 GeLU Emotion probabilities GeLU Convolutional layer Convolutional layer Figure 2: Architecture for processing audio and textual modalities with an LLM. On the left, the audio encoder processes the audio signal and a projection layer maps the extracted audio features into the LLM domain. This Linear (Projector) module is trained from scratch. Then, information from both modalities is combined and used as input to the LLM, represented on the right, which remains frozen the whole process. Finally, the logits corresponding to all the emotions considered are gathered to obtain the probability distribution of all emotions. obtain the probability distribution of all the possible emotions. In the end, audio content and temporal dynamics are captured and processed through the Mel spectrograms received by Whisper-large-v3 as input providing a set of features that are merged with the hidden representations provided by the LLM. The purpose of the introduced projection layer is to obtain a unified representation by mapping audio features into the semantic space of the LLM. This enables the LLM to effectively interpret the audio data and leverage any latent emotional patterns present. 4. Materials and Methods 4.1. Data Source The official data for the challenge come from the Multimodal speech–text corpus for Emotion Analysis in Spanish from natural environments (MEACorpus) 2023 Dataset [2]. A pioneer source of multimodal data annotated in terms of emotion in Spanish, it consists of a collection of audio segments extracted from YouTube videos uploaded to Spanish channels. The audios, lasting 9 seconds approximately on average, are annotated in terms of Ekman’s six basic emotions: anger, disgust, sadness, joy, fear, and surprise, as well as a neutral class. A subset of the Spanish MEACorpus 2023 Dataset is used for the task, comprising a grand total of 3,750 audios split into a train (3,000 samples) and test (750 samples) subset. The corpus includes both processed audio segments and automatically generated transcriptions, as well as the golden labels for the train split. The distribution of labels for the training data is shown in Figure 3. Firstly, as originally reported by the authors, it can be seen that there are no examples of the “surprised” class in the dataset. Secondly, there are clear signs of class imbalance in the data, with “fear” being heavily misrepresented. Figure 3: Label distribution for the train partition of the MEACorpus 2023. 4.2. Experimental Setup In order to validate our approaches in-house before submitting the challenge runs, a Stratified 5-Fold Cross-Validation schema is used in all our experiments, which benefits robustness and statistical rele- vance. In this way, the training data are split 5 times into training and validation, with the validation splits of each fold being disjoint. Furthermore, the data splits are stratified, i.e. they contain approxi- mately the same distribution of labels as the entire training corpus, thus enforcing similar experimental conditions across the folds. As validation metric, we use the f1-macro score, the official figure of merit for the task, which does not take into account class imbalance. For the challenge runs, we utilized the entire train dataset to obtain our final models. For Task 1, all models are trained for 6 epochs using the Adam optimizer, a batch size of 4 and a learning rate of 10−4 with a decreasing linear scheduler to a final value of 0. For task 2, Qwen-Audio-Chat is fine-tuned for 6 epochs using Adam optimizer with weight decay of 0.1, a batch size of 3 and a learning rate of 10−4 with a decreasing linear scheduler. All experiments were carried out on a single NVIDIA A100 40GB GPU. 5. Results and Discussion 5.1. In-house results 5.1.1. Task 1 - Text only classification When adapting LLMs to a downstream task using LoRA, the alpha and rank parameters become crucial in the model design. These parameters effectively control the complexity of the “surrogate model” learned by LoRA and how much it overrides the original LLM. To find the optimal settings for our task, we performed a 5-fold cross-validation (CV) on the challenge’s training data using various alpha and rank combinations. The results are visualized in Figure 4 as a heatmap of the mean f1-macro scores for the 5 splits in our CV setup for each configuration. The figure reveals a clear sensitivity to the alpha parameter. The best performance occurs when alpha is close to the chosen rank, but it drops Figure 4: Average f1-macro score after 5-Fold CV for different LoRA parameters configuration fine-tuning Gemma 2B model. significantly at a value of 512. This suggests that overly influential LoRA adaptations can diminish the value of the pre-trained LLM knowledge. Conversely, very low alpha values lead to suboptimal results, as the task-specific knowledge is underutilized. Regarding the rank parameter, for the optimal alpha, higher values generally improve performance. A peak f1-macro score of 0.6694 is achieved with both rank and alpha set to 128. However, this benefit decreases at rank 256. Therefore, it can be empirically induced that, for this specific problem and model configuration, the original weight matrices are best described using a rank 128 decomposition. 5.1.2. Task 2 - Multimodal classification To evaluate Qwen-Audio-Chat [10] performance, we performed a zero-shot evaluation on the entire train dataset achieving an f1-macro result of 0.27 indicating a misalignment with the task requirements. As shown in Table 1, our prompts specified that only the emotion should be provided. However, the model struggled to follow these instructions accurately, often adding unexpected information to its responses. Consequently, a post-processing step was required to retrieve the actual emotion provided. This suggests difficulty following instructions and potential language mismatch, as Qwen-Audio (based on Qwen-7B [9]) was trained primarily on Chinese and English data. Additionally, pre-training data for SER tasks might use different emotion taxonomies than this study. To address these issues, we fine-tuned Qwen-Audio-Chat using LoRA. Figure 5 presents the results of a 5-Fold CV for different LoRA configurations. The F1-macro score generally increased with the LoRA alpha parameter, indicating the benefit of task-specific adaptation. However, excessively high alpha values (above 4,096) yielded lower f1-macro scores, suggesting that the importance given to the newly learned weights is so high that the original model knowledge is being sidelined. Regarding rank, values above 64 showed minimal performance improvement. In fact, the best configuration achieved a f1-macro score of 0.8164 with a LoRA rank and alpha of 64 and 1,024, respectively. We further explored multimodal approaches by combining audio encoders with the Gemma LLM. First, we evaluated both Whisper-large-v3 and Emo2vec encoders with Gemma-2B using a batch size 16 0.7958 0.7853 0.8043 0.7763 0.81 32 0.7976 0.8033 0.7997 0.7640 0.80 LoRA Rank 64 0.7975 0.8164 0.7995 0.7772 0.79 128 0.8065 0.8108 0.8044 0.7893 0.78 256 0.8019 0.8079 0.8137 0.7909 0.77 512 1024 2048 4096 LoRA Alpha Figure 5: Average f1-macro score after 5-Fold CV for different LoRA parameters configuration fine-tuning Qwen-Audio-Chat model. Table 2 Average f1-macro results after 5-Fold CV for different hyperparameters configuration for Whisper-Gemma model. Whisper + Gemma Batch size Learning rate Validation f1-macro 4 10−3 0.8125 4 10−4 0.7660 6 10−3 0.7515 6 10−4 0.7404 of 4 and learning rate 10−4 . Our results on the official test set showed that the Whisper-based model significantly outperformed Emo2vec (0.6636 F1-macro vs. 0.5134). To optimize the Whisper-Gemma model, we conducted a hyperparameter search on batch size and learning rate. Table 2 shows the validation f1-macro scores obtained from the different combinations of hyperparameters. Despite the potential benefits of larger batch sizes in terms of convergence speed and computational efficiency, hardware limitations restricted us to batch sizes of 4 and 6. Interestingly, our experimentation reveals that the model trained with a batch size of 4 achieved a higher validation f1-macro score compared to the one trained with a batch size of 6. Our best model scored an average f1-macro of 0.8125 after 5-fold CV. Table 3 Official challenge f1-Macro results over the 750 samples of the test set for various models and tasks, along with 95% Confidence Intervals. Task 1 Model Test f1-macro Gemma 2B (Alpha = 32 | Rank = 32) 0.5814 ± 0.0352 Gemma 2B (Alpha = 128 | Rank = 128)* 0.6094 ± 0.0349 Task 2 Model Test f1-macro Qwen-Audio-Chat (Alpha = 2,048 | Rank = 256) 0.8248 ± 0.0272 Qwen-Audio-Chat (Alpha = 1,024 | Rank = 64) 0.8072 ± 0.0282 Whisper - Gemma 0.7904 ± 0.0291 Emo2Vec - Gemma 0.5134 ± 0.0359 * This result was obtained at the post-evaluation stage. 5.2. Challenge runs Table 3 summarizes the performance metrics achieved on the official test set containing 750 examples. For Task 1, the best fine-tuned Gemma-2B model achieved an f1-macro score of 0.6094. This configuration used LoRA rank and alpha values of 128. A configuration with a lower rank and alpha (32) yielded a slightly lower score (0.5814). These results are aligned with the trends observed during validation. For Task 2, the test results highlight the significance of audio encoder choice in multimodal models [34]. Specifically, the Whisper-Gemma model achieved a strong f1-macro score of 0.7904, demonstrating consistent performance across different emotion classes. In contrast, the Emo2vec-Gemma model scored a lower f1-macro of 0.5134. Furthermore, we tested for Task 2 the Qwen-Audio model which achieved a higher f1-macro score of 0.8248 on the official test split with a LoRA rank and alpha values of 256 and 2048, respectively. Notice that the value is higher than the one corresponding to the best configuration obtained in our in-house experimentation, which is LoRa rank of 64 and alpha of 1024, which achieved a 0.8072 f1-macro score. However, the reader may note that the difference between both configurations was not statistically significant (see Figure 5). We hypothesize that it may be due to the use of higher rank decomposition matrices results in a better adaptation of the target distribution to be modeled. 6. Conclusions and Future Work This work explored the potential of MM-LLMs to capture the intrinsic patterns of emotions in spoken language, specifically in the context of the EmoSpeech IberLEF 2024 challenge. For task 1, we explored parameter-efficient fine-tuning of LLMs using audio transcripts, working under the hypothesis that the vast generic knowledge held by this kind of architecture can be leveraged in order to build a strong emotion prediction with little adjustments. We obtained a top result of 0.6094 with a Gemma-2B model fine-tuned using the LoRA technique with a rank and alpha parameters of 128 (0.6694 according to our in-house cross-validation-based experimentation setup). This promising result opens the door to further exploration of text-only models for automatic SER by efficiently tuning large models. For task 2, we fine-tuned the Qwen-Audio model to efficiently adapt the LLM using LoRA to improve the emotional recognition capabilities of the model. It achieved a 0.8248 f1-macro in the challenge test dataset, being the second best result, showing that the audio features extracted from the pre- trained audio encoder contain relevant emotional information that can be adapted so that the LLM can accurately process it. Interestingly, a comparable performance with our Qwen-Audio best result was achieved using the Whisper-Gemma model, which only fine-tuned a projection layer while keeping both models frozen. This results suggests that smaller models (8.4 billion parameters for Qwen-Audio vs. 3.2 billion parameters for Whisper-Gemma) can be effective with appropriate audio encoders (Whisper achieved 0.7904 f1-macro compared to 0.5134 with Emo2Vec). These findings suggest several promising directions for future work. First, we plan to investigate the effect of fine-tuning the Qwen-Audio audio encoder specifically for the target language, potentially leading to further performance improvements. Additionally, this could be extended to using models fine-tuned for specific tasks, followed by the joint fine-tuning of the projection layer to create even more capable SER systems. Overall, this work demonstrates the effectiveness of MM-LLMs for emotion recognition in spoken language, contributing valuable insights to the EmoSpeech Iberlef 2024 challenge. Our findings pave the way for further research into efficient and accurate models for this task. 7. Acknowledgments Sergio Esteban-Romero research was supported by the Spanish Ministry of Education (FPI grant PRE2022-105516). Iván Martín-Fernández’s research was supported by the Universidad Politécnica de Madrid (Programa Propio I+D+i). This work was funded by Project ASTOUND (101071191 — HORIZON- EIC-2021-PATHFINDERCHALLENGES-01) of the European Commission and by the Spanish Ministry of Science and Innovation through the projects GOMINOLA (PID2020-118112RB-C22) and BeWord (PID2021-126061OB-C43), funded by MCIN/AEI/ 10.13039/501100011033 and by the European Union “NextGenerationEU/PRTR”. We also thank the equipment provided by the INFRARED program of the JCyL under grant IR2020-1-ULE01. References [1] Y. Ulgen Sonmez, A. Varol, In-depth investigation of speech emotion recognition studies from past to present – The importance of emotion recognition from speech signal for AI, Intelligent Systems with Applications 22 (2024) 200351. URL: https://www.sciencedirect.com/science/article/ pii/S2667305324000279. doi:https://doi.org/10.1016/j.iswa.2024.200351. [2] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, R. Valencia-García, Spanish MEACorpus 2023: A multimodal speech-text corpus for emotion analysis in Spanish from natural environments, Computer Standards & Interfaces (2024) 103856. [3] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process- ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024. [4] R. Pan, J. A. García-Díaz, M. A. Rodríguez-García, F. García-Sánchez, R. Valencia-García, Overview of EmoSpeech 2024@IberLEF: Multimodal Speech-text Emotion Recognition in Spanish, Proce- samiento del Lenguaje Natural 73 (2024). [5] Y. U. Sonmez, A. Varol, In-Depth Analysis of Speech Production, Auditory System, Emotion Theories and Emotion Recognition, in: 2020 8th International Symposium on Digital Forensics and Security (ISDFS), 2020, pp. 1–8. doi:10.1109/ISDFS49300.2020.9116231. [6] OpenAI, GPT-4 Technical Report, 2024. arXiv:2303.08774. [7] Anthropic AI, The claude 3 model family: Opus, sonnet, haiku, Claude-3 Model Card (2024). [8] Gemma Team, Google Deepmind, Gemma: Open Models Based on Gemini Research and Technol- ogy, 2024. arXiv:2403.08295. [9] Qwen Team, Qwen Technical Report, 2023. arXiv:2309.16609. [10] Y. Chu, J. Xu, X. Zhou, et al., Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models, 2023. arXiv:2311.07919. [11] E. J. Hu, Y. Shen, P. Wallis, et al., LoRA: Low-Rank Adaptation of Large Language Models, 2021. arXiv:2106.09685. [12] S. Bou-Ghazale, J. Hansen, A comparative study of traditional and newly proposed features for recognition of speech under stress, IEEE Transactions on Speech and Audio Processing 8 (2000) 429–442. doi:10.1109/89.848224. [13] B. Schuller, R. Müller, M. Lang, G. Rigoll, Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles (2005). [14] S. Wu, T. H. Falk, W.-Y. Chan, Automatic speech emotion recognition using modulation spectral features, Speech communication 53 (2011) 768–785. [15] S. Kuchibhotla, H. D. Vankayalapati, R. Vaddi, K. R. Anne, A comparative analysis of classifiers in emotion recognition through acoustic features, International Journal of Speech Technology 17 (2014) 401–408. [16] C. Yogesh, M. Hariharan, R. Yuvaraj, et al., Bispectral features and mean shift clustering for stress and emotion recognition from natural speech, Computers & Electrical Engineering 62 (2017) 676–691. [17] Mustaqeem, M. Sajjad, S. Kwon, Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM, IEEE Access 8 (2020) 79861–79875. doi:10.1109/ACCESS. 2020.2990405. [18] R. Jahangir, Y. W. Teh, G. Mujtaba, et al., Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion, Machine Vision and Applications 33 (2022) 41. URL: https://doi.org/10.1007/s00138-022-01294-x. doi:10.1007/ s00138-022-01294-x. [19] J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control 47 (2019) 312–323. URL: https://www.sciencedirect. com/science/article/pii/S1746809418302337. doi:https://doi.org/10.1016/j.bspc.2018. 08.035. [20] I. Zubiaga Amar, R. Justo Blanco, M. De Velasco Vázquez, M. I. Torres Barañano, Speech emotion recognition in Spanish TV Debates, in: roc. IberSPEECH 2022, ISCA, 2022, pp. 186–190. [21] Gemini Team, Google Deepmind, Gemini: A Family of Highly Capable Multimodal Models, 2024. arXiv:2312.11805. [22] H. Touvron, T. Lavril, G. Izacard, et al., LLaMA: Open and Efficient Foundation Language Models, 2023. arXiv:2302.13971. [23] J. Su, Y. Lu, S. Pan, et al., RoFormer: Enhanced Transformer with Rotary Position Embedding, 2023. arXiv:2104.09864. [24] S. Gunasekar, Y. Zhang, J. Aneja, et al., Textbooks Are All You Need, 2023. arXiv:2306.11644. [25] R. Zhang, J. Han, C. Liu, et al., LLAMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention, 2023. arXiv:2303.16199. [26] A. Radford, J. W. Kim, T. Xu, et al., Robust Speech Recognition via Large-Scale Weak Supervision, 2022. arXiv:2212.04356. [27] Y. Gong, S. Khurana, L. Karlinsky, J. Glass, Whisper-AT: Noise-Robust Automatic Speech Recog- nizers are Also Strong General Audio Event Taggers, in: INTERSPEECH 2023, interspeech_2023, ISCA, 2023. URL: http://dx.doi.org/10.21437/Interspeech.2023-2193. doi:10.21437/interspeech. 2023-2193. [28] D. Zhang, S. Li, X. Zhang, et al., SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities, 2023. arXiv:2305.11000. [29] Z. Ma, Z. Zheng, J. Ye, et al., emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation, 2023. arXiv:2312.15185. [30] A. Baevski, W.-N. Hsu, Q. Xu, et al., data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, 2022. arXiv:2202.03555. [31] O. Elharrouss, Y. Akbari, N. Almaadeed, S. Al-Maadeed, Backbones-Review: Feature Ex- traction Networks for Deep Learning and Deep Reinforcement Learning Approaches, 2022. arXiv:2206.08016. [32] The ModelScope Team, SWIFT: Scalable lightWeight Infrastructure for Fine-Tuning, https://github. com/modelscope/swift, 2024. [33] C. Tang, W. Yu, G. Sun, et al., SALMONN: Towards Generic Hearing Abilities for Large Language Models, 2024. arXiv:2310.13289. [34] B. McKinzie, Z. Gan, J.-P. Fauconnier, et al., MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training, 2024. arXiv:2403.09611.