Team ITST at EmoSPeech-IberLEF2024: Multimodal Speech- text Emotion Recognition in Spanish Forum Mario Andrés Paredes-Valverde1,*,† and María del Pilar Salas-Zárate1,† 1 Tecnológico Nacional de México/I.T.S. Teziutlán, Fracción l y ll SN, 73960 Teziutlán, Puebla, Mexico Abstract This work describes the participation of the ITST team in the EmoSPeech 2024 Task - Multimodal Speech-text Emotion Recognition in Spanish. To address Task 1 Text AER (Automatic Emotion Recognition), this work proposed a fine-tuning strategy which involves adapting a set of pre- trained transformer models to a specific task. Regarding Task 2 Multimodal AER, an ensemble learning process was proposed. This approach combining multiple individual models, specifically a wac2vec model and BETO model, to improve predictive performance compared to the performance of each model separately. Furthermore, the mean and maximum probability measures were used to provide a final prediction. Keywords transformers, fine tuning, wav2vec1 1. Introduction With the ever-increasing use of electronic devices such as computers and smartphones, emotion recognition has become a significant are of research within the field of human- computer interaction. This area involves the identification and analysis of human emotions through various data inputs, such as facial expressions, voice intonations, textual content, and physiological signals. In the literature there are several efforts to categorize emotions, for example Ekman (Ekman, 1992) proposed a taxonomy of six discrete emotions that are recognized across different cultures namely anger, disgust, fear, happiness, sadness, and surprise. These emotions are popular because they are easily recognizable through facial expressions and other physiological responses. In the context of Natural Language Processing (NLP), transformers are a type of deep learning model architecture that aims to solve sequence-to-sequence tasks through a mechanism called attention (Vaswani et al., 2017). Transformers have been used to build NLP-based solutions that solve problems such as machine translation, conversational agents, question-answering, text generation as well as emotion detection. Although IberLEF 2024, September 2024, Valladolid, Spain ∗ Corresponding author. † These authors contributed equally. mario.pv@teziutlan.tecnm.mx (M.A. Paredes-Valverde); maria.sz@teziutlan.tecnm.mx (M.P. Salas-Zárate) 0000-0001-9508-9818 (M.A. Paredes-Valverde); 0000-0003-1818-3434 (M.P. Salas-Zárate) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings transformers were originally designed for NLP tasks, they have been successfully adapted to a wide range of domains such as image processing, speech processing, time series forecasting, reinforcement learning, and multimodal applications. The scientific context surrounding AER is rich and multifaceted, encompassing various approaches to recognize emotions from different modalities. Traditional methods primarily relied on handcrafted features extracted from facial expressions, voice signals, and text, with machine learning algorithms such as support vector machines (SVMs) and hidden Markov models (HMMs) classifying these features into discrete emotion categories. With the advent of deep learning, more sophisticated approaches, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) such as Long Short-Term Memory (LSTM) networks, have emerged to capture spatial and temporal dependencies in data, significantly improving emotion recognition accuracy. In recent years, multimodal approaches that combine data from multiple sources, such as text, speech, and facial expressions, have gained prominence. These models leverage complementary information from each modality, leading to more robust emotion recognition, with transformers playing a key role due to their ability to handle diverse data types through attention mechanisms. This paper concerns our participation at EmoSPeech 2024 Task - Multimodal Speech- text Emotion Recognition in Spanish (Pan, García-Díaz, Rodríguez-García, García-Sánchez, et al., 2024) which is part of workshop IberLEF 2024 (Chiruzzo et al., 2024). This work implements a fine-tuning strategy which involves adapting a pre-trained transformer model to a specific downstream task, such as named entity recognition, sentiment analysis, or automatic emotion recognition (AER). Specifically, the fine-tuning process leverages the knowledge the transformer model has already acquired during pre-training on a large corpus, thereby requiring less task-specific data and computational resources that training a model from scratch. The corpus used for this task was a multimodal speech-text corpus for emotion analysis in Spanish from natural environments (Pan, García-Díaz, Rodríguez- García, & Valencia-García, 2024). It is important to mention that our approach obtained first place in the AER from text task, which uses a dataset created from real-life situations, and fourth place in the multimodal AER task, which uses a dataset consisting of more than 13.16 hours of audio from audio segments annotated with five of Ekman's six emotions. The next section outlines the fine-tuning-based strategies developed to automatically identify five of Ekman’s six based emotions (anger, disgust, fear, joy, and sadness) from text as well for from a combination of text and speech cues. In the final remarks, we discuss our findings and propose directions for future research. 2. Developed Strategies 2.1. Task 1 Automatic Emotion Recognition The aim of the first challenge of the EmoSpeech 2024 Task is to explore the field of AER from text, i.e., extracting features and identifying the most representative feature of each emotion. The dataset used for this task was created from real-life situations; specifically, it consists of 3000 comments for the training phase and 750 for the testing phase. Figure 1 shows the flowchart of the fine-tuning process for automatic emotion recognition proposed in this work. As can be seen, this process consists of five main phases. A brief description of this process is provided below. Figure 1: Flow diagram of the fine-tuning process for automatic emotion recognition from text. 1. Select pre-trained models for Spanish. In this case, the models selected are BETO (Cañete et al., 2023), BERTIN (De La Rosa et al., 2022), and MarIA (Gutiérrez-Fandiño et al., 2021). BETO is a BERT-based language model pre-trained exclusively on Spanish data. BERTIN is a model pre-trained using perplexity sampling. Meanwhile, MarIA is a family of Spanish language models thar includes ROBERTa-base, RoBERTa-large, GPT2, and GPT-2large Spanish language models. 2. Train the three pre-trained models with the dataset provided by the challenge. 3. Split the training dataset to perform the fine-tuning process in a 90-10 ratio. 4. Use the training dataset to perform the fine-tuning process and adapt the pre-trained models to the automatic emotion recognition task. 5. Use the validation dataset to tune the model’s hyperparameters and to evaluate the model’s performance on unseen data. In this case, the hyperparameters were configured as follows: learning rate of 2e-5, a training batch size of 16, and an evaluation strategy based on epoch. The BETO model achieves an overall accuracy of 73.7%, indicating a good level performance for the AER task from text. Table 1 shows the classification report obtained by this model. As can be seen, BETO performs reasonably well across emotion classes, with particularly strong performance for the “neutral” emotion. However, performance varies across different emotions, with lower precision and recall for classes such as “fear” and “sadness”. Table 1 Classification report of the BETO model for AER task from text. Emotion Precision Recall F1-score anger 0.512195 0.525000 0.518519 disgust 0.656716 0.619718 0.637681 fear 0.666667 1.000000 0.800000 joy 0.789474 0.833333 0.810811 neutral 0.832000 0.888889 0.859504 sadness 0.769231 0.588235 0.666667 accuracy 0.736667 macro avg 0.704380 0.742529 0.715530 weighted avg. 0.734556 0.736667 0.733446 The MarIA model demonstrates solid performance for the proposed challenge, achieving an overall accuracy of 74.7%. Table 2 shows the results obtained by this model, where it can be shown that the model performs particularly well for the “neutral” emotion class, with high precision, recall, and F1-score. The “joy” and “disgust” classes also shown strong performance. However, MarIA’s performance on the “anger” and “sadness” classes is somewhat lower. Is should be noted that the “fear” class shows perfect precision but lower recall due to its small support. Overall, the model shows good generalizability with reasonably balanced precision and recall for most emotion classes. Table 2 Classification report of the MarIA model for AER task from text. Emotion Precision Recall F1-score anger 0.460000 0.575000 0.511111 disgust 0.656716 0.619718 0.637681 fear 1.000000 0.500000 0.666667 joy 0.870968 0.750000 0.805970 neutral 0.877049 0.914530 0.895397 sadness 0.758621 0.647059 0.698413 accuracy 0.746667 macro avg 0.770559 0.667718 0.702540 weighted avg 0.755965 0.746667 0.748585 After fine-tuning from AER from text, The BERTIN model achieves an overall accuracy of 71.3%. Table 3 shows the classification report obtained by the BERTIN model. Notably, this model performs well in recognizing “fear” instances. However, there is a room for improvement in recognizing “anger” and “sadness” instances, as indicated by lower precision and recall values. The BETO, MarIA, and BERTIN models achieved varying levels of performance in AER from text. Overall, all models have strengths and areas for improvement, suggesting the need for ongoing refinement to improve their performance, specifically for less common emotion classes. Table 3 Classification report of the BERTIN model for AER task from text. Emotion Precision Recall F1-score anger 0.400000 0.400000 0.400000 disgust 0.626667 0.661972 0.643836 fear 1.000000 0.500000 0.666667 joy 0.800000 0.777778 0.788732 neutral 0.842975 0.871795 0.857143 sadness 0.714286 0.588235 0.645161 accuracy 0.713333 macro avg 0.730655 0.633297 0.666923 weighted avg 0.714024 0.713333 0.712204 2.2. Task 2: Multimodal Automatic Emotion Recognition The second challenge of the EmoSpeech 2024 Task was multimodal AER, which aims to analyze the performance of language models in solving this classification problem. The dataset used for this task was the Spanish MEACorpus 2023, consisting of more than 13.16 hours of audio from segments annotated with five of Ekman's six emotions. This dataset comprises about 3500-4000 audio segments, divided into training and test sets in an 80%- 20% split. The ensemble learning process followed for automatic emotion recognition from text and speech is shown in Figure 2. This process is described next. Figure 2: Flow diagram of the ensemble learning process for automatic emotion recognition from text and speech. 1. Select a Wav2Vec pre-trained model for Spanish. A Wav2Vec model is trained in large amounts of unlabeled audio data aiming to improve acoustic model training (Schneider et al., 2019). In this work, the wav2vec2-large-xlsr-53-spanish model (Grosman, 2021) was selected. 2. Evaluate the technique called Ensemble Learning that consists of combining multiple individual models to improve predictive performance compared to the performance of each model separately. In this work, an ensemble that combines the best fine- tuned text model (BETO) and another fine-tuned model from Wav2Vec 2.0 has been used for emotion classification. a. Combine the probabilities obtained for each emotion through the two models using Mean. Mean is the mean of the classification probabilities of both sources (text and audio) for each emotion class. b. Combine the probabilities obtained for each emotion through the two models using Max. Max is the maximum probability of each emotion class of the two models. 3. Prediction. For both previous cases, the class with the maximum probability is considered as the final prediction. The ensemble learning process that combines the Wav2Vec pre-trained model with the BETO model achieves an overall accuracy of 75.0% for this challenge using mean of the classification probabilities of both sources. As can be seen in Table 4, this approach performs well across most emotion classes, particularly for "neutral," "joy," and "disgust." It demonstrates the ability to recognize instances of "fear" with high precision and recall. However, the recall is relatively lower in recognizing "sadness" instances. Overall, the ensemble model shows promise in accurately identifying emotions from both text and speech inputs. Table 4 Classification report of the ensemble learning process based on a Wav2Vec and BETO using mean of the classification probabilities of both sources. Emotion Precision Recall F1-score anger 0.50000 0.50000 0.50000 disgust 0.64789 0.64789 0.64789 fear 0.66667 1.00000 0.80000 joy 0.80556 0.80556 0.80556 neutral 0.85600 0.91453 0.88430 sadness 0.84000 0.61765 0.71186 accuracy 0.75000 macro avg 0.71935 0.74760 0.72493 weighted avg 0.75015 0.75000 0.74755 As can be seen in Table 5, the ensemble learning process combining the Wav2Vec pre- trained model with the BETO model using the maximum probability of each emotion class of the two models demonstrates decent precision and recall for some emotion classes like "disgust" and "neutral," it struggles with others such as "anger," "joy," and "sadness," where either precision or recall or both are relatively lower. Notably, it fails to predict any instances of "fear," indicating significant room for improvement. Furthermore, it should be mentioned that this approach achieves an overall accuracy of 70.3%. Table 5 Classification report of the ensemble learning process based on a Wav2Vec and BETO using the maximum probability of each emotion class of the two models. Emotion Precision Recall F1-score anger 0.64286 0.22500 0.33333 disgust 0.57426 0.81690 0.67442 fear 0.00000 0.00000 0.00000 joy 0.64286 0.50000 0.56250 neutral 0.81343 0.93162 0.86853 sadness 0.73913 0.50000 0.59649 accuracy 0.70333 macro avg 0.56876 0.49559 0.50588 weighted avg 0.69977 0.70333 0.67788 The ensemble learning process combining the Wav2Vec pre-trained model with the BETO model exhibits promising performance in automatic emotion recognition from both text and speech. Despite showing promise in certain areas, such as recognizing specific emotions accurately, the ensemble model has clear scope for refinement to enhance its performance across a broader range of emotions. There's a need for further development and optimization to maximize the ensemble's effectiveness in accurately capturing and understanding diverse emotional expressions from both text and speech inputs. 3. Final remarks This paper presents a fine-tuning approach for EmoSPeech 2024 Task - Multimodal Speech- text Emotion Recognition in Spanish. For AER from text, BETO, MarIA, and BERTIN pre- trained models were used achieving varying levels of performance. Some of the models used in this work obtained lower precision and recall values for less common emotion classes such as “anger” and “sadness”. This fact suggests the need for ongoing refinement to improve their performance. Regarding, the ensemble learning approach described in this work, while the ensemble approach based on Mean achieves an impressive overall accuracy of 75.0%, demonstrating strong performance across multiple emotion classes, another the ensemble approach based on the maximum probability achieves a slightly lower accuracy of 70.3%, indicating areas for improvement, particularly in recognizing certain emotions like "anger," "joy," and "sadness." The obtained results emphasize the potential of ensemble learning approaches in advancing automatic emotion recognition technology, while also emphasizing the ongoing need for optimization and development to achieve more robust and comprehensive emotional understanding from diverse textual and audio inputs. Acknowledgements We are grateful to the Tecnológico Nacional de Mexico (TecNM, by its Spanish acronym) for supporting this work. This research was also sponsored by Mexico’s National Council of Humanities, Sciences and Technologies (CONAHCYT). References Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., & Pérez, J. (2023). Spanish Pre-trained BERT Model and Evaluation Data. https://arxiv.org/abs/2308.02976v1 Chiruzzo, L., Jiménez-Zafra, S. M., & Rangel, F. (2024). Overview of IberLEF 2024: Natural Language Processing Challenges for Spanish and other Iberian Languages. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), Co-Located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR- WS.Org. De La Rosa, J., Ponferrada, E. G., Villegas, P., González De Prado Salas, P., Romero, M., Grandury, M., & Project, B. (2022). BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling. Procesamiento Del Lenguaje Natural, 68(0), 13–23. https://doi.org/10.26342/2022-68-1 Ekman, P. (1992). Facial Expressions of Emotion: New Findings, New Questions. Psychological Science, 3(1), 34–38. https://doi.org/10.1111/J.1467- 9280.1992.TB00253.X Grosman, J. (2021). Fine-tuned XLSR-53 large model for speech recognition in Spanish. Gutiérrez-Fandiño, A., Armengol-Estapé, J., Pàmies, M., Llop-Palao, J., Silveira-Ocampo, J., Carrino, C. P., Gonzalez-Agirre, A., Armentano-Oller, C., Rodriguez-Penagos, C., & Villegas, M. (2021). MarIA: Spanish Language Models. Procesamiento Del Lenguaje Natural, 68, 39–60. https://doi.org/10.26342/2022-68-3 Pan, R., García-Díaz, J. A., Rodríguez-García, M. Á., García-Sánchez, F., & Valencia-García, R. (2024). Overview of EmoSPeech at IberLEF 2024: Multimodal Speech-text Emotion Recognition in Spanish. Procesamiento Del Lenguaje Natural, 73(0). Pan, R., García-Díaz, J. A., Rodríguez-García, M. Á., & Valencia-García, R. (2024). Spanish MEACorpus 2023: A multimodal speech-text corpus for emotion analysis in Spanish from natural environments. Computer Standards & Interfaces, 103856. Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised Pre- training for Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-September, 3465–3469. https://doi.org/10.21437/Interspeech.2019-1873 Vaswani, A., Brain, G., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information Processing Systems, 30.