UAE at EmoSPeech–IberLEF2024: Integrating Text and Audio Features with SVM for Emotion Detection Katty Lagos-Ortiz1 , José Medina-Moreira2 and Oscar Apolinario-Arzube3 1 Facultad de Ciencias Agrarias, Universidad Agraria del Ecuador, Av. 25 de Julio, Guayaquil, Ecuador 2 Universidad Bolivariana del Ecuador, R59R+838, Durán, Ecuador 3 Instituto Superior Tecnológico Guayaquil, Carlos Gómez Rendón 1403, Guayaquil 090308, Ecuador Abstract Automatic emotion recognition (AER) has long been a significant challenge and is becoming increasingly important in various fields such as health, psychology, social sciences, and marketing. The EmoSPeech shared task at IberLEF 2024 aims to advance AER by addressing classification challenges, including feature selection for emotion discrimination, the scarcity of real-world multimodal datasets, and the complexity of combining different features. This task includes two subtasks: text-based AER and multimodal AER, emphasizing the novel aspect of multimodal AER by evaluating language models on authentic datasets. This paper presents the contributions of the UAE team to both subtasks. For Task 1, we used text embeddings from the pre-trained language model BETO and classified emotions using the SVM algorithm, achieving an M-F1 score of 0.51, outperforming the baseline and ranking 9th. For Task 2, we extended this approach by incorporating audio features from the Wav2Vec 2.0 model, resulting in an M-F1 score of 0.56 and a ranking of 7th. These results outperformed the baseline, demonstrating that audio features complement text features and improve the performance of the unimodal model. Keywords Speech Emotion Recognition, Automatic Emotion Recognition, Natural Language Processing, Transformers, SVM 1. Introduction Automatic emotion recognition has been a significant challenge for many years and is becoming increasingly important in fields as diverse as health, psychology, social sciences, and marketing. Using algorithms and artificial intelligence, this technology aims to detect and interpret emotions expressed through various channels, including verbal language, body language, facial expressions, and speech prosody [1] [2]. For example, [3] demonstrated the relationship between emotion and mental illness and the importance of emotion recognition in health care. Specifically, within the field of automatic emotion recognition, automatic speech recognition focuses on identifying emotions conveyed through speech. This process involves analyzing acoustic and prosaic features such as fundamental frequency, intensity, rhythm, intonation, and phoneme duration to detect patterns associated with different emotional states, and then categorizing speech into emotional labels such as happiness, sadness, anger, fear, disgust, and others. In addition, multimodal approaches fuse data from multiple sources, such as speech, facial expressions, body language, and written text, to comprehensively capture the emotions conveyed by an individual [4]. The EmoSPeech shared task [5] in IberLEF 2024 [6] aims to delve into the field of Automatic Emotion Recognition (AER) and address the associated classification hurdles, including feature selection for emotion discrimination, the paucity of real-world multimodal datasets, and the complexities arising from the combination of different features. This challenge delineates two subtasks: text-based AER and multimodal AER, reflecting the burgeoning interest in this area as evidenced by numerous collaborative IberLEF 2024, September 2024, Valladolid, Spain * Corresponding author. † These authors contributed equally. $ klagos@uagraria.edu.ec (K. Lagos-Ortiz); jjmedinam@ube.edu.ec (J. Medina-Moreira); oapolinario@istg.edu.ec (O. Apolinario-Arzube)  0000-0002-2510-7416 (K. Lagos-Ortiz); 0000-0003-1728-1462 (J. Medina-Moreira); 0000-0003-4059-9516 (O. Apolinario-Arzube) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings efforts. In particular, the novelty of this challenge lies in its focus on multimodal AER, assessing the performance of language models on authentic datasets, an aspect previously unexplored in shared tasks. This paper describes the UAE team contributions to both subtasks, using conventional algorithms such as SVM in join with Wav2vec 2.0 [7] for audio feature extraction, and text embedding from a pre-trained language model such as BETO [8]. The following sections provide an overview of the task and the dataset (Section 2), outline the methodology used to address Subtask 1 and Subtask 2 (Section 3), present the results obtained (Section 4), and conclude with lessons learned and avenues for future exploration (Section 5). 2. Task description The task at hand consists of two different subtasks, each of which presents a different way of approaching the problem of Automatic Emotion Recognition: i) the first subtask deals with the identification of emotions from textual input, ii) the second subtask addresses the more complex challenge of multimodal automatic emotion recognition. In recent years, there has been a surge of interest in AER in the research community, with collaborative events such as WASSA [9], EmoRec-Com [10], and EmoEvalES [11] demonstrating this growing fascination. What distinguishes this work apart is that it takes a multimodal approach to AER, evaluating how language models perform on real-world datasets. To facilitate this, the organizers provided the Spanish MEACorpus 2023 dataset, which includes audio segments collected from various Spanish YouTube channels, amounting to over 13.16 hours of annotated audio spanning six emotions: disgust, anger, happiness, sadness, neutral, and fear. The dataset was annotated in two phases. For this task, about 3500-4000 audio segments were selected and divided into training and test sets in an 80%-20% ratio. To develop the model, the training set was divided into two subsets with a 90-10 ratio: one for training and the other for validation. The validation set was used to fine-tune the hyperparameters and to evaluate the performance of the model during training. The distribution of the dataset provided by the organizers is shown in table 1. Table 1 Distribution of the datasets Dataset Total Neutral Disgust Anger Joy Sadness Fear Train 2,700 1,070 616 355 330 308 21 Validation 300 96 89 44 37 32 2 Test 750 291 177 100 90 86 6 3. Methodology Figure 1 shows the general architecture of our approach for these two tasks. For Task 1, which is to identify emotions from text, we used an approach that involves using a pre-trained model like BETO to obtain text embeddings and then applying a Support Vector Machine (SVM) classification algorithm. BETO was chosen for its ability to generate highly contextualized word embeddings, capturing the semantic nuances of the text. SVM was selected for its effectiveness in handling high-dimensional data and its robust classification capabilities. This approach focuses on leveraging the advanced text representation capabilities of BETO for emotion classification, evaluating performance using appropriate metrics. The pre-trained language model used for this task is BETO [8], which is a BERT model trained on a large Spanish corpus. BETO is similar in size to the BERT base and was trained using the Whole Word Masking technique. In addition, this model has been shown to perform well especially in classification and author profiling tasks [12] [13]. For Task 2, which focuses on identifying emotions from audio and text, we used an approach that utilizes a pre-trained Transformers-based model called Wav2Vec 2.0, specifically the facebook/wav2vec2- large-xlsr-53-spanish model, to obtain vector representations of the audio. These vectors are combined with the text embeddings from BERT and used as input to an SVM classification model. The goal is to identify emotions from a combination of audio and text, taking advantage of the rich semantic representations provided by both pre-trained models and the robust classification capabilities of SVM. Wav2Vec 2.0, developed by Facebook AI Research (FAIR), is designed for self-supervised learning in audio processing, generating high-quality vector representations of audio that are particularly useful for classification tasks. By combining audio and text embeddings, this approach aims to enhance emotion identification accuracy, leveraging the strengths of both modalities and the effectiveness of SVM in data classification. Figure 1: Overall system architecture. 4. Results The performance of the SVM model on the test split for Task 1 is presented in Table 2. The metrics reported include macro precision (M-P), macro recall (M-R), and macro F1-score (M-F1). The SVM model achieved the following scores: 0.540 in M-P, 0.512 in M-R, and 0.518 in M-F1. When compared to the official leaderboard for Task 1 (Table 3), the SVM model ranks 9th out of 10, with a macro F1-score of 0.518242. This places it ahead of the baseline model (0.496829) but behind the top-performing team ‘TEC_TEZUITLAN” with a macro F1-score of 0.671856. The results indicate that while the SVM model performs better than the baseline, there is significant room for improvement to reach the performance levels of the leading teams. For Task 2, the SVM model achieved a macro F1-score of 0.558898. This result places the model 7th on the official leaderboard (Table 4), surpassing the baseline score of 0.530757. However, it remains significantly behind the leading team “BSC-UPC” which achieved a macro F1-score of 0.866892. In conclusion, we can reveal hat the SVM model, while outperforming the baseline in both tasks, does not achieve top-tier performance. The model for Task 1 ranks 9th with a macro F1-score of 0.518242, whereas for Task 2, it ranks 8th with a macro F1-score of 0.558898. These outcomes suggest that further optimization and potentially the exploration of more sophisticated models or additional feature engineering are necessary to improve performance and compete with the leading approaches on the leaderboard. The classification report for Task 1, as shown in Table 5, provides detailed performance metrics for each emotion class. Our approach achieves an accuracy of 0.682667. The macro averages for precision, recall, and F1-score are 0.540791, 0.512783, and 0.518242 respectively, indicating a balanced but not exceptional performance across different classes. The weighted averages are higher, reflecting the model’s better performance on more prevalent classes. The classification report for Task 2, as shown in Table 6, provides similar performance metrics for each emotion class when both audio and text data are considered. The overall accuracy for Task 2 is Table 2 Results of the SVM model on the test split for Task 1 and Task 2 are reported. The metrics include macro precision (M-P), macro recall (M-R), and macro F1-score (M-F1). Model M-P M-R M-F1 Task 1 SVM 0.540791 0.512783 0.518242 Task 2 SVM 0.571862 0.553863 0.558898 Table 3 Official leaderboard for task 1 Task 1 # Team Name M-F1 1 TEC_TEZUITLAN 0.671856 2 CogniCIC 0.657527 3 UNED-UNIOVI 0.655287 4 UKR 0.648417 - - - 9 UAE 0.518242 10 Baseline 0.496829 Table 4 Official leaderboard for task 2 Task 2 # Team Name M-F1 1 BSC-UPC 0.866892 2 THAU-UPM 0.824833 3 CogniCIC 0.712259 4 TEC_TEZUITLAN 0.712259 - - - 7 UAE 0.558898 8 Baseline 0.530757 0.717333. The macro averages for precision, recall, and F1-score are 0.571862, 0.553863, and 0.558898 respectively, reflecting a more balanced and slightly improved performance compared to Task 1. The weighted averages show that the model handles the more frequent classes better, similar to Task 1. Thus, our approach performance varies across different emotion classes, with some classes like joy and neutral showing high precision and recall, while others like fear show very poor performance. The addition of audio data in Task 2 generally improves the model’s performance, as evidenced by the higher accuracy and macro metrics. However, the SVM model struggles with less frequent and potentially more ambiguous emotions like anger and fear. 5. Conclusion This paper describes the participation of UAE in the IberLEF EmoSPeech 2024 shared task. This task focuses on exploring the field of Automatic Emotion Recognition (AER) through two subtasks: i) a textual approach, which uses only textual content to identify the expressed emotion; and ii) a multimodal approach, which combines audio and text to identify the emotion. Table 5 Classification report of SVM model in task 1 precision recall f1-score anger 0.408163 0.200000 0.268456 disgust 0.565421 0.683616 0.618926 fear 0.000000 0.000000 0.000000 joy 0.794872 0.688889 0.738095 neutral 0.765766 0.876289 0.817308 sadness 0.710526 0.627907 0.666667 accuracy 0.682667 0.682667 0.682667 macro avg 0.540791 0.512783 0.518242 weighted avg 0.661836 0.682667 0.663992 Table 6 Classification report of SVM model in task 2 precision recall f1-score anger 0.485294 0.330000 0.392857 disgust 0.586538 0.689266 0.633766 fear 0.000000 0.000000 0.000000 joy 0.786517 0.777778 0.782123 neutral 0.829582 0.886598 0.857143 sadness 0.743243 0.639535 0.687500 accuracy 0.717333 0.717333 0.717333 macro avg 0.571862 0.553863 0.558898 weighted avg 0.704614 0.717333 0.707209 For Task 1, we used an approach based on classifying emotions through text embeddings obtained with a pre-trained language model called BETO and the SVM algorithm, obtaining a score of 0.51 in M-F1, beating the baseline and reaching rank 9th in the classification table. On the other hand, for Task 2, we modified the approach used for Task 1 by adding audio features through a pre-trained audio model based on Wav2Vec 2.0. With this approach, we obtained a score of 0.56 in M-F1, ranking 7th in the table. The results of both tasks have exceeded the baseline proposed by the organizers, and through the results obtained in Task 2, we can conclude that the audio features complement the text features and improve the performance of the unimodal model. As a future line, we propose to improve the approach using fine-tuning techniques and to test other classification algorithms, such as Recurrent Neural Networks (RNN), Random Forest (RF), and Convolu- tional Neural Networks (CNN). We also plan to test other pre-trained models based on Transformers. In addition, we propose to add a sentiment feature to the model to enrich its ability to understand and analyze text in different contexts, since sentiment indicates the polarity of sentences and is complemen- tary to emotion. In [14], the analysis is very useful in different domains such as politics, marketing, healthcare, among others. References [1] F. Chenchah, Z. Lachiri, Speech emotion recognition in noisy environment, in: 2016 2nd Interna- tional Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2016, pp. 788–792. doi:10.1109/ATSIP.2016.7523189. [2] A. A. Varghese, J. P. Cherian, J. J. Kizhakkethottam, Overview on emotion recognition system, in: 2015 International Conference on Soft-Computing and Networks Security (ICSNS), 2015, pp. 1–5. doi:10.1109/ICSNS.2015.7292443. [3] A. Salmerón-Ríos, J. A. García-Díaz, R. Pan, R. Valencia-García, Fine grain emotion analysis in Spanish using linguistic features and transformers, PeerJ Computer Science 10 (2024) e1992. doi:10.7717/peerj-cs.1992. [4] R. Pan, J. A. García-Díaz, M. A. Rodríguez-García, R. Valencia-García, Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments, Computer Standards & Interfaces 90 (2024) 103856. URL: https://www.sciencedirect.com/science/ article/pii/S0920548924000254. doi:https://doi.org/10.1016/j.csi.2024.103856. [5] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, F. García-Sanchez, R. Valencia-García, Overview of EmoSPeech at IberLEF 2024: Multimodal Speech-text Emotion Recognition in Spanish, Proce- samiento del Lenguaje Natural 73 (2024). [6] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process- ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024. [7] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations, Advances in neural information processing systems 33 (2020) 12449– 12460. [8] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish Pre-Trained BERT Model and Evaluation Data, in: PML4DC at ICLR 2020, 2020. [9] S. Mohammad, F. Bravo-Marquez, WASSA-2017 shared task on emotion intensity, in: A. Balahur, S. M. Mohammad, E. van der Goot (Eds.), Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 34–49. URL: https://aclanthology.org/W17-5205. doi:10.18653/v1/W17-5205. [10] N.-V. Nguyen, X.-S. Vu, C. Rigaud, L. Jiang, J.-C. Burie, ICDAR 2021 competition on multimodal emotion recognition on comics scenes, in: International Conference on Document Analysis and Recognition, Springer, 2021, pp. 767–782. [11] F. M. Plaza-del Arco, S. M. Jiménez-Zafra, A. Montejo-Ráez, M. D. Molina-González, L. A. Ureña- López, M. T. Martín-Valdivia, Overview of the EmoEvalEs task on emotion detection for Spanish at IberLEF 2021, Procesamiento del Lenguaje Natural 67 (2021) 155–161. URL: http://journal.sepln. org/sepln/ojs/ojs/index.php/pln/article/view/6385. [12] J. A. García-Díaz, S. M. Jiménez-Zafra, M. T. Martín-Valdivia, F. García-Sánchez, L. A. Ureña-López, R. Valencia-García, Overview of PoliticEs 2022: Spanish Author Profiling for Political Ideology, Proces. del Leng. Natural 69 (2022) 265–272. URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/ pln/article/view/6446. [13] J. A. García-Díaz, G. Beydoun, R. Valencia-García, Evaluating Transformers and Linguistic Features integration for Author Profiling tasks in Spanish, Data & Knowledge Engineering 151 (2024) 102307. URL: https://www.sciencedirect.com/science/article/pii/S0169023X24000314. doi:https: //doi.org/10.1016/j.datak.2024.102307. [14] F. Ramírez-Tinoco, G. Alor-Hernández, J. Sánchez-Cervantes, M. Salas-Zarate, R. Valencia-García, Use of Sentiment Analysis Techniques in Healthcare Domain, 2019, pp. 189–212. doi:10.1007/ 978-3-030-06149-4_8.