=Paper=
{{Paper
|id=Vol-3756/EmoSPeech2024_paper7
|storemode=property
|title=UAE at EmoSPeech–IberLEF2024: Integrating Text and Audio Features with SVM for Emotion Detection
|pdfUrl=https://ceur-ws.org/Vol-3756/EmoSPeech2024_paper7.pdf
|volume=Vol-3756
|authors=Katty Lagos-Ortiz,José Medina-Moreira,Oscar Apolinario-Arzube
|dblpUrl=https://dblp.org/rec/conf/sepln/Lagos-OrtizMA24
}}
==UAE at EmoSPeech–IberLEF2024: Integrating Text and Audio Features with SVM for Emotion Detection==
UAE at EmoSPeech–IberLEF2024: Integrating Text and
Audio Features with SVM for Emotion Detection
Katty Lagos-Ortiz1 , José Medina-Moreira2 and Oscar Apolinario-Arzube3
1
Facultad de Ciencias Agrarias, Universidad Agraria del Ecuador, Av. 25 de Julio, Guayaquil, Ecuador
2
Universidad Bolivariana del Ecuador, R59R+838, Durán, Ecuador
3
Instituto Superior Tecnológico Guayaquil, Carlos Gómez Rendón 1403, Guayaquil 090308, Ecuador
Abstract
Automatic emotion recognition (AER) has long been a significant challenge and is becoming increasingly important
in various fields such as health, psychology, social sciences, and marketing. The EmoSPeech shared task at
IberLEF 2024 aims to advance AER by addressing classification challenges, including feature selection for emotion
discrimination, the scarcity of real-world multimodal datasets, and the complexity of combining different features.
This task includes two subtasks: text-based AER and multimodal AER, emphasizing the novel aspect of multimodal
AER by evaluating language models on authentic datasets. This paper presents the contributions of the UAE team
to both subtasks. For Task 1, we used text embeddings from the pre-trained language model BETO and classified
emotions using the SVM algorithm, achieving an M-F1 score of 0.51, outperforming the baseline and ranking 9th.
For Task 2, we extended this approach by incorporating audio features from the Wav2Vec 2.0 model, resulting in
an M-F1 score of 0.56 and a ranking of 7th. These results outperformed the baseline, demonstrating that audio
features complement text features and improve the performance of the unimodal model.
Keywords
Speech Emotion Recognition, Automatic Emotion Recognition, Natural Language Processing, Transformers, SVM
1. Introduction
Automatic emotion recognition has been a significant challenge for many years and is becoming
increasingly important in fields as diverse as health, psychology, social sciences, and marketing. Using
algorithms and artificial intelligence, this technology aims to detect and interpret emotions expressed
through various channels, including verbal language, body language, facial expressions, and speech
prosody [1] [2]. For example, [3] demonstrated the relationship between emotion and mental illness and
the importance of emotion recognition in health care. Specifically, within the field of automatic emotion
recognition, automatic speech recognition focuses on identifying emotions conveyed through speech.
This process involves analyzing acoustic and prosaic features such as fundamental frequency, intensity,
rhythm, intonation, and phoneme duration to detect patterns associated with different emotional states,
and then categorizing speech into emotional labels such as happiness, sadness, anger, fear, disgust,
and others. In addition, multimodal approaches fuse data from multiple sources, such as speech, facial
expressions, body language, and written text, to comprehensively capture the emotions conveyed by an
individual [4].
The EmoSPeech shared task [5] in IberLEF 2024 [6] aims to delve into the field of Automatic Emotion
Recognition (AER) and address the associated classification hurdles, including feature selection for
emotion discrimination, the paucity of real-world multimodal datasets, and the complexities arising
from the combination of different features. This challenge delineates two subtasks: text-based AER and
multimodal AER, reflecting the burgeoning interest in this area as evidenced by numerous collaborative
IberLEF 2024, September 2024, Valladolid, Spain
*
Corresponding author.
†
These authors contributed equally.
$ klagos@uagraria.edu.ec (K. Lagos-Ortiz); jjmedinam@ube.edu.ec (J. Medina-Moreira); oapolinario@istg.edu.ec
(O. Apolinario-Arzube)
0000-0002-2510-7416 (K. Lagos-Ortiz); 0000-0003-1728-1462 (J. Medina-Moreira); 0000-0003-4059-9516
(O. Apolinario-Arzube)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
efforts. In particular, the novelty of this challenge lies in its focus on multimodal AER, assessing the
performance of language models on authentic datasets, an aspect previously unexplored in shared tasks.
This paper describes the UAE team contributions to both subtasks, using conventional algorithms
such as SVM in join with Wav2vec 2.0 [7] for audio feature extraction, and text embedding from a
pre-trained language model such as BETO [8]. The following sections provide an overview of the task
and the dataset (Section 2), outline the methodology used to address Subtask 1 and Subtask 2 (Section
3), present the results obtained (Section 4), and conclude with lessons learned and avenues for future
exploration (Section 5).
2. Task description
The task at hand consists of two different subtasks, each of which presents a different way of approaching
the problem of Automatic Emotion Recognition: i) the first subtask deals with the identification of
emotions from textual input, ii) the second subtask addresses the more complex challenge of multimodal
automatic emotion recognition. In recent years, there has been a surge of interest in AER in the research
community, with collaborative events such as WASSA [9], EmoRec-Com [10], and EmoEvalES [11]
demonstrating this growing fascination. What distinguishes this work apart is that it takes a multimodal
approach to AER, evaluating how language models perform on real-world datasets. To facilitate this,
the organizers provided the Spanish MEACorpus 2023 dataset, which includes audio segments collected
from various Spanish YouTube channels, amounting to over 13.16 hours of annotated audio spanning
six emotions: disgust, anger, happiness, sadness, neutral, and fear. The dataset was annotated in two
phases. For this task, about 3500-4000 audio segments were selected and divided into training and test
sets in an 80%-20% ratio.
To develop the model, the training set was divided into two subsets with a 90-10 ratio: one for
training and the other for validation. The validation set was used to fine-tune the hyperparameters and
to evaluate the performance of the model during training. The distribution of the dataset provided by
the organizers is shown in table 1.
Table 1
Distribution of the datasets
Dataset Total Neutral Disgust Anger Joy Sadness Fear
Train 2,700 1,070 616 355 330 308 21
Validation 300 96 89 44 37 32 2
Test 750 291 177 100 90 86 6
3. Methodology
Figure 1 shows the general architecture of our approach for these two tasks. For Task 1, which is to
identify emotions from text, we used an approach that involves using a pre-trained model like BETO to
obtain text embeddings and then applying a Support Vector Machine (SVM) classification algorithm.
BETO was chosen for its ability to generate highly contextualized word embeddings, capturing the
semantic nuances of the text. SVM was selected for its effectiveness in handling high-dimensional
data and its robust classification capabilities. This approach focuses on leveraging the advanced text
representation capabilities of BETO for emotion classification, evaluating performance using appropriate
metrics. The pre-trained language model used for this task is BETO [8], which is a BERT model trained on
a large Spanish corpus. BETO is similar in size to the BERT base and was trained using the Whole Word
Masking technique. In addition, this model has been shown to perform well especially in classification
and author profiling tasks [12] [13].
For Task 2, which focuses on identifying emotions from audio and text, we used an approach that
utilizes a pre-trained Transformers-based model called Wav2Vec 2.0, specifically the facebook/wav2vec2-
large-xlsr-53-spanish model, to obtain vector representations of the audio. These vectors are combined
with the text embeddings from BERT and used as input to an SVM classification model. The goal
is to identify emotions from a combination of audio and text, taking advantage of the rich semantic
representations provided by both pre-trained models and the robust classification capabilities of SVM.
Wav2Vec 2.0, developed by Facebook AI Research (FAIR), is designed for self-supervised learning in
audio processing, generating high-quality vector representations of audio that are particularly useful for
classification tasks. By combining audio and text embeddings, this approach aims to enhance emotion
identification accuracy, leveraging the strengths of both modalities and the effectiveness of SVM in data
classification.
Figure 1: Overall system architecture.
4. Results
The performance of the SVM model on the test split for Task 1 is presented in Table 2. The metrics
reported include macro precision (M-P), macro recall (M-R), and macro F1-score (M-F1). The SVM
model achieved the following scores: 0.540 in M-P, 0.512 in M-R, and 0.518 in M-F1.
When compared to the official leaderboard for Task 1 (Table 3), the SVM model ranks 9th out of 10,
with a macro F1-score of 0.518242. This places it ahead of the baseline model (0.496829) but behind the
top-performing team ‘TEC_TEZUITLAN” with a macro F1-score of 0.671856. The results indicate that
while the SVM model performs better than the baseline, there is significant room for improvement to
reach the performance levels of the leading teams.
For Task 2, the SVM model achieved a macro F1-score of 0.558898. This result places the model 7th
on the official leaderboard (Table 4), surpassing the baseline score of 0.530757. However, it remains
significantly behind the leading team “BSC-UPC” which achieved a macro F1-score of 0.866892.
In conclusion, we can reveal hat the SVM model, while outperforming the baseline in both tasks, does
not achieve top-tier performance. The model for Task 1 ranks 9th with a macro F1-score of 0.518242,
whereas for Task 2, it ranks 8th with a macro F1-score of 0.558898. These outcomes suggest that
further optimization and potentially the exploration of more sophisticated models or additional feature
engineering are necessary to improve performance and compete with the leading approaches on the
leaderboard.
The classification report for Task 1, as shown in Table 5, provides detailed performance metrics for
each emotion class. Our approach achieves an accuracy of 0.682667. The macro averages for precision,
recall, and F1-score are 0.540791, 0.512783, and 0.518242 respectively, indicating a balanced but not
exceptional performance across different classes. The weighted averages are higher, reflecting the
model’s better performance on more prevalent classes.
The classification report for Task 2, as shown in Table 6, provides similar performance metrics for
each emotion class when both audio and text data are considered. The overall accuracy for Task 2 is
Table 2
Results of the SVM model on the test split for Task 1 and Task 2 are reported. The metrics include macro precision
(M-P), macro recall (M-R), and macro F1-score (M-F1).
Model M-P M-R M-F1
Task 1
SVM 0.540791 0.512783 0.518242
Task 2
SVM 0.571862 0.553863 0.558898
Table 3
Official leaderboard for task 1
Task 1
# Team Name M-F1
1 TEC_TEZUITLAN 0.671856
2 CogniCIC 0.657527
3 UNED-UNIOVI 0.655287
4 UKR 0.648417
- - -
9 UAE 0.518242
10 Baseline 0.496829
Table 4
Official leaderboard for task 2
Task 2
# Team Name M-F1
1 BSC-UPC 0.866892
2 THAU-UPM 0.824833
3 CogniCIC 0.712259
4 TEC_TEZUITLAN 0.712259
- - -
7 UAE 0.558898
8 Baseline 0.530757
0.717333. The macro averages for precision, recall, and F1-score are 0.571862, 0.553863, and 0.558898
respectively, reflecting a more balanced and slightly improved performance compared to Task 1. The
weighted averages show that the model handles the more frequent classes better, similar to Task 1.
Thus, our approach performance varies across different emotion classes, with some classes like joy
and neutral showing high precision and recall, while others like fear show very poor performance.
The addition of audio data in Task 2 generally improves the model’s performance, as evidenced by
the higher accuracy and macro metrics. However, the SVM model struggles with less frequent and
potentially more ambiguous emotions like anger and fear.
5. Conclusion
This paper describes the participation of UAE in the IberLEF EmoSPeech 2024 shared task. This task
focuses on exploring the field of Automatic Emotion Recognition (AER) through two subtasks: i) a
textual approach, which uses only textual content to identify the expressed emotion; and ii) a multimodal
approach, which combines audio and text to identify the emotion.
Table 5
Classification report of SVM model in task 1
precision recall f1-score
anger 0.408163 0.200000 0.268456
disgust 0.565421 0.683616 0.618926
fear 0.000000 0.000000 0.000000
joy 0.794872 0.688889 0.738095
neutral 0.765766 0.876289 0.817308
sadness 0.710526 0.627907 0.666667
accuracy 0.682667 0.682667 0.682667
macro avg 0.540791 0.512783 0.518242
weighted avg 0.661836 0.682667 0.663992
Table 6
Classification report of SVM model in task 2
precision recall f1-score
anger 0.485294 0.330000 0.392857
disgust 0.586538 0.689266 0.633766
fear 0.000000 0.000000 0.000000
joy 0.786517 0.777778 0.782123
neutral 0.829582 0.886598 0.857143
sadness 0.743243 0.639535 0.687500
accuracy 0.717333 0.717333 0.717333
macro avg 0.571862 0.553863 0.558898
weighted avg 0.704614 0.717333 0.707209
For Task 1, we used an approach based on classifying emotions through text embeddings obtained
with a pre-trained language model called BETO and the SVM algorithm, obtaining a score of 0.51 in
M-F1, beating the baseline and reaching rank 9th in the classification table. On the other hand, for
Task 2, we modified the approach used for Task 1 by adding audio features through a pre-trained audio
model based on Wav2Vec 2.0. With this approach, we obtained a score of 0.56 in M-F1, ranking 7th in
the table. The results of both tasks have exceeded the baseline proposed by the organizers, and through
the results obtained in Task 2, we can conclude that the audio features complement the text features
and improve the performance of the unimodal model.
As a future line, we propose to improve the approach using fine-tuning techniques and to test other
classification algorithms, such as Recurrent Neural Networks (RNN), Random Forest (RF), and Convolu-
tional Neural Networks (CNN). We also plan to test other pre-trained models based on Transformers.
In addition, we propose to add a sentiment feature to the model to enrich its ability to understand and
analyze text in different contexts, since sentiment indicates the polarity of sentences and is complemen-
tary to emotion. In [14], the analysis is very useful in different domains such as politics, marketing,
healthcare, among others.
References
[1] F. Chenchah, Z. Lachiri, Speech emotion recognition in noisy environment, in: 2016 2nd Interna-
tional Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2016, pp.
788–792. doi:10.1109/ATSIP.2016.7523189.
[2] A. A. Varghese, J. P. Cherian, J. J. Kizhakkethottam, Overview on emotion recognition system, in:
2015 International Conference on Soft-Computing and Networks Security (ICSNS), 2015, pp. 1–5.
doi:10.1109/ICSNS.2015.7292443.
[3] A. Salmerón-Ríos, J. A. García-Díaz, R. Pan, R. Valencia-García, Fine grain emotion analysis
in Spanish using linguistic features and transformers, PeerJ Computer Science 10 (2024) e1992.
doi:10.7717/peerj-cs.1992.
[4] R. Pan, J. A. García-Díaz, M. A. Rodríguez-García, R. Valencia-García, Spanish MEACorpus 2023:
A multimodal speech–text corpus for emotion analysis in Spanish from natural environments,
Computer Standards & Interfaces 90 (2024) 103856. URL: https://www.sciencedirect.com/science/
article/pii/S0920548924000254. doi:https://doi.org/10.1016/j.csi.2024.103856.
[5] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, F. García-Sanchez, R. Valencia-García, Overview
of EmoSPeech at IberLEF 2024: Multimodal Speech-text Emotion Recognition in Spanish, Proce-
samiento del Lenguaje Natural 73 (2024).
[6] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process-
ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages
Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for
Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024.
[7] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning
of speech representations, Advances in neural information processing systems 33 (2020) 12449–
12460.
[8] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish Pre-Trained BERT Model
and Evaluation Data, in: PML4DC at ICLR 2020, 2020.
[9] S. Mohammad, F. Bravo-Marquez, WASSA-2017 shared task on emotion intensity, in: A. Balahur,
S. M. Mohammad, E. van der Goot (Eds.), Proceedings of the 8th Workshop on Computational
Approaches to Subjectivity, Sentiment and Social Media Analysis, Association for Computational
Linguistics, Copenhagen, Denmark, 2017, pp. 34–49. URL: https://aclanthology.org/W17-5205.
doi:10.18653/v1/W17-5205.
[10] N.-V. Nguyen, X.-S. Vu, C. Rigaud, L. Jiang, J.-C. Burie, ICDAR 2021 competition on multimodal
emotion recognition on comics scenes, in: International Conference on Document Analysis and
Recognition, Springer, 2021, pp. 767–782.
[11] F. M. Plaza-del Arco, S. M. Jiménez-Zafra, A. Montejo-Ráez, M. D. Molina-González, L. A. Ureña-
López, M. T. Martín-Valdivia, Overview of the EmoEvalEs task on emotion detection for Spanish
at IberLEF 2021, Procesamiento del Lenguaje Natural 67 (2021) 155–161. URL: http://journal.sepln.
org/sepln/ojs/ojs/index.php/pln/article/view/6385.
[12] J. A. García-Díaz, S. M. Jiménez-Zafra, M. T. Martín-Valdivia, F. García-Sánchez, L. A. Ureña-López,
R. Valencia-García, Overview of PoliticEs 2022: Spanish Author Profiling for Political Ideology,
Proces. del Leng. Natural 69 (2022) 265–272. URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/
pln/article/view/6446.
[13] J. A. García-Díaz, G. Beydoun, R. Valencia-García, Evaluating Transformers and Linguistic Features
integration for Author Profiling tasks in Spanish, Data & Knowledge Engineering 151 (2024)
102307. URL: https://www.sciencedirect.com/science/article/pii/S0169023X24000314. doi:https:
//doi.org/10.1016/j.datak.2024.102307.
[14] F. Ramírez-Tinoco, G. Alor-Hernández, J. Sánchez-Cervantes, M. Salas-Zarate, R. Valencia-García,
Use of Sentiment Analysis Techniques in Healthcare Domain, 2019, pp. 189–212. doi:10.1007/
978-3-030-06149-4_8.