UTP at EmoSPeech–IberLEF2024: Using Random Forest
                         with FastText and Wav2Vec 2.0 for Emotion Detection
                         Denis Cedeño-Moreno1 , Miguel Vargas-Lombardo1 , Alan Delgado-Herrera1 ,
                         Camilo Caparrós-Láiz2 and Tomás Bernal-Beltrán1
                         1
                             Universidad Tecnológica de Panamá, Ciudad de Panamá, Panamá
                         2
                             Facultad de Informática, Universidad de Murcia, Campus de Espinardo, 30100, Murcia, Spain


                                        Abstract
                                        Automatic emotion recognition (AER) has become increasingly important in fields such as health, psychology,
                                        social sciences, and marketing. Within AER, automatic speech recognition focuses on identifying emotions
                                        expressed through speech by analyzing features such as fundamental frequency, intensity, rhythm, intonation,
                                        and phoneme duration. Multimodal approaches combine information from speech, facial expressions, body
                                        language, and text to enhance emotion identification. The goal of the EmoSPeech at IberLEF 2024 is to advance
                                        AER by addressing challenges like feature identification, scarcity of multimodal datasets, and the complexity of
                                        integrating multiple features. This shared task includes two subtasks: text-based AER and multimodal AER. The
                                        novelty of this challenge lies in its multimodal approach, analyzing language model performance on real-world
                                        datasets, a first in collaborative tasks. This paper presents the contribution of the UTP team to both subtasks. For
                                        Task 1, we used text embeddings from a FastText model and classified emotions with the Random Forest algorithm,
                                        achieving an M-F1 score of 0.41 and ranking 10th. For Task 2, we enhanced this approach by incorporating audio
                                        features from a pre-trained Wav2Vec 2.0 model, resulting in an M-F1 score of 0.48 and ranking 8th. Although
                                        these results did not surpass the baseline, they demonstrate that audio features complement text embeddings and
                                        improve performance.

                                        Keywords
                                        Speech Emotion Recognition, Automatic Emotion Recognition, Natural Language Processing, Transformers,
                                        Random Forest, FastText


                         1. Introduction
                         Automatic emotion recognition has been a significant problem for many years, and in recent years its
                         importance has grown due to its impact on various fields such as health, psychology, social sciences,
                         and marketing. For example, [1] shows the relationship between emotions and mental illness, as well as
                         the importance of automatic recognition in the health field. It is a technology that uses algorithms and
                         artificial intelligence techniques to identify and understand the emotions expressed by people through
                         various modalities such as verbal language, body language, facial expressions, and speech prosody.
                         Within automatic emotion recognition, automatic speech recognition refers to the identification of
                         emotions expressed by a person through speech [2] [3]. The AER process involves analyzing acoustic
                         and prosodic features of speech, such as fundamental frequency, intensity, rhythm, intonation, and
                         phoneme duration, to identify patterns associated with different emotional states. These patterns are
                         then used to classify speech into emotional categories such as happiness, sadness, anger, fear, disgust,
                         and others. There are also multimodal approaches, which consist of combining information from
                         different sources, such as speech, facial expression, body language, written text, and others, to identify
                         and understand the emotions expressed by a person [4].


                         IberLEF 2024, September 2024, Valladolid, Spain
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ denis.cedeno@utp.ac.pa (D. Cedeño-Moreno); miguel.vargas@utp.ac.pa (M. Vargas-Lombardo); alan.delgado@utp.ac.pa
                         (A. Delgado-Herrera); camilo.caparrosl@um.es (C. Caparrós-Láiz); tomas.bernalb@um.es (T. Bernal-Beltrán)
                          0000-0002-9640-1284 (D. Cedeño-Moreno); 0000-0002-2074-2939 (M. Vargas-Lombardo); 0000-0002-5191-7500
                         (C. Caparrós-Láiz); 0009-0006-6971-1435 (T. Bernal-Beltrán)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Thus, the goal of the EmoSPeech [5] at IberLEF 2024 [6] is to explore the field of Automatic Emotion
Recognition (AER). Challenges associated with this classification problem are addressed, including
the identification of meaningful features to distinguish between emotions, the scarcity of multimodal
datasets with real-life scenarios, and the added complexity due to the combined use of multiple features.
Two challenges are presented: text-based AER and multimodal AER. AER has received considerable
attention in the research community, with several joint events demonstrating the growing interest in
the field. The novelty of this challenge lies in its multimodal approach to AER, which analyzes the
performance of language models on real-world datasets. No previous collaborative task has focused on
this specific challenge.
   This paper presents the UTP team contribution to both subtasks, based on the use of traditional
algorithms such as SVM with Wav2vec 2.0 [7] to extract audio features and text embedding of from a
pre-trained language model as BETO. The rest of the paper is organized as follows. Section 2 presents the
task and dataset provided. Section 3 describes the methodology of our proposed system for addressing
subtask 1 and subtask 2. Section 4 shows the results obtained. Finally, Section 5 concludes the paper
with some findings and possible future work.


2. Task description
The task is divided into two subtasks with two approaches to address the AER problem: i) identifying
emotions across texts ii) multimodal automatic emotion recognition requires a more complex architecture
to solve this classification problem. In recent years, AER has received considerable attention from the
research community, with several joint events such as WASSA [8], EmoRec-Com [9], and EmoEvalES
[10] highlighting the growing interest in this area. The novelty of this work lies in its multimodal
approach to AER, analyzing the performance of language models on real datasets. For this purpose,
the organizers provided us with the Spanish MEACorpus 2023 dataset, which consists of a set of audio
segments collected from different Spanish YouTube channels. This dataset contains over 13.16 hours of
audio annotated with six different emotions: disgust, anger, joy, sadness, neutral, and fear. This dataset
was annotated in two phases. For this task, approximately 3500-4000 audio segments were selected and
divided into training and testing in a ratio of 80%-20%.
   To build the model, the training set was divided into two subsets in a ratio of 90-10: training and
validation. We used the validation set to adjust the hyperparameters of a model and evaluate its
performance during the process of developing and training the machine learning model. Table 1 shows
the distribution of the dataset provided by the organizers.

Table 1
Distribution of the datasets
                  Dataset      Total   Neutral   Disgust   Anger   Joy   Sadness    Fear
                     Train     2,700     1,070       616     355   330        308      21
                Validation       300        96        89      44    37         32       2
                      Test       750       291       177     100    90         86       6


3. Methodology
Figure 1 shows the general architecture of our approach for these two tasks. For Task 1, which is to
identify emotions from text, we used an approach that involves using the FastText model to obtain text
embeddings and then applying an Random Forest (RF) classification algorithm. FastText was chosen
for its efficiency in generating word embeddings, while RF was chosen for its robust classification
capability. This approach focuses on capturing the semantic meaning of texts and using it for emotion
classification, considering the complexity of the model and evaluating performance using appropriate
metrics.
   For Task 2, which focuses on identifying emotions from audio and text, we used an approach that
uses a pre-trained Transformers-based model called Wav2vec 2.0, specifically the facebook/wav2vec2-
large-xlsr-53-spanish model, to obtain vector representations of the audio. These vectors are combined
with the text embeddings and used as input to a classification RF model. The goal is to identify emotions
from a combination of audio and text, taking advantage of the semantic representation capabilities of
the pre-trained model and the robustness of the RF in data classification. Wav2vec 2.0 is a deep learning
model developed by Facebook AI Research (FAIR) for self-monitoring in audio processing. This model
is primarily used to generate high-quality vector representations (embeddings) of audio, making it
particularly useful for classification tasks.


Figure 1: Overall system architecture.


4. Results
Tables 2 and 3 show the results of the RF ranking models in the two defined tasks, as well as the scores
obtained by different teams in the official ranking for each task.
   For task 1, which consists of identifying emotions through text, it is observed that the RF model
obtains a macro F1 score of 0.4102. This score indicates moderate accuracy and recall, suggesting
that the model is able to identify emotions in text with some effectiveness, although there is room for
improvement.
   Comparing this result with the scores of the teams on the official leaderboard, our team “UTP”
is ranked 10th with a macro F1 score of 0.4102, indicating that the RF model has achieved results
comparable to the teams in the middle of the table. However, there are other teams that have achieved
much higher scores, as in this case we have not outperformed the baseline.
   For task 2, which focuses on identifying emotions from audio and text, the RF model achieves a
macro F1 score of 0.4816. This score is slightly higher than that obtained in Task 1, indicating that the
model performs better when audio information is included in addition to text.
   Comparing this result with the team scores on the official leaderboard, our team “UTP” also ranks 8th
with a macro F1 score of 0.4816. Again, this shows that the RF model has achieved results comparable
to other teams on the leaderboard, but there is still room for improvement to reach the highest scores,
and we have not passed the baseline.
   Overall, the results show that the RF model performs acceptably in both tasks, but there is still room
for improvement, especially in identifying emotions from text in Task 1.
   To better understand the behavior of the model, we extracted the classification report from the test
set for each task. Table 4 shows the classification report for emotion identification from text (Task 1),
while Table 5 shows the report for the combined audio and text approach (Task 2).
   In Task 1, the RF model shows variable accuracy and recall for different emotion classes. It stands
out for its high accuracy and recall for the emotions of happiness and neutrality, but shows lower
Table 2
Results of RF model on the test split for task 1 and task 2 . In this case, the macro precision (M-P), macro recall
(M-R), and macro F1-score (M-F1) are reported.
                                   Model         M-P             M-R        M-F1
                                                     Task 1
                                   RF         0.450811   0.409147      0.410227
                                                     Task 2
                                   RF         0.538039   0.479356      0.481559


Table 3
Official leaderboard for task 1 and task 2
                                  Task 1                                    Task 2
                   #    Team Name             M-F1           #    Team Name             M-F1
                   1    TEC_TEZUITLAN         0.671856       1    BSC-UPC               0.87
                   2    CogniCIC              0.657527       2    THAU-UPM              0.866892
                   3    UNED-UNIOVI           0.655287       3    CogniCIC              0.824833
                   4    UKR                   0.648417       4    TEC_TEZUITLAN         0.712259
                   -    -                     -              -    -                     -
                   10   Baseline              0.496829       9    Baseline              0.530757
                   10   UTP                   0.410227       8    UTP                   0.481559


performance for the emotions of anger and sadness. The weighted average accuracy is 54.03%, suggesting
an overall acceptable performance, but with room for improvement.
   For Task 2, the RF model also shows variable results for the different emotion classes. It stands out
for its remarkable accuracy and recall for the emotion of joy, but fails to predict the emotion of fear
at all (with an accuracy, recall and f1 score of 0). The weighted average accuracy is 65.08%, which is
slightly better than in Task 1, but still leaves room for improvement.

Table 4
Classification report of RF model in task 1
                                                Precision         Recall    F1-score
                               anger              0.300000       0.060000    0.100000
                               disgust            0.472393       0.435028    0.452941
                               fear               0.000000       0.000000    0.000000
                               joy                0.704225       0.555556    0.621118
                               neutral            0.598109       0.869416    0.708683
                               sadness            0.630137       0.534884    0.578616
                               accuracy           0.576000       0.576000    0.576000
                               macro avg          0.450811       0.409147    0.410227
                               weighted avg       0.540314       0.576000    0.536079


5. Conclusion
This paper describes the participation of UTP in the IberLEF EmoSPeech 2024 shared task. This task
focuses on exploring the field of Automatic Emotion Recognition (AER) from two approaches: i) a textual
approach, which uses only textual content to identify the expressed emotion; and ii) a multimodal
approach, which combines audios and texts to identify the emotion. Thus, this shared task is divided
into two subtasks corresponding to these approaches.
Table 5
Classification report of RF model in task 2
                                              Precision    Recall    F1-score
                              anger            0.476190   0.100000   0.165289
                              disgust          0.516807   0.694915   0.592771
                              fear             0.000000   0.000000   0.000000
                              joy              0.833333   0.611111   0.705128
                              neutral          0.748571   0.900344   0.817473
                              sadness          0.653333   0.569767   0.608696
                              accuracy         0.665333   0.665333   0.665333
                              macro avg        0.538039   0.479356   0.481559
                              weighted avg     0.650820   0.665333   0.633524


   For task 1, we used an approach based on classifying emotions through text embeddings obtained
with a FastText model and the Random Forest algorithm, obtaining a score of 0.41 in M-F1, reaching the
10th position in the classification table. On the other hand, for task 2, we have modified the approach
used for task 1, adding audio features through a pre-trained audio model based on Wav2Vec 2.0. With
this approach, we obtained a score of 0.48 on M-F1, ranking 8th in the leaderboard. Although the results
of both tasks have not exceeded the baseline, we can see that the audio features obtained with Wav2Vec
2.0 complement the text embeddings and improve their performance.
   As a future line, we plan to improve the approach using fine-tuning techniques and test other
classification algorithms, such as recurrent neural networks (RNN), support vector machines (SVM) and
convolutional neural networks (CNN). We also propose to test different pre-trained linguistic models
such as BETO, MarIA, among others, since in [11], [12], and[13] the good performance of these models
in the classification task of different domains has been demonstrated. We also suggest exploring whether
sentiment features can enhance emotion detection, given their complementary nature. As demonstrated
in [14], this approach has proven effective in various domains, including politics, marketing, healthcare,
and others.


References
 [1] A. Salmerón-Ríos, J. A. García-Díaz, R. Pan, R. Valencia-García, Fine grain emotion analysis
     in Spanish using linguistic features and transformers, PeerJ Computer Science 10 (2024) e1992.
     doi:10.7717/peerj-cs.1992.
 [2] A. A. Varghese, J. P. Cherian, J. J. Kizhakkethottam, Overview on emotion recognition system, in:
     2015 International Conference on Soft-Computing and Networks Security (ICSNS), 2015, pp. 1–5.
     doi:10.1109/ICSNS.2015.7292443.
 [3] F. Chenchah, Z. Lachiri, Speech emotion recognition in noisy environment, in: 2016 2nd Interna-
     tional Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2016, pp.
     788–792. doi:10.1109/ATSIP.2016.7523189.
 [4] R. Pan, J. A. García-Díaz, M. Ángel Rodríguez-García, R. Valencia-García, Spanish MEACorpus
     2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments,
     Computer Standards & Interfaces 90 (2024) 103856. URL: https://www.sciencedirect.com/science/
     article/pii/S0920548924000254. doi:https://doi.org/10.1016/j.csi.2024.103856.
 [5] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, F. García-Sanchez, R. Valencia-García, Overview
     of EmoSPeech at IberLEF 2024: Multimodal Speech-text Emotion Recognition in Spanish, Proce-
     samiento del Lenguaje Natural 73 (2024).
 [6] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process-
     ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for
     Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024.
 [7] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learning
     of speech representations, Advances in neural information processing systems 33 (2020) 12449–
     12460.
 [8] S. Mohammad, F. Bravo-Marquez, WASSA-2017 shared task on emotion intensity, in: A. Balahur,
     S. M. Mohammad, E. van der Goot (Eds.), Proceedings of the 8th Workshop on Computational
     Approaches to Subjectivity, Sentiment and Social Media Analysis, Association for Computational
     Linguistics, Copenhagen, Denmark, 2017, pp. 34–49. URL: https://aclanthology.org/W17-5205.
     doi:10.18653/v1/W17-5205.
 [9] N.-V. Nguyen, X.-S. Vu, C. Rigaud, L. Jiang, J.-C. Burie, ICDAR 2021 competition on multimodal
     emotion recognition on comics scenes, in: International Conference on Document Analysis and
     Recognition, Springer, 2021, pp. 767–782.
[10] F. M. Plaza-del Arco, S. M. Jiménez-Zafra, A. Montejo-Ráez, M. D. Molina-González, L. A. Ureña-
     López, M. T. Martín-Valdivia, Overview of the EmoEvalEs task on emotion detection for Spanish
     at IberLEF 2021, Procesamiento del Lenguaje Natural 67 (2021) 155–161. URL: http://journal.sepln.
     org/sepln/ojs/ojs/index.php/pln/article/view/6385.
[11] R. Pan, J. García-Díaz, F. Garcia-Sanchez, R. Valencia-García, Evaluation of transformer models
     for financial targeted sentiment analysis in Spanish, PeerJ Computer Science 9 (2023) e1377.
     doi:10.7717/peerj-cs.1377.
[12] J. A. García-Díaz, S. M. J. Zafra, M. T. M. Valdivia, F. García-Sánchez, L. A. U. López, R. Valencia-
     García, Overview of PoliticEs 2022: Spanish Author Profiling for Political Ideology, Proces. del
     Leng. Natural 69 (2022) 265–272. URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/
     view/6446.
[13] J. A. García-Díaz, G. Beydoun, R. Valencia-García, Evaluating Transformers and Linguistic Features
     integration for Author Profiling tasks in Spanish, Data & Knowledge Engineering 151 (2024)
     102307. URL: https://www.sciencedirect.com/science/article/pii/S0169023X24000314. doi:https:
     //doi.org/10.1016/j.datak.2024.102307.
[14] F. Ramírez-Tinoco, G. Alor-Hernández, J. Sánchez-Cervantes, M. Salas Zarate, R. Valencia-García,
     Use of Sentiment Analysis Techniques in Healthcare Domain, 2019, pp. 189–212. doi:10.1007/
     978-3-030-06149-4_8.