SINAI at EmoSPeech-IberLEF2024: Evaluating Popular
                                Tools and Transformers Models for Multimodal
                                Speech-Text Emotion Recognition in Spanish
                                Daniel García-Baena1,* , Miguel Ángel García-Cumbreras1 and
                                Salud María Jiménez-Zafra1
                                1
                                    Computer Science Department, SINAI, CEATIC, Universidad de Jaén, 23071, Spain


                                                                         Abstract
                                                                         This work presents the participation of the SINAI team at EmoSPeech-IberLEF2024 shared task, Mul-
                                                                         timodal Speech-Text Emotion Recognition in Spanish. We have addressed the first of the proposed
                                                                         tasks, focused on extracting features and identifying the most representative ones of each emotion in a
                                                                         dataset created from real-life situations compiled from YouTube videos. For emotion analysis, we have
                                                                         evaluated some of the most popular transformers models and specific emotion analysis open source
                                                                         transformers publicly available on Hugging Face. In total, 14 systems have participated (including the
                                                                         baseline provided by the organizers). The best run sent by our team have been placed in position 8th
                                                                         with an F1-score of 0.5200, being 0.6719 the best result obtained in the first task ranking.

                                                                         Keywords
                                                                         emotion analysis, text emotion recognition, transformers, natural language processing


                                1. Introduction
                                IberLEF is a shared evaluation campaign for Natural Language Processing (NLP) systems in
                                Spanish and other Iberian languages [1]. In an annual cycle that starts in December (with
                                the call for task proposals) and ends in September (with an IberLEF meeting collocated with
                                SEPLN), several challenges are run with large international participation from research groups in
                                academia and industry. Specifically, this shared task was titled EmoSPeech 2024 Task - Multimodal
                                Speech-Text Emotion Recognition in Spanish [2], and aims to explore multimodal speech-text
                                emotion recognition for texts written in Spanish [3].
                                  We found interesting to take part into this shared task specially because being able to recog-
                                nize human emotions is crucial for building positive relationships, whether it is in person or
                                through interactions with computers [4]. Automatic Emotion Recognition (AER) has a growing
                                importance due to its impact on various fields such as healthcare, psychology, social sciences and
                                marketing [5]. AER software can help providing personalized responses and recommendations,

                                IberLEF 2024, September 2024, Valladolid, Spain
                                *
                                 Corresponding author.
                                $ daniel.gbaena@gmail.com (D. García-Baena); magc@ujaen.es (M. García-Cumbreras); sjzafra@ujaen.es
                                (S. M. Jiménez-Zafra)
                                 0000-0002-3334-8447 (D. García-Baena); 0000-0003-1867-9587 (M. García-Cumbreras); 0000-0003-3274-8825
                                (S. M. Jiménez-Zafra)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
leading to improved user engagement and satisfaction. The process of AER can be addressed
using several taxonomies and is focused on recognizing six basic emotional expressions as
they are: anger, disgust, fear, happiness, sadness and surprise [6]. By automatically recog-
nizing emotions, a system could identify, interpret and respond to different human ways of
communication, such as text, facial expressions, voice tones or even body language. Different
features can be used to identify emotions [7] but, in this work, our team focused exclusively in
text for developing automatic emotion analysis systems. Competition is available in CodaLab:
https://codalab.lisn.upsaclay.fr/competitions/17647


2. Task description
As we previously noted, our team concentrated on tackling the first subtask from this challenge,
analyze a given text and identify the emotion it conveys based on five of Ekman’s six basic
emotions: anger, disgust, fear, joy, and sadness, as well as one neutral emotion. Therefore, we
developed several AER systems that worked exclusively with text written in Spanish. We took
all the different texts collected from YouTube by the shared task organizers, about 3500-4000
transcripts of audio segments divided into training and test in an 80%-20% split (see Table 1),
and compiled into the available dataset and extracted different features in order to identify the
most representative emotions present in each of the cited texts that constitute a corpus created
from real-life situations.
   The aim of the first task was to analyze texts and identify the emotion that their convey
based on the popular Ekman’s six basic emotions: anger, disgust, fear, joy, sadness and surprise.
It is important to notice that the organizers removed surprise from the list, due to the small
amount of examples, and added as well one neutral emotion in order to classify those texts that
do not relate with any of Ekman’s list of emotions. The evaluation measures for this subtask
were: precision, recall and F1-score. Consequently, emotion classification systems in this work
were ranked attending to their macro-F1 scores.


3. Methodology
We evaluated several public available models from Hugging Face for task 1. With this task,
organizers pretended to classify texts, extracted from YouTube, and classify them according to
Ekman’s five basic emotions plus one neutral additional option. With this purpose, we chose
ten of the most popular source code transformer models from Hugging Face and evaluated their
results when they were trained with the shared dataset.
   For selecting all the ten LLM from Hugging Face, we used the website public filter to choose
those especially indicated for emotion analysis, while working with texts written in Spanish. As
we pretended to discover which of the more popular were giving the best results for this first
task, we sorted the results from the Hugging Face filter in order to show the most downloaded
transformers models on top.
   As it can be seen in Table 2, we worked with several types of models while trying to perform
emotion analysis over the shared dataset. Thus, we did not evaluate LLM pretrained just texts
written only in Spanish, but we did evaluate too those pretrained only with text written in
Table 1
Dataset distribution
                          Set         Sentiment    Total headlines
                                      anger                     51
                                      disgust                   91
                                      fear                       3
                          dev-train
                                      joy                       46
                                      neutral                  149
                                      sadness                   44
                                      anger                     13
                                      disgust                   22
                                      fear                       1
                          dev-test
                                      joy                       12
                                      neutral                   37
                                      sadness                   11
                                      anger                    399
                                      disgust                  705
                                      fear                      23
                          train
                                      joy                      362
                                      neutral                 1166
                                      sadness                  345
                                      anger                    100
                                      disgust                  177
                                      fear                       6
                          test
                                      joy                       90
                                      neutral                  291
                                      sadness                   86


English and the ones developed using several different languages, in this last case, always
including Spanish. In addition, we did not focused exclusively in models that were made
precisely to perform emotion analysis but also the most popular options available for general
purpose and LLM made to work in similar areas as sentiment analysis.


4. Experimental setup
It is important to note that we did not perform any prior data pre-processing on the shared
dataset before of performing all of the experiments.
   With respect to the models, all were downloaded from their public profiles in Hugging
Face. During the finetuning process we always used Google Colab for coding under a Pro
configuration for being able to use their GPU based hardware options.
   Finally, concerning the hyperparameters, we did not performed any hyperparameters search
so all model configurations are the default ones.
Table 2
Rank list for the training phase
 Model                                                                Main language     F1-score
 finiteautomata/beto-emotion-analysis                                 Spanish             0.8172
 pysentimiento/robertuito-emotion-analysis                            Spanish             0.5486
 finiteautomata/beto-sentiment-analysis                               Spanish             0.5064
 lxyuan/distilbert-base-multilingual-cased-sentiments-student         Multilingual        0.4634
 nlptown/bert-base-multilingual-uncased-sentiment                     Multilingual        0.4342
 somosnlp/bertin_base_climate_detection_spa_v2                        Spanish             0.4063
 mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis    English             0.3878
 distilbert/distilbert-base-uncased-finetuned-sst-2-english           English             0.3535
 distilbert-base-uncased-finetuned-sst-2-english                      English             0.3104
 papluca/xlm-roberta-base-language-detection                          Multilingual        0.2408


Table 3
Rank list for the test phase
    Model                                                          Main language     F1-score
    finiteautomata/beto-emotion-analysis                           Spanish             0.5200
    finiteautomata/beto-sentiment-analysis                         Spanish             0.4919
    pysentimiento/robertuito-emotion-analysis                      Spanish             0.4704
    somosnlp/bertin_base_climate_detection_spa_v2                  Spanish             0.4321
    nlptown/bert-base-multilingual-uncased-sentiment               Multilingual        0.4074
    lxyuan/distilbert-base-multilingual-cased-sentiments-student   Multilingual        0.3931


5. Results and discussion
This section presents the results obtained in the evaluation phase of the shared task EmoSPeech
[2], Multimodal Speech-Text Emotion Recognition in Spanish, at IberLEF 2024. The organizers
selected target F1-score for ranking the systems from task 1. Each participating team could
submit a maximum of ten runs through CodaLab, from which each team had to select the best
one for the ranking. We selected our top runs based on the experiments carried out on the
training phase. The models and their results for each of the test runs are shown, sorted by their
F1-score, in Table 3.
   Firstly, we would like to highlight how the Spanish based systems outperformed the multi-
lingual ones. Results from training for English only LLM were specially disappointing so we
focused on those models that achieved over 0.4000 F1-score in the training phase (see Table 2).
   On the other hand, those models precisely made for emotion analysis generally outperformed
those with general or not exactly the same purpose (sentiment analysis). Nevertheless, one
sentiment analysis focused model, finiteautomata/beto-sentiment-analysis [8, 9], was able
to score a better F1-score than the emotion analysis focused LLM pysentimiento/robertuito-
emotion-analysis [10, 9, 11].
   Now, paying attention to Table 2, the best result from finiteautomata/beto-emotion-analysis
[10, 9] from training, it is surprisingly better than the one from Table 3 (test phase), 0.8172 VS
0.5200, respectively. We cannot categorically confirm this but it seems like the training phase
subset that was distributed by the shared task organizers contained several texts that were similar
or directly extracted from the same dataset that was used for training the finiteautomata/beto-
emotion-analysis model. On the contrary, test subset from this shared task should not contain
so similar texts to finiteautomata/beto-emotion-analysis training dataset.
   On the other hand, as we were expecting to happened, both finiteautomata/beto-emotion-
analysis and pysentimiento/robertuito-emotion-analysis models, that were trained with all
the six Ekman’s basic emotions, achieved the top positions in Table 3. However, models as
finiteautomata/beto-sentiment-analysis, somosnlp/bertin_base_climate_detection_spa_v2 and
lxyuan/distilbert-base-multilingual-cased-sentiments-student achieved similar F1-score with a
positive, negative and neutral configuration.
   In addition, we relate the low F1-scores to the big amount of tags that were expected to be
taken into account during the classification process. Systems needed to distinguish between six
different options as they were: anger, disgust, fear, joy, sadness and neutral; and this level of
exigence make this task way harder than developing a simple binary classifier.
   In relation to the last, we find important to highlight that even with a presumably low top
F1-score of 0.5200, we were just 0.1519 points behind of the best performer team of the first
task, classifying on 8th position for this task 1. This small difference reassures our thinking
about how classifying with six different categories, is a hard task for current open source most
popular LLM.


6. Conclusions and future work
In this paper we have presented the participation of the SINAI team in the shared task Emo-
SPeech, Multimodal Speech-text Emotion Recognition in Spanish, at IberLEF 2024. The objective
of our experiments, for the emotion analysis task, was to test the performance of the most
popular emotion analysis transformer-based models. The main conclusion is that most popular
transformers-based solutions are not precise when they have to take into account six different
options.
   In the future, we want to continue evaluating more different resources in order to further
improve our systems by analyzing the contribution of each LLM, testing different transfer
learning systems and using different data preprocessing systems to generate new datasets
and/or augment existing ones.


Acknowledgments
This work has been partially supported by Project CONSENSO (PID2021-122263OB-C21),
Project MODERATES (TED2021-130145B-I00) and Project SocialTox (PDC2022-133146-C21)
funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenera-
tionEU/PRTR, Project PRECOM (SUBV-00016) funded by the Ministry of Consumer Af-
fairs of the Spanish Government, Project FedDAP (PID2020-116118GA-I00) supported by
MICINN/AEI/10.13039/501100011033, Project PID2020-119478GB-I00 supported by MICIN-
N/AEI/10.13039/501100011033, and WeLee project (1380939, FEDER Andalucía 2014-2020)
funded by the Andalusian Regional Government. The research work conducted by Salud
María Jiménez-Zafra has been supported by Action 7 from Universidad de Jaén under the
Operational Plan for Research Support 2023-2024.


References
 [1] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language
     Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
     Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference
     of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org,
     2024.
 [2] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, F. García-Sánchez, R. Valencia-García,
     Overview of EmoSPeech at IberLEF 2024: Multimodal Speech-text Emotion Recognition
     in Spanish, Procesamiento del Lenguaje Natural 73 (2024).
 [3] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, R. Valencia-García, Spanish meacorpus
     2023: A multimodal speech-text corpus for emotion analysis in spanish from natural
     environments, Computer Standards & Interfaces (2024) 103856.
 [4] A. Varghese, J. Cherian, J. Kizhakkethottam, Overview on emotion recognition system,
     2015, pp. 1–5. doi:10.1109/ICSNS.2015.7292443.
 [5] F. Chenchah, Z. Lachiri, Speech emotion recognition in noisy environment, 2016, pp.
     788–792. doi:10.1109/ATSIP.2016.7523189.
 [6] E. Rolls, P. Ekman, D. Perrett, H. Ellis, Facial expressions of emotion: An old controversy
     and new findings: Discussion, Royal Society of London Philosophical Transactions Series
     B 335 (1992) 69–. doi:10.1098/rstb.1992.0008.
 [7] M. S. Fahad, A. Ranjan, J. Yadav, A. Deepak, A survey of speech emotion recognition in
     natural environment, Digital Signal Processing 110 (2020) 102951. doi:10.1016/j.dsp.
     2020.102951.
 [8] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert
     model and evaluation data, Pml4dc at iclr 2020 (2020) 1–10.
 [9] J. M. Pérez, J. C. Giudici, F. Luque, pysentimiento: A python toolkit for sentiment analysis
     and socialnlp tasks, 2021. arXiv:2106.09462.
[10] F. M. P. del Arco, C. Strapparava, L. A. Ureña-López, M. T. Martín-Valdivia, Emoevent:
     A multilingual emotion corpus based on different events, in: Proceedings of the 12th
     Language Resources and Evaluation Conference, 2020, pp. 1492–1498.
[11] J. M. Pérez, D. A. Furman, L. Alonso Alemany, F. M. Luque, RoBERTuito: a pre-trained
     language model for social media text in Spanish, in: Proceedings of the Thirteenth Lan-
     guage Resources and Evaluation Conference, European Language Resources Association,
     Marseille, France, 2022, pp. 7235–7243. URL: https://aclanthology.org/2022.lrec-1.785.