=Paper=
{{Paper
|id=Vol-3756/EmoSPeech2024_paper3
|storemode=property
|title=Team ITST at EmoSPeech-IberLEF2024: Multimodal Speech-text Emotion Recognition in Spanish Forum
|pdfUrl=https://ceur-ws.org/Vol-3756/EmoSPeech2024_paper3.pdf
|volume=Vol-3756
|authors=Mario Andrés Paredes-Valverde,María del Pilar Salas-Zárate
|dblpUrl=https://dblp.org/rec/conf/sepln/Paredes-Valverde24
}}
==Team ITST at EmoSPeech-IberLEF2024: Multimodal Speech-text Emotion Recognition in Spanish Forum==
Team ITST at EmoSPeech-IberLEF2024: Multimodal Speech-
text Emotion Recognition in Spanish Forum
Mario Andrés Paredes-Valverde1,*,† and María del Pilar Salas-Zárate1,†
1 Tecnológico Nacional de México/I.T.S. Teziutlán, Fracción l y ll SN, 73960 Teziutlán, Puebla, Mexico
Abstract
This work describes the participation of the ITST team in the EmoSPeech 2024 Task - Multimodal
Speech-text Emotion Recognition in Spanish. To address Task 1 Text AER (Automatic Emotion
Recognition), this work proposed a fine-tuning strategy which involves adapting a set of pre-
trained transformer models to a specific task. Regarding Task 2 Multimodal AER, an ensemble
learning process was proposed. This approach combining multiple individual models, specifically
a wac2vec model and BETO model, to improve predictive performance compared to the
performance of each model separately. Furthermore, the mean and maximum probability
measures were used to provide a final prediction.
Keywords
transformers, fine tuning, wav2vec1
1. Introduction
With the ever-increasing use of electronic devices such as computers and smartphones,
emotion recognition has become a significant are of research within the field of human-
computer interaction. This area involves the identification and analysis of human emotions
through various data inputs, such as facial expressions, voice intonations, textual content,
and physiological signals.
In the literature there are several efforts to categorize emotions, for example Ekman
(Ekman, 1992) proposed a taxonomy of six discrete emotions that are recognized across
different cultures namely anger, disgust, fear, happiness, sadness, and surprise. These
emotions are popular because they are easily recognizable through facial expressions and
other physiological responses.
In the context of Natural Language Processing (NLP), transformers are a type of deep
learning model architecture that aims to solve sequence-to-sequence tasks through a
mechanism called attention (Vaswani et al., 2017). Transformers have been used to build
NLP-based solutions that solve problems such as machine translation, conversational
agents, question-answering, text generation as well as emotion detection. Although
IberLEF 2024, September 2024, Valladolid, Spain
∗ Corresponding author.
† These authors contributed equally.
mario.pv@teziutlan.tecnm.mx (M.A. Paredes-Valverde); maria.sz@teziutlan.tecnm.mx (M.P. Salas-Zárate)
0000-0001-9508-9818 (M.A. Paredes-Valverde); 0000-0003-1818-3434 (M.P. Salas-Zárate)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
transformers were originally designed for NLP tasks, they have been successfully adapted
to a wide range of domains such as image processing, speech processing, time series
forecasting, reinforcement learning, and multimodal applications.
The scientific context surrounding AER is rich and multifaceted, encompassing various
approaches to recognize emotions from different modalities. Traditional methods primarily
relied on handcrafted features extracted from facial expressions, voice signals, and text,
with machine learning algorithms such as support vector machines (SVMs) and hidden
Markov models (HMMs) classifying these features into discrete emotion categories. With
the advent of deep learning, more sophisticated approaches, including Convolutional Neural
Networks (CNNs) and Recurrent Neural Networks (RNNs) such as Long Short-Term
Memory (LSTM) networks, have emerged to capture spatial and temporal dependencies in
data, significantly improving emotion recognition accuracy. In recent years, multimodal
approaches that combine data from multiple sources, such as text, speech, and facial
expressions, have gained prominence. These models leverage complementary information
from each modality, leading to more robust emotion recognition, with transformers playing
a key role due to their ability to handle diverse data types through attention mechanisms.
This paper concerns our participation at EmoSPeech 2024 Task - Multimodal Speech-
text Emotion Recognition in Spanish (Pan, García-Díaz, Rodríguez-García, García-Sánchez,
et al., 2024) which is part of workshop IberLEF 2024 (Chiruzzo et al., 2024). This work
implements a fine-tuning strategy which involves adapting a pre-trained transformer
model to a specific downstream task, such as named entity recognition, sentiment analysis,
or automatic emotion recognition (AER). Specifically, the fine-tuning process leverages the
knowledge the transformer model has already acquired during pre-training on a large
corpus, thereby requiring less task-specific data and computational resources that training
a model from scratch. The corpus used for this task was a multimodal speech-text corpus
for emotion analysis in Spanish from natural environments (Pan, García-Díaz, Rodríguez-
García, & Valencia-García, 2024).
It is important to mention that our approach obtained first place in the AER from text
task, which uses a dataset created from real-life situations, and fourth place in the
multimodal AER task, which uses a dataset consisting of more than 13.16 hours of audio
from audio segments annotated with five of Ekman's six emotions.
The next section outlines the fine-tuning-based strategies developed to automatically
identify five of Ekman’s six based emotions (anger, disgust, fear, joy, and sadness) from text
as well for from a combination of text and speech cues. In the final remarks, we discuss our
findings and propose directions for future research.
2. Developed Strategies
2.1. Task 1 Automatic Emotion Recognition
The aim of the first challenge of the EmoSpeech 2024 Task is to explore the field of AER
from text, i.e., extracting features and identifying the most representative feature of each
emotion. The dataset used for this task was created from real-life situations; specifically, it
consists of 3000 comments for the training phase and 750 for the testing phase. Figure 1
shows the flowchart of the fine-tuning process for automatic emotion recognition proposed
in this work. As can be seen, this process consists of five main phases. A brief description of
this process is provided below.
Figure 1: Flow diagram of the fine-tuning process for automatic emotion recognition from
text.
1. Select pre-trained models for Spanish. In this case, the models selected are BETO
(Cañete et al., 2023), BERTIN (De La Rosa et al., 2022), and MarIA (Gutiérrez-Fandiño
et al., 2021). BETO is a BERT-based language model pre-trained exclusively on
Spanish data. BERTIN is a model pre-trained using perplexity sampling. Meanwhile,
MarIA is a family of Spanish language models thar includes ROBERTa-base,
RoBERTa-large, GPT2, and GPT-2large Spanish language models.
2. Train the three pre-trained models with the dataset provided by the challenge.
3. Split the training dataset to perform the fine-tuning process in a 90-10 ratio.
4. Use the training dataset to perform the fine-tuning process and adapt the pre-trained
models to the automatic emotion recognition task.
5. Use the validation dataset to tune the model’s hyperparameters and to evaluate the
model’s performance on unseen data. In this case, the hyperparameters were
configured as follows: learning rate of 2e-5, a training batch size of 16, and an
evaluation strategy based on epoch.
The BETO model achieves an overall accuracy of 73.7%, indicating a good level
performance for the AER task from text. Table 1 shows the classification report obtained by
this model. As can be seen, BETO performs reasonably well across emotion classes, with
particularly strong performance for the “neutral” emotion. However, performance varies
across different emotions, with lower precision and recall for classes such as “fear” and
“sadness”.
Table 1
Classification report of the BETO model for AER task from text.
Emotion Precision Recall F1-score
anger 0.512195 0.525000 0.518519
disgust 0.656716 0.619718 0.637681
fear 0.666667 1.000000 0.800000
joy 0.789474 0.833333 0.810811
neutral 0.832000 0.888889 0.859504
sadness 0.769231 0.588235 0.666667
accuracy 0.736667
macro avg 0.704380 0.742529 0.715530
weighted avg. 0.734556 0.736667 0.733446
The MarIA model demonstrates solid performance for the proposed challenge, achieving
an overall accuracy of 74.7%. Table 2 shows the results obtained by this model, where it can
be shown that the model performs particularly well for the “neutral” emotion class, with
high precision, recall, and F1-score. The “joy” and “disgust” classes also shown strong
performance. However, MarIA’s performance on the “anger” and “sadness” classes is
somewhat lower. Is should be noted that the “fear” class shows perfect precision but lower
recall due to its small support. Overall, the model shows good generalizability with
reasonably balanced precision and recall for most emotion classes.
Table 2
Classification report of the MarIA model for AER task from text.
Emotion Precision Recall F1-score
anger 0.460000 0.575000 0.511111
disgust 0.656716 0.619718 0.637681
fear 1.000000 0.500000 0.666667
joy 0.870968 0.750000 0.805970
neutral 0.877049 0.914530 0.895397
sadness 0.758621 0.647059 0.698413
accuracy 0.746667
macro avg 0.770559 0.667718 0.702540
weighted avg 0.755965 0.746667 0.748585
After fine-tuning from AER from text, The BERTIN model achieves an overall accuracy of
71.3%. Table 3 shows the classification report obtained by the BERTIN model. Notably, this
model performs well in recognizing “fear” instances. However, there is a room for
improvement in recognizing “anger” and “sadness” instances, as indicated by lower
precision and recall values.
The BETO, MarIA, and BERTIN models achieved varying levels of performance in AER
from text. Overall, all models have strengths and areas for improvement, suggesting the
need for ongoing refinement to improve their performance, specifically for less common
emotion classes.
Table 3
Classification report of the BERTIN model for AER task from text.
Emotion Precision Recall F1-score
anger 0.400000 0.400000 0.400000
disgust 0.626667 0.661972 0.643836
fear 1.000000 0.500000 0.666667
joy 0.800000 0.777778 0.788732
neutral 0.842975 0.871795 0.857143
sadness 0.714286 0.588235 0.645161
accuracy 0.713333
macro avg 0.730655 0.633297 0.666923
weighted avg 0.714024 0.713333 0.712204
2.2. Task 2: Multimodal Automatic Emotion Recognition
The second challenge of the EmoSpeech 2024 Task was multimodal AER, which aims to
analyze the performance of language models in solving this classification problem. The
dataset used for this task was the Spanish MEACorpus 2023, consisting of more than 13.16
hours of audio from segments annotated with five of Ekman's six emotions. This dataset
comprises about 3500-4000 audio segments, divided into training and test sets in an 80%-
20% split. The ensemble learning process followed for automatic emotion recognition from
text and speech is shown in Figure 2. This process is described next.
Figure 2: Flow diagram of the ensemble learning process for automatic emotion
recognition from text and speech.
1. Select a Wav2Vec pre-trained model for Spanish. A Wav2Vec model is trained in large
amounts of unlabeled audio data aiming to improve acoustic model training
(Schneider et al., 2019). In this work, the wav2vec2-large-xlsr-53-spanish model
(Grosman, 2021) was selected.
2. Evaluate the technique called Ensemble Learning that consists of combining multiple
individual models to improve predictive performance compared to the performance
of each model separately. In this work, an ensemble that combines the best fine-
tuned text model (BETO) and another fine-tuned model from Wav2Vec 2.0 has been
used for emotion classification.
a. Combine the probabilities obtained for each emotion through the two
models using Mean. Mean is the mean of the classification probabilities of
both sources (text and audio) for each emotion class.
b. Combine the probabilities obtained for each emotion through the two
models using Max. Max is the maximum probability of each emotion class of
the two models.
3. Prediction. For both previous cases, the class with the maximum probability is
considered as the final prediction.
The ensemble learning process that combines the Wav2Vec pre-trained model with the
BETO model achieves an overall accuracy of 75.0% for this challenge using mean of the
classification probabilities of both sources. As can be seen in Table 4, this approach
performs well across most emotion classes, particularly for "neutral," "joy," and "disgust."
It demonstrates the ability to recognize instances of "fear" with high precision and recall.
However, the recall is relatively lower in recognizing "sadness" instances. Overall, the
ensemble model shows promise in accurately identifying emotions from both text and
speech inputs.
Table 4
Classification report of the ensemble learning process based on a Wav2Vec and BETO using
mean of the classification probabilities of both sources.
Emotion Precision Recall F1-score
anger 0.50000 0.50000 0.50000
disgust 0.64789 0.64789 0.64789
fear 0.66667 1.00000 0.80000
joy 0.80556 0.80556 0.80556
neutral 0.85600 0.91453 0.88430
sadness 0.84000 0.61765 0.71186
accuracy 0.75000
macro avg 0.71935 0.74760 0.72493
weighted avg 0.75015 0.75000 0.74755
As can be seen in Table 5, the ensemble learning process combining the Wav2Vec pre-
trained model with the BETO model using the maximum probability of each emotion class
of the two models demonstrates decent precision and recall for some emotion classes like
"disgust" and "neutral," it struggles with others such as "anger," "joy," and "sadness," where
either precision or recall or both are relatively lower. Notably, it fails to predict any
instances of "fear," indicating significant room for improvement. Furthermore, it should be
mentioned that this approach achieves an overall accuracy of 70.3%.
Table 5
Classification report of the ensemble learning process based on a Wav2Vec and BETO using
the maximum probability of each emotion class of the two models.
Emotion Precision Recall F1-score
anger 0.64286 0.22500 0.33333
disgust 0.57426 0.81690 0.67442
fear 0.00000 0.00000 0.00000
joy 0.64286 0.50000 0.56250
neutral 0.81343 0.93162 0.86853
sadness 0.73913 0.50000 0.59649
accuracy 0.70333
macro avg 0.56876 0.49559 0.50588
weighted avg 0.69977 0.70333 0.67788
The ensemble learning process combining the Wav2Vec pre-trained model with the
BETO model exhibits promising performance in automatic emotion recognition from both
text and speech. Despite showing promise in certain areas, such as recognizing specific
emotions accurately, the ensemble model has clear scope for refinement to enhance its
performance across a broader range of emotions. There's a need for further development
and optimization to maximize the ensemble's effectiveness in accurately capturing and
understanding diverse emotional expressions from both text and speech inputs.
3. Final remarks
This paper presents a fine-tuning approach for EmoSPeech 2024 Task - Multimodal Speech-
text Emotion Recognition in Spanish. For AER from text, BETO, MarIA, and BERTIN pre-
trained models were used achieving varying levels of performance. Some of the models used
in this work obtained lower precision and recall values for less common emotion classes
such as “anger” and “sadness”. This fact suggests the need for ongoing refinement to
improve their performance. Regarding, the ensemble learning approach described in this
work, while the ensemble approach based on Mean achieves an impressive overall accuracy
of 75.0%, demonstrating strong performance across multiple emotion classes, another the
ensemble approach based on the maximum probability achieves a slightly lower accuracy
of 70.3%, indicating areas for improvement, particularly in recognizing certain emotions
like "anger," "joy," and "sadness." The obtained results emphasize the potential of ensemble
learning approaches in advancing automatic emotion recognition technology, while also
emphasizing the ongoing need for optimization and development to achieve more robust
and comprehensive emotional understanding from diverse textual and audio inputs.
Acknowledgements
We are grateful to the Tecnológico Nacional de Mexico (TecNM, by its Spanish acronym) for
supporting this work. This research was also sponsored by Mexico’s National Council of
Humanities, Sciences and Technologies (CONAHCYT).
References
Cañete, J., Chaperon, G., Fuentes, R., Ho, J.-H., Kang, H., & Pérez, J. (2023). Spanish Pre-trained
BERT Model and Evaluation Data. https://arxiv.org/abs/2308.02976v1
Chiruzzo, L., Jiménez-Zafra, S. M., & Rangel, F. (2024). Overview of IberLEF 2024: Natural
Language Processing Challenges for Spanish and other Iberian Languages. Proceedings
of the Iberian Languages Evaluation Forum (IberLEF 2024), Co-Located with the 40th
Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-
WS.Org.
De La Rosa, J., Ponferrada, E. G., Villegas, P., González De Prado Salas, P., Romero, M.,
Grandury, M., & Project, B. (2022). BERTIN: Efficient Pre-Training of a Spanish
Language Model using Perplexity Sampling. Procesamiento Del Lenguaje Natural,
68(0), 13–23. https://doi.org/10.26342/2022-68-1
Ekman, P. (1992). Facial Expressions of Emotion: New Findings, New Questions.
Psychological Science, 3(1), 34–38. https://doi.org/10.1111/J.1467-
9280.1992.TB00253.X
Grosman, J. (2021). Fine-tuned XLSR-53 large model for speech recognition in Spanish.
Gutiérrez-Fandiño, A., Armengol-Estapé, J., Pàmies, M., Llop-Palao, J., Silveira-Ocampo, J.,
Carrino, C. P., Gonzalez-Agirre, A., Armentano-Oller, C., Rodriguez-Penagos, C., &
Villegas, M. (2021). MarIA: Spanish Language Models. Procesamiento Del Lenguaje
Natural, 68, 39–60. https://doi.org/10.26342/2022-68-3
Pan, R., García-Díaz, J. A., Rodríguez-García, M. Á., García-Sánchez, F., & Valencia-García, R.
(2024). Overview of EmoSPeech at IberLEF 2024: Multimodal Speech-text Emotion
Recognition in Spanish. Procesamiento Del Lenguaje Natural, 73(0).
Pan, R., García-Díaz, J. A., Rodríguez-García, M. Á., & Valencia-García, R. (2024). Spanish
MEACorpus 2023: A multimodal speech-text corpus for emotion analysis in Spanish
from natural environments. Computer Standards & Interfaces, 103856.
Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised Pre-
training for Speech Recognition. Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH, 2019-September,
3465–3469. https://doi.org/10.21437/Interspeech.2019-1873
Vaswani, A., Brain, G., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł.,
& Polosukhin, I. (2017). Attention is All you Need. Advances in Neural Information
Processing Systems, 30.