=Paper=
{{Paper
|id=Vol-3756/EmoSPeech2024_paper8
|storemode=property
|title=UAH-UVA in EmoSpeech-IberLEF2024: A Transfer Learning Approach for Emotion Recognition in Spanish Texts based on a Pre-trained DistilBERT Model
|pdfUrl=https://ceur-ws.org/Vol-3756/EmoSPeech2024_paper8.pdf
|volume=Vol-3756
|authors=Andrea Chaves-Villota,Ana Jimenez,Alfonso Bahillo
|dblpUrl=https://dblp.org/rec/conf/sepln/Chaves-VillotaJ24
}}
==UAH-UVA in EmoSpeech-IberLEF2024: A Transfer Learning Approach for Emotion Recognition in Spanish Texts based on a Pre-trained DistilBERT Model==
UAH-UVA in EmoSpeech-IberLEF2024: A Transfer
Learning Approach for Emotion Recognition in Spanish
Texts based on a Pre-trained DistilBERT Model
Andrea Chaves-Villota1,* , Ana Jimenez1 and Alfonso Bahillo2
1
Electronics Department, University of Alcala, E.P.S. Campus universitario s/n, E-28805, Alcalá de Henares (Madrid), Spain.
2
University of Valladolid, Escuela Técnica Superior de Ingenieros de Telecomunicación, Campus Miguel Delibes, 47011, Valladolid,
España
Abstract
Emotion recognition is a key component in numerous domains, emphasizing its significance in understanding
human behavior, enhancing communication technologies, and facilitating personalized user experiences. In this
study, we present the methodology used in EmoSPeech 2024 Task to train two classification models capable of
identifying five of Ekman’s six basic emotions from Spanish text transcripts extracted from the Spanish MEA
Corpus 2023 database (Task 1: Text Automatic Emotion Recognition). This methodology is developed from a
transfer learning approach using a pre-trained model based on Distilbert’s architecture. To handle the class
imbalance of the dataset and to avoid a bias of the model towards the majority classes, it is proposed to use a
technique based on class weighting, where a higher weight is given to the minority class (fear) and a lower weight
to the majority class (neutral). Subsequently, the models’ performance in classifying emotions is compared, where
the weighted model outperforms the unweighted one with a f1-score of 0.63 as opposed to 0.61. Furthermore, we
discuss our approach’s strengths and weaknesses and share our understanding of the variables that influence
its effectiveness. Our findings highlight the feasibility of developing emotion identification systems from voice
transcriptions by using pre-trained models.
Keywords
Emotion Recognition, Spanish Text Classification, Distilbert, Transfer Learning
1. Introduction
Emotion recognition plays a fundamental role in the understanding of cognitive processes, human
behaviors, and social dynamics faced by human beings in different facets of their lives. Nowadays,
with the rise of artificial intelligence, the study of emotion recognition systems has received great
attention from the scientific community [1], since they allow a better understanding of how emotions
are expressed, experienced, and regulated across different cultures and contexts. This is also due
to their wide range of practical applications which include developing therapies for mental health
disorders, improving user experiences in areas like virtual assistants and educational tools, enhancing
human-computer interaction through more intuitive and sympathetic technologies, and improving
marketing strategies by understanding consumer emotions [2]. In addition, from emotion recognition
systems, certain notions of collective behaviors can be discovered, bringing with them a collective
benefit among groups of individuals and communities, for example, in education, these systems can
enhance learning experiences by adapting the content and providing support for the learning process
based on students’ emotional states [3, 4]. Therefore, research in emotion recognition systems allows
a better understanding of human nature and improves the design of technologies and interventions
aimed at promoting mental health, and social well-being [5].
To develop more robust models for emotion recognition, the latest research focuses on training mul-
timodal models that take into account different aspects of human behavior, including facial expressions,
body language, speech patterns, and physiological responses [6, 7, 8, 9]. In particular, voice patterns
IberLEF 2024, September 2024, Valladolid, Spain
*
Corresponding author.
$ andrea.chaves@uah.es (A. Chaves-Villota); ana.jimenez@uah.es (A. Jimenez); alfonso.bahillo@uva.es (A. Bahillo)
0000-0002-9272-9426 (A. Chaves-Villota); 0000-0003-0713-0054 (A. Jimenez); 0000-0003-3370-33388 (A. Bahillo)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Figure 1: Distribution of Spanish MEA Corpus 2023
provide valuable information for emotion recognition, such as text and its extracted features like words
or phrases, as well as certain linguistic features such as adjectives, adverbs, and intensifiers that can be
associated with a particular emotion [10]. Hence, it is important to use different data sources to support
a model’s robust classification performance.
In this study, we use a particular machine learning technique known as transfer learning, in which a
pre-trained model is adapted to perform tasks related to text classification. Thus, the main contribution
of the paper refers to the methodology for training the model in emotion classification from texts
using Spanish MEA Corpus 2023 (Task 1: Text AER). This methodology is based on a transfer learning
approach using two models, a weighted model that weights classes in the training stage to deal with the
problem of unbalanced data, and an unweighted one, which takes an equal weighting for all classes. The
paper is organized as follows: section II gives a brief description of the corpus, models, and techniques
used, section III describes the main results found, and finally, we discuss the conclusions in Section IV.
2. Materials and Methods
2.1. Dataset
To fine-tune the pre-trained model, we make use of the transcriptions provided by the multimodal
Spanish MEA Corpus 2023 dataset [11], made available by the EmoSPeech 2024 Task [12, 13] which
consists of approximately 13 hours of audio segments extracted from different Spanish YouTube channels,
including political, sports and entertainment topics. Thus, the dataset provided comprises audio
segments labeled with five of Ekman’s basic emotions: disgust, anger, joy, sadness, fear, and neutral one
[14]. Table 1 and Figure 1 show the distribution of the training data categorized by each emotion, it can
be seen that the target classes are not balanced. Especially, there is a significantly different proportion of
the emotion fear represented by only 00.76% compared to the rest of the classes. Likewise, the samples
of the neutral state are presented in greater quantity with 38.86% of the total dataset, followed by 23.50%
for the emotion disgust. The amount of samples for anger, joy and sadness are approximately balanced
between them.
Unbalanced datasets are a common problem faced by emotion recognition systems [15]. To address
this challenge, this paper proposes a model that weights the classes during the training stage. This
approach aims to prevent the model from being biased towards the majority class (neutral) while
ensuring adequate performance in classifying the minority class (fear) as well.
Table 1
Dataset distribution
Emotion No. %
neutral 1166 38.86
disgust 705 23.50
anger 399 13.30
joy 362 12.06
sadness 345 11.50
fear 23 00.76
Total 3000 100%
2.2. Training based on Pre-trained Distilbert Architecture and Class Weighting
Technique
In this paper, we propose the use of a technique based on a transfer learning approach for emotion
classification from Spanish texts. As shown in Fig. 2, first, it is necessary to define a proper pre-trained
model that solves a classification problem (Task 1: tweet classification). This pre-trained model will
serve as a starting point in the new classification task to be developed (Task 2: emotion classification),
thus the selected pre-trained model must develop a classification task related to the new problem. In
addition, a class weighting technique is proposed to deal with the unbalanced dataset problem. These
phases are explained in the following subsections.
2.2.1. Pre-trained Distilbert Architecture
We selected a Distilbert pre-trained model to obtain better results in the emotion classification task
(Task 1 in Fig.2) [16]. Its architecture is based on the compressed version of BERT (Bidirectional Encoder
Representations from Transformers) known as Distilbert that presents similar performances in the
development of NLP tasks such as text classification, question answering, and named entity recognition.
Its main advantage is that it can be used with reduced computational resources, its inference time is
faster and it uses fewer parameters (up to 40% less), making it a more practical model for developing
real-world applications [17]. This pre-trained model was selected since it solutions a classification
problem with certain patterns and features similar to the emotion classification task to be achieved with
the new target model (Task 2 in Fig.2), in this way, an appropriate knowledge transfer between both
architectures could be developed. Specifically, we selected the Distilbert base finetuned with Spanish
tweets, whose objective is to classify tweets in Spanish into three categories, positive, negative, and
neutral. Taking advantage of the model’s classification task that classifies Spanish text data, they are
additionally categorized into discrete classes.
The main difference between the model architectures is centered on the head, which for the source
model corresponds to an output layer capable of categorizing Spanish tweets into three different labels
(Positive, Negative, Neutral), while the new head of the target model is randomly initialized and trained
to categorize into the 6 new classes (Disgust, Anger, Joy, Sadness, Fear, Neutral). The remaining
hyperparameters are kept constant to preserve the learned features and prevent them from being
updated during training. This approach maximizes the use of the model’s existing knowledge. These
hyperparameters are detailed in Table 2.
2.2.2. Class Weighting Technique
Furthermore, to deal with the class imbalance problem presented by the dataset distribution, avoid
biasing the model to the majority class and thus obtain a possible more robust classification. We propose
the evaluation of a simple and common technique during training that performs class weighting,
penalizing the minority class by setting a higher weight and at the same time reducing the weighting for
the majority classes. The class weighting vector 𝑤 was adjusted in a range of 𝑤1 ∈ [0.5, 2] according
Table 2
Hyperparameters of model
Hyperparameter Value
No. Transformers Layers 6
Attention heads 12
Transformer Activation Gaussian Error Linear Unit (GELU)
Dropout 0.1
Optimizer AdamW
Loss Cross Entropy
Epochs 12
Batch size 8
Learning rate 4e-5
Table 3
Class weighting
Emotion % 𝑤𝑖
neutral 38.86 0.5
disgust 23.50 1
anger 13.30 1.5
joy 12.06 1.5
sadness 11.50 1.5
fear 00.76 2
to the percentage of data belonging to each class 𝑖 under the convention given by (1). It is important to
highlight that the selected weights 𝑤 were approximated depending on the data per label. However, it
is important to highlight that these values can be adjusted using optimization techniques that could
improve the model performance.
⎧
⎪
⎪ 2, %<5
1.5, 5 < % < 20
⎨
𝑤𝑖 = (1)
⎪
⎪ 1, 20 < % < 35
0.5, % > 35
⎩
The training of both weighted and unweighted models was run on a Tesla T4 GPU with 15 GBs RAM
provided by the Google Colaboratory cloud service, using the Simple Transformers package based on
the Transformers library by HuggingFace.
3. Results and discussion
3.1. Convergence analysis
Figure 3 shows the behavior of the loss in the training phase of the two models for each step, i.e. each
time the batch training is completed. It is possible to appreciate that the two models manage to converge
in approximately 3000 steps to a loss of approximately 0. The two models present a high variability
during the training, more evident for the weighted model in comparison with training achieved by
the unweighted model. It can also be noted that these variances decrease at approximately 5000 steps,
resulting in values close to zero.
3.2. Model performances
Figure 4a and Figure 4b show the confusion matrixes accomplished in the test stage for the unweighted
and weighted models, respectively. In both cases, it can be seen that it is achieved a better classification
Figure 2: Methodology based on transfer learning for emotion classification.
Figure 3: Loss convergence in model training
of the neutral class concerning the others and that the poorest performance is for the fear label. This
performance is expected due to the unbalanced nature of the data. The weighted model achieves better
classification performance for the neutral, sadness, and disgust classes compared to the unweighted
one. However, the unweighted model obtains better metrics for the anger and joy classes. It can also be
verified that both models mispredict a high percentage of the anger emotion with disgust, 39%, and 35%
for weighted and unweighted models respectively. Additionally, they present the same behavior in the
classification of fear, misclassifying it with the emotions joy, neutral, and sadness.
Overall, both models score high on the main diagonal, except for the fear label, as is desirable in the
evaluation of the classification task, since it represents the instances where the predicted emotions
match the true ones. According to this, it could be inferred that especially for the fear emotion the
weighted model does not show a sufficient difference in the classification to the unweighted model,
hence we propose the study of an optimization of the weighted vector 𝑤, which could include a higher
weighting for this class and verify a possible improvement in the performance of the model.
Table 4 reports the metrics evaluated with the test data by the two models. In general, the weighted
one achieves higher scores in macro f1-score and recall with 0.63 and 0.61, respectively, in contrast to
the unweighted model that achieved an f1-score of 0.61 and recall of 0.60. Furthermore, considering
that the dataset is unbalanced, the weighted model is considered to have a more robust classification
performance. This is also highlighted taking into consideration that they used the same computational
(a) Unweighted model (b) Weighted model
Figure 4: Confusion matrixes
Figure 5: Comparison of F1-score between models for each emotion
resources.
To have some insights into emotion specific performance with test data. We show f1-score achieved
for each class by models in Figure 5. We can see that the weighted model achieves better results in most
cases, except for the emotion joy. In general, there are not high differences in the f1 scores obtained
by the models for each emotion. Nevertheless, it can be identified that the models perform better in
classifying the emotions joy, neutral, sadness, and disgust with an f1-score above 0.6. This is in contrast
to the identification of the emotions anger and fear. This could not necessarily be related to the amount
of data per class, since in the training phase, the emotions joy and sadness were covered in lower
percentages, (12.06% and 11.50%) with respect to disgust (23.50%) and even so, the f1-score obtained is
higher than 0.7 in both cases. It is evident that the two models face a high challenge when classifying
the emotion fear, for which f1-scores is below 0.3 in both cases, being even lower for the unweighted
model. Hence, it is proposed to make use of class balancing techniques such as oversampling of the
minority class or undersampling of the majority ones, as well as to evaluate synthetic data generation
techniques that could help to improve the robustness of the model.
Table 4
F1-score and recall with test data
Emotion Unweighted Model Weighted Model
recall F1 recall F1
neutral 0.86 0.84 0.87 0.84
disgust 0.63 0.63 0.71 0.67
anger 0.49 0.47 0.44 0.48
joy 0.79 0.80 0.76 0.78
sadness 0.67 0.73 0.71 0.74
fear 0.17 0.20 0.17 0.29
Macro avg. 0.60 0.61 0.61 0.63
Table 5
Class weighting
Emotion % 𝑤𝑖
neutral 38.86 0.4
disgust 23.50 0.7
anger 13.30 1.2
joy 12.06 1.3
sadness 11.50 1.4
fear 00.76 21.7
Table 6
F1-score and recall achieved by New Weighted Model
Emotion Weighted Model
recall F1
neutral 0.82 0.85
disgust 0.63 0.63
anger 0.45 0.45
joy 0.83 0.79
sadness 0.78 0.77
fear 1.00 0.29
Macro avg. 0.75 0.63
3.2.1. Out-of-competition results
In addition, out-of-competition we evaluate another common class weighting, where the vector 𝑤 is set
inversely proportional to the frequency of classes in the data, according to (2). Where 𝑛 refers to the
total number of samples, 𝑛𝑐 corresponds to the total number of classes (emotions) and 𝑛𝑖 is the total
number of samples belonging to the class 𝑖. The resulting weights for each class are shown in Table 5.
𝑛
𝑤𝑖 = (2)
𝑛𝑐 𝑛𝑖
Fig. 6 shows the confusion matrix resulting with test data (not seen in the training phase). It is
possible to appreciate that according to recall (main diagonal), a higher weighting in the fear class
improves the model performance in its categorisation. However, it is necessary to highlight that as well
as improving the classification, the prediction is also affected in the recognition of other emotions such
as Anger and Disgust, in relation to the weighted model I (See Fig. 7). The recall and F1-score results
achieved by this model for each emotion and their respective averages are shown in Table 6. Notice
clearly that in relation to the weighted model I, this new class weighting improves the classification in
the rate of true positives, especially for the classes anger, fear, joy, and sadness.
Figure 6: Confusion Matrix of the New Weighted Model
Figure 7: Comparison of F1-score achieved by models for each emotion
We can observe that according to the results achieved by the two models with class weighting, the
use of balancing techniques such as the one used in this study, allows us to achieve a better result in
the task of emotion classification, highlighting that the computational cost both in the training and test
phases used by the weighted models does not present a major difference with respect to the unweighted
one. We further emphasise the importance of using optimization techniques to find optimal weights
that could improve the model performance in this task.
4. Conclusion
In this work, we developed a methodology based on transfer learning to explore the advantage of
employing pre-trained models for emotion recognition from speech transcriptions. The Spanish MEA
Corpus 2023 was used as a benchmark dataset for the training and test phases. We propose evaluate
the performance of two models, a weighted that uses a technique based on class weighting to address
the problem of emotion imbalance, to avoid biasing the model towards the majority class (neutral),
and the unweighted one that does not make an adjustment of weights between classes. Through
experimentation, we observed favorable results with F1 scores of 0.63 and 0.61 for weighted and
unweighted models, respectively. Even though the two models exhibit comparative behaviors, the
present research determined key factors in the use of the class-weighting technique that could yield
potential improvements in handling the emotion imbalance problems. These findings highlight the
potential of leveraging pre-trained models as a viable approach for emotion recognition from text.
Moving forward, efforts should focus on refining and optimizing the weights for the weighted model to
enhance emotion recognition prediction. Additionally, exploring multimodal methods that incorporate
other input sources such as speech patterns would offer alternative ways to improve model performance.
Acknowledgments
This work was supported by the FrailAlert (SBPLY/21/180501/000216, co-financing from both the Junta
de Comunidades de Castilla-La Mancha and the European Union through the European Regional
Development Fund); and ActiTracker (TED2021-130867B-I00) and INDRI (PID2021-122642OB-C41 /AEI/
10.13039/501100011033/FEDER, UE) .
References
[1] E. Cambria, D. Das, S. Bandyopadhyay, A. Feraco, Affective computing and sentiment analysis, A
practical guide to sentiment analysis (2017) 1–10.
[2] A. Kołakowska, A. Landowska, M. Szwoch, W. Szwoch, M. R. Wróbel, Emotion Recognition
and Its Applications, Springer International Publishing, Cham, 2014, pp. 51–62. URL: https:
//doi.org/10.1007/978-3-319-08491-6_5. doi:10.1007/978-3-319-08491-6_5.
[3] W. Wang, K. Xu, H. Niu, X. Miao, Emotion recognition of students based on facial expressions in
online education based on the perspective of computer simulation, Complexity 2020 (2020) 1–9.
[4] D. Yang, A. Alsadoon, P. C. Prasad, A. K. Singh, A. Elchouemi, An emotion recognition model
based on facial recognition in virtual learning environment, Procedia Computer Science 125 (2018)
2–10.
[5] S. G. Koolagudi, K. S. Rao, Emotion recognition from speech: a review, International journal of
speech technology 15 (2012) 99–117.
[6] M. Ragot, N. Martin, S. Em, N. Pallamin, J.-M. Diverrez, Emotion recognition using physiological
signals: laboratory vs. wearable sensors, in: Advances in Human Factors in Wearable Technologies
and Game Design: Proceedings of the AHFE 2017 International Conference on Advances in Human
Factors and Wearable Technologies, July 17-21, 2017, The Westin Bonaventure Hotel, Los Angeles,
California, USA 8, Springer, 2018, pp. 15–22.
[7] N. Sebe, I. Cohen, T. S. Huang, Multimodal emotion recognition, in: Handbook of pattern
recognition and computer vision, World Scientific, 2005, pp. 387–409.
[8] A. B. Ingale, D. Chaudhari, Speech emotion recognition, International Journal of Soft Computing
and Engineering (IJSCE) 2 (2012) 235–238.
[9] P. Tarnowski, M. Kołodziej, A. Majkowski, R. J. Rak, Emotion recognition using facial expressions,
Procedia Computer Science 108 (2017) 1175–1184.
[10] V. Ramanarayanan, A. C. Lammert, H. P. Rowe, T. F. Quatieri, J. R. Green, Speech as a biomarker:
Opportunities, interpretability, and challenges, Perspectives of the ASHA Special Interest Groups
7 (2022) 276–283.
[11] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, R. Valencia-García, Spanish meacorpus 2023:
A multimodal speech-text corpus for emotion analysis in spanish from natural environments,
Computer Standards & Interfaces (2024) 103856.
[12] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, R. Valencia-García, Overview of EmoSPeech
2024@IberLEF: Multimodal Speech-text Emotion Recognition in Spanish, Procesamiento del
Lenguaje Natural 73 (2024).
[13] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process-
ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages
Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for
Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024.
[14] P. Ekman, et al., Basic emotions, Handbook of cognition and emotion 98 (1999) 16.
[15] A. Batliner, B. Schuller, D. Seppi, S. Steidl, L. Devillers, L. Vidrascu, T. Vogt, V. Aharonson, N. Amir,
The automatic recognition of emotions in speech, Springer, 2011.
[16] F. Perez-Sorrosal, Distilbert base uncased fine-tuned with spanish tweets, https://huggingface.co/f
rancisco-perez-sorrosal/distilbert-base-uncased-finetuned-with-spanish-tweets-clf-cleaned-ds,
2023.
[17] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).