CogniCIC at EmoSPeech-IberLEF2024: Exploring
                         Multimodal Emotion Recognition in Spanish: Deep
                         Learning Approaches for Speech-Text Analysis
                         Miguel Soto1,† , Cesar Macias1,*,† , Marco Cardoso-Moreno1,† , Tania Alcántara1,† ,
                         Omar García1 and Hiram Calvo1
                         1
                             Instituto Politécnico Nacional, Center for Computing Research, Cognitive Sciences Laboratory, Mexico, City, 07700, Mexico


                                         Abstract
                                         Human emotion recognition, which encompasses both verbal and non-verbal signals such as body language and
                                         facial expressions, remains a complex challenge for computing systems. This recognition can be derived from
                                         various sources, including audio, text, and physiological responses, making multimodal approaches particularly
                                         effective. The importance of this task has grown significantly due to its potential to improve human-computer in-
                                         teraction, providing better feedback and usability in various applications such as social media, education, robotics,
                                         marketing and entertainment. Despite its potential, emotional expression is an heterogeneous phenomenon
                                         influenced by factors such as age, gender, sociocultural origin and mental health. Our study addresses these
                                         complexities and presents our findings from the recent EmoSpeech competition. Our system achieved an F1 score
                                         of 0.7256 and a precision of 0.7043, with a precision of 0.7013, in the validation task. For multimodal task 1, our
                                         CogniCIC team ranked second with an official F1 score of 0.657527 and for task 2, with an F1 score of 0.712259.
                                         These results underline the effectiveness of our approach in multimodal emotion recognition and its potential for
                                         practical applications.

                                         Keywords
                                         Emotion Recognition, LLM, Classification, Speech, Text


                         1. Introduction
                         Even though, human beings can automatically recognize emotions in other humans by considering
                         verbal and non-verbal expressions (body language and facial expressions), this is still a complicated task
                         for computer systems [1]. Human emotion recognition can be analyzed from several verbal and non-
                         verbal sources, such as: audio, text [2], and physiological responses [3]; therefore, emotion recognition
                         tasks are suitable for multimodal approaches [2, 4]. This task has received increasing attention in recent
                         years [1], since human-computer interaction systems would benefit from the identification of emotions
                         for better feedback [5, 6, 7, 8]. The range of applications is wide, ranging from: social media, education,
                         robotics, marketing and entertainment industries in general [3, 9].
                            Nevertheless, it is important to understand that emotional expressions are, in general, an heteroge-
                         neous phenomenon; variations may arise from different factors such as: age, gender, sociocultural and
                         even mental health [10, 6].
                            There are several approaches for human emotion recognition, from physiological signals [11, 12] to
                         text-based approaches [13, 14] and speech-based [15, 16]. Moreover, recently there have been approaches
                         dealing with multimodal approaches, or instance, speech-text based solutions [1, 17].
                         IberLEF 2024, September 2024, Valladolid, Spain
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ msotoh2021@cic.ipn.mx (M. Soto); cmaciass2021@cic.ipn.mx (C. Macias); mcardosom2021@cic.ipn.mx
                         (M. Cardoso-Moreno); talcantaram2020@cic.ipn.mx (T. Alcántara); ogarciav2024@cic.ipn.mx (O. García); hcalvo@cic.ipn.mx
                         (H. Calvo)
                          https://mashd3v.github.io/ (M. Soto); https://github.com/MACI-dev-96 (C. Macias); https://cardoso1994.github.io/
                         (M. Cardoso-Moreno); https://github.com/talcantaram (T. Alcántara);
                         https://www.linkedin.com/in/omar-garcia-vazquez-093128219/ (O. García); http://hiramcalvo.com/ (H. Calvo)
                          0009-0006-4619-9352 (M. Soto); 0009-0005-1708-5359 (C. Macias); 0009-0001-1072-2985 (M. Cardoso-Moreno);
                         0009-0001-4391-6225 (T. Alcántara); 0009-0001-4391-6225 (O. García); 0000-0003-2836-2102 (H. Calvo)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   The task EmoSPeech 24 [4] is a prominent part of the IberLEF 2024 [18] conference, which focuses on
advancements in language processing and related fields. Our participation in this task was structured
as follows: Section 2 shows a brief mention of the literature related to this work. Section 3 provides an
overview of the dataset employed. Section 4 explains the proposed preprocessing, models and metrics
used to evaluate the results. Section 5 shows the results and discussion after carefully evaluation of
the models’ performance. Finally, Section 6 highlights the strengths, concludes and points out future
directions of this research work.


2. Literature Review
In this section, a brief literature review on emotion recognition is presented, from text only proposals,
through speech only, and speech and text multimodal approaches.

2.1. Text Based Emotion Recognition
The proposal by deVelasco and colleagues’ [1] involved the creation of a new Spanish dataset by selecting
speech fragments from a Spanish TV show, which were also transcribed to text. Their solution consists
on using a paradigm known as VAD: Valence—related to polarity—, Arousal—related to calmness or
excitement—, and Dominance—the degree of control over a situation—; VAD, then, encodes every
emotion label as a new, three-dimensional vector, where each element is real valued and corresponds
to each one of the VAD parameters. This approach allows for the emotion recognition problem to be
stated as a regression task instead of a classification one. deVelasco’s solution for text based emotion
recognition consisted on using FastText embeddings of 300 dimensions; their proposed model was a
Deep Neural Network (DNN) which yielded an MSE error of 0.1196.
   In [19] emotions were analyzed by gathering tweets, each one of them with an emotional hashtag.
In a first stage the authors used a ConvNet for emotion classification. During the learning process,
an embedding model was extracted, which was used to further classify ROC (Receiver Operating
Characteristic) curve story text emotions, based on Plutchik’s emotions model. The performance of the
model ranged from values of 28% to 73% in accuracy, depending on the emotion.

2.2. Speech Based Emotion Recognition
In [1], the Speech based solution consisted on extracting, from audio signals, several acoustic features
such as: Pitch, Energy, Spectral Centroid, Spectral Spread, among others. The implemented model
was a Long-Short Term Memory (LSTM) cell followed by a Multi-Layer Perceptron (MLP) to solve the
regression task according to the VAD paradigm; their model achieved an MSE error of ranging from
0.14 to 0.16 for various subsets of acoustic features.
   In [20], two models were proposed for speech emotion recognition, both a combination of Con-
vNets and an LSTM cell; the main difference being the dimensionality of the convolution layers,
one-dimensional and two-dimensional, respectively. This proposal obtained significant results on
different benchmark datasets, including IEMOCAP, where it achieved an accuracy of 89.16% for the
speaker-depenent configuration, and 52.14% on the speaker-independent experiments.
   Furthermore, in [21], Chatziagapi and colleagues propoposed a Generative Adversarial Network
(GAN) for synthetic data generation, particularly for balancing the minority class. Once the datasets
were in balance, a VGG19 model [22] was instantiated to perform the classification task. The model
achieved an average UAR of 53.6%.

2.3. Multimodal Emotion Recognition
In [23], a Multimodal Dual Recurrent Encoder (MDRE) was proposed. The model consisted on two
separate encoders: one for text, a Text Recurrent Encoder (TRE); and one for audio, Audio Recurrent
Encoder (ARE). Both models consists on a Gated Recurrent Unit (GRU); for ARE Mel-frequency cepstral
coefficients (MFFCs) are the corresponding features, whereas for TRE text was tokenized using Natural
Language Toolkit (NLTK) [24] and passed through an embedding layer, yielding vectors of 300 dimen-
sions. The dataset used was the IEMOCAP dataset [25]. The proposal achieved accuracy values from
68.8% to 71.8%, being the multimodal approach the one with better performance.
   Hazarika et al. [26] proposed a method to fusion both text and audio features at early stages by means
of self-attention mechanisms. Feature extraction was made with a Convolutional Neural Network
(ConvNet) model, while audio feature extraction was performed with the help of the openSMILE library.
Performance was also tested on the IEMOCAP dataset, yielding values around 72% in metrics such as
accuracy, F1-scorre and UAR.
   Krishna and their colleagues [27] project used cross-modal attention mechanisms, so that audio
features attend text features and viceversa, in addition to ConvNets. Features were extracted with
two different autoencoders: for audio, thee autoencoder processed the raw signal to extract high-level
features; for text, the encoder extracted high-level semantic features. The model achieved an Unweighted
Accuracy of 72.82% on the IEMMOCAP dataset.


3. Dataset
The dataset for this competition is the MEACorpus 2023 dataset [28], consisting on audio segments and
transcripts from YouTube videos. After the videos were downloaded, audio was extracted in segments;
then, an annotation procedure was carried on taking into account the following set of five emotions:
disgust, anger, sadness, joy, and fear, based on Ekman’s [29] findings—surprise was not considered in
the dataset due to the difficulty of extracting it from videos, and a neutral emotion was added—.
   The EmoSPeech competition [4] consisted in two tasks, both of which we decided to participate in.
The aim of task 1 was to perform text-based emotion recognition only, whereas task 2 was focused on
multimodal (text and speech) emotion recognition.


4. Proposal
  For each task, we have proposed different approaches, which are described in the following sections.

4.1. Task 1: Text AER
The dataset for Task 1, consisted of text transcripts from audio segments, those texts were written in
Spanish. The approach we have carried out for this task was to use large language models (LLMs) based
on transformers. Since the texts in the dataset are in Spanish, we chose to fine-tune BETO [30], a BERT
model trained on a very large Spanish corpus. For this approach we have not preprocessed the data, the
parameters used to fine-tune the BETO model were the following, 15 training epochs (we saved the
model with the best performance from all the training epochs, to make the predictions) and the training
dataset was divided into 90% for training and 10% for validation to verify the model performance during
training, the average used to compute the metrics was macro average, Adam W optimizer was used, and
its learning rate and epsilon were set to 3 × 10−6 and 1 × 10−9 , respectively. To make the experiments
replicable, the random generation seed was set to 42.

4.2. Task 2: Multimodal AER
The dataset provided for Task 2 consisted of text transcripts and associated audio segments. To prepare
the data for further analysis, the following pre-processing steps were carried out:

4.2.1. Data Upload.
The first stage of the process consisted of loading the data provided in CSV files, which contain the
transcripts and labels corresponding to the training and test samples.
4.2.2. Data Pre-processing.
This focused on cleaning the textual transcripts. Specifically, the extra blanks at the beginning and end
of each transcript were removed. This step was essential to ensure consistency and accuracy in the
textual features subsequently extracted.

4.2.3. Feature Extraction
    • Audio feature extraction. For audio feature extraction, the Librosa library was used. Audio
      files were loaded with a sampling rate of 22050 Hz. From these files, Mel-Frequency Cepstral
      Coefficients (MFCCs) were extracted. MFCCs are widely used in audio signal processing due to
      their ability to represent relevant acoustic features [31]. Subsequently, the mean of the MFCCs
      matrices was calculated to obtain a compact and efficient representation of the audio features.
    • Textual feature extraction. For textual feature extraction, the pre-trained BERT (Bidirectional
      Encoder Representations from Transformers) [32] model was used. The BertTokenizer was
      used to tokenize the transcripts and the BertModel was used to obtain the vector representa-
      tions. The BERT outputs were averaged to obtain a fixed representation of each transcript, thus
      effectively capturing the semantics of the texts.

4.2.4. Label Encoding.
The textual labels provided in the data were transformed into integer values using Scikit-Learn’s
LabelEncoder. This encoding is crucial to allow machine learning models to work with categorical
labels efficiently.

4.2.5. Feature and Label Integration.
Once the audio and text features were extracted, they were combined into a final dataset. Each sample
was represented as a pair of audio and text features. This integration was carried out for both the
training and test sets. The encoded labels were associated with the training samples for use in the
modelling phase.

4.2.6. Model Definition.
The implemented model for the multimodal emotion recognition task is based on a neural network
architecture designed to combine audio and text features. The model was designed to process and
combine audio and text features as follows:

    • Audio subnet. A linear layer (Linear) that receives audio features and transforms them into a
      256-dimensional feature space.
    • Text subnet. A linear layer that receives textual features and transforms them into a 256-
      dimensional feature space.
    • Feature combination. The outputs of the audio and text subnetworks are concatenated and
      passed through a fully connected network with a ReLU and Dropout layer for regularization.
      Finally, a linear layer reduces the dimensions to the number of emotion classes, enabling classifi-
      cation.

4.2.7. Model Specifications.
The model was configured with several key specifications to optimize performance. Firstly, the audio
feature size was set to 128, providing a detailed representation of the audio data. The dimension of the
text features was 768, allowing for a comprehensive capture of textual information. The model was
designed to classify into 6 distinct emotion classes, ensuring a nuanced understanding of emotional
states.
   In the architecture, the intermediate layer of combined features consisted of 256 neurons, enabling
effective integration and processing of the audio and text features. For the loss function, cross entropy
loss was chosen due to its suitability for classification tasks, particularly in handling multiple classes.
The model optimization was handled by the Adam optimizer, known for its efficiency and adaptive
learning rate capabilities. The learning rate was set to 0.001, a value selected to balance convergence
speed and training stability.

4.2.8. Model Summary Interpretation.
The model summary reveals a detailed architecture designed to process and combine audio and text
features efficiently. The first layer for audio processing, denoted as Linear (audio), transforms the
audio features from their original 128 dimensions to 256 dimensions. Similarly, the Linear (text) layer
handles the text features, reducing their dimensionality from 768 to 256 dimensions.
   Following the initial transformations, the model employs a Sequential (combined) network to
integrate the concatenated audio and text features. This sequential network consists of several layers:
firstly, a Linear layer that reduces the combined feature dimensions from 512 to 256, ensuring a more
manageable and computationally efficient size. This is followed by a ReLU activation function, which
introduces non-linearity into the model, enhancing its ability to capture complex patterns within
the data. To prevent overfitting, a Dropout layer is included with a dropout rate of 50%, effectively
regularizing the model by randomly omitting half of the neurons during training. Finally, a second
Linear layer is used to produce the output, which matches the number of emotion classes, set to 6 in
this case.
   The training process was repeated for 150 epochs. This approach allowed the audio and text features
to be effectively combined, achieving accurate classification of emotions in the dataset provided for the
EmoSPeech 24 competition.
   In addition to the basic architecture described above, variants of the model were tested to explore
different configurations and improve performance:

    • Implementation of attention modules. Attention layers were added for each modality (audio
      and text) in order to highlight the most relevant features before combining them.
    • Multi-head attention. Multi-head attention layers were implemented to improve the capture of
      long-term dependencies in audio and text features.
    • Modification of audio vector dimensions. Experimented with different audio feature dimen-
      sions, specifically 56 and 256, to assess their impact on model performance.
    • Use of BETO as a textual feature extractor. BETO [30], a BERT model adjusted for Spanish,
      was used as a textual feature extractor instead of the original BERT model, in order to evaluate
      improvements in the semantic representation of the transcripts.

  These variants were implemented and evaluated for their effectiveness in the multimodal emotion
recognition task, providing further insight into best practices and configurations for this type of task.
However, the original network performed the best.


5. Results
5.1. Task 1: Text AER
As mentioned earlier, we used the model with the best performance on the validation partition. To
select this model, we focused on the one that achieved the best results in the F1-score, as this was the
evaluation metric used to rank the participants. This model was obtained in the 12th training epoch.
The official result over the F1-score (macro average) is shown in Table 1.
   To gain a deeper understanding of our model’s performance, we examined the confusion matrix for
Task 1, shown in Figure 1. This analysis reveals the model’s proficiency in accurately identifying each
emotion category. For example, the model demonstrates a high precision of 0.92 in recognizing ‘neutral’
emotions. Nonetheless, there are significant misclassifications, such as the frequent confusion of ‘fear’
with ‘neutral’ and ‘sadness’.

    Table 1
    Task 1, official results over test data
                                  Team        Metric      Score     Placement
                                  CogniCIC    F1-score   0.657527     2nd


Figure 1: Confusion matrix for task 1


5.2. Task 2: Multimodal AER
Even though several models were tested, the one achieving best performance on the validation set
was the initial configuration, i.e., two plain multi-layer fully connected layers whose outputs are
concatenated previous to the actual classification stage. The results yielded by such model on the testing
set are shown in Table 2
   To further evaluate the performance of our model, we analyzed the confusion matrix for Task 2,
as depicted in Figure 2. The confusion matrix provides a detailed insight into the model’s ability to
correctly classify each emotion category. For instance, the model shows high accuracy in recognizing
‘neutral’ emotions with a precision of 0.92. However, there are noticeable misclassifications, such as
‘fear’ being frequently confused with ‘anger’ and ‘sadness’. These insights are crucial for understanding
the strengths and weaknesses of our model and guiding future improvements.

    Table 2
    Task 2, official results over test data
                                  Team        Metric      Score     Placement
                                  CogniCIC    F1-score   0.712259      3rd


Figure 2: Confusion matrix for task 2


6. Conclusions and future work
Automatic Emotion Recognition has proven to be a difficult task for Machine Learning algorithms.
Under this context is that the EmoSpeech competition presents a crucial environment in which to test
hypotheses and models. With this regard, we have presented two different proposals for the task in hand:
a fine-tuned BETO LLM for pure text emotion recognition with no preprocessing, leveraging LLMs
context capabilities; for multimodal recognition, a model consisting on two multi-layer fully connected
networks, each one dedicated to analyze a different mode of data and just before the classification stage,
the transformed values of each input are concatenated.
   We believe our proposal to be a significant contribution to the field of single modal and multimodal
emotion recognition, since both implementations need minimal configuration and preprocessing stages,
allowing for easy development and understanding of the process. Additionally, our second proposal
stands out for its simplicity both preprocessing architecture wise. Nevertheless, the importance of our
contribution is by the positions obtained in the contest: second and third place for task 1 and task 2,
respectively.
   Given the promising results of our proposals, with second and third place finishes in tasks 1 and 2
respectively, we see several avenues for future work to build upon these findings. Firstly, exploring
other pretrained models beyond BETO could provide additional insights and potentially improve
performance. Enhanced multimodal integration methods, such as attention mechanisms or transformer-
based architectures, could capture deeper interactions between audio and text features. Incorporating
advanced data augmentation techniques could generate more robust models by increasing the diversity
and size of the training dataset.
   Expanding the scope of emotion recognition to include multiple languages and cultural contexts could
enhance the generalizability of our models. This involves training and evaluating models on diverse
datasets to ensure they can accurately recognize emotions across different populations. Developing
and optimizing models for real-time emotion recognition applications, such as virtual assistants or
customer service bots, is another significant direction, requiring low-latency and efficient processing of
multimodal data.
   Incorporating user-specific adaptations and personalization mechanisms could improve the accuracy
of emotion recognition systems by accounting for individual differences in emotional expression.
Additionally, interdisciplinary collaborations with experts in psychology and cognitive science could
better inform the development of more sophisticated and accurate emotion recognition models.
   By addressing these areas, we aim to further advance the field of emotion recognition, making models
more robust, versatile, and applicable to a wider range of real-world scenarios.


Acknowledgments
The authors wish to thank the support of the Instituto Politécnico Nacional (COFAA, SIP-IPN, Grant
SIP 20240610) and the Mexican Government (CONAHCyT, SNI).


References
 [1] M. de Velasco, R. Justo, J. Antón, M. Carrilero, M. I. Torres, Emotion detection from speech and
     text., in: IberSPEECH, 2018, pp. 68–71.
 [2] S. Zhang, Y. Yang, C. Chen, X. Zhang, Q. Leng, X. Zhao, Deep learning-based multimodal emotion
     recognition from audio, visual, and text modalities: A systematic review of recent advancements
     and future prospects, Expert Systems with Applications 237 (2024) 121692. URL: https://www.
     sciencedirect.com/science/article/pii/S0957417423021942. doi:https://doi.org/10.1016/j.
     eswa.2023.121692.
 [3] A. Dzedzickis, A. Kaklauskas, V. Bucinskas, Human emotion recognition: Review of sensors
     and methods, Sensors 20 (2020). URL: https://www.mdpi.com/1424-8220/20/3/592. doi:10.3390/
     s20030592.
 [4] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, F. García-Sánchez, R. Valencia-García, Overview
     of EmoSPeech 2024 at IberLEF: Multimodal Speech-text Emotion Recognition in Spanish, Proce-
     samiento del Lenguaje Natural 73 (2024).
 [5] A. A. Varghese, J. P. Cherian, J. J. Kizhakkethottam, Overview on emotion recognition system, in:
     2015 International Conference on Soft-Computing and Networks Security (ICSNS), 2015, pp. 1–5.
     doi:10.1109/ICSNS.2015.7292443.
 [6] S. K. Pandey, H. S. Shekhawat, S. R. M. Prasanna, Deep learning techniques for speech emotion
     recognition: A review, in: 2019 29th International Conference Radioelektronika (RADIOELEK-
     TRONIKA), 2019, pp. 1–6. doi:10.1109/RADIOELEK.2019.8733432.
 [7] M. A. Cardoso-Moreno, C. Macias, T. Alcantara, M. Soto, H. Calvo, C. Yañez-Marquez, Convolving
     emotions: A compact cnn for eeg-based emotion recognition, in: 2023 IEEE Symposium Series on
     Computational Intelligence (SSCI), IEEE, 2023, pp. 1472–1476.
 [8] E. Lieskovská, M. Jakubec, R. Jarina, M. Chmulík, A review on speech emotion recognition using
     deep learning and attention mechanism, Electronics 10 (2021). URL: https://www.mdpi.com/
     2079-9292/10/10/1163. doi:10.3390/electronics10101163.
 [9] D. Wang, X. Zhao, Affective video recommender systems: A survey, Frontiers in Neuroscience 16
     (2022). URL: https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2022.984404.
     doi:10.3389/fnins.2022.984404.
[10] S. Narayanan, P. G. Georgiou, Behavioral signal processing: Deriving human behavioral informatics
     from speech and language, Proceedings of the IEEE 101 (2013) 1203–1233. doi:10.1109/JPROC.
     2012.2236291.
[11] S. Jerritta, M. Murugappan, R. Nagarajan, K. Wan, Physiological signals based human emotion
     recognition: a review, in: 2011 IEEE 7th International Colloquium on Signal Processing and its
     Applications, 2011, pp. 410–415. doi:10.1109/CSPA.2011.5759912.
[12] A. Dzedzickis, A. Kaklauskas, V. Bucinskas, Human emotion recognition: Review of sensors
     and methods, Sensors 20 (2020). URL: https://www.mdpi.com/1424-8220/20/3/592. doi:10.3390/
     s20030592.
[13] N. Alswaidan, M. E. B. Menai, A survey of state-of-the-art approaches for emotion recognition in
     text, Knowledge and Information Systems 62 (2020) 2937–2987.
[14] P. Thakur, D. R. Shrivastava, A. DR, A review on text based emotion recognition system, Interna-
     tional Journal of Advanced Trends in Computer Science and Engineering 7 (2018).
[15] T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, E. Ambikairajah, A comprehensive review
     of speech emotion recognition systems, IEEE Access 9 (2021) 47795–47814. doi:10.1109/ACCESS.
     2021.3068045.
[16] M. Swain, A. Routray, P. Kabisatpathy, Databases, features and classifiers for speech emotion
     recognition: a review, International Journal of Speech Technology 21 (2018) 93–120.
[17] K. Sailunaz, M. Dhaliwal, J. Rokne, R. Alhajj, Emotion detection from text and speech: a survey,
     Social Network Analysis and Mining 8 (2018) 28.
[18] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process-
     ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for
     Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024.
[19] S.-H. Park, B.-C. Bae, Y.-G. Cheong, Emotion recognition from text stories using an emotion
     embedding model, in: 2020 IEEE International Conference on Big Data and Smart Computing
     (BigComp), 2020, pp. 579–583. doi:10.1109/BigComp48618.2020.00014.
[20] J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1d & 2d cnn lstm networks,
     Biomedical signal processing and control 47 (2019) 312–323.
[21] A. Chatziagapi, G. Paraskevopoulos, D. Sgouropoulos, G. Pantazopoulos, M. Nikandrou, T. Gi-
     annakopoulos, A. Katsamanis, A. Potamianos, S. Narayanan, Data augmentation using gans for
     speech emotion recognition., in: Interspeech, 2019, pp. 171–175.
[22] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
     arXiv preprint arXiv:1409.1556 (2014).
[23] S. Yoon, S. Byun, K. Jung, Multimodal speech emotion recognition using audio and text, in: 2018
     IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 112–118. doi:10.1109/SLT.2018.
     8639583.
[24] S. Bird, Nltk: the natural language toolkit, in: Proceedings of the COLING/ACL 2006 Interactive
     Presentation Sessions, 2006, pp. 69–72.
[25] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan,
     Iemocap: Interactive emotional dyadic motion capture database, Language resources and evaluation
     42 (2008) 335–359.
[26] D. Hazarika, S. Gorantla, S. Poria, R. Zimmermann, Self-attentive feature-level fusion for multi-
     modal emotion detection, in: 2018 IEEE Conference on Multimedia Information Processing and
     Retrieval (MIPR), 2018, pp. 196–201. doi:10.1109/MIPR.2018.00043.
[27] D. Krishna, A. Patil, Multimodal emotion recognition using cross-modal attention and 1d convolu-
     tional neural networks., in: Interspeech, 2020, pp. 4243–4247.
[28] R. Pan, J. A. García-Díaz, M. Rodríguez-García, R. Valencia-García, Spanish meacorpus 2023:
     A multimodal speech-text corpus for emotion analysis in spanish from natural environments,
     Computer Standards & Interfaces (2024) 103856.
[29] P. Ekman, Facial expressions of emotion: New findings, new questions, 1992.
[30] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and
     evaluation data, in: PML4DC at ICLR 2020, 2020.
[31] Z. K. Abdul, A. K. Al-Talabani, Mel frequency cepstral coefficient and its applications: A review,
     IEEE Access 10 (2022) 122136–122158.
[32] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
     for language understanding, arXiv preprint arXiv:1810.04805 (2018).