=Paper=
{{Paper
|id=Vol-3756/EmoSPeech2024_paper4
|storemode=property
|title=LACELL at EmoSPeech-IberLEF2024: Combining Linguistic Features and Contextual Sentence Embeddings for Detecting Emotions from Audio Transcriptions
|pdfUrl=https://ceur-ws.org/Vol-3756/EmoSPeech2024_paper4.pdf
|volume=Vol-3756
|authors=Ángela Almela,Pascual Cantos-Gómez,Daniel Granados-Meroño,Gema Alcaraz-Mármol
|dblpUrl=https://dblp.org/rec/conf/sepln/AlmelaCGA24
}}
==LACELL at EmoSPeech-IberLEF2024: Combining Linguistic Features and Contextual Sentence Embeddings for Detecting Emotions from Audio Transcriptions==
<pdf width="1500px">https://ceur-ws.org/Vol-3756/EmoSPeech2024_paper4.pdf</pdf>
<pre>
                         LACELL at EmoSPeech-IberLEF2024: Combining
                         Linguistic Features and Contextual Sentence Embeddings
                         for Detecting Emotions from Audio Transcriptions
                         Ángela Almela1,*,† , Pascual Cantos-Gómez1,† , Daniel Granados-Meroño1,† and
                         Gema Alcaraz-Mármol2,†
                         1
                             Facultad de Letras, Universidad de Murcia, Campus de La Merced, 30001, Murcia (Spain)
                         2
                             Facultad de Educación, Universidad de Castilla-La Mancha, 45004, Toledo (Spain)


                                        Abstract
                                        These working notes summarize the participation of the LACELL team in the EmoSPeech 2024 shared task,
                                        focused on multimodal emotion recognition, which combines textual and intonation features to comprehensively
                                        understand human emotions. Its application in Spanish is crucial due to the language’s vast global presence,
                                        enabling more accurate emotion recognition and fostering better cross-cultural communication and emotional
                                        insight in diverse Spanish-speaking communities. We participated in the textual task with a combination linguistic
                                        features from LIWC and sentence embeddings from MarIA using ensemble learning, achieving the 7th position
                                        with a macro f1-score of 52.882%. This result outperformed the baseline by 3.199 points.

                                        Keywords
                                        LIWC, Linguistic Features, Emotion Classification, Natural Language Processing


                         1. Introduction
                         Emotion Recognition (ER) is an essential task for building positive relationships, whether in person or
                         through computer interactions [1]. ER is not an easy task, as there is not even scientific consensus on
                         the definition of emotion, much less on the operationalization of this research construct. Due to the
                         inherent difficulty of defining observable and measurable components of emotional behavior, Automatic
                         Emotion Recognition (AER) has been a significant challenge for many years. It is gaining importance
                         due to its impact on healthcare, psychology, social sciences, and marketing [2], as AER can provide
                         personalized responses and recommendations, thereby increasing user engagement and satisfaction.
                            AER can be approached using different taxonomies, with the most popular recognizing six basic
                         emotions: anger, disgust, fear, happiness, sadness, and surprise [3]. In this regard, it is worth noting that,
                         even though researchers are increasingly split over the validity of Ekman’s conclusions on universality
                         and his assumptions on non-verbal expression of emotions [4], it does not affect the linguistic expression
                         of emotions in a specific language.
                            The EmoSpeech 2024 shared-task [5] from IberLEF 2024 [6] aims to deepen the AER field by addressing
                         its inherent challenges. A key issue is to identify the features that are relevant for discriminating between
                         emotions. In order to fulfill this task, a major challenge is the scarcity of multimodal datasets that
                         reflect real-life scenarios, as many existing datasets are derived from artificial situations that lack
                         genuine emotional expressions. Furthermore, the complexity of the classification problem is increased
                         IberLEF 2024, September 2024, Valladolid, Spain
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ angelalm@um.es (Á. Almela); pcantos@um.es (P. Cantos-Gómez); daniel.granadosm@um.es (D. Granados-Meroño);
                         gema.alcaraz@uclm.es (G. Alcaraz-Mármol)
                          https://portalinvestigacion.um.es/investigadores/331758/detalle (Á. Almela);
                         https://portalinvestigacion.um.es/investigadores/330963/detalle (P. Cantos-Gómez);
                         https://portalinvestigacion.um.es/investigadores/332724/detalle (D. Granados-Meroño);
                         https://www.researchgate.net/profile/Gema-Alcaraz-Marmol (G. Alcaraz-Mármol)
                          0000-0002-1327-8410 (Á. Almela); 0000-0001-6329-2352 (P. Cantos-Gómez); 0000-0002-5305-1376 (D. Granados-Meroño);
                         https://orcid.org/0000-0001-7703-3829 (G. Alcaraz-Mármol)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
by the combined use of multiple features, making it difficult to design advanced architectures that
can integrate a wide range of features. Indeed, multimodal AER can identify, interpret and respond to
emotions expressed through different modalities such as text, images, and audio. Image modalities can
capture data from facial expressions and body language, while speech modalities can capture data from
voice tone, intensity, duration or rhyme. The integration of these features in a multimodal paradigm,
combining text and speech data, improves performance in emotion recognition tasks.
   Nonetheless, instead of adopting a multimodal approach to the task, our team focused exclusively on
the text task with a combination of linguistic features from LIWC and sentence embeddings from MarIA,
achieving the 7th position with a macro f1-score of 52.882%. This result outperformed the baseline by
3.199 points.


2. Dataset
According to the organizers, the EmoSpeech 2024 dataset consists into audio segments from different
Spanish YouTube channels. The underlying assumption is that certain topics elicit different emotional
responses from content creators when they express their opinions. For example, it was observed that
politicians on politics channels often conveyed disgust towards opposing parties, while interviews with
athletes in sports contexts often showed anger after a loss.
   The dataset is a subset of 3k audio segments of a larger corpus named Spanish MEACorpus 2023
[7]. The organizers of the task first released a development dataset but we did not use it. Besides, we
selected a subset of 25% for the training annotations to build a custom development split for testing
and hyperparameter optimization. Table 1 summarizes the statistics of the dataset. The dataset is
unbalanced, with more documents expressing disgust and neutral emotions. Fear is the emotion with
fewer examples.

Table 1
EmoSpeech 2024 statistics
                                 Emotion    Train   Val   Test   Total
                                 Anger        299   100   100     499
                                 Disgust      528   177   177     882
                                 Fear          17     6     6      29
                                 Joy          271    91    90     452
                                 Neutral      874   292   291    1457
                                 Sadness      258    87    86     431
                                 Total       2247   753   750    3000

   To analyze the dataset, we used the UMUTextStats tool [8] to obtain the linguistic features used
by emotion (see Figure 1). We observed that features related to part-of-speech (nouns, conjunctions,
articles, and pronouns) are relevant, as well as features related to spelling errors, use of title case
(especially relevant for documents annotated as fear and sadness), and forms of politeness, which are
not common in texts expressing disgust or sadness, but very common in documents expressing fear
and joy.


3. System description
We evaluated LIWC [9] as linguistic features. On the one hand, the 2022 version of LIWC, the de-facto
linguistic analysis tool that extracts a vector of psychological dimensions of language data from text
documents. It is worth noting that the last version available for Spanish is from 2007 [10], as the
subsequent versions of the software for English (LIWC2015 and LIWC-22) have not been translated
into Spanish yet. On the other hand, UMUTextStats [8] is a linguistic extraction tool designed for
Spanish language analysis, addressing specific linguistic phenomena that conventional tools like LIWC
overlook. Unlike LIWC, UMUTextStats is tailored to take into account nuances such as grammatical
gender and different verb tenses inherent to the Spanish language. Furthermore, UMUTextStats has
been successfully applied in various research areas, including hate speech [11] or satire [12] detection,
among others.
   Before extracting the LFs from LIWC, a preprocessed version of the transcriptions are generated. The
second version is used to extract Part-of-Speech (PoS) features. This version lacks hyperlinks, hashtags,
mentions, digits and percentages. Some of these symbols are replaced with a fixed token and others are
replaced. Expressive lengthening has been removed and misspellings are fixed using ASPELL tool1 . It is
worth noting that we keep the original audio transcription to extract LFs concerning correction and
style.
   As for the LLMs, we focused on two Spanish Large Language Models: MarIA [13] and BETO [14],
which are based, respectively, on RoBERTa and BERT architectures. We use [15] to extract sentence
embeddings from the audio transcriptions.

Table 2
Hyperparameters for fine-tuning the LLMs
                                        LLM                lr              epochs                warmup steps         weight decay
                                        BETO               4.5e-05         4                                   250                 0.19
                                        MARIA              1.8e-05         5                                     0                0.031

   As both feature sets (LFs and sentence embeddings) are encoded as vectors, we could combine them
to build stronger models. Specifically, we evaluated ensemble learning, combining the output of models
trained with only one feature set using different strategies. In our work, we evaluated the strategy of
combining these features using the mode, different ensemble learning strategies based on obtaining the
mode, the average of the probabilities, and obtaining the emotion predicted with the highest probability.
   In order to adjust the LLMs for this task, we first fine-tuned the models with the training dataset
using hyperparameter tuning. For each LLM, we evaluate 10 configurations that include variations on
the learning rate, the warm-up steps, the weight decay, the number of epochs, and the batch size. Table
2 depicts the results for both models resulting in a larger number of epochs (4 for BETO, 5 for Maria)
and little or no warm-up steps.
1
    http://aspell.net/


                                                                  anger   disgust         fear    joy   neutral   sadness
                                                 (MOR) nouns

                                           (ERR) orthographics

                                            (MOR) conjunctions

                                                (MOR) articles

                                (MOR) pronouns-personal-plural

                               (MOR) pronouns-personal-second

                               (MOR) pronouns-personal-gender

                                                (STY) titlecase
                   label


                           (MOR) pronouns-personal-person-third

                               (MOR) pronouns-personal-number

                                              (MOR) pronouns

                                        (STY) words-length-avg

                                  (MOR) prepositions-individual

                                         (LEX) social-cognitive

                                   (PRA courtesy-forms-general

                                           (MOR) nouns-proper

                                                                0%                  25%                  50%                75%           100%


             Figure 1: Information gain of the dataset with the stacked values organised by emotion
   In order to combine the LLMs and LIWC, we train a traditional neural network with the inputs with
another hyperparameter tuning. The results of this process is shown in Table 3. As it can be observed,
all the resulting neural networks are shallow, composed by one or two layers, even in the case of the
LIWC features. For the LLMs, the simplicity of the networks is expected, as the sentence embeddings
were already adjusted for each emotion.

    Table 3
    Best hyperparameters per model
             features   shape      # of layers   neurons   dropout       lr   batch size     activation
             LIWC       brick      2                   8         0     0.01          64      linear
             BETO       brick      1                  16     False     0.01          32      linear
             MARIA      brick      2                 128       0.3     0.01          64      sigmoid

   First, we present the experiments with the custom validation split in Table 4. The results are organized
by the LIWC linguistic features in the first subsets of rows, the sentence embeddings of the LLMs in
the second set of rows and the feature integration strategies in the last set of rows. From the results, it
can be observed that LIWC-22 achieved limited results compared with the sentence embeddings. With
the sentence embeddings, the performance of BETO and MarIA are similar with better macro f1-score
of MarIA and better precision but a slightly more limited recall. Concerning the feature integration
strategies, the best results are achieved using an ensemble based on highest probability. However, our
previous background yielded bad results when passing from custom validation to official test sets and
we decided to submit the ensemble based on the mode as our final submission.


4. Results
In this section, we report the results with our custom validation split (see Section 4.1), the official leader
board (see Section 4.2, and an error analysis of the custom validation split (see Section 4.3).

4.1. Validation

    Table 4
    Results with the validation split
                        Strategy                           precision     recall   f1-score
                        LIWC                                 41.643      40.814     39.997
                        BETO                                 70.959      72.856     71.520
                        MarIA                                76.348      71.117     73.117
                        Ensemble Learning / HIGHEST          76.240     68.364     70.855
                        Ensemble Learning / MEAN              75.715     67.623     70.211
                        Ensemble Learning / MODE              64.923     59.946     60.431


   Next, we show the detailed classification report of the ensemble learning based on the mode with
the custom validation split in Table 5. This report includes the precision, recall, and f1-score of all
emotions as well as the macro and weighted values. The model achieved similar weighted and macro
f1-scores, which indicates that it performs well regardless the emotion, including fear, that was the
most underrepresented one. However, the precision of some emotions is not very high, as it is the case
of anger and joy.
4.2. Official results
Table 6 depicts the official leaderboard for the competition. Our team ranked 7th from a total number
of 12 participants and improved the baseline (52.882% vs 49.683% of macro F1-score). It is worth noting
that CIPIN team outperformed our best result, 84.993%, but the team was not in consideration for the
official leaderboard as they submitted their task a few hours later according to the organizers.
   As it can be observed from Table 6, we achieved 7th position with a macro f1-score of 52.88210%
with a combination of LIWC features and MarIA using an ensemble based on the mode. This results
outperformed the proposed baseline based on statistical features based on TF–IDF by 3.1992 points, but
it was 14.3035% lower than the 1st team, that achieved a macro f1-score of 67.18560%. It is worth noting
that we would have achieved the 8th position if the CICIPN team had submitted their runs on time, as
they achieved slightly better results than our approach.

4.3. Error Analysis
To conduct the error analysis, we obtained the confusion matrix of MarIA and LIWC ensemble learning
based on the mode with the custom validation split (see Figure 2).
   As expected, documents considered neutral are hard to classify. When our model output is neutral,
there were 8 documents tagged as anger, 13 as disgust, 1 as fear, 4 as joy, and 12 as sadness, but there was
a major number of missclassifications for the actual neutral documents, as 71 of them were identified as
disgust, 32 as joy, and 18 as anger. We observed that our model tends to confuse anger and disgust.


Table 5
Classification report of the ensemble learning strategy based on the mode with the custom validation split.
                                                   precision     recall   f1-score
                                   anger             44.531     57.000     50.000
                                   disgust           46.457     66.667     54.756
                                   fear              80.000     66.667     72.727
                                   joy               54.386     68.132     60.488
                                   neutral           81.553     57.534     67.470
                                   sadness           82.609     43.678     59.363
                                   macro avg         64.923     59.946     60.431
                                   weighted avg      65.213     59.363     60.166


Table 6
Official leader-board for Task 1
                                   #    Team                   MACRO F1-SCORE
                                   1    TEC_TEZUITLAN                       67.186
                                   2    mashd3v                             65.753
                                   3    UNED-UNIOVI                         65.529
                                   4    UKR                                 64.842
                                   5    AndreaJohanaCV                      61.751
                                   6    jaime                               58.314
                                   7    LACELL                             52.882
                                   8    SINAI                               52.000
                                   9    UAE                                 51.824
                                   -    Baseline                           49.683
                                   10   UTP                                41.023
                                   11   adri28                             37.852
                                   12   Iris5                              33.459
                                   -    CICIPN                             54.993
                              er    57       31       1               2      8         1
                       ang

                          u   st    42      118       0               3     13         1
                     disg

                         fear        0        0       4               1      1         0
            Actual


                             joy     6       16       0           62         4         3

                          t  ral    18       71       0           32       168         3
                      neu

                         ness        5       18       0           14        12        38
                     sad
                                   anger


                                                     fear
                                           disgust


                                                                 joy


                                                                          neutral


                                                                                    sadness
                                                          Predicted

Figure 2: Confusion matrix of the ensemble model based on the mode


4.4. Conclusions and further work
In this working notes, we have described the participation of the LACELL team in the first task of
the EmoSpeech 2024 competition, based on textual emotion analysis. Our proposal is grounded on
the feature integration of features based on sentence embeddings from MarIA, a Spanish LLM, and
linguistic features from LIWC. We reached the 7th position in the official ranking with a macro f1-score
of 52.882%, outperforming the baseline by 3.199 points.
   As further work, we plan to include features from novel acoustic LLMs in order to participate in
multimodal tasks. Specifically, we will evaluate models such as Wav2Vec 2.0, as suggested in [16].


Acknowledgments
This work is part of the research projects LaTe4PoliticES (PID2022- 138099OB-I00) funded by MICI-
U/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF)-a way to make
Europe and LT-SWM (TED2021-131167B-I00) funded by MICIU/AEI/10.13039/ 501100011033 and by the
European Union Next Generation EU/PRTR. This work is also part of the research project "Services based
on language technologies for political microtargeting" (22252/PDC/23) funded by the Autonomous
Community of the Region of Murcia through the Regional Support Program for the Transfer and
Valorization of Knowledge and Scientific Entrepreneurship of the Seneca Foundation, Science and
Technology Agency of the Region of Murcia.


References
 [1] A. A. Varghese, J. P. Cherian, J. J. Kizhakkethottam, Overview on emotion recognition system, in:
     2015 international conference on soft-computing and networks security (ICSNS), IEEE, 2015, pp.
     1–5.
 [2] F. Chenchah, Z. Lachiri, Speech emotion recognition in noisy environment, in: 2016 2nd Interna-
     tional Conference on Advanced Technologies for Signal and Image Processing (ATSIP), IEEE, 2016,
     pp. 788–792.
 [3] P. Ekman, Lie catching and microexpressions, The philosophy of deception 1 (2009) 5.
 [4] C. Crivelli, J. A. Russell, S. Jarillo, J. M. Fernández-Dols, Recognizing spontaneous facial expressions
     of emotion in a small-scale society of papua new guinea, Emotion 17 (2017) 337.
 [5] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, F. García-Sanchez, R. Valencia-García, Overview
     of EmoSPeech 2024@IberLEF: Multimodal Speech-text Emotion Recognition in Spanish, Proce-
     samiento del Lenguaje Natural 73 (2024).
 [6] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process-
     ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages
     Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for
     Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024.
 [7] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, R. Valencia-García, Spanish meacorpus 2023:
     A multimodal speech-text corpus for emotion analysis in spanish from natural environments,
     Computer Standards & Interfaces (2024) 103856.
 [8] J. A. García-Díaz, P. J. Vivancos-Vicente, A. Almela, R. Valencia-García, Umutextstats: A linguistic
     feature extraction tool for spanish, in: Proceedings of the Thirteenth Language Resources and
     Evaluation Conference, 2022, pp. 6035–6044.
 [9] R. L. Boyd, A. Ashokkumar, S. Seraj, J. W. Pennebaker, The development and psychometric
     properties of liwc-22, Austin, TX: University of Texas at Austin (2022) 1–47.
[10] N. Ramírez-Esparza, J. W. Pennebaker, F. A. García, R. Suriá, La psicología del uso de las palabras:
     Un programa de computadora que analiza textos en español, Revista mexicana de psicología (2007)
     85–99.
[11] J. A. García-Díaz, S. M. Jiménez-Zafra, M. A. García-Cumbreras, R. Valencia-García, Evaluating
     feature combination strategies for hate-speech detection in spanish using linguistic features and
     transformers, Complex & Intelligent Systems 9 (2023) 2893–2914.
[12] J. A. García-Díaz, R. Valencia-García, Compilation and evaluation of the spanish saticorpus 2021
     for satire identification using linguistic features and transformers, Complex & Intelligent Systems
     8 (2022) 1723–1736.
[13] A. Gutiérrez Fandiño, J. Armengol Estapé, M. Pàmies, J. Llop Palao, J. Silveira Ocampo, C. Pio Car-
     rino, C. Armentano Oller, C. Rodriguez Penagos, A. Gonzalez Agirre, M. Villegas, Maria: Spanish
     language models, Procesamiento del Lenguaje Natural 68 (2022).
[14] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and
     evaluation data, in: PML4DC at ICLR 2020, 2020.
[15] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in:
     Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
     the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019,
     pp. 3982–3992.
[16] L. Pepino, P. Riera, L. Ferrer, Emotion recognition from speech using wav2vec 2.0 embeddings,
     Proc. Interspeech 2021 (2021) 3400–3404.

</pre>