1. Introduction

Emotion Hunters at EMit: Categorical Emotion Detection combining BERT and ChatGPT models

Gianluca Calò

Francesco Massafra

Berardina De Carolis

Corrado Loglisci

0 0 Department of Computer Science, University of Bari , Italy

Emotion detection in text plays a crucial role in various applications, such as customer feedback analysis, social media monitoring, or for the analysis of the verbal part of human communication. Deep learning techniques have shown promising results in accurately recognizing and classifying emotions in textual data. This paper describes the approach to categorical emotion detection of the Emotion Hunters team. After a preprocessing phase, a model fine-tuned from AlBERTo together with the ChatGPT APIs was used to address the challenge. The results show that on the out-of-domain test set our approach performed better than on the in-domain one thus showing a good generalization capability.

eol>Emotion Detection BERT ChatGPT

1. Introduction

for other less-resourced Indo-European languages, such as Italian.

Emotion detection in texts has gained significant impor- The EMit (Emotions in Italian) Subtask A [ 1 ] at tance in recent years due to the pervasive presence of EVALITA 2023 [ 2 ] aims at detecting emotions in social digital communication platforms and the wealth of user- media messages about TV shows, music videos and adgenerated content. The ability of software to understand vertisements. Given a message, the system has to decide and analyze emotions expressed in a text has numerous which emotions are expressed in the message or if the applications across various domains, including market- message is a neutral one. According to the annotation ing, customer service, mental health, and social sciences. of the dataset, the problem to address is designed as a Emotion detection involves the use of Natural Language multilabel classification one. Therefore, the system, given Processing (NLP) techniques and machine learning al- a message, will classify it and return all the possible lagorithms to automatically identify and classify emotions bels denoting emotions contained in it. In particular, in conveyed in textual content. By accurately detecting and Subtask A, the message could be classified as neutral, or interpreting emotions, we can gain a deeper understanding expressing one or more emotions in the following set of of human experiences, opinions, and attitudes. 10 labels: anger, anticipation, disgust, fear, joy, sadness,

As far as emotion detection for the Italian language surprise, trust, neutral [ 3 ], and the additional label love. is concerned, it presents distinctive features which range Our team, the EmotionHunters, addressed this challenge from morphological to lexical viewpoints. It presents with a two-steps model. After a pre-processing phase, we a lot of words with particles in two or three units (e.g., fine-tuned a model based on AlBERTo [ 4 ] and test it on verbal groups), which are difcfiult to label. The words the validation set. The performance on the validation set used to express the same idea can have different types of exceeded the baseline of about 16% reaching an accuracy grammatical categories, they can be nouns and verbs. In calculated with the weighted F1-score of 0.56. However, addition, they can be associated with general semantic when we run the model on the provided in-domain testcategories or specific categories. Italian is a very rich set, we noticed that in some cases the model did not make language with words that hold more than one meaning, predictions and that there was a high number of neutral which may mislead an automatic emotion detector. More- predictions on the total of the results. Then, since the over, while many linguistic resources and annotated texts beginning of the call for this challenge, ChatGPT became have been generated for wide-coverage languages, such very popular, for these two cases, we integrated the Chatas English, Chinese and Arabic, the same cannot be said GPT APIs 3.5 [ 5 ] and this increased the prediction of the model of 1%. The proposed model has been tested on EVALITA 2023: 8th Evaluation Campaign of Natural Language Pro- the two test sets proposed by the challenge: in-domain, cessing and Speech Tools for Italian, Sep 7 – 8, Parma, IT including tweets of the same textual genre and subjects of $ g.calo26@studenti.uniba.it (G. Calò); the training set, and another one, out-of-domain, including fb.emraarsdsianfara.d7e@casrtouldies@ntiu.unnibibaa.i.tit(B(F..DM. aCsasraoflrias));; social data of different genres and subjects. Our approach corrado.loglisci@uniba.it (C. Loglisci) showed to have a better performance on the out-of-domain © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License test set showing that it is able to generalize with respect CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutiRon 4W.0Iontrekrnsathioonapl(CPCrBoYc4e.0e)d.ings (CEUR-WS.org)

2.2. Pre-processing

to the topic. This is for us an interesting result because we want to apply the model in different domains like the one of conversational experiences with intelligent agents.

We did not have the time within the challenge deadline to train a model based only of LLMs like ChatGPT and this is part of our future work.

A pipeline of preliminary operations is performed in order to rfist clean and standardize the input messages and then prepare them for a format suitable to the selected learning algorithms [ 8 ]. More precisely, we remove the symbols used in social media communication (emoticons, url, links, mentions) without discarding the relative con2. Description of the system tents but we assign them to semantic categories denoted as tags. For instance, the mentions "@NickName" are In the following, we first describe the pipeline of text pre- converted into the tokens with the uniform and generic tag processing and then provide details on the classification < >. Each emoticon is converted into the textual dealgorithms used and configured for the task at hand. scription of its meaning taken from a predefined collection of emoticon-description pairs we made for the purpose 2.1. Analysis of the Dataset of this work. For instance, the emoticon with grinning face with big eyes would be converted into the descripThe provided training dataset consists of a collection of tion <faccina con un gran sorriso e occhi spalancati> 5966 labeled tweets, each identified with one or more (in Italian). Also, words with hashtags are split into the labels related to the predicted emotions (among the 10 single tokens, which are then reported with the open and emotions mentioned above) (see Figure 2). The class close tags < ℎℎ > and < /ℎℎ >. For indistribution is not homogeneous, with the trust and neutral stance, the hashtag "#Sanremo2020" would be converted classes being predominant, while the rear class is the least into the tagged sequence < ℎℎ > Sanremo 2020 frequent (see Figure 1). < /ℎℎ > composed of two single tokens. The

To augment the number of sentences of the fear class, rationale behind these operations is to make tokens and we integrated the training dataset with sentences taken symbols expressing emotions explicit and jointly to augby the dataset MultiEmotions-It [ 6 ], moreover, using the ment the features describing the original text. So, the affective dictionary proposed in [ 7 ], we changed affective learning process can work on multiple sources of informaterms with synonyms in this way the size of the fear class tion and better capture the emotive content. was upsampled to 400 sentences. Next, we perform a conversion operation to represent the tweets pre-processed in the input format for the selected learning algorithms, that is, BERT models and variants (as explained in the following). All the tokens produced for the pre-processed tweets were indexed and used to create a dictionary for the input vectors to the learning process. Considering the typical length of the tweet, usually very less the maximum number of admissible digits, we chose an input length of 128 (elements of the vectors) and prepared an attention mask to decrease the importance of the elements inserted into the input data for padding. Each input vector is in binary code and each element represents the presence/absence of the corresponding indexed token.

2.3. Selected Models The experimentation is based on BERT models (and vari

ants), which have achieved state-of-the-art results in text classification tasks. BERT utilizes special tokens [CLS] and [SEP] to indicate the beginning of the input sequence and the separation between sentences.

In this specific case, the contextualized embedding associated with the [CLS] token is used as the embedding for the entire sentence. Thanks to the multi-head attention mechanism, it can capture the semantics of the entire sentence effectively.

Several instances of pre-trained BERT models that contain at least some Italian text in their training corpus were considered. For each model, a fully connected layer was added to perform multi-label classification and fine-tune the models on the specific dataset. The models considered are: • dbmdz/bert-base-italian-xxl-cased [ 9 ]: The MDZ Digital Library team released "Italian BERT cased XXL," a BERT version pre-trained on two corpora. The first corpus consists of texts obtained from a Wikipedia dump and various texts from the OPUS collection (http://opus.nlpl.eu/), with a total corpus size of approximately 13 GB and over 2 billion tokens. The second corpus is the Italian part of the OSCAR corpus (https://traces1.inria.fr/oscar/), with a final corpus size of approximately 81 GB and over 13 billion tokens. The "cased" version was chosen as it aligns better with the chosen pre-processing method, as previously done in [ 8 ]. • AlBERTo-Base, Italian Twitter lower-cased [ 4 ]: A BERT model trained on a corpus of 200 million Italian tweets. • UmBERTo-Commoncrawl-Cased [ 10 ]: A RoBERTa model (a variant of BERT) trained on an Italian subcorpus of OSCAR as the training set. It uses a ten-fold version of the Italian corpus, which consists of 70 GB of raw text data, 210 million sentences, and 11 billion words. The sentences were filtered, shuffled at the line level, and utilized for NLP research. • MilaNLProc/feel-it-italian-emotion [ 11 ]: An adapted version of UmBERTo for classification on the Feel-IT dataset. • bert-base-multilingual-uncased [ 12 ]: A multilingual uncased BERT model. contribution can be attributed to the translation of emojis into their Italian descriptions, which can be discriminative in emotion recognition. As a resource we used the one in [15] whose CLDR Short Name was translated into Italian.

The best trials were achieved with the BERT AlBERTo model, with learning rates ranging from 2e-05 to 3e-05, a patience value of 3 allowing for training for 4 epochs, a batch size of 16, and employing the "transform" preprocessing strategy. The best trial, in particular, was achieved with the following hyperparameters:

2.4. Training the Models The challenge provides two baselines and the correspond

ing code to reproduce their execution. The first baseline uses count vectors to represent the documents based on token frequency, while the second baseline uses TF-IDF vectors. Both implementations limit the vector dimen- AlBERTo and dbmdz had a similar performance, howsions to 5000. In the emotion recognition task, the TF- ever, we selected AlBERTo with an average F1 score of IDF baseline performs better, achieving an F1 score of 0.562 also because it had a better performance in classi41.48%. fying fear, which was one of the most problematic due to

For both the baselines and the experiments conducted the limited number of examples (see Figure 3). in this work, a seed was fixed to ensure reproducibility.

Taking into consideration the work presented in GoEmotions [ 13 ], we decided to freeze the layers of the pre- 2.5. Prediction trained BERT model and train only the additional classifi- During the prediction phase on the test dataset made using cation layers. the model fine-tuned from AlBERTo, we noticed a high

Various preliminary experiments were conducted by number of neutral examples, and in some cases, the model manually modifying the model’s hyperparameters, such was unable to determine the emotion, leaving the result as the number of epochs, batch size, learning rate, etc., to ifeld empty. We noticed this problem only on the test set understand which was the better approach for this task. To and not on the validation set, therefore, to address this improve the results, the following decisions were made: issue, in addition to using our trained model, we integrated The Optuna library was adopted to systematically test the ChatGPT APIs to obtain additional results to fill in different combinations of hyperparameters. The MLFlow the case of neutral or undetermined sentences. Then, library was used to track intermediate (F1) and final results each sentence in the test set is pre-processed and given (F1 and metrics for each class). as input into the fine-tuned model, with a threshold set at

The hyperparameter search space was defined as fol- 0.5. At the end of the prediction phase, every sentence lows: that didn’t receive any label or was classified as neutral is passed to a Python program that utilizes the ChatGPT APIs 3.5. Below, we show the prompt used and some of the examples provided to ChatGPT. Due to the limitations of the free APIs, the number of tokens and the amount of input examples are limited. • Learning rate: between 2e-05 and 5e-05 • Epsilon (AdamW optimizer): between 1e-8 and 1e-6 • Hidden dropout probability: between 0.1 and 0.3 • Patience for early stopping: a discrete interval between

1 and 5 • Batch size: 16, 32, and 64

prompt = "Your are an emotion recognition tool for

The numerous trials conducted using Optuna allowed us ˓→ tweets and your task is to " \ to observe that the models pre-trained consistently on an ˓"→analeymzoetitohnesm, asnedpagriavteedabsyincgolmemaemtohtaitonyoourmaiglhitstthoifnk Italian text corpus performed better than those pre-trained ˓→ are expressed in the current tweet and you should on a multi-lingual corpus, including Italian. ˓"→thisusleisotnl[y'atnhgeere'm,ot'iaonntsicfirpoamti"on\', 'disgust', 'fear',

In general, it is observed that training that exceed the ˓→ 'joy', 'love', 'neutral', 'sadness', 'surprise', fourth epoch often result in a degradation of performance ˓→ 'trust']" in terms of F1 score, and the ideal batch size was found to examples = """ \ be 16. On average, the performance of all models benefits IHnepruets"oImoe aenxcaomrpalenso:n ho capito se la voce mentre from the preprocessing step, as this strategy likely retains ˓→ cantano sia modificata o meno more informative content and includes typical social me- ˓I→npu#tIl"CRaTnt@aunsteerM:asTcahretraartuog"a Oiustptuhte:nneeuwtrzaolccola enorme dia expressions in standard Italian [ 14 ]. A significant ˓→ #chilhavisto" Output:disgust

3. Results

The results in the following Tables 1 and 2 suggest that our approach, even if it is not the best model of the challenge, at least generalizes quite well to a domain different from the training one. This is for us a good result since we aim at applying the model in contexts different from social media analysis. In particular, we are working on the multimodal analysis of human communication with conversational agents, in which the analysis of the textual part of verbal communication can be very important to fully understand the emotional state of the user. We are actually exploring the performances of models based on

LLMs by fine-tuning the most popular one on a dataset of text denoting emotion expression not taken from tweets, which is more in line with our final goal.

[1]

Araque ,

Frenda ,

Sprugnoli ,

Nozza , V. Patti, EMit at EVALITA 2023: Overview of the Categorical Emotion Detection in Italian Social Media Task, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[2]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, Evalita 2023 : Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[3]

Plutchik , The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice , American Scientist 89 ( 2001 ) 344 - 350 .

[4]

Polignano ,

Basile , M. de Gemmis, G. Semeraro,

Basile , Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets , in: Proceedings of the Sixth Italian ConNovember 13-15 , 2019 , 2019 . 133 .

[5] Openai - gpt 3.5 api large language model , 2023 . [15] Emoji

list

, last consulted May 2023 . URL: https: URL: https://chat.openai.com/chat. //unicode.org/emoji/charts/full-emoji-list.html.

[6]

Sprugnoli , Multiemotions-it: a new dataset for opinion polarity and emotion analysis for italian , in: Proceedings of the Seventh Italian Conference on Computational Linguistics , CLiC-it 2020 , Bologna, Italy, March 1- 3 , 2021 , 2020 .

[7]

L. C.

Passaro ,

Lenci , Evaluating context selection strategies to build emotive vector space models , in: N. Calzolari , K.

Choukri , T.

Declerck , S.

Goggi , M.

Grobelnik , B.

Maegaard , J.

Mariani , H.

Mazo , A.

Moreno , J.

Odijk , S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016 , Portorož, Slovenia, May 23 -28, 2016 ,

European

Language Resources Association (ELRA), 2016 .

[8]

Pota ,

Ventura ,

Catelli ,

Esposito , An effective bert-based pipeline for twitter sentiment analysis: A case study in italian , Sensors 21 ( 2021 ) 133 . doi: 10 .3390/s21010133.

[9] Mdz digital library team, «bert xxl italian models» , hugging face , 2020 . URL: https://huggingface.co/d bmdz/bert-base -italian-xxl-cased.

[10]

Parisi ,

Francia ,

Magnani , Umberto: an italian language model trained with whole word masking , Original-date 55 ( 2020 ) 31Z .

[11]

Bianchi ,

Nozza ,

Hovy , et al., Feel-it: Emotion and sentiment classification for the italian language , in: Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis , Association for Computational Linguistics , 2021 .

[12]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[13]

Demszky ,

Movshovitz-Attias ,

Ko ,

Cowen , G. Nemade,

Ravi , Goemotions: A dataset of fine-grained emotions , arXiv preprint arXiv: 2005 . 00547 ( 2020 ).

[14]

Pota ,

Ventura ,

Catelli ,

Esposito , An effective bert-based pipeline for twitter sentiment analysis: A case study in italian , Sensors 21 ( 2020 )