1. Introduction

App2Check at EMit: Large Language Models for Multilabel Emotion Classification

Gioele Cageggi

Emanuele Di Rosa

Asia Uboldi

1 0 Chief Technology Oficer at App2Check srl , Via XX Settembre, 14 - 16121, Genoa , Italy 1 Data Scientist at App2Check srl , Via XX Settembre, 14 - 16121, Genoa , Italy

In this paper we compare the performance of three state-of-the-art LLM-based approaches for multilabel emotion classification: ifne-tuned multilingual T5 and two few shot prompting approaches: plain FLAN and ChatGPT. In our experimental analysis we show that FLAN T5 is the worst performer and our fine-tuned MT5 is the best performer in our dev set and, overall, is better than ChatGPT3.5 on the test set of the competition. Moreover, we show that MT5 and ChatGPT3.5 have complementary performance on diferent emotions and that A2C-best, our unsubmitted system that combines our best performer models for each emotion, has a macro F1 that is 0.02 greater than the winner of the competition in the out-of-domain benchmark. Finally, we suggest that a perspectivist approach is more suitable for evaluating systems on emotion detection.

eol>Emotion Detection Large Language Model ChatGPT FLAN mT5 Prompt Engineering

1. Introduction Categorical Emotions Detection refers to the machine

learning task of detecting the presence of specific emotions in a text. Detecting customers emotions, for example, is a useful task having many practical applications in industry, from customer experience analysis to customer churn prevention.

The categories of emotions used may vary. In this paper we consider the 8 main emotions of Plutchik’s wheel [ 2 ] (anger, expectation, disgust, fear, joy, sadness, surprise, trust), plus the emotion "love," which is one of the dyads, according to the Emit 2023 competition [ 3 ], and Neutral, which is absence of emotions.

In this paper, we: 1. present three approaches for detecting emotions in a text, all based on large language models (LLM) 2. show that, on the dev set, FLAN T5 is the worst performer and our fine-tuned MT5 is the best performer 3. overall, between our models, MT5 is better than ChatGPT3.5 on the dev and test set of the competition 4. show that MT5 and ChatGPT3.5 show complementary performance on diferent emotions

2. Approaches Adopted In this paper, we study two diferent approaches to

solve the Categorical Emotion Detection task, both Transformer-based: • LLM Fine-tuning: Starting from a pre-trained

LLM model, we use the competition dataset to ifne-tune the model in order to solve the specific task • Few-Shot Prompting: Using an Instruction Tuned

LLM, prompts are designed to properly guide the model in defining its behavior for the task.

Briefly, the main diferences of these two approaches are: • While fine-tuned models require a larger labeled the google/mt5-base version, which has 580 million padataset for training, prompt-based models work rameters. We tried to apply google/mt5-xxl, but out of even with a smaller few-shot dataset memory exception prevents us from using it in a Google • Fine-tuning requires high computational and re- Colaboratory cloud environment. More specifically, has source capacity to complete the training. Few- been trained using an Nvidia A100 GPU with 40GB of shot prompting focuses on the refinement of memory. Training is performed for 20 epochs on 90% of prompts and instructions without changing the the competition training dataset with a stratified split model parameters strategy. In this paper this model is referred to as A2C• The carbon footprint of the two approaches is mT5-r1.

quite diferent. Fine-tuning an LLM can be computationally expensive and energy-intensive. The 2.2. Plain FLAN environmental impact generally tends to be more energy-demanding than prompt tuning, which is considered more eco-friendly because it avoids a full-scale fine-tuning process • Fine-tuned models achieve better accuracy values when there is an abundance of labeled data, while prompt tuning can ofer reasonable performance even with a limited amount of labeled data • While fine-tuned models make LLMs specialized for a specific task, prompt tuning allows for a more flexible approach to solving diferent tasks with minimal changes to the prompt.

FLAN-T5 [12] is one of two Few-Shot Prompting ap

proaches that we experiment with in this paper.

It is a model based on T5[9], on which we perform instruction fine-tuning. This process entails training the model using an instruction set that describes how to perform over 1000 additional tasks. The instruction finetuning process involves providing the model with an instruction set and executing the tasks specified in the instructions.

In this paper, we use Hugging Face’s transformers library to import the google/flan-t5-xl model and use it.

Then, through prompt engineering techniques, we de

Moreover, as internal reference, we build a system velop a prompt to associate an input text with one or called A2C-Baseline. It combines multiple ML models, more emotions. In the first iteration of the solution, we such as Decision Trees [ 5 ] and KNN models [ 6 ], where we use a single prompt to identify and associate all possiselect for each emotion the best one from a pool of models. ble emotions if present in the input text. However, the The input text is vectorized using the tf-idf methodology. model is not supporting this compact approach. Thus,

Finally, we define a voting system, A2C-Voting, that we modify the prompt to identify a single emotion at combines the prediction of each sentence from A2C-mT5- a time. We find better outputs with this last approach. r1, A2C-GPT-r2 and A2C-Baseline. It chooses for each Then we develop ten prompts, one per emotion. prediction the result with the largest agreement. The The prompts start with Detect if the text provided conmajority is always guaranteed, being based on a binary tains EmotionX as emotion. If the emotion is available ranking of individual emotions (present/not present) and in the input text, the value will be 1; 0 otherwise , where a voting system on three diferent predictions. EmotionX is the emotion to look for. Then two sentences follow, one of which contains the emotion and the other 2.1. Fine-tuned LLM does not. In this paper this model is referred to as A2CFlanT5.

Fine tuning LLMs has been proved to be an efective

approach for text classification problems and in [ 7] we 2.3. ChatGPT showed to be the winner approach in all tasks of the ABSITA competition. MT5 [8] is the LLM we decide to ChatGPT 3.5 is the second of the two Few-Shot Promptuse here. It is a multilingual variant of T5 [9], a text- ing approaches that we apply to experiment with in this to-text model released by Google in 2021. T5 uses a paper. The version we use in this model is gpt-3.5-turbotransformer-based architecture and can be fine-tuned 0301 [13]. The specifics of the model have not been pubto return text labels for classification tasks. MT5 has licly disclosed yet. It is a similar model to the previous been pre-trained on mC4, which is a version of Common GPT-3 model [13], trained on a set of text and code creCrawl’s multilingual web crawl corpus containing 101 ated before Q4 2021. It is then trained using a reinforcelanguages. This enables the exploitation of the potential ment learning method with rewards derived from human of the T5 model on a task involving Italian text. comparison.

In this paper, in order to use this model, we use the In this paper, we use the OpenAI library [14] to proHugging Face API [10] wrapped by the Simple Transform- cess requests to the model. Unlike the approach chosen ers [11] library. From the available models, we choose for FLAN T5, we develop a prompt to simultaneously identify all emotions for each text input. We prepared 3. Experimental Analysis a prompt with six examples of text inputs, taken from the competition training dataset. All emotions have been mapped within the text examples. The output requested In this paper, we refer to two types of datasets: the develwithin the prompt is structured as a JSON with as many opment dataset and the competition test set. The develkeys as emotions, with a value of 1 if a given emotion is opment dataset is used to select the best A2C models to present, 0 otherwise. submit to the competition, while the competition test set

The prompt used is the following: Determin the emo- consists of both in-domain and out-of-domain data. The tions in the text provided, which is delimited by <>. The dev set is split from the competition training set using available emotions are: Anger, Anticipation, Disgust, Fear, the stratified technique [ 15], which ensures that the origJoy, Love, Neutral, Sadness, Surprise, Trust. Provide the inal proportions of labels is maintained in each subset. answer in JSON format, with the following keys: Anger, The training is made on the 80% of the training dataset; Anticipation, Disgust, Fear, Joy, Love, Neutral, Sadness, Sur- models are selected on the 10% of the dataset and tested prise, Trust. If that emotion is present inside the input text, on the remaining 10%. Once models to submit have been the value will be 1; 0 otherwise. A series of examples then selected, we retrained them on the 100% of the training follow, in the format: Text: <...>Answer: {"Anger":0, "An- data. From here on we will refer to Dev set as the model ticipation":0, "Disgust":0, "Fear":0, "Joy":0, "Love":0, "Neu- selection set; In-domain test set and Out-of-domain test set tral":0, "Sadness":0, "Surprise":1, "Trust":1} refer, respectively, as the in-domain and out-of-domain

Note that this model allows to identify all emotions competition datasets. simultaneously, unlike FLAN T5, in which emotions have Tables 2 and 3 show the A2C models that participated been identified one at a time. In this paper, this model is in the competition applied on the Dev set, but also adreferred to as A2C-GPT-r2. ditional models developed post-deadline, highlighted in italics for a fair detection. Tables 6 and 7 include all mod2.4. Description of our best approach: els from both A2C and other competitors applied on the competition test set. All tables display the Macro F1 and

A2C-best F1 metrics for individual emotions across all models. A2C-mT5-r1 and A2C-GPT-r2 show to be complementary in their ability to accurately detect emotions in the evaluation sets. Specifically, in the dev set, A2C-mT5-r1 outperforms A2C-GPT-r2, while the latter exhibits better performance on Anger, Disgust, Fear, and Sadness. Based on these findings, in the following, we show A2C-best, which combines the top-performing A2C models for each individual emotion.

We show in 3.1 and 3.2 the results of its application ranking on the test set of the competition as unsubmitted system, since we believe that its results are interesting for the research community.

3.1. Results on Dev Set

In Table 2 and 3, we show the results of our model on the Dev set, where unsubmitted models are shown in italic. The worst performer is A2C-FlanT5 with an MF1 of 0.27: it shows the worst performance on the Neutral label, with an F1 score of 0. The best performer between the models we evaluated for the submission is A2C-mT5r1, with an MF1 of 0.45, showcasing better performance on 6 out of 10 emotions when compared to the models that are not highlighted in italic. For the second run, we decide to select A2C-GPT-r2 instead of A2C-Baseline, since it performs in a complementary way compared to A2C-mT5-r1, and to pursue a more innovative approach. More specifically, it is clear that A2C-mT5-r1 and A2CGPT-r2 exhibit complementary performance on diferent emotions: A2C-GPT-r2 excels in Anger, Disgust, and Sadness, while A2C-mT5-r1 performs better in Anticipation, Joy, Neutral, Surprise, and Trust. This complementary performance is almost entirely preserved in the competition test sets as well. Based on this observation, we synthesize a post-deadline system called A2C-best which selects the model with the best performance for each emotion.

3.2. Results on Competition Test Sets

In domain test set In Tables 4 and 5, we compare both competitors systems and all our models on the indomain Test Set of the competition. When we look at the individual emotions, ExtremITA run 2 achieves almost always the highest scores, except for Joy, where ABCD run 1 is the best one, and Love, where A2C-GPT-r2 is the best performer.

We also include in the tables the results obtained by A2C-best, which ranks second after ExtremITA’s solutions, with an MF1 score at a distance of 0.005 from its ifrst run.

The complementarity observed in the dev set between A2C-mT5-r1 and A2C-GPT-r2 also holds true within this test set, except for the emotion of Fear. We include an upper bound benchmark, Best-All, to define the potential margin of improvement by combining all the competition models.

Out-of-domain test set In Tables 6 and 7, we show the results of our systems and the other participants on the out-of-domain Test Set of the competition. Observing individual emotions, A2C-GPT-r2 shows the best score on Anger, Disgust, and Fear, while A2C-Voting on Sadness.

We obtain A2C-best by incorporating the best results of our models into one system, that selects the best of our models for each emotion. A2C-best shows to be the top performer among the submitted runs, improving the agree on the samples annotation. In table 1, we show winner by 0.02 of MF1. just 3 samples (out of many) in which we disagree with

Once again, complementarity on emotions is clear be- the golden standard (two diferent people plus a referee). tween A2C-mT5-r1 and A2C-GPT-r2, except for Love. As The goal is to highlight whether disagreement between an upper bound, Best-All shows that the potential margin the systems is due to just systems that cannot correctly of improvement is more significant in the out-of-domain meet the ground truth or if such instances may be intest set. terpreted in multiple ways and thus requiring multiple, equally correct, labeling. As we can see in table 1, there 3.3. Error analysis are diferences between the Golden Standard (Gold column) and our classification (A2C team column). The In order to improve our systems performance, we ran- research community is working towards the direction domly selected instances in which all systems disagree of perspectivist approaches (see [16] and [17]) in which, to analyze the most dificult cases. However, during our well-known issues of having just one single ground truth error analysis, we noticed that many times we did not are taken into account especially in Natural Language

4. Conclusion In this paper we presented the systems runs we submitted

at Emit 2023 competition for emotion detection in text, and also our post deadline system called A2C-best. In particular, we presented the performance of three diferent LLM-based approaches, such as fine-tuned multilingual T5, and two few shot prompting techniques, A2C-GPT-r2 and FLAN T5. Our A2C-best model shows significant improvement to our oficial run and comparable performance with the first ranker of the competition in the out-of domain run. A2C-best scores 0.099 below the winner in the in-domain run. Finally, after relabeling dificult instances where all systems and humans disagree, we suggested that a perspectivist approach is more suitable for evaluating systems on emotion detection. MF1 Model Ang Ant Dis Fea Joy 0.564 Best-All 0.64 0.60 0.68 0.18 0.44 0.518 A2C-best 0.64 0.60 0.68 0.18 0.42 0.498 extremITA2 0.41 0.49 0.67 0.00 0.44 0.484 A2C-r1r2 0.64 0.43 0.68 0.18 0.42 0.449 extremITA1 0.50 0.37 0.62 0.00 0.32 0.438 A2C-Voting 0.39 0.60 0.65 0.00 0.25 0.402 A2C-mT5-r1 0.27 0.43 0.47 0.00 0.42 0.373 A2C-GPT-r2 0.64 0.33 0.68 0.18 0.25 0.303 A2C-Baseline 0.23 0.45 0.46 0.00 0.14 0.295 A2C-FlanT5 0.51 0.22 0.59 0.00 0.26 Processing (NLP), and propose multiple equally correct labeling samples. In our opinion, categorical emotions detection is one relevant example of NLP in which is very dificult to agree on just one golden standard. 1017/CBO9780511809071. [7] E. D. Rosa, A. Durante, App2check @ ate_absita 2020: Aspect term extraction and aspect-based sentiment analysis, in: V. Basile, D. Croce, M. D. Maro, L. C. Passaro (Eds.), Proceedings of (EVALITA 2020), volume 2765 of CEUR Workshop Proceedings, CEURWS.org, 2020. URL: https://ceur-ws.org/Vol-2765/ paper122.pdf . [8] L. X. et al., mt5: A massively multilingual pretrained text-to-text transformer, in: K. T. et al. (Ed.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Associa tion for Computational Linguistics, 2021, pp. 483–498. URL: https://doi.org/10.18653/ v1/2021.naacl-main.41. doi:10.18653/v1/2021. naacl-main.41. [9] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, 2020. arXiv:1910.10683. [10] Hugging Face website, 2023. URL: https: //huggingface.co/. [11] T. C. Rajapakse, Simple transformers, https://github.

com/ThilinaRajapakse/simpletransformers, 2019. [12] H. W. C. et al., Scaling instruction-finetuned language models, 2022. arXiv:2210.11416. [13] L. O. et al., Training language models to follow instructions with human feedback, 2022. arXiv:2203.02155. [14] Openai website, 2023. URL: https://openai.com/. [15] K. Sechidis, G. Tsoumakas, I. P. Vlahavas, On the stratification of multi-label data, in: ECML/PKDD, 2011. [16] F. Cabitza, , A. Campagner, V. Basile, Toward a perspectivist turn in ground truthing for predictive computing, Washington DC, USA, 2023. [17] The perspectivist data manifesto, 2023. URL: https: //pdai.info/.

[1]

Lai ,

Menini ,

Polignano ,

Russo ,

Sprugnoli , G. Venturi, Evalita 2023 : Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[2]

K. K.

Imbir , Psychoevolutionary Theory of Emotion (Plutchik) , Springer International Publishing, Cham, 2017 , pp. 1 - 9 . URL: https://doi.org/ 10.1007/978-3- 319 -28099-8_ 547 - 1 . doi: 10 .1007/ 978-3- 319 -28099-8_ 547 - 1 .

[3]

Araque ,

Frenda ,

Sprugnoli ,

Nozza , V. Patti, EMit at EVALITA 2023: Overview of the Categorical Emotion Detection in Italian Social Media Task, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[4]

Basile ,

Cabitza ,

Campagner ,

Fell , Toward a perspectivist turn in ground truthing for predictive computing , CoRR abs/2109 .04270 ( 2021 ). URL: https://arxiv.org/abs/2109.04270. arXiv: 2109 . 04270 .

[5]

Rokach ,

Maimon , Top-down induction of decision trees classifiers - a survey , IEEE Transactions on Systems, Man, and Cybernetics , Part C ( Applications and Reviews) 35 ( 2005 ) 476 - 487 . doi: 10 .1109/TSMCC. 2004 . 843247 .

[6] C. D. M . et al., Introduction to information retrieval, Cambridge University Press, 2008 . URL: https://nlp. stanford.edu/IR-book/pdf/irbookprint.pdf . doi:10.