<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>App2Check at EMit: Large Language Models for Multilabel Emotion Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gioele Cageggi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Di Rosa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asia Uboldi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chief Technology Oficer at App2Check srl</institution>
          ,
          <addr-line>Via XX Settembre, 14 - 16121, Genoa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Scientist at App2Check srl</institution>
          ,
          <addr-line>Via XX Settembre, 14 - 16121, Genoa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we compare the performance of three state-of-the-art LLM-based approaches for multilabel emotion classification: ifne-tuned multilingual T5 and two few shot prompting approaches: plain FLAN and ChatGPT. In our experimental analysis we show that FLAN T5 is the worst performer and our fine-tuned MT5 is the best performer in our dev set and, overall, is better than ChatGPT3.5 on the test set of the competition. Moreover, we show that MT5 and ChatGPT3.5 have complementary performance on diferent emotions and that A2C-best, our unsubmitted system that combines our best performer models for each emotion, has a macro F1 that is 0.02 greater than the winner of the competition in the out-of-domain benchmark. Finally, we suggest that a perspectivist approach is more suitable for evaluating systems on emotion detection.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Emotion Detection</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>ChatGPT</kwd>
        <kwd>FLAN</kwd>
        <kwd>mT5</kwd>
        <kwd>Prompt Engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Categorical Emotions Detection refers to the machine</title>
        <p>learning task of detecting the presence of specific
emotions in a text. Detecting customers emotions, for
example, is a useful task having many practical applications in
industry, from customer experience analysis to customer
churn prevention.</p>
        <p>
          The categories of emotions used may vary. In this
paper we consider the 8 main emotions of Plutchik’s
wheel [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] (anger, expectation, disgust, fear, joy, sadness,
surprise, trust), plus the emotion "love," which is one of
the dyads, according to the Emit 2023 competition [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ],
and Neutral, which is absence of emotions.
        </p>
        <p>In this paper, we:
1. present three approaches for detecting emotions
in a text, all based on large language models
(LLM)
2. show that, on the dev set, FLAN T5 is the worst
performer and our fine-tuned MT5 is the best
performer
3. overall, between our models, MT5 is better than
ChatGPT3.5 on the dev and test set of the
competition
4. show that MT5 and ChatGPT3.5 show
complementary performance on diferent emotions</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Approaches Adopted</title>
      <sec id="sec-2-1">
        <title>In this paper, we study two diferent approaches to</title>
        <p>solve the Categorical Emotion Detection task, both
Transformer-based:
• LLM Fine-tuning: Starting from a pre-trained</p>
        <p>LLM model, we use the competition dataset to
ifne-tune the model in order to solve the specific
task
• Few-Shot Prompting: Using an Instruction Tuned</p>
        <p>LLM, prompts are designed to properly guide the
model in defining its behavior for the task.</p>
        <p>Briefly, the main diferences of these two approaches
are:
• While fine-tuned models require a larger labeled the google/mt5-base version, which has 580 million
padataset for training, prompt-based models work rameters. We tried to apply google/mt5-xxl, but out of
even with a smaller few-shot dataset memory exception prevents us from using it in a Google
• Fine-tuning requires high computational and re- Colaboratory cloud environment. More specifically, has
source capacity to complete the training. Few- been trained using an Nvidia A100 GPU with 40GB of
shot prompting focuses on the refinement of memory. Training is performed for 20 epochs on 90% of
prompts and instructions without changing the the competition training dataset with a stratified split
model parameters strategy. In this paper this model is referred to as
A2C• The carbon footprint of the two approaches is mT5-r1.</p>
        <p>quite diferent. Fine-tuning an LLM can be
computationally expensive and energy-intensive. The 2.2. Plain FLAN
environmental impact generally tends to be more
energy-demanding than prompt tuning, which is
considered more eco-friendly because it avoids a
full-scale fine-tuning process
• Fine-tuned models achieve better accuracy values
when there is an abundance of labeled data, while
prompt tuning can ofer reasonable performance
even with a limited amount of labeled data
• While fine-tuned models make LLMs specialized
for a specific task, prompt tuning allows for a
more flexible approach to solving diferent tasks
with minimal changes to the prompt.</p>
      </sec>
      <sec id="sec-2-2">
        <title>FLAN-T5 [12] is one of two Few-Shot Prompting ap</title>
        <p>proaches that we experiment with in this paper.</p>
        <p>It is a model based on T5[9], on which we perform
instruction fine-tuning. This process entails training the
model using an instruction set that describes how to
perform over 1000 additional tasks. The instruction
finetuning process involves providing the model with an
instruction set and executing the tasks specified in the
instructions.</p>
        <p>In this paper, we use Hugging Face’s transformers
library to import the google/flan-t5-xl model and use it.</p>
        <p>Then, through prompt engineering techniques, we
de</p>
        <p>
          Moreover, as internal reference, we build a system velop a prompt to associate an input text with one or
called A2C-Baseline. It combines multiple ML models, more emotions. In the first iteration of the solution, we
such as Decision Trees [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and KNN models [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], where we use a single prompt to identify and associate all
possiselect for each emotion the best one from a pool of models. ble emotions if present in the input text. However, the
The input text is vectorized using the tf-idf methodology. model is not supporting this compact approach. Thus,
        </p>
        <p>Finally, we define a voting system, A2C-Voting, that we modify the prompt to identify a single emotion at
combines the prediction of each sentence from A2C-mT5- a time. We find better outputs with this last approach.
r1, A2C-GPT-r2 and A2C-Baseline. It chooses for each Then we develop ten prompts, one per emotion.
prediction the result with the largest agreement. The The prompts start with Detect if the text provided
conmajority is always guaranteed, being based on a binary tains EmotionX as emotion. If the emotion is available
ranking of individual emotions (present/not present) and in the input text, the value will be 1; 0 otherwise , where
a voting system on three diferent predictions. EmotionX is the emotion to look for. Then two sentences
follow, one of which contains the emotion and the other
2.1. Fine-tuned LLM does not. In this paper this model is referred to as
A2CFlanT5.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Fine tuning LLMs has been proved to be an efective</title>
        <p>approach for text classification problems and in [ 7] we 2.3. ChatGPT
showed to be the winner approach in all tasks of the
ABSITA competition. MT5 [8] is the LLM we decide to ChatGPT 3.5 is the second of the two Few-Shot
Promptuse here. It is a multilingual variant of T5 [9], a text- ing approaches that we apply to experiment with in this
to-text model released by Google in 2021. T5 uses a paper. The version we use in this model is
gpt-3.5-turbotransformer-based architecture and can be fine-tuned 0301 [13]. The specifics of the model have not been
pubto return text labels for classification tasks. MT5 has licly disclosed yet. It is a similar model to the previous
been pre-trained on mC4, which is a version of Common GPT-3 model [13], trained on a set of text and code
creCrawl’s multilingual web crawl corpus containing 101 ated before Q4 2021. It is then trained using a
reinforcelanguages. This enables the exploitation of the potential ment learning method with rewards derived from human
of the T5 model on a task involving Italian text. comparison.</p>
        <p>In this paper, in order to use this model, we use the In this paper, we use the OpenAI library [14] to
proHugging Face API [10] wrapped by the Simple Transform- cess requests to the model. Unlike the approach chosen
ers [11] library. From the available models, we choose for FLAN T5, we develop a prompt to simultaneously
identify all emotions for each text input. We prepared
3. Experimental Analysis
a prompt with six examples of text inputs, taken from
the competition training dataset. All emotions have been
mapped within the text examples. The output requested In this paper, we refer to two types of datasets: the
develwithin the prompt is structured as a JSON with as many opment dataset and the competition test set. The
develkeys as emotions, with a value of 1 if a given emotion is opment dataset is used to select the best A2C models to
present, 0 otherwise. submit to the competition, while the competition test set</p>
        <p>The prompt used is the following: Determin the emo- consists of both in-domain and out-of-domain data. The
tions in the text provided, which is delimited by &lt;&gt;. The dev set is split from the competition training set using
available emotions are: Anger, Anticipation, Disgust, Fear, the stratified technique [ 15], which ensures that the
origJoy, Love, Neutral, Sadness, Surprise, Trust. Provide the inal proportions of labels is maintained in each subset.
answer in JSON format, with the following keys: Anger, The training is made on the 80% of the training dataset;
Anticipation, Disgust, Fear, Joy, Love, Neutral, Sadness, Sur- models are selected on the 10% of the dataset and tested
prise, Trust. If that emotion is present inside the input text, on the remaining 10%. Once models to submit have been
the value will be 1; 0 otherwise. A series of examples then selected, we retrained them on the 100% of the training
follow, in the format: Text: &lt;...&gt;Answer: {"Anger":0, "An- data. From here on we will refer to Dev set as the model
ticipation":0, "Disgust":0, "Fear":0, "Joy":0, "Love":0, "Neu- selection set; In-domain test set and Out-of-domain test set
tral":0, "Sadness":0, "Surprise":1, "Trust":1} refer, respectively, as the in-domain and out-of-domain</p>
        <p>Note that this model allows to identify all emotions competition datasets.
simultaneously, unlike FLAN T5, in which emotions have Tables 2 and 3 show the A2C models that participated
been identified one at a time. In this paper, this model is in the competition applied on the Dev set, but also
adreferred to as A2C-GPT-r2. ditional models developed post-deadline, highlighted in
italics for a fair detection. Tables 6 and 7 include all
mod2.4. Description of our best approach: els from both A2C and other competitors applied on the
competition test set. All tables display the Macro F1 and</p>
        <p>A2C-best F1 metrics for individual emotions across all models.
A2C-mT5-r1 and A2C-GPT-r2 show to be
complementary in their ability to accurately detect emotions in the
evaluation sets. Specifically, in the dev set, A2C-mT5-r1
outperforms A2C-GPT-r2, while the latter exhibits better
performance on Anger, Disgust, Fear, and Sadness. Based
on these findings, in the following, we show A2C-best,
which combines the top-performing A2C models for each
individual emotion.</p>
        <p>We show in 3.1 and 3.2 the results of its application
ranking on the test set of the competition as unsubmitted
system, since we believe that its results are interesting
for the research community.</p>
        <sec id="sec-2-3-1">
          <title>3.1. Results on Dev Set</title>
          <p>In Table 2 and 3, we show the results of our model on
the Dev set, where unsubmitted models are shown in
italic. The worst performer is A2C-FlanT5 with an MF1
of 0.27: it shows the worst performance on the Neutral
label, with an F1 score of 0. The best performer between
the models we evaluated for the submission is
A2C-mT5r1, with an MF1 of 0.45, showcasing better performance
on 6 out of 10 emotions when compared to the models
that are not highlighted in italic. For the second run,
we decide to select A2C-GPT-r2 instead of A2C-Baseline,
since it performs in a complementary way compared to
A2C-mT5-r1, and to pursue a more innovative approach.
More specifically, it is clear that A2C-mT5-r1 and
A2CGPT-r2 exhibit complementary performance on diferent
emotions: A2C-GPT-r2 excels in Anger, Disgust, and
Sadness, while A2C-mT5-r1 performs better in Anticipation,
Joy, Neutral, Surprise, and Trust. This complementary
performance is almost entirely preserved in the
competition test sets as well. Based on this observation, we
synthesize a post-deadline system called A2C-best which
selects the model with the best performance for each
emotion.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>3.2. Results on Competition Test Sets</title>
          <p>In domain test set In Tables 4 and 5, we compare
both competitors systems and all our models on the
indomain Test Set of the competition. When we look at the
individual emotions, ExtremITA run 2 achieves almost
always the highest scores, except for Joy, where ABCD
run 1 is the best one, and Love, where A2C-GPT-r2 is the
best performer.</p>
          <p>We also include in the tables the results obtained by
A2C-best, which ranks second after ExtremITA’s
solutions, with an MF1 score at a distance of 0.005 from its
ifrst run.</p>
          <p>The complementarity observed in the dev set between
A2C-mT5-r1 and A2C-GPT-r2 also holds true within this
test set, except for the emotion of Fear. We include an
upper bound benchmark, Best-All, to define the potential
margin of improvement by combining all the competition
models.</p>
          <p>Out-of-domain test set In Tables 6 and 7, we show
the results of our systems and the other participants on
the out-of-domain Test Set of the competition. Observing
individual emotions, A2C-GPT-r2 shows the best score on
Anger, Disgust, and Fear, while A2C-Voting on Sadness.</p>
          <p>We obtain A2C-best by incorporating the best results
of our models into one system, that selects the best of
our models for each emotion. A2C-best shows to be the
top performer among the submitted runs, improving the agree on the samples annotation. In table 1, we show
winner by 0.02 of MF1. just 3 samples (out of many) in which we disagree with</p>
          <p>Once again, complementarity on emotions is clear be- the golden standard (two diferent people plus a referee).
tween A2C-mT5-r1 and A2C-GPT-r2, except for Love. As The goal is to highlight whether disagreement between
an upper bound, Best-All shows that the potential margin the systems is due to just systems that cannot correctly
of improvement is more significant in the out-of-domain meet the ground truth or if such instances may be
intest set. terpreted in multiple ways and thus requiring multiple,
equally correct, labeling. As we can see in table 1, there
3.3. Error analysis are diferences between the Golden Standard (Gold
column) and our classification (A2C team column). The
In order to improve our systems performance, we ran- research community is working towards the direction
domly selected instances in which all systems disagree of perspectivist approaches (see [16] and [17]) in which,
to analyze the most dificult cases. However, during our well-known issues of having just one single ground truth
error analysis, we noticed that many times we did not are taken into account especially in Natural Language</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <sec id="sec-3-1">
        <title>In this paper we presented the systems runs we submitted</title>
        <p>at Emit 2023 competition for emotion detection in text,
and also our post deadline system called A2C-best. In
particular, we presented the performance of three diferent
LLM-based approaches, such as fine-tuned multilingual
T5, and two few shot prompting techniques, A2C-GPT-r2
and FLAN T5. Our A2C-best model shows significant
improvement to our oficial run and comparable
performance with the first ranker of the competition in the
out-of domain run. A2C-best scores 0.099 below the
winner in the in-domain run. Finally, after relabeling dificult
instances where all systems and humans disagree, we
suggested that a perspectivist approach is more suitable
for evaluating systems on emotion detection.
MF1 Model Ang Ant Dis Fea Joy
0.564 Best-All 0.64 0.60 0.68 0.18 0.44
0.518 A2C-best 0.64 0.60 0.68 0.18 0.42
0.498 extremITA2 0.41 0.49 0.67 0.00 0.44
0.484 A2C-r1r2 0.64 0.43 0.68 0.18 0.42
0.449 extremITA1 0.50 0.37 0.62 0.00 0.32
0.438 A2C-Voting 0.39 0.60 0.65 0.00 0.25
0.402 A2C-mT5-r1 0.27 0.43 0.47 0.00 0.42
0.373 A2C-GPT-r2 0.64 0.33 0.68 0.18 0.25
0.303 A2C-Baseline 0.23 0.45 0.46 0.00 0.14
0.295 A2C-FlanT5 0.51 0.22 0.59 0.00 0.26
Processing (NLP), and propose multiple equally correct
labeling samples. In our opinion, categorical emotions
detection is one relevant example of NLP in which is very
dificult to agree on just one golden standard.
1017/CBO9780511809071.
[7] E. D. Rosa, A. Durante, App2check @ ate_absita
2020: Aspect term extraction and aspect-based
sentiment analysis, in: V. Basile, D. Croce, M. D. Maro,
L. C. Passaro (Eds.), Proceedings of (EVALITA 2020),
volume 2765 of CEUR Workshop Proceedings,
CEURWS.org, 2020. URL: https://ceur-ws.org/Vol-2765/
paper122.pdf .
[8] L. X. et al., mt5: A massively multilingual
pretrained text-to-text transformer, in: K. T. et al.
(Ed.), Proceedings of the 2021 Conference of the
North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, NAACL-HLT 2021, Online, June 6-11,
2021, Associa tion for Computational Linguistics,
2021, pp. 483–498. URL: https://doi.org/10.18653/
v1/2021.naacl-main.41. doi:10.18653/v1/2021.
naacl-main.41.
[9] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
limits of transfer learning with a unified text-to-text
transformer, 2020. arXiv:1910.10683.
[10] Hugging Face website, 2023. URL: https:
//huggingface.co/.
[11] T. C. Rajapakse, Simple transformers, https://github.</p>
        <p>com/ThilinaRajapakse/simpletransformers, 2019.
[12] H. W. C. et al., Scaling instruction-finetuned
language models, 2022. arXiv:2210.11416.
[13] L. O. et al., Training language models to
follow instructions with human feedback, 2022.
arXiv:2203.02155.
[14] Openai website, 2023. URL: https://openai.com/.
[15] K. Sechidis, G. Tsoumakas, I. P. Vlahavas, On the
stratification of multi-label data, in: ECML/PKDD,
2011.
[16] F. Cabitza, , A. Campagner, V. Basile, Toward a
perspectivist turn in ground truthing for predictive
computing, Washington DC, USA, 2023.
[17] The perspectivist data manifesto, 2023. URL: https:
//pdai.info/.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          , G. Venturi,
          <year>Evalita 2023</year>
          :
          <article-title>Overview of the 8th evaluation campaign of natural language processing and speech tools for italian, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Imbir</surname>
          </string-name>
          ,
          <source>Psychoevolutionary Theory of Emotion (Plutchik)</source>
          , Springer International Publishing, Cham,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . URL: https://doi.org/ 10.1007/978-3-
          <fpage>319</fpage>
          -28099-8_
          <fpage>547</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>319</fpage>
          -28099-8_
          <fpage>547</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Araque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          , V. Patti, EMit at EVALITA 2023:
          <article-title>Overview of the Categorical Emotion Detection in Italian Social Media Task, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cabitza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Campagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fell</surname>
          </string-name>
          ,
          <article-title>Toward a perspectivist turn in ground truthing for predictive computing</article-title>
          ,
          <source>CoRR abs/2109</source>
          .04270 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2109.04270. arXiv:
          <volume>2109</volume>
          .
          <fpage>04270</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rokach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Maimon</surname>
          </string-name>
          ,
          <article-title>Top-down induction of decision trees classifiers - a survey</article-title>
          ,
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          , Part C (
          <article-title>Applications</article-title>
          and Reviews)
          <volume>35</volume>
          (
          <year>2005</year>
          )
          <fpage>476</fpage>
          -
          <lpage>487</lpage>
          . doi:
          <volume>10</volume>
          .1109/TSMCC.
          <year>2004</year>
          .
          <volume>843247</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>C. D. M</surname>
          </string-name>
          . et al., Introduction to information retrieval, Cambridge University Press,
          <year>2008</year>
          . URL: https://nlp. stanford.edu/IR-book/pdf/irbookprint.pdf . doi:10.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>