=Paper= {{Paper |id=Vol-3606/68 |storemode=property |title=An Experimental Comparison of Large Language Models for Emotion Recognition in Italian Tweets |pdfUrl=https://ceur-ws.org/Vol-3606/paper68.pdf |volume=Vol-3606 |authors=Claudia Diamantini,Alex Mircoli,Domenico Potena,Simone Vagnoni,Claudia Cavallaro,Vincenzo Cutello,Mario Pavone,Patrik Cavina,Federico Manzella,Giovanni Pagliarini,Guido Sciavicco,Eduard I. Stan,Paola Barra,Zied Mnasri,Danilo Greco,Valerio Bellandi,Silvana Castano,Alfio Ferrara,Stefano Montanelli,Davide Riva,Stefano Siccardi,Alessia Antelmi,Massimo Torquati,Daniele Gregori,Francesco Polzella,Gianmarco Spinatelli,Marco Aldinucci |dblpUrl=https://dblp.org/rec/conf/itadata/DiamantiniMPV23 }} ==An Experimental Comparison of Large Language Models for Emotion Recognition in Italian Tweets== https://ceur-ws.org/Vol-3606/paper68.pdf
                                An Experimental Comparison of Large Language
                                Models for Emotion Recognition in Italian Tweets
                                Claudia Diamantini1,† , Alex Mircoli1,∗,† , Domenico Potena1,† and Simone Vagnoni1,†
                                1
                                    Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy


                                                                         Abstract
                                                                         In recent years, the advent of Large Language Models (LLMs), which are task-agnostic models trained
                                                                         on huge amounts of textual data, has given momentum to a wide variety of NLP applications, ranging
                                                                         from chatbots to sentiment classifiers. Currently, many LLMs are publicly available, each with different
                                                                         features and performance, and the selection of the best LLM for a specific task may be challenging.
                                                                         In this work, we focus on the task of emotion recognition in Italian social media content and we
                                                                         present an experimental comparison among three of the most popular LLMs: Google Bidirectional
                                                                         Encoder Representations from Transformers (BERT), OpenAI Generative Pre-trained Transformer 3
                                                                         (GPT-3) and GPT-3.5. Model specialization in emotion recognition has been achieved by using two
                                                                         different approaches, namely fine-tuning and prompt engineering with few-shot task transfer. The
                                                                         experimentation has been performed on TwIT, a corpus of about 3100 Italian tweets annotated with
                                                                         respect to six emotions. The results show that fine-tuning GPT-3 leads to the best performance on the
                                                                         considered dataset, achieving a remarkable 𝐹1 =0.90.

                                                                         Keywords
                                                                         emotion recognition, BERT, GPT-3, large language model, sentiment analysis, emotion recognition of
                                                                         tweets, emotion recognition in Italian, fine tuning, few-shot learning




                                1. Introduction
                                The advent of social networks has made available huge amounts of user-generated content,
                                whose analysis could give valuable insights into people’s feelings and opinions. For this reason,
                                several techniques for the semantic analysis of natural language have been developed. Among
                                others, emotion recognition algorithms have been proposed to analyze emotions expressed in
                                texts. Such algorithms are usually developed through a supervised learning approach and
                                hence, given the complex nature of textual data, require to be trained on enormous manually-
                                annotated datasets, whose creation is costly and time-consuming. A first attempt to overcome
                                this limitation is represented by the techniques for the automatic creation of annotated datasets
                                by exploiting noisy indicators, such as emojis [1] or facial expressions [2]. Anyway, a more
                                promising approach has been proposed in recent years thanks to the popularity gained by
                                attention-based neural networks [3] [4] such as Transformers. These architectures mitigate

                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open c.diamantini@univpm.it (C. Diamantini); a.mircoli@univpm.it (A. Mircoli); d.potena@univpm.it (D. Potena);
                                vagnonisimone96@gmail.com (S. Vagnoni)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
some of the challenges associated with Recurrent Neural Networks (RNNs) [5] and Long Short-
term Memory (LSTM) networks [6] and led to the development of Large Language Models
(LLMs). LLMs are massive Transformer-based neural networks that have been trained on huge
amounts of data and offer unprecedented performance in a large variety of NLP activities. LLMs
are usually general-purpose models and hence they are not specialized in a single task. In order
to improve their performance on a complex task such as emotion recognition, a fine-tuning
phase is usually needed. Such task consists of a small training on an emotionally annotated
dataset, with the aim of correctly defining the eligible responses (i.e., the chosen classes) and
showing the network some samples related to the chosen domain. Nevertheless, even if some
works (e.g., [7]) have shown that fine-tuning an LLM on a small dataset is often sufficient to
obtain good results in terms of classification accuracy, the customization of LLMs remains costly
as it is usually performed through paid API calls. For this reason, in the present work we also
investigate the effect of prompt engineering on LLMs. Prompt engineering is the name given
to the process of determining the best sentence (also known as prompt) to ask LLMs in order
to obtain the best possible response1 . In particular, LLMs demonstrated to give good results
performing unknown tasks when a detailed description of the task in natural language and a few
shot examples are included in the prompt. Such an approach has been investigated by Brown et
al. [8], which found that model performance increased on several benchmarks, demonstrating
that a task-agnostic LLM can be turned into a task-specific model through few-shot task transfer.
Although LLMs have been widely tested on English corpora, limited experimentation has been
conducted on other languages. For this reason, we focus on emotion recognition in Italian texts
and, in particular, in Twitter data, since they are usually difficult to classify.
The contributions of the present work are two-fold:
       • the experimental evaluation of three publicly-available LLMs on the emotion recognition
         task, in particular for Italian social media content. LLMs have been selected based on their
         popularity and the chosen models are: Google Bidirectional Encoder Representations
         from Transformers (BERT) [9], OpenAI Generative Pre-trained Transformers 3 (GPT-3)
         [8] and OpenAI GPT-3.5;
       • the comparison between two different techniques, namely a traditional fine-tuning and a
         few-shot task transfer through prompt engineering.
The rest of the paper is structured as follows: the next section presents some relevant related
work on emotion recognition. The description of the techniques used for fine-tuning and prompt
engineering is proposed in Section 3, while Section 4 reports the results of the experimental
evaluation of the models on a real-world dataset of Italian tweets. Finally, Section 5 draws
conclusions and discusses future work.


2. Related work
In recent literature, much research effort has focused on the task of emotion recognition,
applying it to different data types: audio [10], images [11], videos [12], and texts [13]. With
respect to the latter, the majority of works on emotion recognition are limited to the English
1
    https://itnext.io/prompt-engineering-the-magical-world-of-large-language-models-dde7d8d043ee
language [14] [15]. The main differences among such works are on the adopted emotional
framework and the architecture of the proposed classifier. For what concerns the emotional
frameworks, the most widespread are Ekman’s theory of six archetypal emotions [16] and
Plutchik’s wheel of emotions [17]. With regard to the classification algorithm, researchers were
mainly focused on word embeddings (e.g., Word2Vec [18]) until the advent of Transformer-based
architectures. Such architectures have shown unprecedented performance and have rapidly
replaced older approaches. In the last five years, a large number of Large Language Models
based on Transformers have been released: among others, OpenAI GPT-3 [8] and Google BERT
[19] have gained much popularity among researchers and practitioners. Despite the great
availability of models for the English language (e.g., [20]), only a few resources for emotion
recognition are available for other languages, including Italian. To the best of our knowledge,
the two most recent LLM-based approaches for emotion recognition in Italian texts are [21] and
[22]. In particular, in the first work the authors propose AlBERTo, which is a BERT-based LLM
for the Italian language which has been created by fine-tuning BERT-Base on a large dataset of
Italian tweets. Such work is similar to [22], where authors use emojis as noisy indicators to build
an annotated dataset for fine-tuning BERT. For what concerns the experimental comparisons of
LLMs for emotion recognition, in particular for the Italian language, to the best of our knowledge
no relevant works have been published in the literature prior to the writing of this paper.


3. Methodology
The proposed methodology aims to evaluate the differences both between different LLMs and
different task transfer techniques, in order to empirically determine the approach that may lead
to the best performance in emotion recognition of Italian texts. The comparison between LLMs
has been performed using the same task transfer technique, namely fine-tuning; the considered
models are Google BERT and OpenAI GPT-3. For what concerns the comparison between
different task transfer techniques, i.e. fine-tuning and few-shot task transfer through prompt
engineering, the same version of the LLM could not be used due to API limitations in newer
GPT versions. For this reason, the comparison has been made between the fine-tuned version
of GPT-3 and the prompt-engineered version of GPT-3.5.

3.1. Fine-tuning
In this subsection we describe the approaches used to fine-tune the two considered LLMs. The
selected BERT-based model for emotion recognition of Italian tweets, i.e. EmotionAlBERTo,
is the result of fine tuning an Italian version of Google BERT, namely AlBERTo [21], which,
in turn, has been created by fine-tuning BERT-Base on a dataset of about 200 million Italian
tweets named TWITA[23]. Such a LLM has been developed without using the ”next following
sentence” technique, thus making it unsuitable for tasks like question answering but perfectly
appropriate for emotion recognition. In order to fine-tune AlBERTo, we followed the approach
presented in [22]. In particular, we added a final classification stage to AlBERTo and then we
fine-tuned the entire network on the TwIT dataset. The entire architecture is depicted in Figure
1: the text is fed into AlBERTo, which generates a convenient sentence representation, and it is
then classified through a classification stage consisting of a fully connected layer and a softmax
layer with 6 neurons (i.e., one for each considered emotion).




Figure 1: The architecture used for fine-tuning BERT [22].


   For what concerns GPT-3, the fine-tuning has been performed through the OpenAI APIs2 .
Such APIs require training data to be formatted in the JSON lines (JSONL) format, which is
equivalent to JSON format but implemented using newline characters to separate JSON values.
The training dataset must be converted so that each line is a prompt-completion pair where, in
the case of emotion recognition, prompt corresponds to the sentence to be classified and the
completion is the related emotion. The fine-tuned model is stored on the cloud of OpenAI and it
is not downloadable: it can only be accessed through specific API calls in which it has to be
explicitly selected as the current model. This aspect represents a potential limitation since the
API calls, that are required for both training and inference, are paid and their cost depends on
the number of analyzed tokens.

3.2. Prompt engineering
Few-shot task transfer through prompt engineering can be performed by defining an optimal
initial prompt for the LLM, which may take into account both contextual information and
training data. Regarding contextual information, it has been demonstrated that giving detailed

2
    https://platform.openai.com/docs/guides/fine-tuning
instructions about the task to be performed and the semantics of each considered class im-
proves the model’s ability to discriminate between emotions. For this reason, we started our
prompts with an accurate description of the task and a description of each considered emotion.
Subsequently, we add some sentence-class example pairs extracted from the dataset.
  After several attempts (see 4.1), we found the following text to be the best initial prompt for
GPT-3.5:

    • ITA: ”Sei uno sociologo esperto nell’analisi delle emozioni espresse sui social network, che
      classifica le emozioni nel testo secondo questo schema: 1) felicità: Sentimenti di piacere, con-
      tentezza, soddisfazione, o anche attrazione e desiderio. Può includere risposte a complimenti
      o manifestazioni di affetto. 2) fiducia: Sentimenti di sicurezza, affidabilità o apprezzamento
      verso gli altri. Può comprendere la fiducia in se stessi, negli altri o nelle situazioni. Può
      anche includere sentimenti di rispetto o ammirazione per qualcuno o qualcosa. 3) tristezza:
      Sentimenti di dolore, malinconia o dispiacere. Può comprendere la delusione, il dispiacere
      per una perdita o un fallimento, o la sensazione di mancanza o vuoto. 4) rabbia: Sentimenti
      di frustrazione, irritazione o ira. Può includere reazioni a ingiustizie, insoddisfazioni o
      comportamenti negativi da parte degli altri. 5) paura: Sentimenti di preoccupazione, ansia o
      paura. Può comprendere la paura di eventi futuri, l’ansia per situazioni attuali o preoccu-
      pazioni in generale. 6) disgusto: Sentimenti di avversione, repulsione o disprezzo. include
      sentimenti verso comportamenti immorali, cibi o odori sgradevoli, o qualsiasi altra cosa che
      provoca una forte avversione. Nel contesto considera gli indizi lessicali, l’uso di simboli, di
      emoji, dell’ironia. Rispondi scegliendo soltanto una tra le seguenti emozioni: felicità, fiducia,
      tristezza, rabbia, paura, disgusto. [...]”
    • ENG: ”You are a sociologist expert in the analysis of emotions expressed on social networks,
      which classifies the emotions in the text according to this scheme: 1) happiness: Feelings of
      pleasure, contentment, satisfaction, or even attraction and desire. It may include responses
      to compliments or displays of affection. 2) trust: Feelings of security, trustworthiness or
      appreciation towards others. It can include trust in oneself, in others or in situations. It
      can also include feelings of respect or admiration for someone or something. 3) sadness:
      Feelings of pain, melancholy or sorrow. It can include disappointment, sorrow over a loss or
      failure, or feelings of lack or emptiness. 4) anger: Feelings of frustration, irritation or anger.
      It can include reactions to injustice, dissatisfaction or negative behavior from others. 5) fear:
      Feelings of worry, anxiety or fear. It may include fear of future events, anxiety about current
      situations, or worries in general. 6) disgust: Feelings of aversion, repulsion or contempt.
      includes feelings about immoral behavior, unpleasant foods or smells, or anything else that
      causes a strong dislike. In context, consider lexical clues, the use of symbols, emojis, irony.
      Answer by choosing only one of the following emotions: happiness, trust, sadness, anger, fear,
      disgust. [...]”

For the sake of room, we have omitted the part of the text where the example sentences were
provided.
4. Experiments
In this section, we discuss the result of an experimentation aimed at determining the best LLM
and the best specialization technique for emotion recognition in Italian texts.

4.1. Experimental setup
The LLMs have been evaluated on the TwIT dataset [22], which is a dataset of 3108 Italian
tweets labeled with regard to six emotions: happiness, trust, sadness, anger, fear, and disgust.
The dataset is available at the following URL: https://github.com/a-mircoli/twit. The chosen
emotions are consistent with those found in other datasets (e.g., MultiEmotionsIT [24]) and are
considered basic, universal emotions, with the exception of trust, which has been added since it
is quite common in social media texts. The class distribution is shown in Figure 2; it can be
noticed that the dataset is quite balanced, with a maximum difference of 70 samples between
the majority and the minority class.




Figure 2: Class distribution of the TwIT dataset.


  We tested the following LLMs:

    • BERT : we considered EmotionAlBERTo, which is the fine-tuned version of AlBERTo, and
      we used the network hyperparameters shown in Table 1, since they provided the best
      results in a previous experimentation on the same dataset, as discussed in [22].
    • GPT-3: we fine-tuned both the davinci model, which is the largest and costly GPT-3 model,
      and the curie model, which is smaller but faster than davinci. In the following, we only
      report the results of davinci, since it achieved slightly better results on the considered
      dataset.
    • GPT-3.5: we performed few-shot task transfer through prompt engineering. In particular,
      we measured the performance of 20 different prompts, in which we varied the number of
      given examples and the context description, in order to find the best prompt. The results
      presented in the following subsection are related to the latter. We worked on the version
      available in June 2023. It has to be noticed that this model receives frequent updates that
      may alter its performance and hence impact on the reproducibility of the obtained results.


Table 1
The optimal hyperparameters found for EmotionAlBERTo through the hyperparameter tuning phase.
                                      Hyperparameter                Value
                                      learning_rate                  2e-5
                                      train_batch_size               512
                                      eval_batch_size                512
                                      max_seq_length                 128
                                      num_training_epochs             10

  We evaluated the model by means of three metrics: precision, recall and F1 score. Let 𝑥𝑖𝑗 be
the number of data belonging to 𝑗-th class which have been classified as 𝑖-th class and let 𝐶 be
the number of classes. Precision and recall of 𝑖-th class are determined as follows:
                                                      𝑥
                                      𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 = 𝐶 𝑖𝑖                                       (1)
                                                    ∑ 𝑥𝑖𝑗
                                                              𝑗=1

                                                              𝑥𝑖𝑖
                                             𝑟𝑒𝑐𝑎𝑙𝑙𝑖 =    𝐶
                                                                                                  (2)
                                                          ∑ 𝑥𝑗𝑖
                                                         𝑗=1
F1 score of 𝑖-th class is equal to:

                                                   𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 ⋅ 𝑟𝑒𝑐𝑎𝑙𝑙𝑖
                                       𝐹1𝑖 = 2 ⋅                                                  (3)
                                                   𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑖 + 𝑟𝑒𝑐𝑎𝑙𝑙𝑖
Therefore, the F1 score achieved by a classification model is defined as the average of F1𝑖 :
                                                          𝐶
                                                      1
                                              𝐹1 =      ∑𝐹                                        (4)
                                                      𝐶 𝑖=1 1𝑖

4.2. Results
The results of the experiments are shown in Table 2. It can be noticed that, even if the differ-
ence between the fine-tuned LLMs is quite small (4%), GPT-3 achieves the best performance,
with a remarkable 𝐹1 =0.90. Conversely, the prompt-engineered version of GPT-3.5 achieves
significantly lower results, settling on a 𝐹1 score equals to 0.48. In particular, it can be seen that
this model has great difficulties in classifying the trust emotion.
   The confusion matrix for the best model is shown in Table 3. The confusion matrix highlights
how the emotions sadness and anger are the most difficult to recognize, as they count respectively
14 and 13 misclassifications, while only 3 trust sentences are misclassified, leading to a very
high precision (0.97) for the trust class. In general, the classification of the happiness and the
trust classes seems to be easier for the model, as they have the highest values for precision and
recall.
Table 2
Results of the experiments on the TwIT dataset. 𝐹1 score and class-related precision and recall are
reported for each model.
 Model               𝐹1 score    Metric      Happiness   Trust   Sadness   Anger    Fear   Disgust   Avg
                                 precision     0.95       0.96     0.82     0.79    0.86    0.76     0.86
 BERT                  0.86
                                 recall        0.96       0.97     0.89     0.71    0.80    0.81     0.86
                                 precision     0.94       0.97     0.86     0.86    0.92    0.87     0.90
 GPT (fine-tuning)     0.90
                                 recall        0.96       0.93     0.90     0.88    0.82    0.93     0.90
                                 precision     0.45       0.25     0.53     0.45    0.73    0.61     0.51
 GPT (prompt eng.)     0.48
                                 recall        0.91       0.11     0.57     0.71    0.15    0.37     0.47


Table 3
Confusion matrix for the best classifier: GPT-3 (fine-tuning)
                     Act. Happiness Act. Trust Act. Sadness Act. Anger Act. Fear Act. Disgust
  Pred. Happiness              96              2           1            0            1          1
  Pred. Trust                   3             89           0            0            0          0
  Pred. Sadness                 2              0          78            3            8          1
  Pred. Anger                   0              3           2          110            2          6
  Pred. Fear                    0              1           6            3           93          2
  Pred. Disgust                 0              1           2            5            5         95
  Recall                      0.96           0.93        0.90         0.88         0.82       0.93
  Precision                   0.94           0.97        0.86         0.86         0.92       0.87


   On the basis of the obtained results, it could be concluded that fine-tuning offers superior
performance compared to prompt engineering, despite the fact that the latter was done on a
newer and more performing version of GPT. In fact, it is possible to note that the fine-tuning
causes the LLM to adapt very much to the concept of happiness, trust, etc. expressed in the
training set, for which the classification is extremely better. However, it should be noted that
the fine-tuning is done on TwIT, which is a dataset that was annotated based on the emojis
present in the text, which were subsequently removed. This fact shouldn’t be overlooked as
emojis provide hints on how to classify ambiguous sentences such as ”Un’altra cena di 3 ore?
Gna faccio” (ENG: Another 3-hour dinner? I can’t do it) which could be associated with various
classes (e.g., sadness, fear, disgust) on the basis of the emojis added to the text. In this particular
example, the emoji with the disgusted face (present when the sentence was collected and then
removed after assigning the class to the sentence) adds a bias that GPT-3 is able to capture
thanks to the fine-tuning done on the annotated dataset but GPT-3.5 does not, because the emoji
was previously removed. Another example is represented by the sentence ”Wtf? mi sento male”
(ENG: Wtf? I feel bad), which changes meaning depending on whether it is coupled with a
scared or smiling emoji.


5. Conclusion and future work
The goal of the work was the experimental comparison of three LLMs (i.e., Google BERT, OpenAI
GPT-3 and GPT-3.5) on emotion recognition in Italian tweets. The LLMs were specialized on this
task following two different approaches, namely fine-tuning and prompt engineering with few-
shot task transfer, in order to determine the best technique in terms of classification accuracy
and training effort. The models were tested on TwIT, a corpus of 3100 Italian tweets labeled
with respect to six emotions. The experimentation showed that fine-tuning GPT-3 leads to the
best classification performance (𝐹1 =0.90) and, in particular, is more capable of analyzing more
complex and nuanced emotions like sadness and fear, suggesting that it is better able to capture
semantic aspects of text. For what concerns the comparison between the two specialization
approaches, fine-tuning produced significantly better results than prompt engineering (𝐹1 =0.48).
In future work, we plan to include in the experimentation some other popular LLMs, such as
Google PaLM 2 and Meta LLaMA, in order to carry out a more complete comparison of the
models on the market. We are also working on the creation of a larger manually-annotated
dataset with the purpose of testing the models on a broader variety of topics. Finally, we are
interested in delving into the emerging field of multilabel emotion recognition, where texts are
labeled with multiple emotion classes, providing a more nuanced representation of emotions
and considering the typical subtleties of human feelings.


References
 [1] J. Islam, R. E. Mercer, L. Xiao, Multi-channel convolutional neural network for twitter
     emotion and sentiment recognition, in: Proceedings of the 2019 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long and Short Papers), 2019, pp. 1355–1365.
 [2] C. Diamantini, A. Mircoli, D. Potena, E. Storti, Automatic annotation of corpora for
     emotion recognition through facial expressions analysis, 2020, p. 5650 – 5657. URL:
     https://www.scopus.com/inward/record.uri?eid=2-s2.0-85110414516&doi=10.1109%
     2fICPR48806.2021.9413311&partnerID=40&md5=25ad1cb30d9f7ccda4c7854507e70429.
     doi:10.1109/ICPR48806.2021.9413311 .
 [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, Advances in neural information processing systems 30
     (2017).
 [4] A. F. Adoma, N.-M. Henry, W. Chen, Comparative analyses of bert, roberta, distilbert, and
     xlnet for text-based emotion recognition, in: 2020 17th International Computer Conference
     on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), IEEE,
     2020, pp. 117–121.
 [5] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural network
     based language model., in: Interspeech, volume 2, Makuhari, 2010, pp. 1045–1048.
 [6] R. C. Staudemeyer, E. R. Morris, Understanding lstm–a tutorial into long short-term
     memory recurrent neural networks, arXiv preprint arXiv:1909.09586 (2019).
 [7] X. Qin, Z. Wu, J. Cui, T. Zhang, Y. Li, J. Luan, B. Wang, L. Wang, Bert-erc: Fine-tuning
     bert is enough for emotion recognition in conversation, arXiv preprint arXiv:2301.06745
     (2023).
 [8] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
 [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[10] L. Schoneveld, A. Othmani, H. Abdelkawy, Leveraging recent advances in deep learning
     for audio-visual emotion recognition, Pattern Recognition Letters 146 (2021) 1–7.
[11] W. Zheng, H. Tang, T. S. Huang, Emotion recognition from non-frontal facial images,
     Emotion Recognition: A Pattern Analysis Approach (2015) 183–213.
[12] A. Mircoli, G. Cimini, Automatic extraction of affective metadata from videos through
     emotion recognition algorithms, Communications in Computer and Information Science
     909 (2018) 191–202. doi:10.1007/978- 3- 030- 00063- 9_19 .
[13] E. Batbaatar, M. Li, K. H. Ryu, Semantic-emotion neural network for emotion recognition
     from text, IEEE access 7 (2019) 111866–111878.
[14] N. Alswaidan, M. E. B. Menai, A survey of state-of-the-art approaches for emotion
     recognition in text, Knowledge and Information Systems 62 (2020) 2937–2987.
[15] I. Shahin, A. B. Nassif, S. Hamsa, Emotion recognition using hybrid gaussian mixture
     model and deep neural network, IEEE access 7 (2019) 26777–26787.
[16] P. Ekman, An argument for basic emotions, Cognition & emotion 6 (1992) 169–200.
[17] R. Plutchik, Emotions: A general psychoevolutionary theory, Approaches to emotion 1984
     (1984) 197–219.
[18] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in
     vector space, arXiv preprint arXiv:1301.3781 (2013).
[19] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/
     anthology/N19-1423. doi:10.18653/v1/N19- 1423 .
[20] A. Chiorrini, C. Diamantini, A. Mircoli, D. Potena, Emotion and sentiment analysis of
     tweets using bert, in: EDBT/ICDT Workshops, 2021.
[21] M. Polignano, P. Basile, M. De Gemmis, G. Semeraro, V. Basile, Alberto: Italian bert
     language understanding model for nlp challenging tasks based on tweets, in: 6th Italian
     Conference on Computational Linguistics, CLiC-it 2019, volume 2481, CEUR, 2019, pp.
     1–6.
[22] A. Chiorrini, C. Diamantini, A. Mircoli, D. Potena, E. Storti, Emotionalberto: Emotion recog-
     nition of italian social media texts through bert, in: 2022 26th International Conference on
     Pattern Recognition (ICPR), 2022, pp. 1706–1711. doi:10.1109/ICPR56361.2022.9956403 .
[23] V. Basile, M. Nissim, Sentiment analysis on italian tweets, in: Proceedings of the 4th
     Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media
     Analysis, 2013, pp. 100–107.
[24] R. Sprugnoli, Multiemotions-it: A new dataset for opinion polarity and emotion analysis for
     italian, in: 7th Italian Conference on Computational Linguistics, CLiC-it 2020, Accademia
     University Press, 2020, pp. 402–408.