1. Introduction and Motivation

EmotivITA at EVALITA2023: Overview of the Dimensional and Multidimensional Emotion Analysis Task

Giovanni Gafà

Francesco Cutugno

Marco Venuti

0 0 University of Catania , Italy 1 University of Naples Federico II , Italy

EmotivITA is the first shared task for Italian Dimensional and Multidimensional Emotion Analysis, aiming to promote research in the field of emotion detection within the Italian language. We developed an Italian dataset annotated following the dimensional model of emotions and invited participants to submit systems to predict Valence, Arousal, and Dominance associated to sentences in the corpus. Five runs were submitted by two teams. We present the dataset, the evaluation methodology, and the approaches of the participating systems.

eol>emotion analysis emotion detection VAD model dataset EmoITA EmotivITA Evalita 2023

1. Introduction and Motivation

texts as well [6, 7].

Recently, EA started receiving more and more attenIn the last two decades, the analysis of emotions that tion as well. Several models of emotion proposed in people express in texts has become an essential area in psychology have been used in NLP, either categorical or Natural Language Processing (NLP). Such an interest dimensional. The former consider feelings as discrete, springs from the awareness of the crucial role feelings and usually identify a small set of basic emotions upon have in our cognition: being able to detect and eventually which other, more subtle and complex afective states are simulate them could be a fundamental step to produce built; the widely adopted model conceived by Ekman [8], human-like forms of artificial intelligence [ 1]. For a re- for instance, proposes six fundamental emotions. The view on possible applications of Emotion Analysis (EA), latter, on the contrary, describes emotions by combinranging from stock market predictions to the manage- ing a limited number of independent dimensions in a ment of catastrophic events, see for example [2]. real-valued vector space. The model proposed by Russel

Taking into account the somewhat uncertain termi- and Mehrabian [9], probably the best-known, recognizes nology about human feelings occasionally found in the three dimensions: Valence (measuring pleasure or disliterature (see below), we start by defining some terms. pleasure), Arousal (degree of excitement or calm), and Adopting a well known typology of afective states by Dominance (level of control over the situation) – the VAD Scherer [3, pp. 140–141], we use the word ‘emotion’ to model. refer to a “relatively brief episode of synchronized re- Categorical models have some advantages over dimensponses by all or most organismic subsystems to the sional ones, as they allow the identification of several evaluation of an external or internal event as being of emotions in the same input and usually have simpler inmajor significance", whereas ‘sentiments’, like Scherer’s terpretations. Nevertheless, they have been criticized for ‘attitudes’, are “relatively enduring, afectively colored their use of culture and language specific labels [ 10]; bebeliefs, preferences, and predispositions toward objects sides, diferent categorical models adopt diferent sets of or persons". emotions, making it dificult to compare studies. Concern

Sentiment analysis has been a major interest for com- ing dimensional models, the independence of the three putational linguistics for a long time, and, over the years, dimensions is yet to be ascertained [11, 12]; however, it moved from the prediction of the semantic polarity dimensional models allow easier comparisons between towards more fine-grained modeling, as is the case in emotions and can describe feelings that are dificult to Aspect-based Sentiment Analysis [4] and Stance Detec- label. tion [5]; similar studies have been conducted on Italian At SemEval, the most renowned evaluation campaign of NLP, the first shared task concerning emotion detecEVALITA 2023: 8th Evaluation Campaign of Natural Language Pro- tion (for three languages: English, Arabic and Spanish) *ceCsosirnrgesapnodnSdpinegecahuTthooolrs. for Italian, Sep 7 – 8, Parma, IT was proposed in 2018 [13]. Building on earlier works, a $ giovanni.gafa@phd.unict.it (G. Gafà); cutugno@unina.it 22,000 tweet dataset was annotated for many diferent (F. Cutugno); marco.venuti@unict.it (M. Venuti) afect states, following both the categorical and dimen© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License sional models of emotions (limited to the Valence dimenCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) sion1); the sub-tasks involved emotion classification and organized into two sub-tasks whose results will be evaluemotion regression. ated separately:

Another task of emotion classification was proposed at SemEval 2019 [14], this time leveraging a dataset con- • Sub-task A: Dimensional emotion regression. taining roughly 3,000 short conversations annotated for Prediction of Valence, Arousal, and Dominance the presence of four emotions; the purpose was to study values based on a set of Italian sentences and and exploit the role of context in facilitating emotion annotations, using only the target annotated didetection. mension for training – so, for instance, when

Anyway, EA has not yet received in Italy the same predicting Valence participant systems may only amount of interest it gained at the international level. use Valence values annotated in the dataset for This is probably due to the lack of resources annotated training; the same holds for Arousal and Domifor emotions. After some investigations, we could find nance. just a few lexica [15, 16, 17]; some are not open to the pub- • Sub-task B: Multidimensional emotion relic [18] or are quite specific in scope [ 19]; others are the gression. Prediction of Valence, Arousal, and result of automatic translations from English of existing Dominance values based on a set of Italian senvocabularies, and have not been re-annotated by Italian tences and annotations, using all mentioned dispeakers [20, 21]. This situation worsens when it comes mensions for training – so participant systems to datasets, where to the best of our knowledge only should determine Valence, Arousal, and Domdomain-specific resources are available [ 22, 23, 24]. An- inance simultaneously, using values from the other dataset [25] has been proposed at Evalita 2023 [26], three dimensions for training. containing social media messages about TV shows, TV Both sub-tasks are regression problems, so participating series, music videos, and advertisements, which had been teams were asked to provide in the output the sentence labeled following the Plutchik model of emotions [27]. id and three real numbers between 1 and 5, relative to

As we tried to outline, existing datasets for EA in Ital- the three predicted dimensions. Sub-task B intends to ian are scarce and quite specialized. Moreover, the emo- study and exploit potential correlations between Valence, tion formats used for annotating the corpora are uniquely Arousal, and Dominance, which have been discussed in categorical. Nevertheless, dimensional models are re- the literature (see § 1). ceiving increasing attention in tasks of emotion detec- Participants could carry out either both sub-tasks or tion [28, 29]. By proposing the EmotivITA shared task only one of them, even if participation in sub-task A was at the Evalita 2023 evaluation campaign, we aim at pro- strongly recommended, in order to have a common basis viding a new, general-purpose resource for EA in Italian, for comparison. Each participating team was allowed with labeling provided by Italian speakers, EmoITA: a to submit a maximum of 2 runs for each sub-task. All dataset composed with a genre and domain-balanced runs could be produced according to the ‘constrained’ or selection of more than 10,000 written sentences, anno- ‘unconstrained’ modality (or both); however, we asked tated following the dimensional model of emotions; on to specify the type of run. In constrained modality, only the other hand, we intend to promote dimensional and annotated data distributed by the organizers could be multidimensional EA in Italian. used for training and tuning the systems. Other linguis

The rest of the paper is organized as follows: Section 2 tic resources (e.g., word embeddings and lexicons) were provides a definition of the task; Section 3 describes the instead allowed. In unconstrained modality, annotated dataset made available to participants, and the process external data could also be employed and had to be deof its creation; Section 4 details the oficial evaluation scribed in the system reports. measures; Section 5 reports the results obtained by participating teams; Section 6 discusses the results; in Section 7 we draw some conclusions on the outcomes of the task. 3. Dataset

2. Definition of the task

The EmotivITA shared task consists of automatically annotating for emotions in the VAD model a collection of written sentences from a genre-balanced dataset translated into Italian. More specifically, the task has been

1As a case in point of inaccuracy when dealing with emotion

related terms, Valence was regarded as an equivalent of ‘sentiment’ throughout the study.

As mentioned above, the data released for the shared task

derive from the Italian translation of an existing dataset, EmoBank [30]. EmoBank is the largest genre-balanced English dataset annotated employing the VAD model of emotions. As shown in Table 1, it mainly consists of the MASC: Manually Annotated Sub-Corpus of the American National Corpus [31], with roughly 10% of the sentences coming from the dataset of SemEval-2007 Task 14 [32].

The 10,062 sentences were originally annotated by English native speakers according to two diferent perspectives: the emotion they felt the writer meant to express, and the emotion evoked in an average reader. Figure 1: The SAM scales for VAD values. Dimensions (Va

At first, the Italian version of the dataset was studied lence, Arousal and Dominance) are reported in rows, values as part of a Master’s degree thesis discussed in 2022 at the (1f9r9o4m. 1 to 5) in columns. Copyright of SAM by Peter J. Lang Department of Humanities of the University of Catania.

In this context, the sentences were initially translated automatically to Italian using the neural machine trans- Table 2 lation service ofered by Microsoft Azure. As we were IAA for the three dimensions in the pilot study. not satisfied with the results, a manual revision was per- V A D Average formed splitting the corpus evenly between nine Italian r 0.794 0.552 0.676 0.593 native speakers, researchers in linguistics afiliated with MAE 0.357 0.900 0.583 0.613 Interdepartmental Research Center Urban/Eco at the University of Naples Federico II.

We also conducted a pilot study asking two of the par- significantly worse than those obtained with the original ticipants to independently annotate VAD values from the values from the EmoBank dataset (MAE was between 2 reader’s perspective for a small sample of sentences (150). and 3 times higher, r for Valence and Arousal was respecWe chose the reader’s perspective because, according to tively 33% and 13% lower). This was probably due to the Buechel and Hahn, it yields better inter-annotator agree- lack of consistency from having a single annotation for ment (IAA). For annotation, we used the Self-Assessment a sentence.

Manikin (SAM), a pictographic scale to assess emotional Moreover, we reviewed the manual revisions of the response [33, 34] already adopted for EmoBank. SAM translations and found that, in at least half of the cases, consists of three sets of anthropomorphic cartoons dis- the quality was still poor, either because the translated playing diferences in Valence, Arousal, and Dominance sentence did not feel natural in Italian or because it convalues, respectively as shown in Figure 1. tained some kind of error.

We asked participants to attribute a value between To produce EmoITA, we resolved to start over the en1 (minimum Valence, Arousal, and Dominance) and 5 tire process, only keeping the approximately 5,000 trans(maximum Valence, Arousal, and Dominance), with 4 lations we considered good enough. This time, we chose intermediate steps of 0.5. This results in a 9-point scale 16 students from the Master’s Degree in Foreign Lanlike the one originally proposed by Bradley and Lang guages at the University of Catania. All of them are Ital(Buechel and Hahn preferred a 5-point scale). Instruc- ian native speakers and are specializing in English. The tions were adapted from those used for EmoBank and sentences were split among the participants: we asked are available for further analysis upon request. to revise the 5,000 translations we kept from previous

To measure IAA we used Pearson’s correlation coefi- work and to propose new translations for the rest of the cient (r) and Mean Absolute Error (MAE), as other metrics corpus. The same group of subjects also labeled each Itallike Cohen’s k are not designed for scale variables (see ian sentence, and we took care never to ask a participant § 4). We obtained encouraging scores in both measures to annotate a sentence he had translated. Overall, we for all three dimensions, with an average of 0.593 for obtained 7 diferent annotations for each sentence, and r, indicating a large efect (see Table 2). Therefore, we we judge the quality of translation is now satisfactory if decided to ask all participants to annotate the remaining not perfect. sentences individually (one annotator per sentence). We To evaluate the annotations, we proceeded similarly then used the new labeling to fine-tune several models to the original EmoBank study: we calculated r and MAE of transformers for dimensional EA, but the scores were between each individual series of annotations and the aggregated values in EmoITA, and then averaged those values for each dimension (see Table 3).

The values of r indicate a large efect in every dimension, particularly for Valence. Correlation is a little higher in Dominance than in Arousal, as per our pilot study: this is somewhat unusual, as in most research we analyzed regarding the English language the opposite is true. MAE is not as good, but still acceptable (10% of the 1-5 scale).

Overall, scores are in line with those of EmoBank (r= 0.634 and MAE= 0.386, on average). They could probably get better analyzing outliers and excluding some of the annotations whose disagreement is particularly strong, a process we have not yet started at this time.

For the shared task the dataset was randomly split into a development and a test set of 8,000 and 2,062 sentences respectively (79.5% and 20.5%), taking care to preserve the genre distribution in the corpus (with a 1% tolerance). The development set was provided as a UTF-8, CSV commaseparated file, reporting the following fields: standard annotations of the test set. Both constrained and unconstrained runs for a sub-task are reported in the same ranking, but we specify the type of the run.

Evaluation metrics for both sub-tasks are the standard metrics known in the literature for emotion regression that we already mentioned throughout this paper: we measure IAA based on r and MAE. The first metric estimates linear dependence between two series of data points: x = 1, ..., and y = 1, ..., . In our case, x corresponds to the values annotated in our dataset for each dimension and y to those predicted by participant systems. The formula for r is as follows: r(x, y) := √︀∑︀ =1( − ¯) 2√︀∑︀

=1( − ¯) 2 ∑︀

=1( − ¯)( − ¯) where ¯ and ¯ are respectively the mean value of x and y.

MAE is a measure of errors between a couple of obser

vations describing the same phenomenon (in this case the annotated values of a certain emotional dimension in the dataset, and those predicted). The formula for MAE is as follows:

MAE(x, y) := 1 ∑︁ | − | =1 (1) (2)

The baselines for both sub-tasks have been built fine

tuning to a regression a BERT model available on HuggingFace2, with a learning rate of 1e-05.

5. Results 4. Evaluation Measures The two sub-tasks are evaluated separately comparing results obtained by participant systems with the gold The teams of the EmotivITA challenge were invited to de

scribe their solution in a technical report; in this section

2https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased, last

access 06-20-2023.

See Table 4 for a couple of examples.

The test set followed the same format, but labels for Valence, Arousal and Domination were not provided. where:

We received submissions from two teams. Both of them

participated to sub-task B, and only one to sub-task A.

In total, 5 runs were submitted, constrained and unconid, text, V, A, D strained. In Table 5 we report the results for r and MAE in sub-task A, in Table 6 those relative to sub-task B, along with our baselines. We appended a sufix to distinguish 1. ‘id’ denotes the unique identifier of the sentence the ID of the submitted run and another one to identify constrained (‘_C_’) and unconstrained (‘_U_’) runs. 2. ‘text’ denotes the text of the sentence Regarding sub-task A, the ISTC-CNR team obtained 3. ‘V’ denotes the average Valence value annotated the best r score in the Valence dimension with his second for the sentence. run. Anyway, our baseline had better results in every 4. ‘A’ denotes the average Arousal value annotated other dimension and metric.

for the sentence. Concerning sub-task B, the team extremITA achieved 5. ‘D’ denotes the average Dominance value anno- the best results in all metrics and dimensions with their tated for the sentence. second run, with the exception of Arousal and Dominance’s r, where our baseline performed slightly better.

6. Discussion

we compare participant systems based on their architec- was used for every task in EVALITA 2023. The second tures. architecture is a Decoder that adopts instruction-tuning,

The ISTC-CNR team proposed a method based on Natu- based on a large language model, the LLaMA [39]. The ral Language Inference (NLI). More specifically, they used model was trained using Low-Rank Adaptation [40] on a multilingual MNLI-XML-RoBERTa model grounded on Italian translations of the instructions originally develXML-RoBERTa [35], which was fine-tuned on a version of oped for Alpaca [41], which also builds on LLaMA. It the MNLI dataset [36] automatically translated to Italian. was then fine-tuned using instructions specific to the adThe model was adapted for the regression task replacing dressed EVALITA task. In the case of EmotivITA sub-task its last linear layer. During training, sentences from the B, the sentence from the EmoITA dataset was paired with EmoITA dataset were used as premises. Then, for sub- a prompt in the form of the instruction: “Scrivi quanta task A, three diferent models were conceived, with three valenza è espressa in questo testo su una scala da 1 a 5, diferent prompts acting as hypotheses for the NLI pro- seguito da quanto stimolo è espresso in questo testo su cess and targeting the VAD dimensions. The prompt for una scala da 1 a 5, seguito da quanto controllo è espresso Valence was “quanta positività esprime la frase?" (how in questo testo in una scala da 1 a 5" (Rate how much much positivity does the sentence convey?), the one for valence is expressed in this text on a scale from 1 to 5, Arousal “quanto è eccitante la frase?" (how exciting is the followed by how much arousal is expressed in this text on sentence?) and the one for Domination “quanto è con- a scale from 1 to 5, followed by how much dominance is trollata l’emozione" (how controlled is the emotion?). For expressed in this text on a scale from 1 to 5). This second sub-task B, a single model was used adopting the prompt model obtained generally better performance than the “valence, arousal, dominance dell’emozione?" (valence, first one as showcased in Table 6, but it also demanded arousal, and dominance of the emotion?). The two runs 144 hours of training (on the entire EVALITA dataset), submitted for sub-task A difer in that the first one only whereas the one based on IT5 only required 12 hours. utilized 99% of the training set made available, while the Quite interestingly the model proposed by the ISTCsecond one utilized it entirely. As we can see in Table 5, CNR team and the second one proposed by the extremITA the results were better with this last configuration. The team both leverage prompting in natural language and only run submitted by the team for sub-task B exploited no task-specific architectural designs (with the exception the entire training set. All runs were produced according of the replacement of the last layer in the MNLI-XMLto the constrained modality. RoBERTa model), proving the eficacy of this approach.

The extremITA team only participated to sub-task B, On the other hand, one could argue that the main limitawith two unconstrained runs. Both their models were tions of the ISTC-CNR method was precisely the chosen trained on the union of all the datasets in the shared prompts, as concepts like Valence, Arousal and Domitasks at EVALITA 2023. The first one adopts an Encoder- nance are not easy to describe. When evaluating the Decoder architecture based on IT5 [37], a T5 model [38] extremITA proposal, instead, one could wonder about pre-trained on Italian texts. The model was fine-tuned the sustainability of a 144 hours training process. concatenating the name of the shared task as a prefix, Anyway, we observe that the baselines obtained finefollowed by an input sentence from the EmoITA dataset. tuning the BERT model were not outperformed by the The output, in the case of the EmotivITA task, was con- proposed systems: maybe the upper limit for the regresstituted by the predicted VAD values. A similar approach sion problem with such a large dataset as EmoITA has been reached, at least for the moment. It is also worth mentioning that the scores are in line with those of the study representing the state-of-the-art [42] for the original English dataset, EmoBank, that obtained values of 0.838, 0.573 and 0.536 for r in the three dimensions.

One last remark is due; neither team explored the possible relations between the three emotion dimensions, which was actually one of the purposes of sub-task B, and remains as a subject for future studies.

7. Conclusion We presented the first shared task on Dimensional and

Multidimensional Emotion Analysis for Italian and discussed the development of the first dedicated Italian dataset, EMoITA, based on the VAD model. EmoITA was obtained by manual translation and annotation of the EmoBank dataset, performed by Italian native speakers. The participating systems leveraged NLI, the EncoderDecoder architecture and Large Language Models to address the regression problems, obtaining results that are similar to those of the state-of-the-art for the English counterpart of the dataset.

We hope that the proposal of our task and the availability of a new Italian dataset for EA will foster studies in this relevant field of NLP. In this spirit, the development and test set, as well as the complete dataset (licensed under CC-BY-SA 4.0), the script used for the baselines and for evaluation will be made available to the public soon; more details on EmotivITA can be found on the task website3.

3Repository: https://github.com/GiovanniGafa/EmoITA. Website:

https://sites.google.com/view/emotivita

3758/s13428-012-0314-x. english words, in: Proceedings of The Annual Con[13] S. Mohammad, F. Bravo-Marquez, M. Salameh, ference of the Association for Computational LinS. Kiritchenko, SemEval-2018 task 1: Afect in guistics (ACL), Melbourne, Australia, 2018. tweets, in: Proceedings of the 12th Interna- [22] Celli, Fabio, Riccardi, Giuseppe, Ghosh, Aridam, tional Workshop on Semantic Evaluation, Associa- CorEA: Italian news corpus with emotions and tion for Computational Linguistics, New Orleans, agreement, in: Proceedings of the First Italian ConLouisiana, 2018, pp. 1–17. URL: https://aclanthology. ference on Computational Linguistics CLiC-it 2014 org/S18-1001. doi:10.18653/v1/S18-1001. and of the Fourth International Workshop EVALITA [14] A. Chatterjee, K. N. Narahari, M. Joshi, P. Agrawal, 2014 9-11 December 2014, Pisa, PISA UNIVERSITY SemEval-2019 task 3: EmoContext contextual emo- PRESS, 2014. URL: http://clic2014.fileli.unipi.it/ tion detection in text, in: Proceedings of the 13th proceedings/Proceedings-CLICit-2014.pdf. doi:10. International Workshop on Semantic Evaluation, 12871/CLICIT2014120.

Association for Computational Linguistics, Min- [23] Z. Shibingfeng, F. Francesco, G. Federico, B.-C. Alneapolis, Minnesota, USA, 2019, pp. 39–48. URL: berto, B. Paolo, P. Angelo, AriEmozione2.0, 2022. https://aclanthology.org/S19-2005. doi:10.18653/ URL: https://zenodo.org/record/7097913. doi:10. v1/S19-2005. 5281/ZENODO.7097913. [15] O. Araque, L. Gatti, J. Staiano, M. Guerini, De- [24] R. Sprugnoli, MultiEmotions-It: a New Dataset pecheMood++: A Bilingual Emotion Lexicon Built for Opinion Polarity and Emotion Analysis for ItalThrough Simple Yet Powerful Techniques, IEEE ian, in: J. Monti, F. dell’Orletta, F. Tamburini (Eds.), Transactions on Afective Computing 13 (2022) 496– Proceedings of the Seventh Italian Conference on 507. URL: https://ieeexplore.ieee.org/document/ Computational Linguistics, CLiC-it 2020, Bologna, 8798675/. doi:10.1109/TAFFC.2019.2934444. Italy, March 1-3, 2021, volume 2769 of CEUR Work[16] M. Montefinese, E. Ambrosini, B. Fairfield, N. Mam- shop Proceedings, CEUR-WS.org, Torino, 2020. URL: marella, The adaptation of the Afective Norms for http://ceur-ws.org/Vol-2769/paper_08.pdf. doi:10. English Words (ANEW) for Italian, Behavior Re- 4000/books.aaccademia.8910. search Methods 46 (2014) 887–903. URL: https://link. [25] O. Araque, S. Frenda, D. Nozza, V. Patti, R. Sprugspringer.com/10.3758/s13428-013-0405-3. doi:10. noli, Emit at evalita2023: Overview of the categori3758/s13428-013-0405-3. cal emotion detection in italian social media task, in: [17] L. Passaro, L. Pollacci, A. Lenci, ItEM: A Vector M. Lai, S. Menini, M. Polignano, V. Russo, R. SprugSpace Model to Bootstrap an Italian Emotive Lexi- noli, G. Venturi (Eds.), Proceedings of the Eighth con, Second Italian Conference on Computational Evaluation Campaign of Natural Language ProcessLinguistics CLiC-it 2015 II (2015). ing and Speech Tools for Italian. Final Workshop [18] A. Bolioli, F. Salamino, V. Porzionato, Social Media (EVALITA 2023), CEUR.org, Parma, Italy, 2023.

Monitoring in Real Life with Blogmeter Platform, [26] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugin: C. Battaglino, C. Bosco, E. Cambria, R. Damiano, noli, G. Venturi, Evalita 2023: Overview of the 8th V. Patti, P. Rosso (Eds.), Proceedings of the First In- evaluation campaign of natural language processternational Workshop on Emotion and Sentiment in ing and speech tools for italian, in: Proceedings Social and Expressive Media: approaches and per- of the Eighth Evaluation Campaign of Natural Lanspectives from AI (ESSEM 2013) A workshop of the guage Processing and Speech Tools for Italian. Final XIII International Conference of the Italian Associ- Workshop (EVALITA 2023), CEUR.org, Parma, Italy, ation for Artificial Intelligence (AI*IA 2013), Turin, 2023.

Italy, December 3, 2013, volume 1096 of CEUR Work- [27] R. Plutchik, A General Psychoevolutionary Theory shop Proceedings, CEUR-WS.org, 2013, pp. 156–163. of Emotion, in: Theories of Emotion, Elsevier, 1980, URL: http://ceur-ws.org/Vol-1096/paper12.pdf. pp. 3–33. URL: https://linkinghub.elsevier.com/ [19] E. Borelli, D. Crepaldi, C. A. Porro, C. Cac- retrieve/pii/B9780125587013500077. doi:10.1016/ ciari, The psycholinguistic and afec- B978-0-12-558701-3.50007-7. tive structure of words conveying pain, [28] R. Mukherjee, A. Naik, S. Poddar, S. Dasgupta, PLOS ONE 13 (2018) e0199658. URL: N. Ganguly, Understanding the role of afect https://dx.plos.org/10.1371/journal.pone.0199658. dimensions in detecting emotions from tweets: doi:10.1371/journal.pone.0199658. A multi-task approach, CoRR abs/2105.03983 [20] S. M. Mohammad, P. D. Turney, Crowdsourcing a (2021). URL: https://arxiv.org/abs/2105.03983. word-emotion association lexicon, Computational arXiv:2105.03983.

Intelligence 29 (2013) 436–465. [29] J. Wang, L.-C. Yu, K. R. Lai, X. Zhang, Dimensional [21] S. M. Mohammad, Obtaining reliable human rat- sentiment analysis using a regional CNN-LSTM ings of valence, arousal, and dominance for 20,000 model, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguis- https://aclanthology.org/N18-1101. doi:10.18653/ tics (Volume 2: Short Papers), Association for Com- v1/N18-1101. putational Linguistics, Berlin, Germany, 2016, pp. [37] G. Sarti, M. Nissim, IT5: Large-scale text-to-text 225–230. URL: https://aclanthology.org/P16-2037. pretraining for italian language understanding and doi:10.18653/v1/P16-2037. generation, ArXiv preprint 2203.03759 (2022). URL: [30] S. Buechel, U. Hahn, EmoBank: Studying the Im- https://arxiv.org/abs/2203.03759. pact of Annotation Perspective and Representa- [38] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, tion Format on Dimensional Emotion Analysis, in: M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the M. Lapata, P. Blunsom, A. Koller (Eds.), Proceed- limits of transfer learning with a unified text-to-text ings of the 15th Conference of the European Chap- transformer, 2020. arXiv:1910.10683. ter of the Association for Computational Linguis- [39] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. tics, EACL 2017, Valencia, Spain, April 3-7, 2017, Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. HamVolume 2: Short Papers, Association for Compu- bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, tational Linguistics, 2017, pp. 578–585. URL: http: G. Lample, Llama: Open and eficient foundation //aclweb.org/anthology/E17-2092. doi:10.18653/ language models, 2023. arXiv:2302.13971. v1/E17-2092. [40] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, [31] N. Ide, C. Baker, C. Fellbaum, C. Fillmore, R. Passon- Y. Li, S. Wang, L. Wang, W. Chen, Lora: Lowneau, MASC: the manually annotated sub-corpus rank adaptation of large language models, 2021. of American English, in: Proceedings of the Sixth arXiv:2106.09685.

International Conference on Language Resources [41] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, and Evaluation (LREC’08), European Language Re- C. Guestrin, P. Liang, T. B. Hashimoto, Stanford alsources Association (ELRA), Marrakech, Morocco, paca: An instruction-following llama model, https: 2008. URL: http://www.lrec-conf.org/proceedings/ //github.com/tatsu-lab/stanford_alpaca, 2023. lrec2008/pdf/617_paper.pdf. [42] S. Park, J. Kim, S. Ye, J. Jeon, H. Y. Park, A. Oh, [32] C. Strapparava, R. Mihalcea, SemEval-2007 task Dimensional emotion detection from categorical 14: Afective text, in: Proceedings of the Fourth emotion, in: Proceedings of the 2021 ConInternational Workshop on Semantic Evaluations ference on Empirical Methods in Natural Lan(SemEval-2007), Association for Computational Lin- guage Processing, Association for Computational guistics, Prague, Czech Republic, 2007, pp. 70–74. Linguistics, Online and Punta Cana, DominiURL: https://aclanthology.org/S07-1013. can Republic, 2021, pp. 4367–4380. URL: https: [33] M. M. Bradley, P. J. Lang, Measuring emotion: The //aclanthology.org/2021.emnlp-main.358. doi:10. self-assessment manikin and the semantic diferen- 18653/v1/2021.emnlp-main.358. tial, Journal of Behavior Therapy and Experimental Psychiatry 25 (1994) 49–59. URL: https://linkinghub. elsevier.com/retrieve/pii/0005791694900639. doi:10.

1016/0005-7916(94)90063-9. [34] P. J. Lang, Behavioral treatment and bio-behavioral assessment: Computer applications, in: J. B. Sidowski, J. H. Johnson, T. A. Williams (Eds.), Technology in mental health care delivery systems, Norwood, NJ: Ablex Publishing, 1980, pp. 119–137. [35] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised crosslingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/ 1911.02116. arXiv:1911.02116. [36] A. Williams, N. Nangia, S. Bowman, A broadcoverage challenge corpus for sentence understanding through inference, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 1112–1122. URL: