1. Introduction and Motivations

EVALITA

Media Task⋆

Oscar Araque

o.araque@upm.es 0 4 6 7

Simona Frenda

simona.frenda@unito.it 2 4 5 6 7

Rachele Sprugnoli

rachele.sprugnoli@unipr.it 3 4 6 7

Debora Nozza

debora.nozza@unibocconi.it 1 4 6 7

Viviana Patti

viviana.patti@unito.it 2 4 6 7 0 Intelligent Systems Group, Universidad Politécnica de Madrid , Spain 1 Università Bocconi , Italy 2 Università degli Studi di Torino , Italy 3 Università di Parma , Italy 4 Workshop Proce dings 5 aequa-tech srl , Turin , Italy 6 and dialogues. For example , Afect in Tweets 7 cessing and Speech Tools for Italian , Sep 7 - 8, Parma, IT

2023

The Emotions in Italian (EMit) task is the first edition of a shared task on emotion analysis and opinion mining in Italian messages at EVALITA 2023. EMit presents two subtasks: (i) Subtask A, that consists in an emotion detection challenge, and (ii) Subtask B, that introduces a novel problem of target detection of the expressed emotion. Additionally, EMit challenges systems with a thorough in-domain and out-of-domain evaluation, probing the generalization capabilities of the submitted solutions. In general, 4 teams have participated in Subtask A, achieving a macro-averaged f-score of 0.6028 and 0.4977 in the in-domain and out-of-domain sets, respectively. In Subtask B a team has participated, obtaining 0.6459 in the in-domain set and 0.3223 in the out-of-domain set as macro-averaged f-scores. The obtained results indicate that further work needs to be done to solve the task, opening new avenues of research.

Emotion detection Emotion target detection User-generated contents Sentiment Analysis

1. Introduction and Motivations

(V. Patti)

https://gsi.upm.es/oaraque/ (O. Araque); http://www.di.unito.it/~frenda/ (S. Frenda); https://personale.unipr.it/it/ugovdocenti/person/236480 (R. Sprugnoli); https://deboranozza.com/ (D. Nozza); https://www.unito.it/persone/vpatti (V. Patti)

0000-0003-3224-0001 (O. Araque); 0000-0002-6215-3374 (S. Frenda); 0000-0001-6861-5595 (R. Sprugnoli); 0000-0002-7998-2267 (D. Nozza); 0000-0001-5991-370X (V. Patti) and Spanish whereas the Emotion Detection task at TASS 2020 [4] and EmoEvalEs at IberLEF 2021 [5] were only on Spanish tweets 1. Instead, EmoContext at SemEval 2019 [6] and EmotionX at the SocialNLP workshop in 2018 [7] and 2019 focused on the emotion classification of dialogues in English. Last year, the Emotion Classification shared task at WASSA 2022 dealt with a diferent genre of text proposing the classification of emotions in essays written in reaction to news articles [8].

In this context, the EMit (Emotions in Italian) task2 aims at providing the first evaluation framework for emotion detection in Italian texts at EVALITA [9], ofering novel annotated data available to the community that will foster future research. EMit tackles a comprehensive emotion model that is complemented with additional annotations regarding the scope of opinions.

2. Task Description EMit is organized according to two subtasks, thus ofering participants diferent perspectives on opinion analysis:

• Subtask A

Emotion Detection (Main Task) The main pro

posed subtask is the detection of emotions in social media messages about TV shows and series

1https://competitions.codalab.org/competitions/28682

emitted by RAI (Radiotelevisione italiana, the na- vision audience. In other words, this would lead to the tional public broadcasting company of Italy), mu- development of an Auditel of emotions. sic videos and advertisements.

Given a message, the system decides the emotions expressed in the message or the absence of emotions. 3. Datasets • Subtask B In order to evaluate the robustness of the models proTarget Detection The second subtask is about posed by participants, in EMit we release two diferthe detection of the target addressed by the au- ent test sets: (i) the in-domain dataset, which including thor of the message: the topic or the direction. In tweets of the same textual genre and subjects of the traineach text, it is indicated whether this refers to ing set, and (ii) an additional out-of-domain set that is what the broadcast is about (the topic) or whether composed of social text of diferent genres and subjects 3. it refers to something that is under control of the In this way, we ofer to participants a cross-domain evalbroadcast itself (direction). When the target of uation setting for both subtasks A and B. Table 1 sumthe post is the topic, this means that the text ad- marizes the size and distribution of the datasets used in dresses topics such as events, issues discussed in EMit 2023. the TV episode/music video/advertisements, or invited guests of a TV show. On the other hand, Learning Set Dataset Total (approx.) the target encoded as direction implies that the Subtask A message describes the specific directors of the shows/series, the showman/artists, fixed guests Train In-domain 5,966 in the TV shows, reporters, or the show/series/- Test 1 In-domain 1,000 music video/advertisements as such. Test 2 Out-of-domain 1,000 Given a message, the system decides if the target Subtask B of the message is related to topic, direction, both or none of the two.

Train Test 1 Test 2

In-domain

In-domain Out-of-domain Both subtasks are designed as multilabel problems of classification . In this way, participating systems are required to provide as output the id of the message and all the predicted labels contained in it. It is worth mentioning that in Subtask A, the message may be classified as neutral, or expressing one or more emotions. Thus, the provided labels are: neutral when the message does not express any emotion, the 8 main emotions defined by Plutchik in [10] (anger, anticipation, disgust, fear, joy, sadness, surprise, trust), and the additional label love that is one of the primary dyads in the Plutchik’s wheel of emotions, being a combination of joy and trust.

Therefore, a total of 10 labels are used for Subtask A. In Subtask B the message can be classified as addressing the topic, the direction, or both or neither, thus the provided labels are: topic and direction.

Considering the specific attention on the entertainment sector, we designed Subtask B particularly on the events and players involved in such contents and in their creation. Indeed, the combination of the two subtasks allows going beyond the simple detection of emotions, identifying also if the target of the afective comments about TV programs is related to the topic or to issues under control of the broadcasting company (the direction). Such finer grained information can be of great importance in real application domains, for artists or broadcasters in the evaluation of the contents delivered, when the analysis of emotions in social media is used as a social signal of emotional reactions of Italian tele

Dataset for in-domain evaluation.

This dataset is obtained from Twitter and it is composed of 6,966 tweets that discuss programs by the Italian RAI TV station. Such messages have been grouped in almost 5 set, each set annotated by three diferent annotators (for a total of 15 annotators) with a multi-layered annotation scheme. As described, the emotion layer consists of 10 labels: Plutchik’s emotions, love and neutral. These emotion annotations are used for running Subtask A.

The emotion labels are non-exclusive, thus a certain tweet can be annotated with one or more emotions, or even solely as neutral , as shown in the examples in Table 3. The number of tweets that expresses at least one emotion is the 78% of all tweets, which is a fairly high coverage. Also, the number of tweets that express two or more emotions represents the 19% of all tweets.

On top of this, the dataset is annotated with the innovative layer concerning the target, including the topic (describing the events of the emission) and direction (whether messages are directed to a specific

3It is important to note that user data is not disclosed, since all data

has been anonymized by removing all personal information such as @usernames and generating new IDs for the texts coming from

Twitter. 1 0 0 0

Top.

Dir. 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 entity related to RAI) labels. These annotations ofer of data from a variety of sources that do not directly a novel perspective on the data, allowing participants address RAI contents, but describe other audiovisual and, in general, the EVALITA community, to explore media. the efectiveness of current models to understand such a subtask. In total, 84% of the tweets are annotated As a summary, Table 2 shows the arrangement of the with the “topic” or “direction” labels, and 8% of tweets proposed datasets for subtasks A and B, with the detail have both labels. These annotations should be used for for each class.

Subtask B.

Dataset for out-of-domain evaluation.

We provide as a second test set 1,000 out-of-domain instances for both subtasks A and B. This additional dataset is composed of comments to music videos and advertisement posted on YouTube. The selection of the videos followed the same procedure used for the creation of the MultiEmotions-It dataset [11]. Specifically, the videos were manually chosen from the songs of Sanremo Music Festival 2021 and from the most recent advertisements, covering diferent types of products and services. The annotation was performed manually using the same approach of the in-domain dataset. Examples are given in Table 4. In this way, we propose the use

4. Evaluation In EMit 2023, participants are allowed to submit up to 2

runs for each subtask, with a mandatory run for the main Subtask A. The first run is required to be a constrained submission. That is, the only annotated data to be used for training and tuning the systems are those distributed by the organizers, with the exception of additional data such as lexicons and word embeddings. On the contrary, the second run of each participant can be unconstrained, thus allowing participants to use additional training data.

The performance of the systems is evaluated using the macro-averaged F1-score, which aggregates the classiifcation metrics for each of the classes thus, in the oficial Vergognatevi! Che schifo [Shame on you! How disgusting] sento la mancanza delle mie crociere .grazie del video e speriamo presto di partire. [I miss my cruises .thanks for the video and hope to go soon.] Ma quanto è bello Damiano raga? Canzone spaziale ! [But how beautiful is Damiano raga? Space song !] Adoro questa canzone complimentissimi [I love this song congratulations]

Trust ranking, participants’ runs are ordered according to the The most used approach is supervised, with a prementioned F-score. dominance of fine-tuning actions of LLMs to address

As baseline, we provide the results of three basic mod- the specific task of classification. Moreover, two els. All these models compute diferent text representa- teams also presented semi-supervised systems based tions that are fed to a logistic regression classifier. In this specifically on a few-shot prompting (extremITA and way, the baselines’ text representations are: App2Check). Various LLMs are employed. For instance, some teams experimented with the classic BERT-based • Baseline_OHE: uni and bi-grams encoded with models for the Italian language (i.e., bert-base-italiana one-hot schema, with a vocabulary of 5,000 cased, bert-base-italian-xxl-cased, bert_uncased_Ltokens. 12_H-768_A-12_italian_alberto, umberto-commoncrawl• Baseline_TFIDF : uni and bi-grams represented cased-v1), others with already fine-tuned versions of with the TF-IDF approach, again using a vocabu- BERT (i.e., feel-it-italian-emotion, polibert_sa), and lary of 5,000 tokens. the rest exploited some sequence-to-sequence LLMs oriented, in this context, to perform mainly instruc

Finally, we also consider the results of a simple random tion solutions such as ChatGPT (gpt-3.5-turbo-0301), baseline Baseline_random, that outputs the predictions flan-t5-xl, mt5-base, IT5 (it5-efficient-small-el32) for all classes following a uniform random distribution. and LLaMA foundational model (llama-7b-hf). In particular, EmotionHunters [12] performed a bat5. Task Overview: Systems and tery of experiments with classic BERT models and already fine-tuned versions of LLMs. The final system, Results selected on the basis of their experiments, is based on the fine-tuning of AlBERTo model and, at the top, the In this first edition of EMit, very few teams participated fully connected layer to provide a multilabel classificain the competition. In particular, we received 1 submis- tion for each text. Both ABCD [13] and App2Check [14] sion by industry (App2Check) and 3 by academic teams teams employed an ensemble of predictions of difer(extremITA, ABCD, and EmotionHunters). Although the ent LLMs based on a soft voting method that considfew participants, the organized shared task also collected ers the confidence score associated with each prediction international interest with the ABCD team coming from (ABCD: run 1) and the best top-performing model for Vietnam. All 4 participating teams have submitted at each emotion (App2Check: unsubmitted run)4 looking at least one run for Subtask A, and just one team sent us the performance in the development set of the two best the predictions on Subtask B. implemented systems: A2C-mT5-r1 (App2Check, run 1) and A2C-GPT-r2 (App2Check, run 2). A2C-mT5-r1 is 5.1. Systems based on the fine-tuning of multilingual T5 employing the Simple Transformers library. While A2C-GPT-r2 is built using a few-shot approach with ChatGPT, prompt to simultaneously identify all emotions for each text input.

A similar approach is used by extremITA [15], who emAttending to the various systems employed for the classification of emotions (multilabel) and target (binary), their design is based mainly on the use of Large Language Models (LLMs), confirming the actual tendency and success of transformer-based models. However, they have been included in diferent architectures.

4The unsubmitted run reported very good scores in both in-domain

(f1-score of 0.504) and out-of-domain setting (f1-score of 0.518) run id 2 1 1 1 2

Anger 0.5176 0.4815 0.4706 0.4596 0.4048 0.3529 0.2945 0.2178 0.1039 ployed sequence-to-sequence LLMs for Italian to solve for Subtask A obtained a macro-averaged score of 0.6028, instructions related to specific tasks. They developed while for Subtask B it is 0.6459. In the case of the out-oftwo systems to solve diferent shared tasks of EVALITA domain evaluation, the best scored obtained by a team 2023: extremIT5 (extremITA, run 1) and extremITLLaMA is 0.4977 in Subtask A, and 0.4448 in Subtask B. This (extremITA, run 2). The former is an Encoder-Decoder decrease in the classification performance when commodel based on IT5, and trained by concatenating the paring in-domain and out-of-domain evaluations was task name and an example as input (i.e., “EMit: Quando expected giving that training was performed only on the ci sarà l’espulsione di Claudia #ilcollegio [url]”) and as in-domain data. Additionally, it is worth noticing that output the sequence of labels; in contrast, the latter is an even if Subtask A contains 10 possible labels and Subtask instruction-tuned Decoder model built upon the LLaMA B has only 2, their best scores are not that diferent (a foundational models, therefore the structured prompt diference of 0.0431). is an instruction in natural language like “Which emo- Following, in relation to the overall results achieved tions are expressed in this text? You can choose among by participants, it can be seen that in Subtask A, both in joy, fear, ...”. Diferently from the previous editions of the in-domain and out-of-domain evaluations, the teams’ EVALITA, in the EMit 2023 shared task it is clear that the submissions have obtained better results than the baseattention is only on the LLMs’ ability to solve tasks and lines. The best baseline in the in-domain evaluation uses their integration into the systems’ architecture, losing the TF-IDF uni and bi-grams, while for the out-of-domain focus on linguistic features that can represent or infer the evaluation the uni and bi-grams using one-hot encoding emotions in the text. Also, the preprocessing of the text achieves the best result. Regarding Subtask B, the only is focused on very few steps, regarding mainly the trans- team that has submitted a run for it has obtained a better formation of emojis in textual descriptions, removing score than in the in-domain evaluation. mentions, urls and other symbols. In contrast, when considering the out-of-domain evaluation in Subtask B, we see that the best baseline is the 5.2. Results one that randomly predicts the objective labels. This decreases in the classification performance is seen in the Tables 5, 6, 7, and 8 report the oficial results obtained in runs but also in the learning-based baselines. This may EMit 2023 for both subtask A and B. The ranking is based be explained by considering the distribution of the outon the macro-averaged F1-score, and considers both the of-domain sets in Subtask B (see Table 2). Indeed, we team and the run of each submission. The higher scores can observe that in the train and in-domain test sets the for each column are marked in bold, while the lower prevalent label is Topic but, conversely, in the out-ofscores are underlined. domain test set the Direction label is more frequent.

Generally, it is interesting to see that even if the classi- Consequently, it is possible to postulate that systems ifcation problem of Subtasks A and B are very diferent, trained with the Subtask B training set would perform the best results for each are similar. Concretely, when fairly well in the in-domain test set, but worse on the considering the in-domain test set, the best submission out-of-domain data.

Finally, the detailed results of the evaluation ofer in- still room for improvement. The large number of emoteresting insights into the models’ performance. For ex- tions considered in Subtask A is indeed a challenge for ample, when considering the efect of the number of automatic systems, increasing the dificulty of the task. instances for each class (Table 2, we see that in Subtask In comparison, Subtask B has fewer categories, but still, A Fear is much less frequent in comparison to the other the proposed systems and baselines obtain rather low emotions. Hence, this has an efect on the performance metrics in the task. Also, we have seen how the represenof the systems: in the out-of-domain evaluation (Table 6) tation of the diferent emotions greatly impacts classificathe majority of the models obtained a null score in the tion performance. These, along with the generalization Fear category, thus afecting in a negative way the over- dificulties in the out-of-domain set, indicate that the all averaged score. Similarly, the most common emotions challenge proposed in EMit is not solved. Indeed, future in Subtask A (Trust and Neutral) are generally better works need to address the shortcomings detected and adpredicted by the participants’ systems. vance in the generation of systems that are more robust to the frequency of categories in the datasets, as well as the inclusion of domain-specific knowledge that may 6. Discussion improve overall results.

The presence of both in-domain and out-of-domain data in the EMit task provides a valuable experimentation 7. Conclusions setting as proved by the diferent performances in classiifcation between the two evaluations settings. Since these The first edition of EMit (Emotions in Italian) proposes two types of datasets have been obtained from diferent the assessment of emotions on Italian texts by presenting sources (see Sect. 3), they represent a diverse collection an interesting challenge that revolves around two subof cases. In this way, we can evaluate the participants’ tasks. On one hand, the main task (subtask A) presents a models in relation to their generalization capabilities. comprehensive emotion annotation set using Plutchik’s

In fact, we observe a general reduction in the clas- model, with the addition of the love emotion. On the sification metrics when comparing the in-domain and other hand, subtask B introduces a novel classification out-of-domain test sets. In Subtask A, with the in-domain problem, which addresses the target of the opinion exset, the average macro f-score of all participants’ systems pressed in the textual message. To complement this, we is 0.4868. In comparison, the average metric drops to also provide out-of-domain test sets to further obtain 0.4393 in the case of the out-of-domain dataset. We can insights into the behaviour of the participants’ systems. see a similar trend when considering Subtask B, even if To advance in the study of opinion mining in relation just one team has participated. The average score in the to emotion, and considering both subtasks, EMit estabin-domain evaluation is 0.6395 and, in the out-of-domain lishes a rich annotation schema for considering the efect case, 0.3935. of this challenge on automated systems. While only one

While participants have achieved promising results in team participated in subtask B, we believe that the adthe detection of emotions and opinion targets, there is ditional perspectives brought by the combined study of emotions and their targets will be the subject of further studies. As an example, an interesting research avenue could study the variation of emotions depending on the target, and how this afects learning systems. Another potential research direction is the inclusion of linguistic knowledge into the commonly used large language models.

Acknowledgments

The work of Oscar Araque has been partially funded by the Spanish Ministry of Science, Innovation, and Universities through project COGNOS (PID2019-105484RB-I00) and “ETSI Telecomunicación” of “Universidad Politécnica de Madrid” through the initiative “Primeros Proyectos” under “AFRICA – Detecting and Analyzing Afective and Moral Factors in Radicalization and ExtremIsm: a MaChine learning Approach”. The work of S. Frenda and V. Patti was partially funded by the Multilingual Perspective-Aware NLU Project in partnership with Amazon Alexa. The work of D. Nozza was partially funded by Fondazione Cariplo (grant No. 2020-4288, MONICA). ural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org, Parma, Italy, 2023. [15] C. D. Hromei, D. Croce, V. Basile, R. Basili, ExtremITA at EVALITA2023: Multi-Task Sustainable Scaling to Large Language Models at its Extreme, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), CEUR.org, Parma, Italy, 2023.