<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>EVALITA</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Media Task⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oscar Araque</string-name>
          <email>o.araque@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simona Frenda</string-name>
          <email>simona.frenda@unito.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rachele Sprugnoli</string-name>
          <email>rachele.sprugnoli@unipr.it</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Debora Nozza</string-name>
          <email>debora.nozza@unibocconi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Patti</string-name>
          <email>viviana.patti@unito.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Intelligent Systems Group, Universidad Politécnica de Madrid</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università Bocconi</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli Studi di Torino</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Università di Parma</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Workshop Proce dings</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>aequa-tech srl</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>and dialogues. For example</institution>
          ,
          <addr-line>Afect in Tweets</addr-line>
        </aff>
        <aff id="aff7">
          <label>7</label>
          <institution>cessing and Speech Tools for Italian</institution>
          ,
          <addr-line>Sep 7 - 8, Parma, IT</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>8</volume>
      <abstract>
        <p>The Emotions in Italian (EMit) task is the first edition of a shared task on emotion analysis and opinion mining in Italian messages at EVALITA 2023. EMit presents two subtasks: (i) Subtask A, that consists in an emotion detection challenge, and (ii) Subtask B, that introduces a novel problem of target detection of the expressed emotion. Additionally, EMit challenges systems with a thorough in-domain and out-of-domain evaluation, probing the generalization capabilities of the submitted solutions. In general, 4 teams have participated in Subtask A, achieving a macro-averaged f-score of 0.6028 and 0.4977 in the in-domain and out-of-domain sets, respectively. In Subtask B a team has participated, obtaining 0.6459 in the in-domain set and 0.3223 in the out-of-domain set as macro-averaged f-scores. The obtained results indicate that further work needs to be done to solve the task, opening new avenues of research.</p>
      </abstract>
      <kwd-group>
        <kwd>Emotion detection</kwd>
        <kwd>Emotion target detection</kwd>
        <kwd>User-generated contents</kwd>
        <kwd>Sentiment Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivations</title>
      <p>(V. Patti)</p>
      <p>https://gsi.upm.es/oaraque/ (O. Araque);
http://www.di.unito.it/~frenda/ (S. Frenda);
https://personale.unipr.it/it/ugovdocenti/person/236480
(R. Sprugnoli); https://deboranozza.com/ (D. Nozza);
https://www.unito.it/persone/vpatti (V. Patti)</p>
      <p>0000-0003-3224-0001 (O. Araque); 0000-0002-6215-3374
(S. Frenda); 0000-0001-6861-5595 (R. Sprugnoli);
0000-0002-7998-2267 (D. Nozza); 0000-0001-5991-370X (V. Patti)
and Spanish whereas the Emotion Detection task at TASS
2020 [4] and EmoEvalEs at IberLEF 2021 [5] were only on
Spanish tweets 1. Instead, EmoContext at SemEval 2019
[6] and EmotionX at the SocialNLP workshop in 2018
[7] and 2019 focused on the emotion classification of
dialogues in English. Last year, the Emotion Classification
shared task at WASSA 2022 dealt with a diferent genre
of text proposing the classification of emotions in essays
written in reaction to news articles [8].</p>
      <p>In this context, the EMit (Emotions in Italian) task2
aims at providing the first evaluation framework for
emotion detection in Italian texts at EVALITA [9], ofering
novel annotated data available to the community that
will foster future research. EMit tackles a
comprehensive emotion model that is complemented with additional
annotations regarding the scope of opinions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <sec id="sec-2-1">
        <title>EMit is organized according to two subtasks, thus ofering participants diferent perspectives on opinion analysis:</title>
        <p>• Subtask A</p>
        <sec id="sec-2-1-1">
          <title>Emotion Detection (Main Task) The main pro</title>
          <p>posed subtask is the detection of emotions in
social media messages about TV shows and series</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>1https://competitions.codalab.org/competitions/28682</title>
        <p>emitted by RAI (Radiotelevisione italiana, the na- vision audience. In other words, this would lead to the
tional public broadcasting company of Italy), mu- development of an Auditel of emotions.
sic videos and advertisements.</p>
        <p>Given a message, the system decides the emotions
expressed in the message or the absence of emotions. 3. Datasets
• Subtask B In order to evaluate the robustness of the models
proTarget Detection The second subtask is about posed by participants, in EMit we release two
diferthe detection of the target addressed by the au- ent test sets: (i) the in-domain dataset, which including
thor of the message: the topic or the direction. In tweets of the same textual genre and subjects of the
traineach text, it is indicated whether this refers to ing set, and (ii) an additional out-of-domain set that is
what the broadcast is about (the topic) or whether composed of social text of diferent genres and subjects 3.
it refers to something that is under control of the In this way, we ofer to participants a cross-domain
evalbroadcast itself (direction). When the target of uation setting for both subtasks A and B. Table 1
sumthe post is the topic, this means that the text ad- marizes the size and distribution of the datasets used in
dresses topics such as events, issues discussed in EMit 2023.
the TV episode/music video/advertisements, or
invited guests of a TV show. On the other hand, Learning Set Dataset Total (approx.)
the target encoded as direction implies that the Subtask A
message describes the specific directors of the
shows/series, the showman/artists, fixed guests Train In-domain 5,966
in the TV shows, reporters, or the show/series/- Test 1 In-domain 1,000
music video/advertisements as such. Test 2 Out-of-domain 1,000
Given a message, the system decides if the target Subtask B
of the message is related to topic, direction, both or
none of the two.</p>
        <p>Train
Test 1
Test 2</p>
        <p>In-domain</p>
        <p>In-domain
Out-of-domain
Both subtasks are designed as multilabel problems of
classification . In this way, participating systems are
required to provide as output the id of the message and
all the predicted labels contained in it. It is worth
mentioning that in Subtask A, the message may be classified
as neutral, or expressing one or more emotions. Thus,
the provided labels are: neutral when the message does
not express any emotion, the 8 main emotions defined by
Plutchik in [10] (anger, anticipation, disgust, fear,
joy, sadness, surprise, trust), and the additional label
love that is one of the primary dyads in the Plutchik’s
wheel of emotions, being a combination of joy and trust.</p>
        <p>Therefore, a total of 10 labels are used for Subtask A. In
Subtask B the message can be classified as addressing the
topic, the direction, or both or neither, thus the provided
labels are: topic and direction.</p>
        <p>Considering the specific attention on the
entertainment sector, we designed Subtask B particularly on the
events and players involved in such contents and in their
creation. Indeed, the combination of the two subtasks
allows going beyond the simple detection of emotions,
identifying also if the target of the afective comments
about TV programs is related to the topic or to issues
under control of the broadcasting company (the
direction). Such finer grained information can be of great
importance in real application domains, for artists or
broadcasters in the evaluation of the contents delivered,
when the analysis of emotions in social media is used
as a social signal of emotional reactions of Italian
tele</p>
        <sec id="sec-2-2-1">
          <title>Dataset for in-domain evaluation.</title>
          <p>This dataset is obtained from Twitter and it is composed
of 6,966 tweets that discuss programs by the Italian RAI
TV station. Such messages have been grouped in almost
5 set, each set annotated by three diferent annotators (for
a total of 15 annotators) with a multi-layered annotation
scheme. As described, the emotion layer consists of 10
labels: Plutchik’s emotions, love and neutral. These
emotion annotations are used for running Subtask A.</p>
          <p>The emotion labels are non-exclusive, thus a certain
tweet can be annotated with one or more emotions, or
even solely as neutral , as shown in the examples in
Table 3. The number of tweets that expresses at least one
emotion is the 78% of all tweets, which is a fairly high
coverage. Also, the number of tweets that express two
or more emotions represents the 19% of all tweets.</p>
          <p>On top of this, the dataset is annotated with the
innovative layer concerning the target, including the
topic (describing the events of the emission) and
direction (whether messages are directed to a specific</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>3It is important to note that user data is not disclosed, since all data</title>
        <p>has been anonymized by removing all personal information such
as @usernames and generating new IDs for the texts coming from</p>
        <p>Twitter.
1
0
0
0</p>
        <p>Top.</p>
        <p>Dir.
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
1
0
entity related to RAI) labels. These annotations ofer of data from a variety of sources that do not directly
a novel perspective on the data, allowing participants address RAI contents, but describe other audiovisual
and, in general, the EVALITA community, to explore media.
the efectiveness of current models to understand such
a subtask. In total, 84% of the tweets are annotated As a summary, Table 2 shows the arrangement of the
with the “topic” or “direction” labels, and 8% of tweets proposed datasets for subtasks A and B, with the detail
have both labels. These annotations should be used for for each class.</p>
        <p>Subtask B.</p>
        <sec id="sec-2-3-1">
          <title>Dataset for out-of-domain evaluation.</title>
          <p>We provide as a second test set 1,000 out-of-domain
instances for both subtasks A and B. This additional
dataset is composed of comments to music videos and
advertisement posted on YouTube. The selection of the
videos followed the same procedure used for the creation
of the MultiEmotions-It dataset [11]. Specifically,
the videos were manually chosen from the songs of
Sanremo Music Festival 2021 and from the most recent
advertisements, covering diferent types of products and
services. The annotation was performed manually using
the same approach of the in-domain dataset. Examples
are given in Table 4. In this way, we propose the use</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Evaluation</title>
      <sec id="sec-3-1">
        <title>In EMit 2023, participants are allowed to submit up to 2</title>
        <p>runs for each subtask, with a mandatory run for the main
Subtask A. The first run is required to be a constrained
submission. That is, the only annotated data to be used
for training and tuning the systems are those distributed
by the organizers, with the exception of additional data
such as lexicons and word embeddings. On the contrary,
the second run of each participant can be unconstrained,
thus allowing participants to use additional training data.</p>
        <p>The performance of the systems is evaluated using the
macro-averaged F1-score, which aggregates the
classiifcation metrics for each of the classes thus, in the oficial
Vergognatevi! Che schifo [Shame on you!
How disgusting]
sento la mancanza delle mie crociere .grazie
del video e speriamo presto di partire. [I
miss my cruises .thanks for the video and
hope to go soon.]
Ma quanto è bello Damiano raga? Canzone
spaziale ! [But how beautiful is Damiano
raga? Space song !]
Adoro questa canzone complimentissimi [I
love this song congratulations]</p>
        <p>Trust
ranking, participants’ runs are ordered according to the The most used approach is supervised, with a
prementioned F-score. dominance of fine-tuning actions of LLMs to address</p>
        <p>As baseline, we provide the results of three basic mod- the specific task of classification. Moreover, two
els. All these models compute diferent text representa- teams also presented semi-supervised systems based
tions that are fed to a logistic regression classifier. In this specifically on a few-shot prompting (extremITA and
way, the baselines’ text representations are: App2Check). Various LLMs are employed. For instance,
some teams experimented with the classic BERT-based
• Baseline_OHE: uni and bi-grams encoded with models for the Italian language (i.e.,
bert-base-italiana one-hot schema, with a vocabulary of 5,000 cased, bert-base-italian-xxl-cased,
bert_uncased_Ltokens. 12_H-768_A-12_italian_alberto,
umberto-commoncrawl• Baseline_TFIDF : uni and bi-grams represented cased-v1), others with already fine-tuned versions of
with the TF-IDF approach, again using a vocabu- BERT (i.e., feel-it-italian-emotion, polibert_sa), and
lary of 5,000 tokens. the rest exploited some sequence-to-sequence LLMs
oriented, in this context, to perform mainly
instruc</p>
        <p>Finally, we also consider the results of a simple random tion solutions such as ChatGPT (gpt-3.5-turbo-0301),
baseline Baseline_random, that outputs the predictions flan-t5-xl, mt5-base, IT5 (it5-efficient-small-el32)
for all classes following a uniform random distribution. and LLaMA foundational model (llama-7b-hf).
In particular, EmotionHunters [12] performed a
bat5. Task Overview: Systems and tery of experiments with classic BERT models and
already fine-tuned versions of LLMs. The final system,
Results selected on the basis of their experiments, is based on
the fine-tuning of AlBERTo model and, at the top, the
In this first edition of EMit, very few teams participated fully connected layer to provide a multilabel
classificain the competition. In particular, we received 1 submis- tion for each text. Both ABCD [13] and App2Check [14]
sion by industry (App2Check) and 3 by academic teams teams employed an ensemble of predictions of
difer(extremITA, ABCD, and EmotionHunters). Although the ent LLMs based on a soft voting method that
considfew participants, the organized shared task also collected ers the confidence score associated with each prediction
international interest with the ABCD team coming from (ABCD: run 1) and the best top-performing model for
Vietnam. All 4 participating teams have submitted at each emotion (App2Check: unsubmitted run)4 looking at
least one run for Subtask A, and just one team sent us the performance in the development set of the two best
the predictions on Subtask B. implemented systems: A2C-mT5-r1 (App2Check, run 1)
and A2C-GPT-r2 (App2Check, run 2). A2C-mT5-r1 is
5.1. Systems based on the fine-tuning of multilingual T5 employing
the Simple Transformers library. While A2C-GPT-r2 is
built using a few-shot approach with ChatGPT, prompt to
simultaneously identify all emotions for each text input.</p>
        <p>A similar approach is used by extremITA [15], who
emAttending to the various systems employed for the
classification of emotions (multilabel) and target (binary),
their design is based mainly on the use of Large
Language Models (LLMs), confirming the actual tendency
and success of transformer-based models. However, they
have been included in diferent architectures.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4The unsubmitted run reported very good scores in both in-domain</title>
        <p>(f1-score of 0.504) and out-of-domain setting (f1-score of 0.518)
run id
2
1
1
1
2</p>
        <p>Anger
0.5176
0.4815
0.4706
0.4596
0.4048
0.3529
0.2945
0.2178
0.1039
ployed sequence-to-sequence LLMs for Italian to solve for Subtask A obtained a macro-averaged score of 0.6028,
instructions related to specific tasks. They developed while for Subtask B it is 0.6459. In the case of the
out-oftwo systems to solve diferent shared tasks of EVALITA domain evaluation, the best scored obtained by a team
2023: extremIT5 (extremITA, run 1) and extremITLLaMA is 0.4977 in Subtask A, and 0.4448 in Subtask B. This
(extremITA, run 2). The former is an Encoder-Decoder decrease in the classification performance when
commodel based on IT5, and trained by concatenating the paring in-domain and out-of-domain evaluations was
task name and an example as input (i.e., “EMit: Quando expected giving that training was performed only on the
ci sarà l’espulsione di Claudia #ilcollegio [url]”) and as in-domain data. Additionally, it is worth noticing that
output the sequence of labels; in contrast, the latter is an even if Subtask A contains 10 possible labels and Subtask
instruction-tuned Decoder model built upon the LLaMA B has only 2, their best scores are not that diferent (a
foundational models, therefore the structured prompt diference of 0.0431).
is an instruction in natural language like “Which emo- Following, in relation to the overall results achieved
tions are expressed in this text? You can choose among by participants, it can be seen that in Subtask A, both in
joy, fear, ...”. Diferently from the previous editions of the in-domain and out-of-domain evaluations, the teams’
EVALITA, in the EMit 2023 shared task it is clear that the submissions have obtained better results than the
baseattention is only on the LLMs’ ability to solve tasks and lines. The best baseline in the in-domain evaluation uses
their integration into the systems’ architecture, losing the TF-IDF uni and bi-grams, while for the out-of-domain
focus on linguistic features that can represent or infer the evaluation the uni and bi-grams using one-hot encoding
emotions in the text. Also, the preprocessing of the text achieves the best result. Regarding Subtask B, the only
is focused on very few steps, regarding mainly the trans- team that has submitted a run for it has obtained a better
formation of emojis in textual descriptions, removing score than in the in-domain evaluation.
mentions, urls and other symbols. In contrast, when considering the out-of-domain
evaluation in Subtask B, we see that the best baseline is the
5.2. Results one that randomly predicts the objective labels. This
decreases in the classification performance is seen in the
Tables 5, 6, 7, and 8 report the oficial results obtained in runs but also in the learning-based baselines. This may
EMit 2023 for both subtask A and B. The ranking is based be explained by considering the distribution of the
outon the macro-averaged F1-score, and considers both the of-domain sets in Subtask B (see Table 2). Indeed, we
team and the run of each submission. The higher scores can observe that in the train and in-domain test sets the
for each column are marked in bold, while the lower prevalent label is Topic but, conversely, in the
out-ofscores are underlined. domain test set the Direction label is more frequent.</p>
        <p>Generally, it is interesting to see that even if the classi- Consequently, it is possible to postulate that systems
ifcation problem of Subtasks A and B are very diferent, trained with the Subtask B training set would perform
the best results for each are similar. Concretely, when fairly well in the in-domain test set, but worse on the
considering the in-domain test set, the best submission out-of-domain data.</p>
        <p>Finally, the detailed results of the evaluation ofer in- still room for improvement. The large number of
emoteresting insights into the models’ performance. For ex- tions considered in Subtask A is indeed a challenge for
ample, when considering the efect of the number of automatic systems, increasing the dificulty of the task.
instances for each class (Table 2, we see that in Subtask In comparison, Subtask B has fewer categories, but still,
A Fear is much less frequent in comparison to the other the proposed systems and baselines obtain rather low
emotions. Hence, this has an efect on the performance metrics in the task. Also, we have seen how the
represenof the systems: in the out-of-domain evaluation (Table 6) tation of the diferent emotions greatly impacts
classificathe majority of the models obtained a null score in the tion performance. These, along with the generalization
Fear category, thus afecting in a negative way the over- dificulties in the out-of-domain set, indicate that the
all averaged score. Similarly, the most common emotions challenge proposed in EMit is not solved. Indeed, future
in Subtask A (Trust and Neutral) are generally better works need to address the shortcomings detected and
adpredicted by the participants’ systems. vance in the generation of systems that are more robust
to the frequency of categories in the datasets, as well
as the inclusion of domain-specific knowledge that may
6. Discussion improve overall results.</p>
        <p>The presence of both in-domain and out-of-domain data
in the EMit task provides a valuable experimentation 7. Conclusions
setting as proved by the diferent performances in
classiifcation between the two evaluations settings. Since these The first edition of EMit (Emotions in Italian) proposes
two types of datasets have been obtained from diferent the assessment of emotions on Italian texts by presenting
sources (see Sect. 3), they represent a diverse collection an interesting challenge that revolves around two
subof cases. In this way, we can evaluate the participants’ tasks. On one hand, the main task (subtask A) presents a
models in relation to their generalization capabilities. comprehensive emotion annotation set using Plutchik’s</p>
        <p>In fact, we observe a general reduction in the clas- model, with the addition of the love emotion. On the
sification metrics when comparing the in-domain and other hand, subtask B introduces a novel classification
out-of-domain test sets. In Subtask A, with the in-domain problem, which addresses the target of the opinion
exset, the average macro f-score of all participants’ systems pressed in the textual message. To complement this, we
is 0.4868. In comparison, the average metric drops to also provide out-of-domain test sets to further obtain
0.4393 in the case of the out-of-domain dataset. We can insights into the behaviour of the participants’ systems.
see a similar trend when considering Subtask B, even if To advance in the study of opinion mining in relation
just one team has participated. The average score in the to emotion, and considering both subtasks, EMit
estabin-domain evaluation is 0.6395 and, in the out-of-domain lishes a rich annotation schema for considering the efect
case, 0.3935. of this challenge on automated systems. While only one</p>
        <p>While participants have achieved promising results in team participated in subtask B, we believe that the
adthe detection of emotions and opinion targets, there is ditional perspectives brought by the combined study of
emotions and their targets will be the subject of further
studies. As an example, an interesting research avenue
could study the variation of emotions depending on the
target, and how this afects learning systems. Another
potential research direction is the inclusion of
linguistic knowledge into the commonly used large language
models.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>The work of Oscar Araque has been partially funded by
the Spanish Ministry of Science, Innovation, and
Universities through project COGNOS (PID2019-105484RB-I00)
and “ETSI Telecomunicación” of “Universidad Politécnica
de Madrid” through the initiative “Primeros Proyectos”
under “AFRICA – Detecting and Analyzing Afective
and Moral Factors in Radicalization and ExtremIsm: a
MaChine learning Approach”. The work of S. Frenda
and V. Patti was partially funded by the Multilingual
Perspective-Aware NLU Project in partnership with
Amazon Alexa. The work of D. Nozza was partially funded
by Fondazione Cariplo (grant No. 2020-4288, MONICA).
ural Language Processing and Speech Tools for
Italian. Final Workshop (EVALITA 2023), CEUR.org,
Parma, Italy, 2023.
[15] C. D. Hromei, D. Croce, V. Basile, R. Basili,
ExtremITA at EVALITA2023: Multi-Task Sustainable
Scaling to Large Language Models at its Extreme, in:
Proceedings of the Eighth Evaluation Campaign of
Natural Language Processing and Speech Tools for
Italian. Final Workshop (EVALITA 2023), CEUR.org,
Parma, Italy, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>