<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HODI at EVALITA 2023: Overview of the first Shared Task on Homotransphobia Detection in Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Debora Nozza</string-name>
          <email>debora.nozza@unibocconi.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Teresa Cignarella</string-name>
          <email>alessandrateresa.cignarella@unito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Greta Damo</string-name>
          <email>greta.damo@studbocconi.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommaso Caselli</string-name>
          <email>t.caselli@rug.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Patti</string-name>
          <email>viviana.patti@unito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Language and Cognition, University of Groningen</institution>
          ,
          <addr-line>Groningen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Turin</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computing Sciences, Bocconi University</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>aequa-tech</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>HODI is a new shared task for the automatic detection of homotransphobia in Italian presented at EVALITA 2023. The challenge is organized into two subtasks: Subtask A focuses on the binary textual classification of homotransphobic tweets, while Subtask B is concerned with the identification of ”rationales” for explainability in the form of textual spans of text. We have received a total of 19 runs for Subtask A and 5 runs for Subtask B from a total of 8 participating teams from 6 diferent countries. We present here an overview of the HODI shared task, the datasets, the evaluation methodology, the results obtained by the participants, and a discussion of the methodology adopted by the teams.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Hate Speech</kwd>
        <kwd>Homotransphobia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Odio i fr*ci
Morte ai gay torinesi
Divento fr*cio per te</p>
      <p>Gay ed etero, stessi diritti
B
pean legislation (General Data Protection Regulation –
GDPR [12]) has introduced a “right to explanation”. This
necessitates a paradigm change from performance-based
models to interpretable models [13]. This shared task
will also contribute towards this need by assessing the
models’ explanation abilities to recognize the terms
relevant for hate speech. This will allow, in the future, to
control for possible biases of models overfitting to specific
terms (e.g., gay) [14, 6], as well as use the explanations
to generate counternarratives.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <p>HODI is structured on two subtasks (see examples in
Table 1):
• Subtask A - Homotransphobia detection: this
is a binary classification task where systems
must classify a message as hateful or not against
LGBTQIA+ community.
• Subtask B - Explainability: once a message is
classified as hateful, the objective is to identify the
rationales of the classification model, i.e., those
tokens in the sequence that contributed to the
lfagging of the message.</p>
      <sec id="sec-2-1">
        <title>Subtask A</title>
      </sec>
      <sec id="sec-2-2">
        <title>Subtask B</title>
      </sec>
      <sec id="sec-2-3">
        <title>Split</title>
      </sec>
      <sec id="sec-2-4">
        <title>Hate Not</title>
        <p>Train 2,008 2,992
Test 511 489</p>
      </sec>
      <sec id="sec-2-5">
        <title>Single Multi</title>
      </sec>
      <sec id="sec-2-6">
        <title>Token Token</title>
        <p>hateful contexts (e.g., fr*cio) and on others related to
specific events that directly involve or afect the LGBTQIA+
community (e.g., Pride, DDL Zan). The complete list of
keywords can be found in Appendix A. The decision to
use keywords identifying events has been done because
of a tendency to observe a surge in homotransphobic
messages around them. In this way, we limited the
presence of only explicit profanity-driven keywords that may
introduce biases in the data and, consequently, in the
trained models. As a result, the final dataset does not
correspond to the natural distribution of hate on social
media, which is lower.</p>
        <p>Data Annotation Our annotation guidelines4 have
been developed by re-using previous guidelines for
similar shared tasks, namely HatEval [15] and AMI [16]. In
particular, we define a message as being hateful by
applying the following definition:
any communication that disparages a person or a
group on the basis of some characteristics, such as
color, race, ethnicity, gender, sexual orientation,
religion, nationality, or other aspects.</p>
        <p>Following the proposals in [17], our definition of hate
speech and annotation guidelines have benefited from a
series of interactions with some members of the Italian
The two tasks are strictly interconnected, but they have LGBTQIA+ community. In addition to this, we managed
been run independently. to have the data manually labeled by three members of
the Italian LGBTQIA+ community (two males and one
3. Training and Testing Data female). Each message has been annotated in parallel
by each annotator for both subtasks. The annotators
laData Collection Data have been collected from Twit- beled whether the text is hateful or not and targets the
ter using a keyword-based approach from May 1st, 2022 LGBTQIA+ community. Then, the annotation for
Subuntil August 31st, 2022. The selection is influenced by task B targeting explainability is performed following
the observation that the summer months coincide with the approach in [13]. In particular, our annotators have
the pride celebrations, leading to increased discussions been asked to highlight the span of text that could
supand engagement on social media regarding the subjects port their labeling decision, the so-called rationales. We
relevant to our objective. Additionally, May 17th is rec- asked annotators to provide rationales only for the tweets
ognized globally as the International Day Against Ho- considered hateful. These span annotations help us to
mophobia, Biphobia, and Transphobia, further emphasiz- investigate deeper the manifestations of hateful speech.
ing the significance of this time frame for our task. We
focused both on keywords that are commonly used in
4Available for consultation here: https://github.com/HODI-EVA</p>
        <p>LITA/HODI_2023
Subtask A
Subtask B</p>
        <p>The annotation campaign has been conducted in three
diferent steps by giving the annotators 2,000 tweets each
for each step. The inter-annotator agreement (IAA) has
been calculated at the end of every step. In Table 3, we
display the measures of the IAA on both subtasks,
calculated with Fleiss’ kappa coeficient (Subtask A) and
% observed agreement (Subtask B). The average of the
IAA obtained in both subtasks is substantial according
to the interpretation of [18]. It is particularly impressive
how the three annotators reached an IAA of 0.648 on the
selection of homotransphobic spans of text, considering
the dificulty and subjectivity of the task.</p>
        <p>Extracting Gold Labels In this shared task, we
decided to provide the participants with aggregated gold
labels for both tasks rather than releasing the
annotations separately. The aggregation process has been
implemented as follows: for Subtask A, the gold label was
chosen through a majority voting strategy. Since the
annotators were three, and they could select only between
two labels (0/1), there was always a clear prevalence for
one or the other. On the other hand, for Subtask B, the
gold span of text has been established by merging the
three spans selected by the three annotators. Finally, in
the fashion proposed in the SemEval 2021 shared task of
toxic spans detection [25], we released the annotation
of spans as a list of indices referring to the position of
characters in the text (see Table 1).</p>
        <p>Data Statistics Table 2 presents a summary of the
annotated data for both subtasks. We provided 5,000
training and 1,000 testing tweets. The data we provided
are roughly balanced (40% hateful tweets in training and
51% in the test set). For Subtask B, we report the number
of messages with a single-token rationale and those with
multi-token rationales. It can be seen how in both train
and test, the majority of spans containing homophobic
expressions are composed of more than one token. On
the other hand, in the train set, there are 48 tweets where
the hateful span contains only one word. In the test set,
those cases are even fewer, i.e., only 16. Table 1 shows
examples of data annotations for both Subtask A and
B, with the rationales highlighted in yellow for better
understanding.
Systems have been evaluated using the following metrics
per task:
Subtask A. We use standard evaluation metrics for
text classification, namely Precision, Recall, and F1-score
per class. The ranking of the systems is based on the
macro-averaged F1-score of the hateful and non-hateful
messages.</p>
        <p>Subtask B. Systems are evaluated using
IntersectionOver-Union (IOU) [26], an agreement metrics.
Tokenlevel IOU is the size of the overlap of the character of the
tokens they cover divided by the size of their union. We
count a prediction as a match if it overlaps with any of
the ground truth rationales by more than some threshold.</p>
        <p>We use these partial matches to calculate an F1 score and
subsequently rank the systems.</p>
        <p>Two diferent methods have been implemented to
compare models to baselines:
Subtask A. Logistic Regression classifier based on
TFID using unigrams and bigrams only.</p>
        <p>Subtask B. A random classifier following the
implementation of the organizers of the SemEval-2021 Task 5,
Toxic Spans Detection [25].</p>
        <p>The HODI GitHub repository5 contains the code for
calculating evaluation metrics and producing predictions
using the baselines.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Participants and Results</title>
      <p>We have received submissions from eight teams, for a
total of 18 runs for Subtask A and four for Subtask B. Only
two teams participated in Subtask B. Two teams used
the same approach and system architecture for
participating in other EVALITA 2023 tasks, namely O-Dang for
HaSpeeDe and extremITA for all tasks. The majority of
the teams were from academia, with only one industrial
participant.</p>
      <p>Participants were allowed to submit a maximum
number of three runs for each subtask. Note that, in the case
of submissions for both tasks, participants were asked
to submit their predictions for Subtask A and Subtask
B at the same time, i.e., in the same evaluation window.
Table 4 provides a summary of the teams, illustrating
their country and the subtasks they addressed.</p>
      <sec id="sec-3-1">
        <title>5https://github.com/HODI-EVALITA/HODI_2023</title>
        <sec id="sec-3-1-1">
          <title>Team Country Task</title>
          <p>DH-FBK [19] IT A, B
CHILab [20] IT A
extremITA [21] IT A, B
O-Dang [22] IT,UK A
LCTs [23] ES,NL A
Team_Tamil [24] IE,IN A</p>
          <p>liI--tzaabddEnTBRm lETBRAO iiIceavpnnADO I5T iscaooCm -ssrrcaopuETBRCUOOm ii-ttttreeeSLnnTXRwmM ii-teFgunnn liijtceeegoodnnnKw ittteaaaagounnDm lii-ttsreaagkLunnM i-tsreeaFgoLnnhw itttrrceeaaxFouEn itrgopPnm</p>
          <p>Subtask A - Homotransphobia detection The ho- already been demonstrated to be efective by [ 27], the
motransphobia detection task received 19 submissions Subtask B results further highlight the power of large
lanfrom 8 teams, as shown in Table 5. The best result has guage models to perform even more dificult subjective
been obtained by LCTs, where the team fine-tuned an tasks, such as explaining homophobic hatred.
Italian pretrained RoBERTa model named UmBERTo6 for
10 epochs. Thus, this underscores the fact that relying
solely on domain-specific approaches is still insuficient 6. Discussion
when it comes to efectively utilizing large models and
extensive training. 6 out of 8 teams provide better results In Table 4, we present an overview of the participating
than the baseline. Due to a code error in the oficial sub- systems for which we have received a system
descripmission that was not ranked in the shared task’s oficial tion paper. This section delves into the team’s varied
results, the team CHILab resubmitted amended runs (**) approaches from diferent perspectives.
after the deadline.</p>
          <p>Language Models Following a trend already
Subtask B - Explainability The subtask related to soefentheinprootphoesredevsaylustaetmiosn mcaamkepauigsens of[15p,re-1t6r]a,inaeldl
the identification of the rationales behind prediction de- language models (PTLMs) based on encoders only
cisions received 5 runs from 2 teams. Table 6 shows
tceaahrndbeedtailirnysteiaeoscruoniynlmatplsppreialenfoerdxrttiiteccartbinipmeodaynstouiootnnafndisF.qkw1u,B.ehloCeantaothadnitistsuneirtdgaeyem,ptrotiisencaaaognmlultythsapenrehetrtaqiafcdousirkpitmro’aestedieindndfvhotedehrseret-- ([aood3nrnb0dl]ym)ud.szT(iO-nwpBgiOeEtnnRtaTley-ri-futtXAwlaMIlolLi-TatRren-aDa7nsma,sevfsnAoitrlnumiBcsmEieeerRdnT[tO2am99r)uc],[,h2lt8iitl]oCei,nrcatgmUuuomdraesBelccEoiRm(doITeo1TO0rd58)s-,,
random baseline. The best performing submission by els (Twitter-XML-R-sentiment and Open AI
Davinci), while all the others used Italian monolingual
reoxgtartienmgIaTnA ionbsttaruincetidonth-teuhnoemdodpechoodbeicr-roantliyonmaoledseiln(tie.er.-, bPeTeLnMtsr.aiFnoerdthweitIhtaalialannPgTuLagMesv,oarnileytyAlcoBmERpTaOtib[2le8]whitahs
LLaMA) with the natural language instruction “Con quali the task’s data, i.e., social media data. It is remarkable
parole l’autore del testo precedente esprime odio omotrans- that pure fine-tuning of PTLMs has been done only by
fobico? Separa le sequenze di parole con [gap]” (en: In one team (LCTs). Another team, Team_Tamil, proposes
what words does the author of the previous text express
homotransphobic hatred? Separate the word sequences with
[gap].). While the ability to prompt such models has</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>7https://huggingface.co/dbmdz/bert-base-italian-cased</title>
        <p>8https://huggingface.co/Musixmatch/umberto-commoncrawl
-cased-v1
9https://huggingface.co/citizenlab/twitter-xlm-roberta-base-s
6https://huggingface.co/Musixmatch/umberto-commoncrawl entiment-finetunned
-cased-v1 10https://github.com/teelinsan/camoscio
Team
LCTs3
LCTs2
O-Dang1
DH-FBK1
extremITA2
O-Dang2
DH-FBK2
O-Dang3
LCTs1
CHILab2*
CHILab3*
extremITA1
CHILab1*
INGEOTEC1
Team_Tamil1</p>
        <sec id="sec-3-2-1">
          <title>Baseline</title>
          <p>SOVRAG3
SOVRAG2
SOVRAG1
CHILab3
CHILab1
CHILab2</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Macro F1 Rank 0.8108</title>
          <p>0.7228
0.7051
0.7008
0.6598
0.2050
tasks by means of prompting. They apply two
diferent prompting approaches, compliant with the models
they use (IT5 and Camoscio). The authors exploited
zero-shot prompting, which means they did not give the
models any examples from the training data. They only
specialized the natural language instruction for the
different tasks.</p>
          <p>Interaction between Subtask A and Subtask B The
only team that exploited as much as possible the
interaction between the two subtasks in the design of their
system is DH-FBK. The authors developed a multi-task
learning architecture using the MaChAmp v2.0 toolkit [33].
7. Conclusion and Future Work
Features and Additional Data No system has used
external features from specialized lexical resources. Only
one participant, DH-FBK, has extended the available
training materials for both subtasks using synthetic data
obtained with IT5. The authors have retained only the
top 2,000 examples for each class as a strategy to double
the size of the HODI training set per class as well as to
mitigate class imbalance.
zero and few-shot learning of fine-tuned classification This paper introduces HODI, the first shared task on
language models aiming at solving hate speech detection homotransphobia detection in Italian. The task aims to
(e.g., [31]) or emotion-related tasks (e.g., [32]) in Italian not only identify homotransphobic messages but also
inand multilingual settings. For all other participants, vestigate the underlying reasons behind them. We have
ifne-tuning represents just one component of other analyzed the submissions from participating teams and
architectures and solutions. concluded that satisfactory results have been achieved
in detecting homotransphobia in Italian. Furthermore,
notable progress has been made in the explainability task,
although further work is required in this area. To
continue advancing in this field, future eforts should focus
on constructing larger and more diverse datasets.
Additionally, there is a need to enhance the detection models
and improve their ability to explain the specific words or
features that contribute to a hateful classification.</p>
          <p>
            Prompting Following recent advancements in
generative language models, two teams, O-Dang and The work of A.T. Cignarella and V. Patti was partially
extremITA, made use of prompting engineering tech- funded by the International project STERHEOTYPES
niques. In the case of O-Dang, prompts have been used - Studying European Racial Hoaxes and sterEOTYPES,
to query the Open AI Davinci model to extract additional funded by the Compagnia di San Paolo and
VolksWadata concerning the names of entities of type “PERSON” gen Stiftung under the ‘Challenges for Europe’ Call for
that are present in the training set. The information Projects (CUP: B99C20000640007). The work of D. Nozza
thus obtained is concatenated to the original message as was partially funded by Fondazione Cariplo
            <xref ref-type="bibr" rid="ref14">(grant No.
a form of knowledge injection. The extremITA team 2020-4288, MONICA)</xref>
            . Debora Nozza is a member of the
took a more radical path by a
            <xref ref-type="bibr" rid="ref3">ddressing all EVALITA 2023</xref>
            MilaNLP group, and the Data and Marketing Insights
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments References</title>
      <p>Unit of the Bocconi Institute for Data Science and
Analysis.</p>
      <p>A special mention also to the people who helped us
with the annotation of the dataset and the assessment
of guidelines: Davide, Greta, and Mauro, thank you very
much for your great help.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          54-
          <fpage>63</fpage>
          . URL: https://aclanthology.org/S19-2007.
          <article-title>Few-Shot Learning for Detecting Homotranspho-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <fpage>S19</fpage>
          -2007.
          <article-title>bia in Italian Language</article-title>
          , in: Proceedings of the [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          , P. Rosso, AMI @ EVALITA2020:
          <article-title>Eighth Evaluation Campaign of Natural Language</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Di</given-names>
            <surname>Maro</surname>
          </string-name>
          , L. C. Passaro (Eds.), Pro- Workshop (EVALITA
          <year>2023</year>
          ), CEUR.org, Parma,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>ceedings of the 7th evaluation campaign of Natural Italy</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Language</given-names>
            <surname>Processing</surname>
          </string-name>
          and Speech tools for Italian [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sorensen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Laugier</surname>
          </string-name>
          , I. Androut-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>(EVALITA</source>
          <year>2020</year>
          ), CEUR.org, Online,
          <year>2020</year>
          . sopoulos, SemEval-2021 task 5: Toxic spans de[17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cibin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Conforti</surname>
          </string-name>
          , E. Encinas, M. Teli, tection,
          <source>in: Proceedings of the 15th international</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>Guiding principles for participatory design-inspired workshop on semantic evaluation (SemEval-</article-title>
          <year>2021</year>
          ),
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>natural language processing</article-title>
          ,
          <source>in: Proceedings of the ACL</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>59</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>1st Workshop on NLP for Positive Impact</source>
          , Associ- [26]
          <string-name>
            <surname>J. DeYoung</surname>
            , S. Jain,
            <given-names>N. F.</given-names>
          </string-name>
          <string-name>
            <surname>Rajani</surname>
          </string-name>
          , E. Lehman,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>ation for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Wallace</surname>
          </string-name>
          , ERASER: A
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          pp.
          <fpage>27</fpage>
          -
          <lpage>35</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .
          <article-title>nlp4 benchmark to evaluate rationalized NLP models,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>posimpact-1</source>
          .4. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .nlp4p in
          <source>: Proceedings of the 58th Annual Meeting of the</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>osimpact-1</source>
          .4. Association for Computational Linguistics, Associ[18]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Landis</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. G. Koch,</surname>
          </string-name>
          <article-title>An application of hier- ation for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>archical kappa-type statistics in the assessment</article-title>
          of pp.
          <fpage>4443</fpage>
          -
          <lpage>4458</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>majority agreement among multiple observers</article-title>
          , Bio- acl-main.
          <volume>408</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-mai
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>metrics</surname>
          </string-name>
          (
          <year>1977</year>
          )
          <fpage>363</fpage>
          -
          <lpage>374</lpage>
          . n.
          <volume>408</volume>
          . [19]
          <string-name>
            <given-names>E.</given-names>
            <surname>Leonardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Casula</surname>
          </string-name>
          , DH-FBK at HODI: Multi- [27]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Plaza-del arco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          , Respectful
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Oversampling</surname>
          </string-name>
          and
          <article-title>Synthetic Data, in: Proceedings models to detect hate speech</article-title>
          ,
          <source>in: The 7th Workshop</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Workshop</surname>
          </string-name>
          (EVALITA
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          , pp.
          <fpage>60</fpage>
          -
          <lpage>68</lpage>
          . URL: https://aclanthology.org/202
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <year>2023</year>
          .
          <article-title>3</article-title>
          .woah-
          <volume>1</volume>
          .
          <fpage>6</fpage>
          . [20]
          <string-name>
            <surname>I. Siragusa</surname>
          </string-name>
          , R. Pirrone, CHILab at HODI: A min- [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , M. de Gemmis, G. Semer-
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>ing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop based on tweets, in: Proceedings of the 6th Italian</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>(EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          . Conference on Computational Linguistics, CLiC[21]
          <string-name>
            <surname>C. D. Hromei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Croce</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Basili</surname>
          </string-name>
          , Extrem- it
          <year>2019</year>
          , volume
          <volume>2481</volume>
          , CEUR Workshop Proceed-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>ITA at EVALITA</surname>
          </string-name>
          <article-title>: Multi-Task Sustainable Scaling to ings (CEUR-WS</article-title>
          . org),
          <source>CEUR-WS.org</source>
          ,
          <year>2019</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Large Language</surname>
          </string-name>
          <article-title>Models at its Extreme</article-title>
          , in: Proceed- http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2481</volume>
          /paper57.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>ings of the Eighth Evaluation Campaign of Natural [</article-title>
          29]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          , C. Wain-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Final</given-names>
            <surname>Workshop</surname>
          </string-name>
          (EVALITA
          <year>2023</year>
          ), CEUR.org, Parma,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          , et al.,
          <article-title>Training language models to follow</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Italy</surname>
          </string-name>
          ,
          <year>2023</year>
          .
          <article-title>instructions with human feedback</article-title>
          , Advances in [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Di Bonaventura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Stranisci</surname>
          </string-name>
          ,
          <source>O-Dang Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>at</surname>
            <given-names>HODI</given-names>
          </string-name>
          and
          <article-title>HaSpeeDe3: A Knowledge-Enhanced 27730-27744</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Approach</surname>
            to Homotransphobia and Hate Speech [30]
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sarti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Nissim, IT5: Large-scale text-to-text</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>Evaluation Campaign of Natural Language Process- generation</article-title>
          ,
          <source>ArXiv preprint 2203.03759</source>
          (
          <year>2022</year>
          ). URL:
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>ing and Speech Tools for Italian</article-title>
          . Final Workshop https://arxiv.org/abs/2203.03759.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>(EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          . [31]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          , G. Attanasio, HATE-ITA: [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Locatelli</surname>
          </string-name>
          , L. Locatelli, LCTs at HODI:
          <article-title>Homo- Hate speech detection in Italian social media text,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Italian</surname>
          </string-name>
          . Final Workshop (EVALITA
          <year>2023</year>
          ), CEUR.org,
          <year>2022</year>
          , pp.
          <fpage>252</fpage>
          -
          <lpage>260</lpage>
          . URL: https://aclanthology.org/2
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Parma</surname>
          </string-name>
          , Italy,
          <year>2023</year>
          . 022.woah-
          <volume>1</volume>
          .24. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .woah-
          <volume>1</volume>
          [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ponnusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Kumaresan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. K. Pon-</surname>
          </string-name>
          .
          <volume>24</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>nusamy</surname>
            , C. Rajkumar,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Priyadharshini</surname>
            , [32]
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bianchi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Nozza</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          , FEEL-IT: Emotion
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <surname>tational Linguistics</surname>
          </string-name>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>76</fpage>
          -
          <lpage>83</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          https://aclanthology.org/
          <year>2021</year>
          .wassa-
          <volume>1</volume>
          .
          <fpage>8</fpage>
          . [33]
          <string-name>
            <surname>R. van der Goot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Üstün</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , I. Sharaf,
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>ceedings of the 16th Conference of the European</source>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>176</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          197. URL: https://aclanthology.org/
          <year>2021</year>
          .eacl-demos
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          .22. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .eacl-demos.
          <volume>22</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>