1. Introduction

HODI at EVALITA 2023: Overview of the first Shared Task on Homotransphobia Detection in Italian

Debora Nozza

debora.nozza@unibocconi.it 2

Alessandra Teresa Cignarella

alessandrateresa.cignarella@unito.it 1 3

Greta Damo

greta.damo@studbocconi.it 2

Tommaso Caselli

t.caselli@rug.nl 0

Viviana Patti

viviana.patti@unito.it 1 0 Center for Language and Cognition, University of Groningen , Groningen , The Netherlands 1 Department of Computer Science, University of Turin , Turin , Italy 2 Department of Computing Sciences, Bocconi University , Milan , Italy 3 aequa-tech , Turin , Italy

HODI is a new shared task for the automatic detection of homotransphobia in Italian presented at EVALITA 2023. The challenge is organized into two subtasks: Subtask A focuses on the binary textual classification of homotransphobic tweets, while Subtask B is concerned with the identification of ”rationales” for explainability in the form of textual spans of text. We have received a total of 19 runs for Subtask A and 5 runs for Subtask B from a total of 8 participating teams from 6 diferent countries. We present here an overview of the HODI shared task, the datasets, the evaluation methodology, the results obtained by the participants, and a discussion of the methodology adopted by the teams.

eol>Natural Language Processing Hate Speech Homotransphobia

1. Introduction

Odio i fr*ci Morte ai gay torinesi Divento fr*cio per te

Gay ed etero, stessi diritti B pean legislation (General Data Protection Regulation – GDPR [12]) has introduced a “right to explanation”. This necessitates a paradigm change from performance-based models to interpretable models [13]. This shared task will also contribute towards this need by assessing the models’ explanation abilities to recognize the terms relevant for hate speech. This will allow, in the future, to control for possible biases of models overfitting to specific terms (e.g., gay) [14, 6], as well as use the explanations to generate counternarratives.

2. Task Description

HODI is structured on two subtasks (see examples in Table 1): • Subtask A - Homotransphobia detection: this is a binary classification task where systems must classify a message as hateful or not against LGBTQIA+ community. • Subtask B - Explainability: once a message is classified as hateful, the objective is to identify the rationales of the classification model, i.e., those tokens in the sequence that contributed to the lfagging of the message.

Subtask A Subtask B Split Hate Not

Train 2,008 2,992 Test 511 489

Single Multi Token Token

hateful contexts (e.g., fr*cio) and on others related to specific events that directly involve or afect the LGBTQIA+ community (e.g., Pride, DDL Zan). The complete list of keywords can be found in Appendix A. The decision to use keywords identifying events has been done because of a tendency to observe a surge in homotransphobic messages around them. In this way, we limited the presence of only explicit profanity-driven keywords that may introduce biases in the data and, consequently, in the trained models. As a result, the final dataset does not correspond to the natural distribution of hate on social media, which is lower.

Data Annotation Our annotation guidelines4 have been developed by re-using previous guidelines for similar shared tasks, namely HatEval [15] and AMI [16]. In particular, we define a message as being hateful by applying the following definition: any communication that disparages a person or a group on the basis of some characteristics, such as color, race, ethnicity, gender, sexual orientation, religion, nationality, or other aspects.

Following the proposals in [17], our definition of hate speech and annotation guidelines have benefited from a series of interactions with some members of the Italian The two tasks are strictly interconnected, but they have LGBTQIA+ community. In addition to this, we managed been run independently. to have the data manually labeled by three members of the Italian LGBTQIA+ community (two males and one 3. Training and Testing Data female). Each message has been annotated in parallel by each annotator for both subtasks. The annotators laData Collection Data have been collected from Twit- beled whether the text is hateful or not and targets the ter using a keyword-based approach from May 1st, 2022 LGBTQIA+ community. Then, the annotation for Subuntil August 31st, 2022. The selection is influenced by task B targeting explainability is performed following the observation that the summer months coincide with the approach in [13]. In particular, our annotators have the pride celebrations, leading to increased discussions been asked to highlight the span of text that could supand engagement on social media regarding the subjects port their labeling decision, the so-called rationales. We relevant to our objective. Additionally, May 17th is rec- asked annotators to provide rationales only for the tweets ognized globally as the International Day Against Ho- considered hateful. These span annotations help us to mophobia, Biphobia, and Transphobia, further emphasiz- investigate deeper the manifestations of hateful speech. ing the significance of this time frame for our task. We focused both on keywords that are commonly used in 4Available for consultation here: https://github.com/HODI-EVA

LITA/HODI_2023 Subtask A Subtask B

The annotation campaign has been conducted in three diferent steps by giving the annotators 2,000 tweets each for each step. The inter-annotator agreement (IAA) has been calculated at the end of every step. In Table 3, we display the measures of the IAA on both subtasks, calculated with Fleiss’ kappa coeficient (Subtask A) and % observed agreement (Subtask B). The average of the IAA obtained in both subtasks is substantial according to the interpretation of [18]. It is particularly impressive how the three annotators reached an IAA of 0.648 on the selection of homotransphobic spans of text, considering the dificulty and subjectivity of the task.

Extracting Gold Labels In this shared task, we decided to provide the participants with aggregated gold labels for both tasks rather than releasing the annotations separately. The aggregation process has been implemented as follows: for Subtask A, the gold label was chosen through a majority voting strategy. Since the annotators were three, and they could select only between two labels (0/1), there was always a clear prevalence for one or the other. On the other hand, for Subtask B, the gold span of text has been established by merging the three spans selected by the three annotators. Finally, in the fashion proposed in the SemEval 2021 shared task of toxic spans detection [25], we released the annotation of spans as a list of indices referring to the position of characters in the text (see Table 1).

Data Statistics Table 2 presents a summary of the annotated data for both subtasks. We provided 5,000 training and 1,000 testing tweets. The data we provided are roughly balanced (40% hateful tweets in training and 51% in the test set). For Subtask B, we report the number of messages with a single-token rationale and those with multi-token rationales. It can be seen how in both train and test, the majority of spans containing homophobic expressions are composed of more than one token. On the other hand, in the train set, there are 48 tweets where the hateful span contains only one word. In the test set, those cases are even fewer, i.e., only 16. Table 1 shows examples of data annotations for both Subtask A and B, with the rationales highlighted in yellow for better understanding. Systems have been evaluated using the following metrics per task: Subtask A. We use standard evaluation metrics for text classification, namely Precision, Recall, and F1-score per class. The ranking of the systems is based on the macro-averaged F1-score of the hateful and non-hateful messages.

Subtask B. Systems are evaluated using IntersectionOver-Union (IOU) [26], an agreement metrics. Tokenlevel IOU is the size of the overlap of the character of the tokens they cover divided by the size of their union. We count a prediction as a match if it overlaps with any of the ground truth rationales by more than some threshold.

We use these partial matches to calculate an F1 score and subsequently rank the systems.

Two diferent methods have been implemented to compare models to baselines: Subtask A. Logistic Regression classifier based on TFID using unigrams and bigrams only.

Subtask B. A random classifier following the implementation of the organizers of the SemEval-2021 Task 5, Toxic Spans Detection [25].

The HODI GitHub repository5 contains the code for calculating evaluation metrics and producing predictions using the baselines.

5. Participants and Results

We have received submissions from eight teams, for a total of 18 runs for Subtask A and four for Subtask B. Only two teams participated in Subtask B. Two teams used the same approach and system architecture for participating in other EVALITA 2023 tasks, namely O-Dang for HaSpeeDe and extremITA for all tasks. The majority of the teams were from academia, with only one industrial participant.

Participants were allowed to submit a maximum number of three runs for each subtask. Note that, in the case of submissions for both tasks, participants were asked to submit their predictions for Subtask A and Subtask B at the same time, i.e., in the same evaluation window. Table 4 provides a summary of the teams, illustrating their country and the subtasks they addressed.

5https://github.com/HODI-EVALITA/HODI_2023 Team Country Task

DH-FBK [19] IT A, B CHILab [20] IT A extremITA [21] IT A, B O-Dang [22] IT,UK A LCTs [23] ES,NL A Team_Tamil [24] IE,IN A

liI--tzaabddEnTBRm lETBRAO iiIceavpnnADO I5T iscaooCm -ssrrcaopuETBRCUOOm ii-ttttreeeSLnnTXRwmM ii-teFgunnn liijtceeegoodnnnKw ittteaaaagounnDm lii-ttsreaagkLunnM i-tsreeaFgoLnnhw itttrrceeaaxFouEn itrgopPnm

Subtask A - Homotransphobia detection The ho- already been demonstrated to be efective by [ 27], the motransphobia detection task received 19 submissions Subtask B results further highlight the power of large lanfrom 8 teams, as shown in Table 5. The best result has guage models to perform even more dificult subjective been obtained by LCTs, where the team fine-tuned an tasks, such as explaining homophobic hatred. Italian pretrained RoBERTa model named UmBERTo6 for 10 epochs. Thus, this underscores the fact that relying solely on domain-specific approaches is still insuficient 6. Discussion when it comes to efectively utilizing large models and extensive training. 6 out of 8 teams provide better results In Table 4, we present an overview of the participating than the baseline. Due to a code error in the oficial sub- systems for which we have received a system descripmission that was not ranked in the shared task’s oficial tion paper. This section delves into the team’s varied results, the team CHILab resubmitted amended runs (**) approaches from diferent perspectives. after the deadline.

Language Models Following a trend already Subtask B - Explainability The subtask related to soefentheinprootphoesredevsaylustaetmiosn mcaamkepauigsens of[15p,re-1t6r]a,inaeldl the identification of the rationales behind prediction de- language models (PTLMs) based on encoders only cisions received 5 runs from 2 teams. Table 6 shows tceaahrndbeedtailirnysteiaeoscruoniynlmatplsppreialenfoerdxrttiiteccartbinipmeodaynstouiootnnafndisF.qkw1u,B.ehloCeantaothadnitistsuneirtdgaeyem,ptrotiisencaaaognmlultythsapenrehetrtaqiafcdousirkpitmro’aestedieindndfvhotedehrseret-- ([aood3nrnb0dl]ym)ud.szT(iO-nwpBgiOeEtnnRtaTley-ri-futtXAwlaMIlolLi-TatRren-aDa7nsma,sevfsnAoitrlnumiBcsmEieeerRdnT[tO2am99r)uc],[,h2lt8iitl]oCei,nrcatgmUuuomdraesBelccEoiRm(doITeo1TO0rd58)s-,, random baseline. The best performing submission by els (Twitter-XML-R-sentiment and Open AI Davinci), while all the others used Italian monolingual reoxgtartienmgIaTnA ionbsttaruincetidonth-teuhnoemdodpechoodbeicr-roantliyonmaoledseiln(tie.er.-, bPeTeLnMtsr.aiFnoerdthweitIhtaalialannPgTuLagMesv,oarnileytyAlcoBmERpTaOtib[2le8]whitahs LLaMA) with the natural language instruction “Con quali the task’s data, i.e., social media data. It is remarkable parole l’autore del testo precedente esprime odio omotrans- that pure fine-tuning of PTLMs has been done only by fobico? Separa le sequenze di parole con [gap]” (en: In one team (LCTs). Another team, Team_Tamil, proposes what words does the author of the previous text express homotransphobic hatred? Separate the word sequences with [gap].). While the ability to prompt such models has

7https://huggingface.co/dbmdz/bert-base-italian-cased

8https://huggingface.co/Musixmatch/umberto-commoncrawl -cased-v1 9https://huggingface.co/citizenlab/twitter-xlm-roberta-base-s 6https://huggingface.co/Musixmatch/umberto-commoncrawl entiment-finetunned -cased-v1 10https://github.com/teelinsan/camoscio Team LCTs3 LCTs2 O-Dang1 DH-FBK1 extremITA2 O-Dang2 DH-FBK2 O-Dang3 LCTs1 CHILab2* CHILab3* extremITA1 CHILab1* INGEOTEC1 Team_Tamil1

Baseline

SOVRAG3 SOVRAG2 SOVRAG1 CHILab3 CHILab1 CHILab2

Macro F1 Rank 0.8108

0.7228 0.7051 0.7008 0.6598 0.2050 tasks by means of prompting. They apply two diferent prompting approaches, compliant with the models they use (IT5 and Camoscio). The authors exploited zero-shot prompting, which means they did not give the models any examples from the training data. They only specialized the natural language instruction for the different tasks.

Interaction between Subtask A and Subtask B The only team that exploited as much as possible the interaction between the two subtasks in the design of their system is DH-FBK. The authors developed a multi-task learning architecture using the MaChAmp v2.0 toolkit [33]. 7. Conclusion and Future Work Features and Additional Data No system has used external features from specialized lexical resources. Only one participant, DH-FBK, has extended the available training materials for both subtasks using synthetic data obtained with IT5. The authors have retained only the top 2,000 examples for each class as a strategy to double the size of the HODI training set per class as well as to mitigate class imbalance. zero and few-shot learning of fine-tuned classification This paper introduces HODI, the first shared task on language models aiming at solving hate speech detection homotransphobia detection in Italian. The task aims to (e.g., [31]) or emotion-related tasks (e.g., [32]) in Italian not only identify homotransphobic messages but also inand multilingual settings. For all other participants, vestigate the underlying reasons behind them. We have ifne-tuning represents just one component of other analyzed the submissions from participating teams and architectures and solutions. concluded that satisfactory results have been achieved in detecting homotransphobia in Italian. Furthermore, notable progress has been made in the explainability task, although further work is required in this area. To continue advancing in this field, future eforts should focus on constructing larger and more diverse datasets. Additionally, there is a need to enhance the detection models and improve their ability to explain the specific words or features that contribute to a hateful classification.

Prompting Following recent advancements in generative language models, two teams, O-Dang and The work of A.T. Cignarella and V. Patti was partially extremITA, made use of prompting engineering tech- funded by the International project STERHEOTYPES niques. In the case of O-Dang, prompts have been used - Studying European Racial Hoaxes and sterEOTYPES, to query the Open AI Davinci model to extract additional funded by the Compagnia di San Paolo and VolksWadata concerning the names of entities of type “PERSON” gen Stiftung under the ‘Challenges for Europe’ Call for that are present in the training set. The information Projects (CUP: B99C20000640007). The work of D. Nozza thus obtained is concatenated to the original message as was partially funded by Fondazione Cariplo (grant No. a form of knowledge injection. The extremITA team 2020-4288, MONICA) . Debora Nozza is a member of the took a more radical path by a ddressing all EVALITA 2023 MilaNLP group, and the Data and Marketing Insights

Acknowledgments References

Unit of the Bocconi Institute for Data Science and Analysis.

A special mention also to the people who helped us with the annotation of the dataset and the assessment of guidelines: Davide, Greta, and Mauro, thank you very much for your great help.

54- 63 . URL: https://aclanthology.org/S19-2007. Few-Shot Learning for Detecting Homotranspho-

doi:10 .18653/v1/ S19 -2007. bia in Italian Language , in: Proceedings of the [16]

Fersini ,

Nozza , P. Rosso, AMI @ EVALITA2020: Eighth Evaluation Campaign of Natural Language

Croce ,

M. Di

Maro , L. C. Passaro (Eds.), Pro- Workshop (EVALITA 2023 ), CEUR.org, Parma,

ceedings of the 7th evaluation campaign of Natural Italy , 2023 .

Language

Processing and Speech tools for Italian [25]

Pavlopoulos ,

Sorensen ,

Laugier , I. Androut-

(EVALITA 2020 ), CEUR.org, Online, 2020 . sopoulos, SemEval-2021 task 5: Toxic spans de[17]

Caselli ,

Cibin ,

Conforti , E. Encinas, M. Teli, tection, in: Proceedings of the 15th international

Guiding principles for participatory design-inspired workshop on semantic evaluation (SemEval-

2021 ),

natural language processing , in: Proceedings of the ACL , 2021 , pp. 59 - 69 .

1st Workshop on NLP for Positive Impact , Associ- [26] J. DeYoung , S. Jain, N. F. Rajani , E. Lehman,

ation for Computational Linguistics , Online, 2021 ,

Xiong ,

Socher ,

B. C.

Wallace , ERASER: A

pp. 27 - 35 . URL: https://aclanthology.org/ 2021 . nlp4 benchmark to evaluate rationalized NLP models,

posimpact-1 .4. doi: 10 .18653/v1/ 2021 .nlp4p in : Proceedings of the 58th Annual Meeting of the

osimpact-1 .4. Association for Computational Linguistics, Associ[18]

J. R.

Landis , G. G. Koch, An application of hier- ation for Computational Linguistics , Online, 2020 ,

archical kappa-type statistics in the assessment of pp. 4443 - 4458 . URL: https://aclanthology.org/ 2020 .

majority agreement among multiple observers , Bio- acl-main. 408 . doi: 10 .18653/v1/ 2020 .acl-mai

metrics ( 1977 ) 363 - 374 . n. 408 . [19]

Leonardelli ,

Casula , DH-FBK at HODI: Multi- [27]

F. M.

Plaza-del arco ,

Nozza ,

Hovy , Respectful

Oversampling and Synthetic Data, in: Proceedings models to detect hate speech , in: The 7th Workshop

Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 , pp. 60 - 68 . URL: https://aclanthology.org/202

2023 . 3 .woah- 1 . 6 . [20] I. Siragusa , R. Pirrone, CHILab at HODI: A min- [28]

Polignano ,

Basile , M. de Gemmis, G. Semer-

ing and Speech Tools for Italian . Final Workshop based on tweets, in: Proceedings of the 6th Italian

(EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 . Conference on Computational Linguistics, CLiC[21] C. D. Hromei , D.

Croce , V.

Basile , R.

Basili , Extrem- it 2019 , volume 2481 , CEUR Workshop Proceed-

ITA at EVALITA

: Multi-Task Sustainable Scaling to ings (CEUR-WS . org), CEUR-WS.org , 2019 . URL:

Large Language

Models at its Extreme , in: Proceed- http://ceur-ws. org/ Vol- 2481 /paper57.pdf .

ings of the Eighth Evaluation Campaign of Natural [ 29]

Ouyang ,

Wu ,

Jiang ,

Almeida , C. Wain-

Final

Workshop (EVALITA 2023 ), CEUR.org, Parma,

Ray , et al., Training language models to follow

Italy , 2023 . instructions with human feedback , Advances in [22]

Di Bonaventura ,

Muti ,

M. A.

Stranisci , O-Dang Neural Information Processing Systems 35 ( 2022 )

HODI

and HaSpeeDe3: A Knowledge-Enhanced 27730-27744 .

Approach to Homotransphobia and Hate Speech [30] G.

Sarti , M.

Nissim, IT5: Large-scale text-to-text

Evaluation Campaign of Natural Language Process- generation , ArXiv preprint 2203.03759 ( 2022 ). URL:

ing and Speech Tools for Italian . Final Workshop https://arxiv.org/abs/2203.03759.

(EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 . [31]

Nozza ,

Bianchi , G. Attanasio, HATE-ITA: [23]

Locatelli , L. Locatelli, LCTs at HODI: Homo- Hate speech detection in Italian social media text,

Italian . Final Workshop (EVALITA 2023 ), CEUR.org, 2022 , pp. 252 - 260 . URL: https://aclanthology.org/2

Parma , Italy, 2023 . 022.woah- 1 .24. doi: 10 .18653/v1/ 2022 .woah- 1 [24]

Ponnusamy ,

P. K.

Kumaresan , K. K. Pon- . 24 .

nusamy , C. Rajkumar, R.

Priyadharshini , [32] F.

Bianchi , D.

Nozza , D.

Hovy , FEEL-IT: Emotion

tational Linguistics , Online, 2021 , pp. 76 - 83 . URL:

https://aclanthology.org/ 2021 .wassa- 1 . 8 . [33] R. van der Goot ,

Üstün ,

Ramponi , I. Sharaf,

ceedings of the 16th Conference of the European

Computational

Linguistics , Online, 2021 , pp. 176 -

197. URL: https://aclanthology.org/ 2021 .eacl-demos

.22. doi: 10 .18653/v1/ 2021 .eacl-demos. 22 .