1. Introduction and Motivation

HaSpeeDe3 at EVALITA 2023: Overview of the Political and Religious Hate Speech Detection task

Mirko Lai

Fabio Celli

Alan Ramponi

Sara Tonelli

Cristina Bosco

Viviana Patti

2 0 Fondazione Bruno Kessler (FBK) , Trento , Italy 1 Maggioli s.p.a., University of Trento , Trento , Italy 2 Università degli Studi di Torino , Torino , Italy

The Hate Speech Detection (HaSpeeDe3) task is the third edition of a shared task on the detection of hateful content in Italian tweets. It difers from the previous editions while maintaining continuity in analysing and contrasting hate speech (HS) on social media. While HaSpeeDe and HaSpeeDe2 were focused on HS against immigrants, Muslims and Roms, HaSpeeDe3 explores hate speech in strong polarised debates, concerning in particular politics and religion. It is articulated in two diferent tasks: A) In-domain political hate speech detection and B) Cross-domain hate speech detection about political and religious tweets. Task A consists in two diferent subtasks for which participants i) can only use the provided textual content of the tweet, or ii) can additionally employ contextual information about the tweet and its author. In Task B, that consists in two subtasks, participants are allowed to use any kind of external data for detecting hate speech in tweets about i) politics and ii) religion. Six teams from both academia and industry participated in the evaluation, with a total of 13 submitted runs for Task A and 16 for Task B.

eol>Hate speech detection social media analysis polarised debates political hate speech religious hate speech shared task

1. Introduction and Motivation

gressions and online hate are exacerbated by the ideological segregation present on social media, where social Social media play an important role in public debates, homophily, as well as personalising and recommending especially concerning politics. On the one hand, political algorithms, facilitate the creation of echo chambers and leaders use social media as a vehicle for political and elec- filter bubbles [7, 8]. The “others” are frequently targeted toral propaganda. On the other hand, they provide news because of characteristics such as gender, sexual orientato a significant part of the population that takes part tion, ethnicity, and religion [9, 10, 11]. in the discussion, supporting or criticising political deci- In the last years, to address these problems posed sions [ 1, 2 ]. Social media are also the place where debates by the widespread use of abusive language online, the on sensitive topics, such as religious beliefs and practices, NLP community has focused on the detection of hate are rather common and sometimes are intertwined with speech [12] and the analysis of online debates [13, 14]. public discussions on political matters. In particular, many researchers have worked on systems

Unfortunately, such discussions often trigger verbal to detect ofensive language against specific vulnerable aggressions [ 3 ], especially after some polarising events groups, e.g., women, immigrants, LGBTQ+ community, in Europe and beyond such as Brexit [ 4 ], the Covid-19 among others [11, 15, 16, 17]. An under-researched – yet pandemics [5] and the Russo-Ukrainian conflict [ 6]. Ag- important – area of investigation is anti-politics hate, i.e., hate speech against politicians, policy makers and laws EVALITA 2023: 8th Evaluation Campaign of Natural Language Pro- at any level (national, regional and local). While anticessing and Speech Tools for Italian, Sep 7 – 8, Parma, IT policy hate speech has been addressed in Arabic [18] and * Corresponding author. German [19], most European languages have been undera$lrammirpkoon.lia@i @fbukn.eiuto(.iAt .( MRa. mLapio);nFi)a;bsiaot.oCneelllil@i@mfbakg.geiuol(iS.i.tT(Fo.nCelelil)li;); researched. As regards religious hate, instead, annotated cristina.bosco@unito.it (C. Bosco); viviana.patti@unito.it (V. Patti) corpora have been created for English, Arabic, Bengali, http://www.di.unito.it/~lai/ (M. Lai); French, Portuguese, and Italian, among others (for an https://dh.fbk.eu/author/alramponi/ (A. Ramponi); overview of works, see [15] and [20]). However, none https://dh.fbk.eu/author/sara/ (S. Tonelli); of them share contextual information about the authors hhttttpp::////wwwwww..ddii..uunniittoo..iitt//~~bpoatstcio//( V(C..PBatotsi)co); of the tweets, neither about their social media network, 0000-0003-1042-0861 (M. Lai); 0000-0002-7309-5886 (F. Celli); although religious self-identification may lead to hard 0000-0002-4305-2404 (A. Ramponi); 0000-0001-8010-6689 conflict with the members of other worships. (S. Tonelli); 0000-0002-8857-4484 (C. Bosco); 0000-0001-5991-370X For this shared task organised within EVALITA 2023 (V. Patti) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License [21], we introduce a new corpus, called PolicyCorpusXL, CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) containing Italian tweets related to political topics, where – XPoliticalHate: the test set consists of tweets from PolicyCorpusXL (as in both the in-domain subtasks above); – XReligiousHate: the test set consists of tweets from the ReligiousHate corpus (Section 3), for which no development data is provided to participants.

Moreover, participants are allowed to use any kind of external data (e.g., datasets for other hate domains) and textual and contextual PolicyCorpusXL development data. 3. Dataset and Format In this section, we describe the dataset creation process (Section 3.1), including data collection, annotation, enrichment, and label distribution. Then, we outline the format used for sharing data to participants (Section 3.2). 3.1. Dataset Creation We collected data from Twitter after selecting it among

existing social media platforms where hatred content could be present. There are two main reasons for this choice. One the one hand, Twitter easily allows the retrieval of a high volume of textual content by using APIs. On the other hand, additional metadata about tweets themselves and their authors can be collected. Furthermore, Twitter users can perform asynchronous actions such as retweeting, replying, and following. This latter aspect allows us to share with HaSpeeDe3 participants not only the text of the tweets and their metadata but also contextual information about the network where the participants of the online debate are situated. hateful messages have been manually annotated. This corpus is an extension of PolicyCorpus [22]. We selected Twitter as the source of data and Italian as the target language because Italy has, at least since the elections in 2018, a large audience that pays attention to hyperpartisan sources on Twitter. These users are prone to produce and retweet messages of hate against policymaking [23]. We also provide the Italian portion of the ReligiousHate dataset [20] as a test set, in which hateful tweets concerning Christianity, Islam and Judaism have been manually labeled. Our goal is to test the in-domain performance of systems for political hate speech detection, as well as the out-of-domain performance on a test set about religion.

2. Definition of the Task

HaSpeeDe3 focuses on detecting hate speech in strong polarised debates on social media, in particular debates on Twitter about political and religious topics. With this task, we invite participants to explore not only features based on the textual content of the tweet, but also features based on contextual information such as metadata that describe both the tweet and the author, or information about the social media community of the participants of the debate.

We propose two tasks, A and B, that in the rest of the paper will be referred also as in-domain and cross-domain tracks. Both tasks aim at tackling binary classification problems, and thus participants’ systems have to predict whether a tweet contains hatred or not. Each task consists of two subtasks: • Task A – (In-domain) political hate speech detection: a binary classification task aimed at determining whether a message contains hate speech or not. The task is based on the Policy- 3.1.1. Data Collection CorpusXL dataset (Section 3) and comprises the following subtasks:

ReligiousHate We use the Italian portion of the reli

gious hate speech corpus introduced in [20]. The dataset is composed of 3,000 tweets collected between December 2020 and August 2021 with keywords that refer to the three main monotheistic religions, namely Christianity, Islam and Judaism.

Due to the diferent nature of the political and religious

topics, the protocols used for data collection are not the same; however, in both cases, ofensive words have not been used as query terms to minimise biased dataset composition and potential learning shortcuts [25, 26]. 3.1.2. Data Annotation

We summarise the annotation procedure followed for

PolicyCorpusXL and ReligiousHate below.

PolicyCorpusXL Two Italian experts of communication annotated the entire dataset. The training set has been additionally annotated by a third expert in case of disagreement. 1,000 tweets have been finally discarded in order to artificially augment the portion of hate tweets and provide more information for the classifiers. With this strategy, the number of tweets containing hate increased from 11.8% (a typical percentage obtained with random sampling) to 40.6%.

ReligiousHate Three native speakers of Italian with

a background in linguistics and computer science annotated 3,000 tweets about religion that have been collected as described in Section 3. Annotation was performed following a protocol for experts that foresaw in-person discussion rounds and adjudication sessions.

The Inter Annotation Agreement is similar for both

the PolicyCorpusXL (Fleiss’ = 0.53) and ReligiousHate (Cohen’s = 0.57) datasets. 3.1.3. Data Enrichment

Using the Twitter stream APIs we retrieve tweets but we

miss their subsequent history in the micro-blog platform. Indeed, since tweets are retrieved at posting time, we are not able to know what happens to them afterwards. In order to follow up the impact of a tweet on the user community after the posting time, we, therefore, use Twitter’s APIs also to retrieve information about each tweet a posteriori. This makes it possible to check, for example, the number of times that the tweet has been retweeted or liked over the weeks after its posting time. We also collected a variety of additional information about the author, such as the list of friends and the users that each author retweeted and replied to since about 2018.

Statistics of the two HaSpeeDe3’s datasets are sum

marised in Table 1. PolicyCorpusXL consists of 7,000 tweets about political debates (5,600 in the development set and 1,400 in the test set), whereas ReligiousHate comprises 3,000 tweets, all belonging to the test set. • anonymized_tweet_id: A pseudo-random integer that identifies the specific tweet and replaces the original tweet id. • created_at: The posting date of the tweet. • retweet_count: The number of times the tweet has been retweeted. • favorite_count: It indicates approximately how many times this tweet has been liked by Twitter users1. • source: The source used for posting the tweet (e.g., Android, iOS, web). • is_reply: 1 if the tweet is a reply, 0 otherwise. • is_retweet: 1 if the tweet is a retweet, 0 otherwise. • is_quote: 1 if the tweet is a quote, 0 otherwise. • anonymized_user_id: The original author id (if known), replaced by a pseudo-random integer. • user_created_at: The date when the author created the account. • statuses_count: The number of tweets posted by the author. • followers_count: The number of Twitter users that follow the author. • friends_count: The number of Twitter users that the author follows. • anonymized_description: The self-description of the author of the tweet. We applied the same anonymisation strategy applied to the field anonymized_text of the file train_textual.csv described above.

The value of some fields could be unavailable or set to 0 if we were unable to recover the metadata of the tweet in 2022 (many months after the posting date), for example, because the tweet has been removed by Twitter, deleted, or made unavailable by the author. training|test_contextual_friends.csv • source: A user, identified by anonymized_user_id, that follows the target. • target: A user, identified by anonymized_user_id, that is followed by the source. training|test_contextual_retweet|reply.csv • source: A user, identified by anonymized_user_id, that retweeted target. 1Twitter released a number that “indicates approximately how many times th[e] Tweet has been liked by Twitter users”: https://developer.twitter.com/en/docs/twitter-api/v1/ data-dictionary/object-model/tweet • target: A user, identified by anonymized_user_id, that has been retweeted by source. • date: The day when source retweeted target. • count: The number of times the source retweeted the target that day.

All sources are authors of at least one tweet in the training corpus, but some authors are missing in this file since it was not possible to recover their friend list. All files described above are available at the oficial

GitHub page of the task2.

4. Evaluation Measures We provide four separate oficial rankings, one for each

subtask. Participants can submit two runs for each subtask. However, participants are not required to participate in all subtasks or to submit 2 runs for each of them.

Systems are evaluated using 1-score computed over the two binary classes, i.e., hate speech (HS) or nonhate speech (¬HS). Therefore, submissions are ranked by averaged 1-score over the two classes, according to the following equation:

1() = (1 + 1¬ )/2

4.1. Baselines

We computed baselines using a simple machine learning model. For Task A - Textual, we employed a Support Vector Classifier trained with a unigram representation of the textual content of the tweet. For Task A - Contextual, we devise a baseline using the same classifier as above, based on a unigram representation of the textual content of the tweet, plus the number of retweets and favourites received by the tweet (retweet_count and favourite_count, see Section 3.2), the author degree computed from the friends network, and the author eigenvector centrality computed from the friends network. A last baseline for both the cross-domain hate speech subtasks employs a Support Vector Classifier with a unigram representation of the textual content of the tweet, trained with the XPoliticalHate and HaSpeeDe2 training sets [27].

In Table 2 we present the results obtained by the baselines on the four subtasks.

5. Task Overview: Participation and Results A total of six teams participated in the HaSpeeDe3 task.

We summarise their contribution below.

2https://github.com/mirkolai/EVALITA2023-HaSpeeDe3 BERTicelli [28] The team submitted results for all the

tasks and used all the provided sets of information. They exploited two pre-trained cased LLMs for Italian, namely UmBERTo and Italian BERT. In the pre-processing phase, they turned hashtags into words to reduce noise, they performed fine-tuning and used a 5-fold cross-validation for the Textual subtask, obtaining high scores. For the Contextual subtask, the team adopted an ensemble approach, wherein additional features were added to the ifne-tuned models through a GradientBoosterClassifier algorithm. UmBERTo performed competitively in both Textual and Contextual subtasks but the model did not benefit from the addition of contextual features. Italian BERT, on the other hand, performed above the baselines but significantly lower than the task average. Overall, the team performed above the average in the political hate domain and below the average in the religious hate domain.

CHILab [29] The team participated only to the Task

A - Textual, i.e., addressing only in-domain political hate speech detection using the provided textual content of the tweets from PolicyCorpusXL for development. They submitted two runs that employ two diferent models based on BiLSTM. The first one generates embeddings of 768 tokens from AlBERTo and the second one employs fastText for generating 300-dimensional token embeddings. Particular attention was paid to pre-processing. The [URL] tag, mention references, and retweet notes were removed since they were not considered relevant. Case sensitivity has been preserved as well as emojis due to the fact that they convey a specific meaning in social media communication in terms of prosody and emotions. sults than the IT5 one on Task A - Contextual, whereas the IT5 model achieved better results on the remainder subtasks.

INGEOTEC The team did not submit a system descrip

tion report; therefore, we are unable to discuss and analyse their approach. They participated to the Task A Textual and to the Task B - XReligiousHate considering both the evaluation settings.

LMU [31] The team participated only to the Task B

XReligiousHate considering both the evaluation settings with multitask prompt-training systems. Their systems consist of two steps in which models are i) pre-finetuned on external datasets in Italian and English from various domains, ii) fine-tuned on the target domain (only applicable to PolicyCorpusXL). As a backbone of their systems, they experimented with both Italian and multilingual pre-trained language models (PLMs). They showed that Italian datasets are more beneficial than the combination of Italian and English ones and that systems based on both Italian and multilingual PLMs achieved similar performance. Their best runs for the political and religious domains are ensembles of prompt-training systems based on Italian and multilingual PLMs. odang4 [32] The team participated in both Tasks A and B, using only textual information in the former. They based their approach on the assumption that a relation between named entities and abusive language exists. They submitted two diferent runs. The first one employs enhanced-ALBERTo with triple verbalisation from the Ontology of Dangerous Speech [33] with prompting Davinci model. The second one applies a majority voting criteria among ALBERTo, the enhanced-ALBERTo with triple verbalisation from the Ontology of Dangerous Speech, and the enhanced-ALBERTo with prompting Davinci. For what concerns Task B - XReligiousHate, the multilingual expert-based hate speech/counter-narrative pairs dataset on Islamophobia (CONAN) [34] has been employed too.

5.1. Final Ranking

extremITA [30] The team addressed all the tasks us- Table 3 shows the results obtained by the participants for ing all the provided sets of information made available by each of the four subtasks. The runs submitted by each the organisers. They also made use of data from all the participant are highlighted in green. However, when a EVALITA 2023 challenges to build monolithic architec- team submits a run to Task A - Textual, the submission tures to tackle all the tasks. Their approaches are based on satisfies also Task A - contextual and Task B - XPoliticali) the IT5 encoder-decoder model, and ii) an instruction- Hate requirements, therefore it is included in the final tuned large language model built upon LLaMA. To the ranking. Likewise, when a team submits a run to Task goal, for both models, they devised natural language in- A - contextual, the submission satisfies Task B - XPolitstructions and output templates for each EVALITA task, icalHate requirements too. The best results in Task A including HaSpeeDe3. Among their submissions, we Textual, Task A - contextual, and Task B - XPoliticalHate observe that the LLaMA-based model achieved better reare achieved by the odang4 team with 1() = 0.912, happens because their system was built to address all employing the same model without taking advantage of EVALITA challenges, and the only task-specific adaptacontextual information nor using external data sources. tion is the use of instructions for HaSpeeDe3. Overall, Only extremITA and LMU (the latter exclusively par- out-of-domain settings still challenge hate speech detecticipated to Task B - XPoliticalHate) reached 1() > tion capabilities and still represent a research direction 0.9 with at least one of their runs. to investigate. Furthermore, approaches that tackle well extremITA and LMU are the only two teams that in-domain hate do not seem to suit the out-of-domain reached 1() > 0.6 in Task B - XReligiousHate. In setting, for which diferent strategies should be pursued. particular, extremITA obtained 1() = 0.6525, with a remarkable improvement with respect to other teams.

All participating systems showed an improvement over Acknowledgments the baselines employed for the in-domain political hate speech detection tasks, whereas only two teams outperformed the baseline for Task B - XReligiousHate, proving the complexity of the cross-domain task (Section 5.1).

This work has received financial support from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070190 (AI4Trust). 6. Discussion and Conclusion Results show that the run #1 submitted by the odang4

team achieves the best scores across all in-domain tasks.

In particular, their approach combining prompting, the Ontology of Dangerous Speech, and the ALBERTo model proved particularly efective in the political domain.

However, none of the participants seems to have found a way to efectively exploit contextual information yielding an improvement over textual-only models. This is in line with past studies showing the challenges of embedding contextual information in hate speech detection systems [35].

While the best performance for the in-domain task confirms the state-of-the-art results obtained in similar settings [36], we observe a significant drop in performance (around − 0.30 1 score on average) for the out-of-domain task. Among the systems, extremITA shows a better generalisation capability and yields the best results in this setting. We hypothesise that this [5] N. Oliver, B. Lepri, H. Sterly, R. Lambiotte, S. Dele- Resources & Evaluation 55 (2021) 477–523. taille, M. De Nadai, E. Letouzé, A. A. Salah, R. Ben- [16] P. Saha, B. Mathew, P. Goyal, A. Mukherjee, Hatemjamins, C. Cattuto, et al., Mobile phone data for iners: detecting hate speech against women, arXiv informing public health actions across the covid-19 preprint arXiv:1812.06700 (2018).

pandemic life cycle, 2020. [17] E. W. Pamungkas, V. Basile, V. Patti, Misogyny de[6] M. Caprolu, A. Sadighian, R. Di Pietro, Charac- tection in twitter: a multilingual and cross-domain terizing the 2022 russo-ukrainian conflict through study, Inf. Process. Manag. 57 (2020) 102360. URL: the lenses of aspect-based sentiment analysis: https://doi.org/10.1016/j.ipm.2020.102360. doi:10. Dataset, methodology, and preliminary findings, 1016/j.ipm.2020.102360. 2022. URL: https://arxiv.org/abs/2208.04903. doi:10. [18] I. Guellil, A. Adeel, F. Azouaou, S. Chennoufi, 48550/ARXIV.2208.04903. H. Maafi, T. Hamitouche, Detecting hate speech [7] E. Elejalde, L. Ferres, E. Herder, The nature of real against politicians in arabic community on social and perceived bias in chilean media, in: Proceed- media, International Journal of Web Information ings of the 28th ACM Conference on Hypertext Systems (2020). and Social Media, HT, Association for Computing [19] S. Jaki, T. De Smedt, Right-wing german hate Machinery, New York, NY, USA, 2017, pp. 95–104. speech on twitter: Analysis and automatic detecURL: http://doi.acm.org/10.1145/3078714.3078724. tion, arXiv preprint arXiv:1910.07518 (2019). doi:10.1145/3078714.3078724. [20] A. Ramponi, B. Testa, S. Tonelli, E. Jezek, Ad[8] Y. Theocharis, W. Lowe, Does Facebook increase dressing religious hate online: from taxonomy crepolitical participation? Evidence from a field exper- ation to automated detection, PeerJ Computer Sciiment, Information, Communication & Society 19 ence 8 (2022) e1128. URL: https://doi.org/10.7717/ (2016) 1465–1486. peerj-cs.1128. doi:https://doi.org/10.7717/ [9] O. S,tefănit, ă, D.-M. Buf, Hate speech in social media peerj-cs.1128.

and its efects on the lgbt community: A review of [21] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugthe current research, Romanian Journal of Commu- noli, G. Venturi, Evalita 2023: Overview of the 8th nication and Public Relations 23 (2021). evaluation campaign of natural language process[10] E. Fersini, D. Nozza, P. Rosso, Overview of the ing and speech tools for italian, in: Proceedings evalita 2018 task on automatic misogyny identifica- of the Eighth Evaluation Campaign of Natural Lantion (ami), in: Evaluation Campaign of Natural Lan- guage Processing and Speech Tools for Italian. Final guage Processing and Speech Tools for Italian. Fi- Workshop (EVALITA 2023), CEUR.org, Parma, Italy, nal Workshop, EVALITA 2018, volume 2263, CEUR, 2023.

2018. [22] A. Duzha, C. Casadei, M. Tosi, F. Celli, Hate versus [11] F. Poletto, M. Stranisci, M. Sanguinetti, V. Patti, politics: detection of hate against policy makers in C. Bosco, Hate speech annotation: Analysis of an italian tweets, SN Social Sciences 1 (2021) 1–15. italian twitter corpus, in: 4th Italian Conference on [23] F. Giglietto, N. Righetti, G. Marino, L. Rossi, MultiComputational Linguistics, CLiC-it 2017, volume party media partisanship attention score. estimat2006, CEUR-WS, 2017, pp. 1–6. ing partisan attention of news media sources using [12] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep twitter data in the lead-up to 2018 italian election, learning for hate speech detection in tweets, in: Comunicazione politica 20 (2019) 85–108. Proceedings of the 26th International Conference [24] F. Celli, M. Lai, A. Duzha, C. Bosco, V. Patti, Polion World Wide Web Companion, 2017, pp. 759–760. cycorpus xl: An italian corpus for the detection of [13] F. Celli, G. Riccardi, A. Ghosh, Corea: Italian news hate speech against politics, in: Proceedings of the corpus with emotions and agreement., in: Proceed- Eighth Italian Conference on Computational Linings of CLIC-it 2014, 2014, pp. 98–102. guistics (CLiC-it 2021), volume 3033 of CEUR Work[14] M. Lai, M. Tambuscio, V. Patti, G. Rufo, shop Proceedings, CEUR-WS.org, Aachen, Germany, P. Rosso, Stance polarity in political de- 2022. URL: http://ceur-ws.org/Vol-3033/paper38. bates: A diachronic perspective of network pdf. homophily and conversations on twitter, Data [25] M. Wiegand, J. Ruppenhofer, T. Kleinbauer, De& Knowledge Engineering 124 (2019) 101738. tection of Abusive Language: the Problem of BiURL: https://www.sciencedirect.com/science/ ased Datasets, in: Proceedings of the 2019 Conarticle/pii/S0169023X19300187. doi:https: ference of the North American Chapter of the As//doi.org/10.1016/j.datak.2019.101738. sociation for Computational Linguistics: Human [15] F. Poletto, V. Basile, M. Sanguinetti, C. Bosco, Language Technologies, Volume 1 (Long and Short V. Patti, Resources and benchmark corpora for hate Papers), Association for Computational Linguistics, speech detection: a systematic review, Language Minneapolis, Minnesota, 2019, pp. 602–608. URL: https://aclanthology.org/N19-1060. doi:10.18653/ M. Guerini, CONAN - COunter NArratives through v1/N19-1060. nichesourcing: a multilingual dataset of responses [26] A. Ramponi, S. Tonelli, Features or spuri- to fight online hate speech, in: Proceedings of ous artifacts? data-centric baselines for fair the 57th Annual Meeting of the Association for and robust hate speech detection, in: Pro- Computational Linguistics, Association for Comceedings of the 2022 Conference of the North putational Linguistics, Florence, Italy, 2019, pp. American Chapter of the Association for Com- 2819–2829. URL: https://aclanthology.org/P19-1271. putational Linguistics: Human Language Tech- doi:10.18653/v1/P19-1271. nologies, Association for Computational Linguis- [35] S. Menini, A. P. Aprosio, S. Tonelli, Abuse is contextics, Seattle, United States, 2022, pp. 3027–3040. tual, what about nlp? the role of context in abusive URL: https://aclanthology.org/2022.naacl-main.221. language annotation and detection, arXiv preprint doi:10.18653/v1/2022.naacl-main.221. arXiv:2103.14916 (2021). [27] M. Sanguinetti, G. Comandini, E. Di Nuovo, [36] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, S. Frenda, M. Stranisci, C. Bosco, T. Caselli, V. Patti, G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pitenis, I. Russo, Overview of the evalita 2020 second hate c. Çöltekin, SemEval-2020 Task 12: Multilingual Ofspeech detection task (haspeede 2), in: V. Basile, fensive Language Identification in Social Media (OfD. Croce, M. Di Maro, L. C. Passaro (Eds.), Pro- fensEval 2020), in: Proceedings of SemEval, 2020. ceedings of the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA 2020), CEUR.org, Online, 2020. [28] L. Grotti, P. Quick, Berticelli at haspeede3: Finetuning and cross-validating large language models for hate speech detection, EVALITA 2023 Eigth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (2023) –. [29] I. Siragusa, R. Pirrone, Chilab at evalita 2023:

Overview of the taks a textual, EVALITA 2023 Eigth Evaluation Campaign of Natural Language

Processing and Speech Tools for Italian (2023) –. [30] C. D. Hromei, D. Croce, V. Basile, R. Basili, Extremita at evalita 2023: Multi-task sustainable scaling to large language models at its extreme, EVALITA 2023 Eigth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (2023) –. [31] V. Hangya, A. Fraserl, Lmu at haspeede3: Multidataset training for cross-domain hate speech detection, EVALITA 2023 Eigth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (2023) –. [32] C. Di Bonaventura, A. Muti, M. A. Stranisci,

B. McGillivray, A. Meroño-Peñuela, O-dang4 at hodi and haspeede3: A knowledge-enhanced approach to homotransphobia and hate speech detection in italian, EVALITA 2023 Eigth Evaluation Campaign of Natural Language Processing and

Speech Tools for Italian (2023) –. [33] M. A. Stranisci, S. Frenda, M. Lai, O. Araque, A. T.

Cignarella, V. Basile, C. Bosco, V. Patti, O-dang! the ontology of dangerous speech messages, in: Proceedings of the 2nd Workshop on Sentiment Analysis and Linguistic Linked Data, European Language Resources Association, Marseille, France, 2022, pp.

2–8. URL: https://aclanthology.org/2022.salld-1.2. [34] Y.-L. Chung, E. Kuzmenko, S. S. Tekiroglu,

[1] CENSIS, 50º rapporto sulla situazione sociale del paese 2016 ,

Franco

Angeli , 2016 .

[2]

Conover ,

Ratkiewicz , M. Francisco,

Goncalves ,

Menczer ,

Flammini , Political polarization on Twitter, in: International AAAI Conference on Web and Social Media , ICWSM , Association for the Advancement of Artificial Intelligence , Palo Alto, CA, USA, 2011 , pp. 89 - 96 .

[3]

Watanabe ,

Bouazizi , T. Ohtsuki, Hate speech on twitter: A pragmatic approach to collect hateful and ofensive expressions and perform hate speech detection , IEEE access 6 ( 2018 ) 13825 - 13835 .

[4]

Celli ,

E. A.

Stepanov ,

Poesio , G. Riccardi, Predicting brexit: Classifying agreement is better than sentiment and pollsters ., in: PEOPLES@ COLING , 2016 , pp. 110 - 118 .