1. Introduction

On the Impact of Hate Speech Journal of Pragmatics 205 (2023) 63-77. URL: Synthetic Data on Model Fairness

10.5334/johd.32

MuLTa-Telegram: A Fine-Grained Italian and Polish Dataset for Hate Speech and Target Detection

Elisa Leonardelli

Camilla Casula

Sebastiano Vecellio Salto

Joanna Ewa Bak

Elisa Muratore

Anna Kolos

Thomas Louf

Sara Tonelli

0 0 Fondazione Bruno Kessler (FBK) , Via Sommarive 18, 38123 Trento , Italy 1 NASK National Research Institute , ul. Kolska 12, 01-045 Warsaw , Poland

2024

14 16 24

This paper introduces the MuLTa-Telegram dataset, a Multi- Lingual and multi-Target dataset specifically developed to detect hate speech on Telegram, an understudied yet influential platform in which extremist and fringe content can be found. The dataset contains about 4,000 Telegram messages in Italian and Polish, annotated for the presence of hate speech and its targets, including also target identity group mentions even when no hate is expressed. Unlike most existing hate speech datasets, which focus on a single target group, our dataset is explicitly designed to capture a diverse range of targets, ensuring a broad and representative sample of hateful (and non-hateful) content. Our work addresses the growing need for updated hate speech datasets, as many existing resources are based on platforms that no longer provide research-friendly data access, such as Twitter (X ). Crucially, we show that training on existing out-of-domain data leads to poor results on Telegram data, underscoring the necessity of in-domain datasets for efective hate speech detection. We evaluate hate speech classification setups in an extensive series of experiments in both languages, including multilingual, multi-task, and LLM-based approaches. We find that incorporating target information leads to the best performances, enabling multilingual generalization. On the contrary, classification of specific targets shows much room for improvement across setups. . Warning: this paper contains examples that may be ofensive or upsetting.

eol>Telegram Hate speech Targets Polish Italian

1. Introduction

content moderation [8].

We present the MuLTA-Telegramdataset, a MultiWhile a large body of research has focused on hate speech Lingual and multi-Target dataset developed for the dedetection in recent years, a significant part of it has been tection of hate speech and its targets on Telegram. It centered on English, especially work that considers dif- consists of 2,000 messages in Italian and around 2,000 in ferent possible targets of hate [1, 2]. Furthermore, while Polish, annotated for hate speech and its targets, as well some datasets containing target annotations exist, many as for target identity group mentions. of them only focus on one specific kind of hate speech Crucially, the dataset ensures broad target coverage, as target (e.g., Sanguinetti et al. [3], Bhattacharya et al. [4]). we employed a matrix of keywords to pre-select messages

The most widely used data source in past research for from a large pool of Telegram data and included content this kind of data has been Twitter (now X ). However, representative of 9 minorities target-categories of interhate speech detection systems have been found to be est. To ensure that each category is represented across subject to performance deterioration when applied to the dataset as a whole and not only within the subset a diferent domain from the one they were trained on, of hateful messages, we annotate the target group mene.g., a diferent social network [ 5, 6] or a diferent time tions, i.e. each message is further assessed on whether period [7]. It is therefore important to study diferent its content addresses one or more targets, regardless of platforms and to develop datasets that can be applied to whether the message is hateful or not.1 diferent use cases. Telegram is an understudied platform Moreover, studying Polish-language content fills a critcompared to Twitter or Facebook, yet it plays a significant ical gap, given the scarcity of hate speech datasets availrole in fringe and extremist communication, especially in able and especially given the growing disinformation light of its anonymity preservation features and reduced activity in Central and Eastern Europe [9]. Our aim is that of creating a resource that can be used CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- to train eficient hate speech detection models for textual tics, September 24 — 26, 2025, Cagliari, Italy data, in particular in Italian and Polish, from Telegram, *$Coerlereosnpaornddelilni@gafbukth.eour.(E. Leonardelli); ccasula@fbk.eu and in the presence of content related to targeted identity (C. Casula); svecelliosalto@fbk.eu (S. Vecellio Salto); jebak@fbk.eu groups. After presenting the dataset and its construction, (J. E. Bak); emuratore@fbk.eu (E. Muratore); anna.kolos@nask.pl we run a series of experiments under a variety of setups, (A. Kolos); tlouf@fbk.eu (T. Louf); satonelli@fbk.eu (S. Tonelli) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1Target mentions and target of hate might not coincide.

Attribution 4.0 International (CC BY 4.0). including using existing datasets for this task from other social media and LLM annotations, in order to assess the performance of models that are commonly used for this task on our Polish and Italian expert-annotated Telegram data.

The full data and annotations can be obtained at this link: github.com/dhfbk/MuLTa-Telegram.

API allows large-scale data collection from public channels. Channels are pages that broadcast self-contained streams of public messages, with posting typically limited to page administrators. Beyond the main chat, channels commonly include additional discussion sections where users can interact with both administrators and one another. We collected data from all these sections.

2. Background Most existing labeled datasets for abusive language de

tection are created starting from Twitter (X ) data, mostly because Twitter data collection APIs were for a long time the easiest to access compared to other platforms [10]. Other less widely used sources for data include Facebook [11, 12] and Instagram [13, 14], while Telegram has been generally overlooked in past work on this topic. Indeed, the only existing resource including hate speech data from Telegram contains automatically-annotated English data from only one Telegram source channel [15], in spite of Telegram having been found to harbor communities that exhibit high levels of toxicity and disinformation across diferent countries due to its loose data moderation policies [8, 16].

English is the main language represented in existing abusive language datasets [10]. While a number of datasets for detecting abusive language and hate speech in Italian exist, a large number of them consider specific targets or hate-related phenomena, such as racism and xenophobia [17, 3], misogyny [18, 19], religious hate [20], and homotransphobia [21, 22, 23], with some other types of targets often being underrepresented in existing data even for English [24]. Conversely, the available resources for abusive language detection in Polish are rather scarce. The first dataset we could find is described only in a manuscript in Polish from 2017 [25] and it has been publicly available on HuggingFace since 2021.2 This dataset, however, lacks a detailed description in English. The other available datasets contain posts from Twitter annotated for cyberbullying [26] or ofensive comments sourced from a social networking service [27]. We therefore aim at creating a hate speech dataset specifically for Telegram data in Polish and Italian, including expert annotations over 9 categories of identity groups that can be the target of hate.

3.1. Data Collection Strategy

We start from an initial seed set of public Telegram channels known to spread disinformation or hate, curated by a panel of international domain experts in the consortium of the Hatedemics European project.3 As Telegram has a very limited keyword-based search feature, matching only channel titles, we expand these seed channel names using a snowballing approach [28]. This kind of approach consists in first searching for the titles of the seed set channels, and then leveraging Telegram’s own user-overlap-based recommendations feature4 to grow the initial set of channels.

Due to processing constraints, we aim at focusing message retrieval on the most potentially relevant channels for our purpose, identified by the total number of channel recommendations they receive and their distance from seed channels. This distance is defined as the minimum number of recommendation steps required to reach a given channel from a seed. From the top 150 channels in terms of distance from the channels in the seed set and the number of times they were recommended, we retrieve all publicly available messages and associated chat conversations from Jan 1, 2022 to Jan 1, 2023, totaling around 2.5 million messages for Italian and 1.1 million messages for Polish.

3.2. Data Anonymization With the aim of preserving privacy as much as possible, sensitive information in messages (emails, phone numbers, mentions, etc.) is detected via regular expressions and replaced with placeholders.

Aside from text content, all other information on messages and channels, including channel titles and descriptions, is deleted. This step is carried out to prevent direct identification of the chats in Telegram and to comply with applicable privacy protection regulations.

3. Data Selection and Annotation 3.3. Data Pre-Selection for Annotation

In this section, we detail the construction process of our Since we aim to detect hateful language in particular dataset. Public Telegram channels are accessible through across multiple vulnerable social groups, in collaboraa freely available API, originally designed for bot devel- tion with civil society domain experts from NGOs and opment. While not initially intended for research, this

2https://huggingface.co/datasets/community-datasets/hate_

speech_pl

3https://hatedemics.eu/

4Via GetFullChannelRequest and GetChannelRecommendation in Telethon: https://github.com/LonamiWebs/Telethon research institutions, we have defined a set of common 3.4. Data Annotation targets of hate in the countries and contexts we take under consideration, including People with Disabilities; We employ expert Polish and Italian annotators, two ItalLGBTQ+ Individuals; Religion: Jews, Muslims, Chris- ian native speakers (one male, age 26, and one female, tians; Ethnicity/Origin: People of Color, Romani people, age 41) and two Polish native speakers (one female, age Other (including Migrants); Women. These target iden- 22, and one female, age 37). Annotators were asked to tity groups have been partially adapted from the ones indicate whether a message contained hate speech. If used in the Measuring Hate Speech corpus [1, 2], which hate speech was present, annotators were required to uses US-centric identity categories, adjusting them to our specify the target of the hate speech from our predefined European context. list of categories. To gain a deeper understanding of the

We then developed a keyword matrix consisting of dataset’s content and to ensure that the dataset covered a 145 group-specific terms. 5 These keywords have been broad range of target identity categories not only in the selected based on prior domain expertise and preliminary hateful part of the dataset, annotators were also asked to corpus exploration. label the target mentions of each message among a set of

Aiming at obtaining a high representation of content predefined categories. 6 An overview of the annotation related to the target identity groups we identified, we scheme that was used for annotating both the Italian and then carried out a pre-selection step. From the entire the Polish sections of the dataset is provided in Figure Telegram data collection, we pre-selected for manual an- 1, while the full annotation guidelines are reported in notation about 1,500 posts (75% of the entire dataset) Appendix 8.1. This process resulted in two comprehencontaining at least two distinct keywords (from our ma- sive databases containing messages annotated for both trix) associated with the same target group. This is done hateful and non-hateful content, targeting various idenusing a string-matching filter. We then construct the re- tity groups. A numerical breakdown of their content is maining 25% of the dataset by randomly selecting posts provided in Table 1, 2 and in Figure 2. to manually annotate, in order to create a more represen- The databases mainly contain non-hateful messages, with tative overall sample of random messages on Telegram, the Italian one featuring almost as many hate messages which of course might not contain target-related words. as the Polish one. This may be due to diferent use of Telegram or to a greater number of controversial, yet not explicitly hateful, messages in the Polish database, which includes many discussions related to the Russia-Ukraine

5The keyword matrix is available on github: https://github.com/

dhfbk/MuLTa-Telegram.

6Target mentions assignation and target of hate might not coincide.

war. When analyzing the targets of hate speech, most messages are directed at ethnic groups, with a prevalence of attacks against people of color in Italian and against Ukrainian refugees in Polish, followed by those targeting LGBTQ+ identities. While a significant portion of hateful messages targets either groups not represented in the selected taxonomy (Other) or expresses hate without a specific target ( No Target), there is little representation of hate toward the remaining identity categories. Inter-annotator agreement was calculated for each language on a sub-sample of 200 posts using Krippendorf’s alpha, annotated each by two expert annotators who are native speakers of Italian or Polish. The Polish portion of the dataset showed an IAA of 0.41, while the Italian one 0.68. These numbers, while low, are in line with previous work on similar topics, especially considering that our annotators had no chance to discuss and revise their annotations together, as they worked asynchronously. For instance, Basile et al. [29] showed an inter-rater agreement for aggressiveness in Spanish of 0.47.

4. Classification Experiments As a way to benchmark our newly-created dataset, and

to explore diferent strategies for classification of hate speech in Italian and Polish on Telegram data, we devise a series of experiments using diferent experimental setups. These experiments include fine-tuning BERT-base classifiers (Sec. 4.1), multi-task models (Sec. 4.2), and LLM prompting (Sec. 4.3). To evaluate approaches across diferent experiments in a comparable way, 35% of the manually annotated dataset was withheld and used as test set for each language. The remaining 1,300 manually annotated items (65%) were used to fine-tune models where necessary (i.e., Experiments 2 and 4). Each experiment was replicated with a consistent setup across both languages.

4.1. Supervised Hate Speech Detection via BERT Fine-Tuning In this set of experiments we fine-tune existing monolin

gual (Exp. 1,2,3) and a multilingual (Exp.4) BERT-based language models [30].

Regarding monolingual models, for Polish we conducted a series of experiments using three distinct BERT-based models for the Polish language: we used a generalpurpose Polish BERT-model (BERT-base-pl)7 and two models trained for identifying specific types of ofensiveness, namely cyberbullying (BERT-cb-pl)8 and hate speech (BERT-hs-pl).9

For Italian, we fine-tuned a general-purpose Italian BERT-based model (BERT-base-it),10 a BERT-based model pre-trained on Italian data from Twitter (AlBERTo) [31],11 and a binary hate speech classification model for Italian social media text (Hate-ita) [32].12

For fine-tuning the models we employed the MaChAmp library [33], an open-source tool designed to simplify flexible tasks configuration, multitask and multilingual fine-tuning of transformer-based language models. All the evaluated models were fine-tuned for 5 epochs using a single GPU, applying the default hyperparameters provided by MaChAmp (see Appendix 8.2). To address class imbalance, we assign equal weight to each class during training, ensuring that minority classes are not underrepresented.

Experiment 1: Training on Existing Datasets Our ifrst experiment aims to evaluate the performance of models fine-tuned on other publicly available datasets on our manually-annotated Telegram test data. They serve as a baseline.

For Italian, we use 2,000 examples from 4 existing datasets that represent some of the targets we consider in our work: the AMI dataset [34], focused on misogyny; the Haspeede dataset [35], focused on hateful content against Muslims, immigrants and Roma people; the HODI dataset [23], a dataset for detection of homotransphobia in Italian; and the Religous Hate dataset [36], an Italian dataset that includes Anti-Judaism, Anti-Christianity and anti-Islam social media posts.13

For Polish, we could find 3 datasets total related to online abusive content. We decided to discard the oldest one [25] due to lack of available information on its construction (data collection, annotation, content) and 7dkleczek/bert-base-polish-uncased-v1 8ptaszynski/bert-base-polish-cyberbullying 9dkleczek/Polish-Hate-Speech-Detection-Herbert-Large 10dbmdz/bert-base-italian-cased 11m-polignano-uniba/bert_uncased_L-12_H-768_A

12_italian_alb3rt0 12MilaNLProc/hate-ita 13Given that this dataset contains several targets in addition to religion-focused ones, we filtered it to retain only religious targets. because after a preliminary manual inspection our annotators found the data to be noisy (e.g., HTML code was found in the middle of the texts).

This left us with two datasets for hate speech, which we use in combination in our experiments: the Cyberbullying dataset [26] and the BAN-PL dataset [27]. These datasets difer significantly in both their definitions of hate and their annotation procedures. For instance, the Cyberbullying dataset contains generally milder or less severe phenomena in its annotations, as it is focused on the somewhat broader phenomenon of cyberbullying compared to hate speech. In contrast, BAN-PL considers a message as Not Hateful if it remained online for more than two days without being removed by a platform moderator. Only a small subset of the removed comments was then manually annotated as Hateful.

Given these diferences, we opted to use only the manually annotated hateful samples from BAN-PL, which are more aligned with our definition of hate speech. For the neutral (non-hateful) class, we combined equal portions of BAN-PL and Cyberbullying data, ensuring a balanced yet representative dataset composition.

Experiment 2: Fine-tuning on Manually Annotated Data This is the main experiment in which we evaluate the potential usefulness of our dataset for training hate speech detection models. We fine-tune the models on 1,300 manually annotated items from our dataset for each language. The task setup is single-task, focusing exclusively on the hate speech task. Since the annotated data is in-domain, we expect this setup to yield better performance on our Telegram test data compared to Experiment 1, which used out-of-domain data (i.e., data from diferent platforms).

Experiment 3: Fine-tuning on LLM-Annotated Data (LLaMA) To investigate whether LLMs can serve as a viable alternative to manual annotation in hate speech detection tasks on Telegram, we devise an experiment in which we use LLaMA 3.1 70B Instruct as an automated annotator. We ask the model to annotate the same train split of our dataset as in Experiment 2, by prompting the model with a summary of our hate speech annotation guidelines. For both languages, we then fine-tune the same BERT-based models as in Experiment 2, but this time on the LLM-annotated data. We then evaluate the trained models on the test sets.

Experiment 4: Multilingual BERT A multilingual approach can leverage shared representations across languages. In this context, a model is required to generalize patterns that may be strongly language- and contextdependent, a non-trivial task. Nonetheless, this strategy ofers several advantages: it can boost performance in low-resource settings through cross-lingual transfer, and it can improve robustness by exposing the model to more diverse inputs during training.

To test the viability of this approach, we merge the two manually annotated train splits of the Polish and Italian datasets to fine-tune a multilingual BERT base model. 14 The performance of the model for classification of hate speech is then evaluated on the Italian and Polish test sets separately.

4.2. Multi-task Setup for Hate Speech and Target Detection

16https://huggingface.co/meta-llama/Llama-Guard-3-8B Experiment 5 Given the hierarchical relationship between hate speech detection and target identification, we adopt a multi-task learning approach to jointly model Figure 3: The design of the multitask setup used for experithese tasks, under the assumption that each task can help ment 5. generalization on the other. In this multi-task learning paradigm, schematically illustrated in Table 3, the model can jointly optimize for diferent tasks, allowing all tasks Instruct (Exp. 6) with our annotation guidelines and ask to benefit from shared signals captured through a com- it to label each test example as hateful or not. We then mon representation, which is jointly fine-tuned during also evaluate LLaMA Guard (Exp. 7), using no prompt as training. This approach is motivated by prior work show- it is a model explicitly made to detect inappropriate or ing that training models on related tasks simultaneously toxic content.16 can lead to better performance than training them in While this kind of experimental setup is useful for isolation [37]. This setup should allow to improve gener- comparison purposes, it should be noted that it is highly alization and stability of the hate speech task, but also to ineficient, and unlikely to be feasible and scalable when automatically predict the targets of hate speech, a task large amounts of data need to be processed at once, as that as a single task would be extremely dificult to ad- its computational speed and eficiency is much lower dress with the currently available data, given the scarcity than that of a BERT-based model fine-tuned on taskof targets (see Table 3.4). specific data. Such models are particularly well-suited for

In this setting, hate speech detection serves as the social science research, where cost-efective processing primary task, since the presence of a target group in of millions of messages is often required to study trends a message depends on the detection of hate speech in in online hate and its societal impact. Given that our goal the first place, while target identification is treated as is the development of hate speech classification models a secondary task. Specifically, we used our pre-trained that can be employed in real-life scenarios, we consider models as the shared encoder for both tasks, while a LLM-based classification out of this scope. separate decoder is utilized by each task. We incorporate diferent loss weighting to the two tasks, in order to represent the hierarchy of primary and auxiliary.15

5. Experimental Results and Discussion 4.3. Prompt-Based Hate Speech Detection via LLMs In this section, we present the results obtained in our

experiments. A summary of the results across all exExperiments 6 and 7: Llama We then aim at evaluat- perimental setups is shown in Tables 3 and 4. For the ing the performance of LLMs on our Telegram annotated experiments using multiple models (Exp. 1, 2, 3, and 5), data in Italian and Polish. For this, we use LLaMA [38], we report average macro-F1 scores, while the detailed since it possesses some multilingual capabilities, espe- results are in Appendix 8.3. As a first general observacially in Italian. In particular, we prompt LLaMA 3.1 70B tion, Polish and Italian show consistent results patterns across experiments, which allows us to derive meaningful observations across both languages. 14google-bert/bert-base-multilingual-cased 15The multi-task learning loss is computed as = ∑︀ , where is the loss for task and the corresponding weighting parameter, and we provide a diferent loss weight for the auxiliary tasks.

For the main task, we empirically set = 0.7, and = 0.3 for the auxiliary task.

5.1. Hate Speech Detection

The results of the binary classification of hate speech are reported in Table 3. In-domain training (Exp. 2) consistently outperforms the models trained on out-of-domain data (Exp. 1) across both languages, underscoring the necessity of domain-specific data. Notably, out-of-domain training results in the worse classification performance for Polish and the second worse for Italian.

Conversely, the training of multilingual BERT (Exp. 4) resulted in very low performance overall, suggesting that models trained across multiple languages can struggle to generalize efectively for this task. Regarding specific model performances, for both languages, fine-tuning a model already fine-tuned for hate speech ( Hate-ita and BERT-hs-pl, for Italian and Polish respectively) leads to the best results within models across all experiments.17

The Llama-based experiments, including Exp. 3, in which Llama was used to annotated data for training a BERT-based classifier, and Exps. 6 and 7, in which Llama (70B Instruct and Llama Guard) predicted test set labels through prompting, yielded intermediate performance.

While generally better than out-of-domain approaches, they consistently fell short of models trained on expert human annotations. Llama-based predictions performed consistently worse in the case of Polish, possibly due to the model lacking oficial support for the Polish language.

The multi-task setup (Exp. 5), on the other hand, improved hate speech detection performance, achieving the highest macro-F1 scores for both languages.

5.2. Target Identification Regarding the parallel task of target of hate identifica

tion, while overall performances appear high in both languages (Accuracy: Polish 87%, Italian 82%), this result is driven primarily by the model’s strong performance on the majority class, i.e., samples in the non-hate class, therefore without target, which heavily skews the results. Macro-averaged F1 scores on each target are very low, as shown in Table 4, indicating very poor performance on 17For more detailed results see Appendix 8.3. minority classes prediction (hateful and targeted examples). Notably, for Italian the most frequent target class Ethnicity/Origin: Person of Color is consistently recognized (with an F1-score of almost 0.70), and performance on the moderately frequent class LGBTQ+ depends on the model (F1 scores range from 0.00 to 0.41), while the other target groups are entirely or almost entirely disregarded. For Polish, the target LGBTQ+ is classified more accurately than the others (F1 0.29 up to 0.69).

5.3. Additional Multilingual Experiments

Given the very low performance of the multilingual model (Italian: 0.589, Polish: 0.564 F1), we sought to investigate potential causes for this. Although diferent languages might express hate diferently, and context can vary, one possible factor that could explain the low performance of multilingual models is annotation inconsistencies between the Italian and Polish datasets, especially given the dificulty and subjectivity of the type of annotation.

To investigate this, we repeated Experiment 4 by finetuning multilingual BERT, this time using the data from Experiment 3, which was annotated via LLM. These LLMgenerated annotations should in principle be more homo- speech, especially in the absence of profanity. geneous across languages, assuming the system is using In the second case, models overestimated hatred in the same criteria given the same prompt instructions. In messages expressing controversial opinions (“non c’è nesthis setup, performance improved notably (Italian: 0.674, sun isolamento perché non esistono i virus” [“There’s no Polish: 0.657 F1), supporting our hypothesis. isolation because viruses don’t exist”], “Lepiej dla Rus

Nonetheless, we were interested in performance of kich, kto lubi ten shit?” [“Better for the Russians, who our the best performing scenario, i.e. on high-quality, likes that shit?”]) or sensitive topics (“Una pacca sul manually annotated data and multitask setup. We re-ran sedere non autorizzata è una molestia sessuale” [“An unthe experiment using multitask learning (i.e., jointly pre- warranted slap on the butt is sexual harassment”]). Addicting hate speech and its target) on the human-labeled ditionally, models struggled with relatively mild insults datasets. This yielded the best results for both languages containing no targets in the given context (“Nikt nie po(Italian: 0.706, Polish: 0.726 F1). może.. Bandyci bezkarni..” [“No one will help.. Bandits

These findings suggest that inconsistencies among an- unpunished..”]), idiomatic use of expressions related to notators across languages can hamper results of multilin- disabilities, which are lexicalized in spoken Italian, albeit gual models, but learning on richer data can help, since unkind (“purtroppo non c’è peggior sordo di chi non vuol an auxiliary task can help generalization by providing sentire e peggior cieco di chi non vuol vedere” [“Unfortumore training signal and regularizing the model. nately, there’s no one more deaf than those who don’t want to hear, and no one more blind than those who don’t want to see”]), or critiquing hateful messages (“tipico cristiano 6. Manual Qualitative Analysis ipocrita...va in chiesa però vorrebbe sterminare chi crede nel Islam” [“Typical hypocritical Christian...goes to church but would like to exterminate those who believe in Islam”]).

Finally, some cases appear to be simply annotation errors (“@<user> finalmente Instagram mi dà le pubblicità giuste” [“@<user> finally Instagram shows me the right ads”]).

To understand the diferences between the Italian and

Polish data, we conducted a manual qualitative analysis.

First, we noticed a disparity in the distribution of hateful messages targeting Ethnicity/Origin. While the Italian dataset shows a predominance of messages directed at people of color (99 instances, compared to 9 in the Polish dataset), the subcategory Other (Migrants) appears less frequently in Italian (39 instances) than in Polish (82). 7. Conclusions These patterns likely reflect the socio-political context at the time of data collection, with immigration by people of In this paper, we introduced MuLTa-Telegram, a novel color being a prominent issue in Italy and the presence of multilingual dataset for hate speech and target detection, Ukrainian refugees being central in Poland. This under- containing data from Telegram in both Italian and Polish. scores the importance of collecting context-sensitive data, The dataset includes anotations across 9 hate speech particularly at the socio-cultural level, as each context target categories, in contrast with the majority of availcan exhibit diferent patterns and phenomena. able datasets, which are often limited to single targets.

We also investigated the discrepancies between auto- Moreover, we ensured the presence of target-related conmatic prediction and human annotation. We identified 29 tent also in the non-hateful part of the dataset, with about Italian messages and 30 Polish ones which the annotators 75% of the messages containing target-relevant content deemed hateful and the models classified otherwise. For (see Figure 2). Furthermore, while the vast majority of the opposite case, there were 70 messages in Italian and hate speech research has been conducted on English, we only 3 messages in Polish. focused on underrepresented languages. In the first case, models seem unable to detect hateful We conducted an extensive set of experiments, showcontent when not presented in a standard explicitly ofen- ing that the fine-tuning of BERT-based models on out-ofsive form. Performance tends to be low when examples domain hate speech classification data leads to poor perinclude hashtags (“Islam, in Afghanistan torna la sharia formance on Telegram data, while training on in-domain [...] #religionedipace...” [“Islam, in Afghanistan sharia resources consistently outperforms it. This draws atis back [...] #religionofpeace..”]); dehumanization being tention to the limitations of relying on datasets from implied (“i roma [...] non sono veri esseri umani, punto” platforms like Twitter, which are no longer reliably ac[“the Roma [...] are not real human beings, period”], “I cessible for academic research, reinforcing the need for kulka we własny łeb” [“And a bullet to your own head”]); updated and diversified resources like MuLTa-Telegram. slurs in non-standard language varieties (“Na Zengara However, results on the detection of individual targets rein pratica” [“ Basically about a gypsy”]); and occasionally mained poor, particularly for more scarcely represented established slurs (e.g., Italian n-word). Models appear groups. This underscores the persistent dificulty of deless proficient than humans in detecting implied hate tecting hate directed at less-represented communities.

Acknowledgments The work of C. Casula, A. Kołos, E. Leonardelli, E. Mu

ratore and S. Tonelli has been supported by the European Union’s CERV fund under grant agreement No. 101143249 (HATEDEMICS). This research was also partially supported by the European Union under the Horizon Europe project AI-CODE, GA No. 101135437.

Furthermore, in a multilingual setup, we showed how the addition of a parallel task predicting targets greatly improves performances for hate speech classification, enabling the model to generalize across languages. We included both LLaMA and LLaMA Guard in our evaluation to explore how general-purpose and safety-focused systems perform on our task. LLaMA Guard, despite its safety orientation, performs poorly in this out-of-domain context, while LLaMA shows strong performance on Italian, but its accuracy drops on Polish data, likely due to limited language coverage during pretraining. These results emphasize the need for both domain- and languagespecific adaptation.

While we fine-tuned transformer-based models directly on classification tasks using Telegram data, future work could explore domain-adaptive pretraining via Masked Language Modeling on unlabeled Telegram messages. This step could improve the encoder’s alignment with the linguistic characteristics of the platform, potentially enhancing classification performance.

We hope this dataset will help foster research into hate speech detection for underrepresented languages and platforms. Future work will explore expanding the dataset to more languages and domains, as well as improving the detection of fine-grained targets of hate. tweets, in: CEUR workshop proceedings, volume • Text can be hateful even if the target is implicit, 2481, CEUR, 2019, pp. 1–6. as long as it’s implied by the context. [32] D. Nozza, F. Bianchi, G. Attanasio, Hate-ita: Hate • It is not hateful if the target is an organization speech detection in italian social media text, in: Pro- and not its members. ceedings of the Sixth Workshop on Online Abuse • Profanities alone do not imply hatefulness, unless and Harms (WOAH), 2022, pp. 252–260. the tone is aggressive or the message is clearly [33] R. Van Der Goot, A. Üstün, A. Ramponi, I. Sharaf, directed toward someone (e.g., “Aspetta che li miB. Plank, Massive choice, ample tasks (machamp): A nacciano per bene e poi vedi se accettano...” ). toolkit for multi-task learning in nlp, arXiv preprint • False or debatable statements do not imply hatearXiv:2005.14672 (2020). fulness, but messages that erase identities (e.g., [34] E. Fersini, D. Nozza, P. Rosso, et al., Ami@ “esistono solo due sessi” ) are hate speech. evalita2020: Automatic misogyny identification, in: • References to individuals or citizens (excluding Proceedings of the 7th evaluation campaign of Natu- military groups) as nazis in the context of the ral Language Processing and Speech tools for Italian Russia-Ukraine war are to be considered hate (EVALITA 2020), (seleziona...), 2020. speech. [35] M. Sanguinetti, G. Comandini, E. Di Nuovo, • In Polish:

S. Frenda, M. Stranisci, C. Bosco, T. Caselli, V. Patti, I. Russo, Haspeede 2@ evalita2020: Overview of – If “Banderowiec” refers to supporters of the evalita 2020 hate speech detection task, Eval- Stepan Bandera (OUN), it is not hate uation Campaign of Natural Language Processing speech.

and Speech Tools for Italian (2020). – If “Banderowiec” is used to refer to the [36] A. Ramponi, B. Testa, S. Tonelli, E. Jezek, Address- entire Ukrainian nation or other social ing religious hate online: from taxonomy creation groups in a hateful or ofensive way, it is to automated detection, PeerJ Computer Science 8 hate speech.

(2022) e1128. [37] R. Caruana, Multitask learning, Machine learning Target Detection

28 (1997) 41–75. [38] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- When text contains hate speech, specify its target. Possidian, A. Al-Dahle, A. Letman, A. Mathur, A. Schel- ble categories include: ten, A. Vaughan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024). [39] R. van der Goot, A. Üstün, A. Ramponi,

I. Sharaf, B. Plank, Massive choice, ample tasks (machamp): A toolkit for multi-task learning in nlp, 2021. URL: https://arxiv.org/abs/2005.14672. arXiv:2005.14672. • Ethnicity/Origin: People of Color, Romani, or Other

(Migrants) • LGBTQ+ • People with Disability • Religion: Jewish, Christians, Muslims, Other • Women • Other • No Target

8. Appendices 8.1. Annotation Guidelines In this section we report the annotation guidelines. Hate Speech Detection Assess whether the message contains hateful language.

Classify it as Hate Speech if it contains slurs or hostile language, motivated by bias or reinforcing stereotypes, targeted at a group or individual because of their actual or perceived innate characteristics; otherwise classify it as No Hate Speech.

• Reported speech is not hate speech.

Choose the most appropriate category. Select other for any specific target not included in any other category. Select No Target for occurrences of hate speech not directed at any specific group.

• In cases where multiple labels apply, prioritize the identity that is most harmed. • The target must be explicitly addressed, not implied (e.g., by referring to stereotypical associations): – Talking about Arabic/Muslim countries or

Islam does not imply the Muslim target. – Talking about Africa or African migration

does not imply the People of Color target. – Mentions of disability imply the People with Disability target.

Mention of Target Group Detection Annotate if one or more of the following groups are addressed in the text. Assign the corresponding label(s). Multiple groups may be annotated for a single message. Possible target groups include:

• Ethnicity/Origin: People of color, Romani, Other (Migrants) • LGBTQ+ • People with Disability • Religion: Jews, Muslims, Christians, Other • Women • None

If none of these target groups are addressed, assign the

label None. A group should be annotated if it is explicitly mentioned or implicitly clear from the context. Annotate a group even if it is not the main focus of the message.

8.2. Hyperparameters In this section we described the parameters used for BERT-based experiments. In this section, in Tables 6 and 7 for Italian and Polish respectively, we report the full detailed results for Experiments 1,2,3 and 5.

Non-Hate

avg Hate

avg F1

Experiments Exp1 - out of domain data Exp2 - manually annotated data Exp3 - Llama as annotator Exp 4 - multilingual Exp5 - multitask setup Exp 6 - Llama Exp 7 - Llama Guard Model Declaration on Generative AI F1