<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>On the Impact of Hate Speech
Journal of Pragmatics 205 (2023) 63-77. URL: Synthetic Data on Model Fairness</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5334/johd.32</article-id>
      <title-group>
        <article-title>MuLTa-Telegram: A Fine-Grained Italian and Polish Dataset for Hate Speech and Target Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elisa Leonardelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camilla Casula</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastiano Vecellio Salto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joanna Ewa Bak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisa Muratore</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Kolos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Louf</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Tonelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler (FBK)</institution>
          ,
          <addr-line>Via Sommarive 18, 38123 Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NASK National Research Institute</institution>
          ,
          <addr-line>ul. Kolska 12, 01-045 Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>14</volume>
      <fpage>16</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>This paper introduces the MuLTa-Telegram dataset, a Multi- Lingual and multi-Target dataset specifically developed to detect hate speech on Telegram, an understudied yet influential platform in which extremist and fringe content can be found. The dataset contains about 4,000 Telegram messages in Italian and Polish, annotated for the presence of hate speech and its targets, including also target identity group mentions even when no hate is expressed. Unlike most existing hate speech datasets, which focus on a single target group, our dataset is explicitly designed to capture a diverse range of targets, ensuring a broad and representative sample of hateful (and non-hateful) content. Our work addresses the growing need for updated hate speech datasets, as many existing resources are based on platforms that no longer provide research-friendly data access, such as Twitter (X ). Crucially, we show that training on existing out-of-domain data leads to poor results on Telegram data, underscoring the necessity of in-domain datasets for efective hate speech detection. We evaluate hate speech classification setups in an extensive series of experiments in both languages, including multilingual, multi-task, and LLM-based approaches. We find that incorporating target information leads to the best performances, enabling multilingual generalization. On the contrary, classification of specific targets shows much room for improvement across setups. . Warning: this paper contains examples that may be ofensive or upsetting.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Telegram</kwd>
        <kwd>Hate speech</kwd>
        <kwd>Targets</kwd>
        <kwd>Polish</kwd>
        <kwd>Italian</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>content moderation [8].</p>
      <p>We present the MuLTA-Telegramdataset, a
MultiWhile a large body of research has focused on hate speech Lingual and multi-Target dataset developed for the
dedetection in recent years, a significant part of it has been tection of hate speech and its targets on Telegram. It
centered on English, especially work that considers dif- consists of 2,000 messages in Italian and around 2,000 in
ferent possible targets of hate [1, 2]. Furthermore, while Polish, annotated for hate speech and its targets, as well
some datasets containing target annotations exist, many as for target identity group mentions.
of them only focus on one specific kind of hate speech Crucially, the dataset ensures broad target coverage, as
target (e.g., Sanguinetti et al. [3], Bhattacharya et al. [4]). we employed a matrix of keywords to pre-select messages</p>
      <p>The most widely used data source in past research for from a large pool of Telegram data and included content
this kind of data has been Twitter (now X ). However, representative of 9 minorities target-categories of
interhate speech detection systems have been found to be est. To ensure that each category is represented across
subject to performance deterioration when applied to the dataset as a whole and not only within the subset
a diferent domain from the one they were trained on, of hateful messages, we annotate the target group
mene.g., a diferent social network [ 5, 6] or a diferent time tions, i.e. each message is further assessed on whether
period [7]. It is therefore important to study diferent its content addresses one or more targets, regardless of
platforms and to develop datasets that can be applied to whether the message is hateful or not.1
diferent use cases. Telegram is an understudied platform Moreover, studying Polish-language content fills a
critcompared to Twitter or Facebook, yet it plays a significant ical gap, given the scarcity of hate speech datasets
availrole in fringe and extremist communication, especially in able and especially given the growing disinformation
light of its anonymity preservation features and reduced activity in Central and Eastern Europe [9].
Our aim is that of creating a resource that can be used
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- to train eficient hate speech detection models for textual
tics, September 24 — 26, 2025, Cagliari, Italy data, in particular in Italian and Polish, from Telegram,
*$Coerlereosnpaornddelilni@gafbukth.eour.(E. Leonardelli); ccasula@fbk.eu and in the presence of content related to targeted identity
(C. Casula); svecelliosalto@fbk.eu (S. Vecellio Salto); jebak@fbk.eu groups. After presenting the dataset and its construction,
(J. E. Bak); emuratore@fbk.eu (E. Muratore); anna.kolos@nask.pl we run a series of experiments under a variety of setups,
(A. Kolos); tlouf@fbk.eu (T. Louf); satonelli@fbk.eu (S. Tonelli)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1Target mentions and target of hate might not coincide.</p>
      <p>Attribution 4.0 International (CC BY 4.0).
including using existing datasets for this task from other
social media and LLM annotations, in order to assess the
performance of models that are commonly used for this
task on our Polish and Italian expert-annotated Telegram
data.</p>
      <p>The full data and annotations can be obtained at this
link: github.com/dhfbk/MuLTa-Telegram.</p>
      <p>API allows large-scale data collection from public
channels. Channels are pages that broadcast self-contained
streams of public messages, with posting typically limited
to page administrators. Beyond the main chat, channels
commonly include additional discussion sections where
users can interact with both administrators and one
another. We collected data from all these sections.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>Most existing labeled datasets for abusive language de</title>
        <p>tection are created starting from Twitter (X ) data, mostly
because Twitter data collection APIs were for a long time
the easiest to access compared to other platforms [10].
Other less widely used sources for data include Facebook
[11, 12] and Instagram [13, 14], while Telegram has been
generally overlooked in past work on this topic. Indeed,
the only existing resource including hate speech data
from Telegram contains automatically-annotated English
data from only one Telegram source channel [15], in spite
of Telegram having been found to harbor communities
that exhibit high levels of toxicity and disinformation
across diferent countries due to its loose data
moderation policies [8, 16].</p>
        <p>English is the main language represented in
existing abusive language datasets [10]. While a number of
datasets for detecting abusive language and hate speech
in Italian exist, a large number of them consider
specific targets or hate-related phenomena, such as racism
and xenophobia [17, 3], misogyny [18, 19], religious hate
[20], and homotransphobia [21, 22, 23], with some other
types of targets often being underrepresented in
existing data even for English [24]. Conversely, the available
resources for abusive language detection in Polish are
rather scarce. The first dataset we could find is described
only in a manuscript in Polish from 2017 [25] and it has
been publicly available on HuggingFace since 2021.2 This
dataset, however, lacks a detailed description in English.
The other available datasets contain posts from Twitter
annotated for cyberbullying [26] or ofensive comments
sourced from a social networking service [27]. We
therefore aim at creating a hate speech dataset specifically
for Telegram data in Polish and Italian, including expert
annotations over 9 categories of identity groups that can
be the target of hate.</p>
        <sec id="sec-2-1-1">
          <title>3.1. Data Collection Strategy</title>
          <p>We start from an initial seed set of public Telegram
channels known to spread disinformation or hate, curated by
a panel of international domain experts in the
consortium of the Hatedemics European project.3 As Telegram
has a very limited keyword-based search feature,
matching only channel titles, we expand these seed channel
names using a snowballing approach [28]. This kind of
approach consists in first searching for the titles of the
seed set channels, and then leveraging Telegram’s own
user-overlap-based recommendations feature4 to grow
the initial set of channels.</p>
          <p>Due to processing constraints, we aim at focusing
message retrieval on the most potentially relevant channels
for our purpose, identified by the total number of channel
recommendations they receive and their distance from
seed channels. This distance is defined as the minimum
number of recommendation steps required to reach a
given channel from a seed. From the top 150 channels
in terms of distance from the channels in the seed set
and the number of times they were recommended, we
retrieve all publicly available messages and associated chat
conversations from Jan 1, 2022 to Jan 1, 2023, totaling
around 2.5 million messages for Italian and 1.1 million
messages for Polish.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>3.2. Data Anonymization</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>With the aim of preserving privacy as much as possible, sensitive information in messages (emails, phone numbers, mentions, etc.) is detected via regular expressions and replaced with placeholders.</title>
        <p>Aside from text content, all other information on
messages and channels, including channel titles and
descriptions, is deleted. This step is carried out to prevent direct
identification of the chats in Telegram and to comply
with applicable privacy protection regulations.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data Selection and Annotation</title>
      <sec id="sec-3-1">
        <title>3.3. Data Pre-Selection for Annotation</title>
        <p>In this section, we detail the construction process of our Since we aim to detect hateful language in particular
dataset. Public Telegram channels are accessible through across multiple vulnerable social groups, in
collaboraa freely available API, originally designed for bot devel- tion with civil society domain experts from NGOs and
opment. While not initially intended for research, this</p>
        <sec id="sec-3-1-1">
          <title>2https://huggingface.co/datasets/community-datasets/hate_</title>
          <p>speech_pl</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3https://hatedemics.eu/</title>
          <p>4Via GetFullChannelRequest and GetChannelRecommendation in
Telethon: https://github.com/LonamiWebs/Telethon
research institutions, we have defined a set of common 3.4. Data Annotation
targets of hate in the countries and contexts we take
under consideration, including People with Disabilities; We employ expert Polish and Italian annotators, two
ItalLGBTQ+ Individuals; Religion: Jews, Muslims, Chris- ian native speakers (one male, age 26, and one female,
tians; Ethnicity/Origin: People of Color, Romani people, age 41) and two Polish native speakers (one female, age
Other (including Migrants); Women. These target iden- 22, and one female, age 37). Annotators were asked to
tity groups have been partially adapted from the ones indicate whether a message contained hate speech. If
used in the Measuring Hate Speech corpus [1, 2], which hate speech was present, annotators were required to
uses US-centric identity categories, adjusting them to our specify the target of the hate speech from our predefined
European context. list of categories. To gain a deeper understanding of the</p>
          <p>We then developed a keyword matrix consisting of dataset’s content and to ensure that the dataset covered a
145 group-specific terms. 5 These keywords have been broad range of target identity categories not only in the
selected based on prior domain expertise and preliminary hateful part of the dataset, annotators were also asked to
corpus exploration. label the target mentions of each message among a set of</p>
          <p>Aiming at obtaining a high representation of content predefined categories. 6 An overview of the annotation
related to the target identity groups we identified, we scheme that was used for annotating both the Italian and
then carried out a pre-selection step. From the entire the Polish sections of the dataset is provided in Figure
Telegram data collection, we pre-selected for manual an- 1, while the full annotation guidelines are reported in
notation about 1,500 posts (75% of the entire dataset) Appendix 8.1. This process resulted in two
comprehencontaining at least two distinct keywords (from our ma- sive databases containing messages annotated for both
trix) associated with the same target group. This is done hateful and non-hateful content, targeting various
idenusing a string-matching filter. We then construct the re- tity groups. A numerical breakdown of their content is
maining 25% of the dataset by randomly selecting posts provided in Table 1, 2 and in Figure 2.
to manually annotate, in order to create a more represen- The databases mainly contain non-hateful messages, with
tative overall sample of random messages on Telegram, the Italian one featuring almost as many hate messages
which of course might not contain target-related words. as the Polish one. This may be due to diferent use of
Telegram or to a greater number of controversial, yet not
explicitly hateful, messages in the Polish database, which
includes many discussions related to the Russia-Ukraine</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>5The keyword matrix is available on github: https://github.com/</title>
          <p>dhfbk/MuLTa-Telegram.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>6Target mentions assignation and target of hate might not coincide.</title>
          <p>war. When analyzing the targets of hate speech, most
messages are directed at ethnic groups, with a prevalence
of attacks against people of color in Italian and against
Ukrainian refugees in Polish, followed by those targeting
LGBTQ+ identities. While a significant portion of hateful
messages targets either groups not represented in the
selected taxonomy (Other) or expresses hate without a
specific target ( No Target), there is little representation
of hate toward the remaining identity categories.
Inter-annotator agreement was calculated for each
language on a sub-sample of 200 posts using Krippendorf’s
alpha, annotated each by two expert annotators who are
native speakers of Italian or Polish. The Polish portion of
the dataset showed an IAA of 0.41, while the Italian one
0.68. These numbers, while low, are in line with previous
work on similar topics, especially considering that our
annotators had no chance to discuss and revise their
annotations together, as they worked asynchronously. For
instance, Basile et al. [29] showed an inter-rater
agreement for aggressiveness in Spanish of 0.47.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Classification Experiments</title>
      <sec id="sec-4-1">
        <title>As a way to benchmark our newly-created dataset, and</title>
        <p>to explore diferent strategies for classification of hate
speech in Italian and Polish on Telegram data, we devise
a series of experiments using diferent experimental
setups. These experiments include fine-tuning BERT-base
classifiers (Sec. 4.1), multi-task models (Sec. 4.2), and
LLM prompting (Sec. 4.3). To evaluate approaches across
diferent experiments in a comparable way, 35% of the
manually annotated dataset was withheld and used as
test set for each language. The remaining 1,300
manually annotated items (65%) were used to fine-tune models
where necessary (i.e., Experiments 2 and 4). Each
experiment was replicated with a consistent setup across both
languages.</p>
        <sec id="sec-4-1-1">
          <title>4.1. Supervised Hate Speech Detection via</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>BERT Fine-Tuning</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>In this set of experiments we fine-tune existing monolin</title>
        <p>gual (Exp. 1,2,3) and a multilingual (Exp.4) BERT-based
language models [30].</p>
        <p>Regarding monolingual models, for Polish we conducted
a series of experiments using three distinct BERT-based
models for the Polish language: we used a
generalpurpose Polish BERT-model (BERT-base-pl)7 and two
models trained for identifying specific types of
ofensiveness, namely cyberbullying (BERT-cb-pl)8 and hate
speech (BERT-hs-pl).9</p>
        <p>For Italian, we fine-tuned a general-purpose Italian
BERT-based model (BERT-base-it),10 a BERT-based model
pre-trained on Italian data from Twitter (AlBERTo) [31],11
and a binary hate speech classification model for Italian
social media text (Hate-ita) [32].12</p>
        <p>For fine-tuning the models we employed the
MaChAmp library [33], an open-source tool designed
to simplify flexible tasks configuration, multitask and
multilingual fine-tuning of transformer-based language
models. All the evaluated models were fine-tuned for 5
epochs using a single GPU, applying the default
hyperparameters provided by MaChAmp (see Appendix 8.2). To
address class imbalance, we assign equal weight to each
class during training, ensuring that minority classes are
not underrepresented.</p>
        <p>Experiment 1: Training on Existing Datasets Our
ifrst experiment aims to evaluate the performance of
models fine-tuned on other publicly available datasets on
our manually-annotated Telegram test data. They serve
as a baseline.</p>
        <p>For Italian, we use 2,000 examples from 4 existing
datasets that represent some of the targets we consider
in our work: the AMI dataset [34], focused on misogyny;
the Haspeede dataset [35], focused on hateful content
against Muslims, immigrants and Roma people; the HODI
dataset [23], a dataset for detection of homotransphobia
in Italian; and the Religous Hate dataset [36], an Italian
dataset that includes Anti-Judaism, Anti-Christianity and
anti-Islam social media posts.13</p>
        <p>For Polish, we could find 3 datasets total related to
online abusive content. We decided to discard the
oldest one [25] due to lack of available information on its
construction (data collection, annotation, content) and
7dkleczek/bert-base-polish-uncased-v1
8ptaszynski/bert-base-polish-cyberbullying
9dkleczek/Polish-Hate-Speech-Detection-Herbert-Large
10dbmdz/bert-base-italian-cased
11m-polignano-uniba/bert_uncased_L-12_H-768_A</p>
        <p>12_italian_alb3rt0
12MilaNLProc/hate-ita
13Given that this dataset contains several targets in addition to
religion-focused ones, we filtered it to retain only religious targets.
because after a preliminary manual inspection our
annotators found the data to be noisy (e.g., HTML code was
found in the middle of the texts).</p>
        <p>This left us with two datasets for hate speech, which
we use in combination in our experiments: the
Cyberbullying dataset [26] and the BAN-PL dataset [27]. These
datasets difer significantly in both their definitions of
hate and their annotation procedures. For instance, the
Cyberbullying dataset contains generally milder or less
severe phenomena in its annotations, as it is focused
on the somewhat broader phenomenon of cyberbullying
compared to hate speech. In contrast, BAN-PL considers
a message as Not Hateful if it remained online for more
than two days without being removed by a platform
moderator. Only a small subset of the removed comments
was then manually annotated as Hateful.</p>
        <p>Given these diferences, we opted to use only the
manually annotated hateful samples from BAN-PL, which are
more aligned with our definition of hate speech. For the
neutral (non-hateful) class, we combined equal portions
of BAN-PL and Cyberbullying data, ensuring a balanced
yet representative dataset composition.</p>
        <p>Experiment 2: Fine-tuning on Manually Annotated
Data This is the main experiment in which we
evaluate the potential usefulness of our dataset for training
hate speech detection models. We fine-tune the models
on 1,300 manually annotated items from our dataset for
each language. The task setup is single-task, focusing
exclusively on the hate speech task. Since the annotated
data is in-domain, we expect this setup to yield better
performance on our Telegram test data compared to
Experiment 1, which used out-of-domain data (i.e., data
from diferent platforms).</p>
        <p>Experiment 3: Fine-tuning on LLM-Annotated Data
(LLaMA) To investigate whether LLMs can serve as a
viable alternative to manual annotation in hate speech
detection tasks on Telegram, we devise an experiment in
which we use LLaMA 3.1 70B Instruct as an automated
annotator. We ask the model to annotate the same train
split of our dataset as in Experiment 2, by prompting the
model with a summary of our hate speech annotation
guidelines. For both languages, we then fine-tune the
same BERT-based models as in Experiment 2, but this
time on the LLM-annotated data. We then evaluate the
trained models on the test sets.</p>
        <p>Experiment 4: Multilingual BERT A multilingual
approach can leverage shared representations across
languages. In this context, a model is required to generalize
patterns that may be strongly language- and
contextdependent, a non-trivial task. Nonetheless, this strategy
ofers several advantages: it can boost performance in
low-resource settings through cross-lingual transfer, and
it can improve robustness by exposing the model to more
diverse inputs during training.</p>
        <p>To test the viability of this approach, we merge the two
manually annotated train splits of the Polish and Italian
datasets to fine-tune a multilingual BERT base model. 14
The performance of the model for classification of hate
speech is then evaluated on the Italian and Polish test
sets separately.</p>
        <sec id="sec-4-2-1">
          <title>4.2. Multi-task Setup for Hate Speech and</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Target Detection</title>
          <p>16https://huggingface.co/meta-llama/Llama-Guard-3-8B
Experiment 5 Given the hierarchical relationship
between hate speech detection and target identification, we
adopt a multi-task learning approach to jointly model Figure 3: The design of the multitask setup used for
experithese tasks, under the assumption that each task can help ment 5.
generalization on the other. In this multi-task learning
paradigm, schematically illustrated in Table 3, the model
can jointly optimize for diferent tasks, allowing all tasks Instruct (Exp. 6) with our annotation guidelines and ask
to benefit from shared signals captured through a com- it to label each test example as hateful or not. We then
mon representation, which is jointly fine-tuned during also evaluate LLaMA Guard (Exp. 7), using no prompt as
training. This approach is motivated by prior work show- it is a model explicitly made to detect inappropriate or
ing that training models on related tasks simultaneously toxic content.16
can lead to better performance than training them in While this kind of experimental setup is useful for
isolation [37]. This setup should allow to improve gener- comparison purposes, it should be noted that it is highly
alization and stability of the hate speech task, but also to ineficient, and unlikely to be feasible and scalable when
automatically predict the targets of hate speech, a task large amounts of data need to be processed at once, as
that as a single task would be extremely dificult to ad- its computational speed and eficiency is much lower
dress with the currently available data, given the scarcity than that of a BERT-based model fine-tuned on
taskof targets (see Table 3.4). specific data. Such models are particularly well-suited for</p>
          <p>In this setting, hate speech detection serves as the social science research, where cost-efective processing
primary task, since the presence of a target group in of millions of messages is often required to study trends
a message depends on the detection of hate speech in in online hate and its societal impact. Given that our goal
the first place, while target identification is treated as is the development of hate speech classification models
a secondary task. Specifically, we used our pre-trained that can be employed in real-life scenarios, we consider
models as the shared encoder for both tasks, while a LLM-based classification out of this scope.
separate decoder is utilized by each task. We incorporate
diferent loss weighting to the two tasks, in order to
represent the hierarchy of primary and auxiliary.15</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results and</title>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <sec id="sec-6-1">
        <title>4.3. Prompt-Based Hate Speech Detection via LLMs</title>
        <sec id="sec-6-1-1">
          <title>In this section, we present the results obtained in our</title>
          <p>experiments. A summary of the results across all
exExperiments 6 and 7: Llama We then aim at evaluat- perimental setups is shown in Tables 3 and 4. For the
ing the performance of LLMs on our Telegram annotated experiments using multiple models (Exp. 1, 2, 3, and 5),
data in Italian and Polish. For this, we use LLaMA [38], we report average macro-F1 scores, while the detailed
since it possesses some multilingual capabilities, espe- results are in Appendix 8.3. As a first general
observacially in Italian. In particular, we prompt LLaMA 3.1 70B tion, Polish and Italian show consistent results patterns
across experiments, which allows us to derive meaningful
observations across both languages.
14google-bert/bert-base-multilingual-cased
15The multi-task learning loss is computed as  = ∑︀  , where
 is the loss for task  and   the corresponding weighting
parameter, and we provide a diferent loss weight for the auxiliary tasks.</p>
          <p>For the main task, we empirically set   = 0.7, and   = 0.3
for the auxiliary task.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>5.1. Hate Speech Detection</title>
        <p>The results of the binary classification of hate speech are
reported in Table 3. In-domain training (Exp. 2)
consistently outperforms the models trained on out-of-domain
data (Exp. 1) across both languages, underscoring the
necessity of domain-specific data. Notably, out-of-domain
training results in the worse classification performance
for Polish and the second worse for Italian.</p>
        <p>Conversely, the training of multilingual BERT (Exp. 4)
resulted in very low performance overall, suggesting that
models trained across multiple languages can struggle
to generalize efectively for this task. Regarding specific
model performances, for both languages, fine-tuning a
model already fine-tuned for hate speech ( Hate-ita and
BERT-hs-pl, for Italian and Polish respectively) leads to
the best results within models across all experiments.17</p>
        <p>The Llama-based experiments, including Exp. 3, in
which Llama was used to annotated data for training a
BERT-based classifier, and Exps. 6 and 7, in which Llama
(70B Instruct and Llama Guard) predicted test set labels
through prompting, yielded intermediate performance.</p>
        <p>While generally better than out-of-domain approaches,
they consistently fell short of models trained on expert
human annotations. Llama-based predictions performed
consistently worse in the case of Polish, possibly due to
the model lacking oficial support for the Polish language.</p>
        <p>The multi-task setup (Exp. 5), on the other hand,
improved hate speech detection performance, achieving the
highest macro-F1 scores for both languages.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.2. Target Identification</title>
        <sec id="sec-6-3-1">
          <title>Regarding the parallel task of target of hate identifica</title>
          <p>tion, while overall performances appear high in both
languages (Accuracy: Polish 87%, Italian 82%), this result
is driven primarily by the model’s strong performance
on the majority class, i.e., samples in the non-hate class,
therefore without target, which heavily skews the results.
Macro-averaged F1 scores on each target are very low, as
shown in Table 4, indicating very poor performance on
17For more detailed results see Appendix 8.3.
minority classes prediction (hateful and targeted
examples). Notably, for Italian the most frequent target class
Ethnicity/Origin: Person of Color is consistently
recognized (with an F1-score of almost 0.70), and performance
on the moderately frequent class LGBTQ+ depends on
the model (F1 scores range from 0.00 to 0.41), while the
other target groups are entirely or almost entirely
disregarded. For Polish, the target LGBTQ+ is classified more
accurately than the others (F1 0.29 up to 0.69).</p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>5.3. Additional Multilingual Experiments</title>
        <p>Given the very low performance of the multilingual
model (Italian: 0.589, Polish: 0.564 F1), we sought to
investigate potential causes for this. Although diferent
languages might express hate diferently, and context
can vary, one possible factor that could explain the low
performance of multilingual models is annotation
inconsistencies between the Italian and Polish datasets,
especially given the dificulty and subjectivity of the type of
annotation.</p>
        <p>To investigate this, we repeated Experiment 4 by
finetuning multilingual BERT, this time using the data from
Experiment 3, which was annotated via LLM. These
LLMgenerated annotations should in principle be more homo- speech, especially in the absence of profanity.
geneous across languages, assuming the system is using In the second case, models overestimated hatred in
the same criteria given the same prompt instructions. In messages expressing controversial opinions (“non c’è
nesthis setup, performance improved notably (Italian: 0.674, sun isolamento perché non esistono i virus” [“There’s no
Polish: 0.657 F1), supporting our hypothesis. isolation because viruses don’t exist”], “Lepiej dla
Rus</p>
        <p>Nonetheless, we were interested in performance of kich, kto lubi ten shit?” [“Better for the Russians, who
our the best performing scenario, i.e. on high-quality, likes that shit?”]) or sensitive topics (“Una pacca sul
manually annotated data and multitask setup. We re-ran sedere non autorizzata è una molestia sessuale” [“An
unthe experiment using multitask learning (i.e., jointly pre- warranted slap on the butt is sexual harassment”]).
Addicting hate speech and its target) on the human-labeled ditionally, models struggled with relatively mild insults
datasets. This yielded the best results for both languages containing no targets in the given context (“Nikt nie
po(Italian: 0.706, Polish: 0.726 F1). może.. Bandyci bezkarni..” [“No one will help.. Bandits</p>
        <p>These findings suggest that inconsistencies among an- unpunished..”]), idiomatic use of expressions related to
notators across languages can hamper results of multilin- disabilities, which are lexicalized in spoken Italian, albeit
gual models, but learning on richer data can help, since unkind (“purtroppo non c’è peggior sordo di chi non vuol
an auxiliary task can help generalization by providing sentire e peggior cieco di chi non vuol vedere”
[“Unfortumore training signal and regularizing the model. nately, there’s no one more deaf than those who don’t want
to hear, and no one more blind than those who don’t want
to see”]), or critiquing hateful messages (“tipico cristiano
6. Manual Qualitative Analysis ipocrita...va in chiesa però vorrebbe sterminare chi crede
nel Islam” [“Typical hypocritical Christian...goes to church
but would like to exterminate those who believe in Islam”]).</p>
        <p>Finally, some cases appear to be simply annotation
errors (“@&lt;user&gt; finalmente Instagram mi dà le pubblicità
giuste” [“@&lt;user&gt; finally Instagram shows me the right
ads”]).</p>
        <sec id="sec-6-4-1">
          <title>To understand the diferences between the Italian and</title>
          <p>Polish data, we conducted a manual qualitative analysis.</p>
          <p>First, we noticed a disparity in the distribution of hateful
messages targeting Ethnicity/Origin. While the Italian
dataset shows a predominance of messages directed at
people of color (99 instances, compared to 9 in the Polish
dataset), the subcategory Other (Migrants) appears less
frequently in Italian (39 instances) than in Polish (82). 7. Conclusions
These patterns likely reflect the socio-political context at
the time of data collection, with immigration by people of In this paper, we introduced MuLTa-Telegram, a novel
color being a prominent issue in Italy and the presence of multilingual dataset for hate speech and target detection,
Ukrainian refugees being central in Poland. This under- containing data from Telegram in both Italian and Polish.
scores the importance of collecting context-sensitive data, The dataset includes anotations across 9 hate speech
particularly at the socio-cultural level, as each context target categories, in contrast with the majority of
availcan exhibit diferent patterns and phenomena. able datasets, which are often limited to single targets.</p>
          <p>We also investigated the discrepancies between auto- Moreover, we ensured the presence of target-related
conmatic prediction and human annotation. We identified 29 tent also in the non-hateful part of the dataset, with about
Italian messages and 30 Polish ones which the annotators 75% of the messages containing target-relevant content
deemed hateful and the models classified otherwise. For (see Figure 2). Furthermore, while the vast majority of
the opposite case, there were 70 messages in Italian and hate speech research has been conducted on English, we
only 3 messages in Polish. focused on underrepresented languages.
In the first case, models seem unable to detect hateful We conducted an extensive set of experiments,
showcontent when not presented in a standard explicitly ofen- ing that the fine-tuning of BERT-based models on
out-ofsive form. Performance tends to be low when examples domain hate speech classification data leads to poor
perinclude hashtags (“Islam, in Afghanistan torna la sharia formance on Telegram data, while training on in-domain
[...] #religionedipace...” [“Islam, in Afghanistan sharia resources consistently outperforms it. This draws
atis back [...] #religionofpeace..”]); dehumanization being tention to the limitations of relying on datasets from
implied (“i roma [...] non sono veri esseri umani, punto” platforms like Twitter, which are no longer reliably
ac[“the Roma [...] are not real human beings, period”], “I cessible for academic research, reinforcing the need for
kulka we własny łeb” [“And a bullet to your own head”]); updated and diversified resources like MuLTa-Telegram.
slurs in non-standard language varieties (“Na Zengara However, results on the detection of individual targets
rein pratica” [“ Basically about a gypsy”]); and occasionally mained poor, particularly for more scarcely represented
established slurs (e.g., Italian n-word). Models appear groups. This underscores the persistent dificulty of
deless proficient than humans in detecting implied hate tecting hate directed at less-represented communities.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>The work of C. Casula, A. Kołos, E. Leonardelli, E. Mu</title>
        <p>ratore and S. Tonelli has been supported by the
European Union’s CERV fund under grant agreement No.
101143249 (HATEDEMICS). This research was also
partially supported by the European Union under the
Horizon Europe project AI-CODE, GA No. 101135437.</p>
        <p>Furthermore, in a multilingual setup, we showed how
the addition of a parallel task predicting targets greatly
improves performances for hate speech classification,
enabling the model to generalize across languages. We
included both LLaMA and LLaMA Guard in our
evaluation to explore how general-purpose and safety-focused
systems perform on our task. LLaMA Guard, despite its
safety orientation, performs poorly in this out-of-domain
context, while LLaMA shows strong performance on
Italian, but its accuracy drops on Polish data, likely due to
limited language coverage during pretraining. These
results emphasize the need for both domain- and
languagespecific adaptation.</p>
        <p>While we fine-tuned transformer-based models
directly on classification tasks using Telegram data,
future work could explore domain-adaptive pretraining
via Masked Language Modeling on unlabeled Telegram
messages. This step could improve the encoder’s
alignment with the linguistic characteristics of the platform,
potentially enhancing classification performance.</p>
        <p>We hope this dataset will help foster research into
hate speech detection for underrepresented languages
and platforms. Future work will explore expanding the
dataset to more languages and domains, as well as
improving the detection of fine-grained targets of hate.
tweets, in: CEUR workshop proceedings, volume • Text can be hateful even if the target is implicit,
2481, CEUR, 2019, pp. 1–6. as long as it’s implied by the context.
[32] D. Nozza, F. Bianchi, G. Attanasio, Hate-ita: Hate • It is not hateful if the target is an organization
speech detection in italian social media text, in: Pro- and not its members.
ceedings of the Sixth Workshop on Online Abuse • Profanities alone do not imply hatefulness, unless
and Harms (WOAH), 2022, pp. 252–260. the tone is aggressive or the message is clearly
[33] R. Van Der Goot, A. Üstün, A. Ramponi, I. Sharaf, directed toward someone (e.g., “Aspetta che li
miB. Plank, Massive choice, ample tasks (machamp): A nacciano per bene e poi vedi se accettano...” ).
toolkit for multi-task learning in nlp, arXiv preprint • False or debatable statements do not imply
hatearXiv:2005.14672 (2020). fulness, but messages that erase identities (e.g.,
[34] E. Fersini, D. Nozza, P. Rosso, et al., Ami@ “esistono solo due sessi” ) are hate speech.
evalita2020: Automatic misogyny identification, in: • References to individuals or citizens (excluding
Proceedings of the 7th evaluation campaign of Natu- military groups) as nazis in the context of the
ral Language Processing and Speech tools for Italian Russia-Ukraine war are to be considered hate
(EVALITA 2020), (seleziona...), 2020. speech.
[35] M. Sanguinetti, G. Comandini, E. Di Nuovo, • In Polish:</p>
        <p>S. Frenda, M. Stranisci, C. Bosco, T. Caselli, V. Patti,
I. Russo, Haspeede 2@ evalita2020: Overview of – If “Banderowiec” refers to supporters of
the evalita 2020 hate speech detection task, Eval- Stepan Bandera (OUN), it is not hate
uation Campaign of Natural Language Processing speech.</p>
        <p>and Speech Tools for Italian (2020). – If “Banderowiec” is used to refer to the
[36] A. Ramponi, B. Testa, S. Tonelli, E. Jezek, Address- entire Ukrainian nation or other social
ing religious hate online: from taxonomy creation groups in a hateful or ofensive way, it is
to automated detection, PeerJ Computer Science 8 hate speech.</p>
        <p>(2022) e1128.
[37] R. Caruana, Multitask learning, Machine learning Target Detection</p>
        <p>28 (1997) 41–75.
[38] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka- When text contains hate speech, specify its target.
Possidian, A. Al-Dahle, A. Letman, A. Mathur, A. Schel- ble categories include:
ten, A. Vaughan, et al., The llama 3 herd of models,
arXiv preprint arXiv:2407.21783 (2024).
[39] R. van der Goot, A. Üstün, A. Ramponi,</p>
        <p>I. Sharaf, B. Plank, Massive choice, ample
tasks (machamp): A toolkit for multi-task learning
in nlp, 2021. URL: https://arxiv.org/abs/2005.14672.
arXiv:2005.14672.
• Ethnicity/Origin: People of Color, Romani, or Other</p>
        <p>(Migrants)
• LGBTQ+
• People with Disability
• Religion: Jewish, Christians, Muslims, Other
• Women
• Other
• No Target</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Appendices</title>
      <sec id="sec-8-1">
        <title>8.1. Annotation Guidelines</title>
        <sec id="sec-8-1-1">
          <title>In this section we report the annotation guidelines.</title>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>Hate Speech Detection</title>
        <sec id="sec-8-2-1">
          <title>Assess whether the message contains hateful language.</title>
          <p>Classify it as Hate Speech if it contains slurs or hostile
language, motivated by bias or reinforcing stereotypes,
targeted at a group or individual because of their actual
or perceived innate characteristics; otherwise classify it
as No Hate Speech.</p>
          <p>• Reported speech is not hate speech.</p>
          <p>Choose the most appropriate category. Select other for
any specific target not included in any other category.
Select No Target for occurrences of hate speech not directed
at any specific group.</p>
          <p>• In cases where multiple labels apply, prioritize
the identity that is most harmed.
• The target must be explicitly addressed, not
implied (e.g., by referring to stereotypical
associations):
– Talking about Arabic/Muslim countries or</p>
          <p>Islam does not imply the Muslim target.
– Talking about Africa or African migration</p>
          <p>does not imply the People of Color target.
– Mentions of disability imply the People
with Disability target.</p>
        </sec>
      </sec>
      <sec id="sec-8-3">
        <title>Mention of Target Group Detection</title>
        <sec id="sec-8-3-1">
          <title>Annotate if one or more of the following groups are addressed in the text. Assign the corresponding label(s). Multiple groups may be annotated for a single message. Possible target groups include:</title>
          <p>• Ethnicity/Origin: People of color, Romani, Other
(Migrants)
• LGBTQ+
• People with Disability
• Religion: Jews, Muslims, Christians, Other
• Women
• None</p>
        </sec>
        <sec id="sec-8-3-2">
          <title>If none of these target groups are addressed, assign the</title>
          <p>label None. A group should be annotated if it is explicitly
mentioned or implicitly clear from the context. Annotate
a group even if it is not the main focus of the message.</p>
        </sec>
      </sec>
      <sec id="sec-8-4">
        <title>8.2. Hyperparameters</title>
        <sec id="sec-8-4-1">
          <title>In this section we described the parameters used for BERT-based experiments.</title>
        </sec>
        <sec id="sec-8-4-2">
          <title>In this section, in Tables 6 and 7 for Italian and Polish respectively, we report the full detailed results for Experiments 1,2,3 and 5.</title>
          <p>Non-Hate</p>
          <p>avg
Hate</p>
          <p>avg
F1</p>
          <p>Experiments
Exp1 - out of domain data
Exp2 - manually annotated data
Exp3 - Llama as annotator
Exp 4 - multilingual
Exp5 - multitask setup
Exp 6 - Llama
Exp 7 - Llama Guard
Model
Declaration on Generative AI
F1</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>