<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a Hate Speech Index with Attention-based LSTMs and XLM-RoBERTa</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mauro Bruno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Catanese</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Ortame</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The number of parameters in Large Language Models ranges between a few billions to hundreds of billions of parameters, while the large version of XLM-RoBERTa “only” has 561 million parameters</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The difusion of hate speech on social media requires robust detection mechanisms to measure its harmful impact. However, detecting hate speech, particularly in the complex linguistic environments of social media, presents significant challenges due to slang, sarcasm, and neologisms. State-of-the-art methods like Large Language Models (LLMs) demonstrate strong contextual understanding, but they often require prohibitive computational resources. To address this, we propose two solutions: (1) a bidirectional long short-term memory network with an attention mechanism (AT-BiLSTM) to enhance the model's interpretability and natural language understanding, and (2) fine-tuned multilingual robustly optimized BERT (XLM-RoBERTa) models. Building on the promising results from EVALITA campaigns in hate speech detection, we develop robust classifiers to analyse 20.4 million Tweets related to migrants and ethnic minorities. Further, we utilise an additional custom labeled dataset (IstatHate) for benchmarking and training and we show how its inclusion can improve classification performance. Our best model outperforms top entries from previous EVALITA campaigns. Finally, we introduce Hate Speech Indices (HSI), which capture the dynamics of hate speech over time, and assess whether their main peaks correlate with major events.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;hate speech detection</kwd>
        <kwd>deep learning</kwd>
        <kwd>attention mechanism</kwd>
        <kwd>RoBERTa</kwd>
        <kwd>artificial intelligence</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>XLM-RoBERTa (large) model, benchmarked against its (3) we retrieve Tweets from these clusters, identifying the
base, smaller version. We use two labeled training sets: expressions with a probability of 1 of belonging to the
(a) the EVALITA 2020 HaSpeeDe 2 task dataset, and (b) clusters. This approach isolates 242,000 Tweets, of which
a custom, smaller labeled dataset, which we refer to as 67,000 are unique. It is worth noticing that viral Tweets
IstatHate. Our study explores the impact of training mod- (the ones that are repeated/retweeted several times) need
els on both the EVALITA dataset alone and a combined to be annotated with a higher probability. A common
dataset that includes EVALITA and IstatHate, evaluating practice to draw a much more eficient sample instead
their performance across multiple test sets. of simple random sampling is to use stratified sampling,</p>
      <p>Finally, we present a preliminary version of the Hate an efective method for handling skewed distributions.
Speech Index (HSI), designed to quantify the proportion In particular, we adopted [7]. (4) We employ stratified
of hate speech by classifying 20.4 million Italian Tweets sampling using the total number of Tweets as the target
related to migrants and ethnic minorities from January variable, and we divided that variable into five classes
2018 to February 2023. using them as stratification criteria. (5) The Tweets are
then stratified into the classes based on the number of
retweets, with the final class being a take-all stratum,
2. Data resulting in 681 sampled texts, ensuring a coeficient of
variation of 5%. (6) These 681 Tweets are then manually
This section describes the data used for training, validat- labeled by Istat researchers adopting the following
criteing, and testing the models and the corpus of Tweets on ria: if the language is vulgar/aggressive but generic it is
which we compute the hate speech index (HSI). not labeled as hateful, if, on the contrary, it is related to
migrants and/or ethnic minorities and the hate/prejudice
2.1. Corpus is clearly directed towards them, then they were labeled
as hateful. The weighted estimate indicates that 34% of
the Tweets contains hateful language, serving as a rough
upper bound of the hate proportion within our prediction
corpus. Even if our sample dataset likely over-represents
hateful content, we disregard the weighting at this
preliminary phase, simply adding IstatHate to the EVALITA
dataset.</p>
      <sec id="sec-1-1">
        <title>The prediction corpus consists of 20.4 million unlabeled</title>
        <p>Tweets from January 2018 to February 2023. The Tweets
are obtained through a two-step filtering procedure: first ,
a general 250-keyword filter gathers Tweets directly from
X’s API;second, a smaller, immigration-related keyword
iflter retrieves the relevant Tweets from the database.
Thematic experts, borrowing the contents of
discrimination survey questionnaires, have derived a preliminary
iflter. These regular or stemmed expressions have been
validated by means of topic modelling analysis and word
embedding. For instance, the word cinese (“chinese”) was
almost always related to markets or products and has
therefore been removed. We also noticed that due to the
generic term stranieri (“foreigners”) there are also some
residual out-of-scope and irrelevant conversations. These
issues only afects around 5% of the total texts. The final
iflter consists in 21 stemmed expression (ex. immigrat-),
or complete words.</p>
        <sec id="sec-1-1-1">
          <title>2.2. Training data</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>EVALITA Most of the labeled training data comes from</title>
        <p>
          the EVALITA 2020 HaSpeeDe 2 task. The distribution of
the labels in the training dataset is shown in Table 1.
IstatHate Additionally, we use a custom-labeled
dataset, i.e., IstatHate, derived from our corpus in the
following way: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) we fit a Latent Dirichlet Allocation
(LDA) model [6] on the entire corpus, (2) we identify
clusters likely to contain hateful Tweets, i.e., those with
ofensive language, such as “ fate schifo” (“you suck”), and
"avete rotto i c****oni" ("you pi**ed us of") and few others,
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Methodology</title>
      <sec id="sec-2-1">
        <title>In this section, we present the methodology adopted in our study and outline the experimental design. We begin by introducing the model architectures, followed by a detailed description of the training procedure.</title>
        <sec id="sec-2-1-1">
          <title>3.1. AT-BiLSTM model architecture</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The architecture of our attention-based bidirectional</title>
        <p>LSTM (AT-BiLSTM) model comprises four main
components: an embedding layer, a bidirectional LSTM layer,
an attention layer, and an output layer. We will detail
each component sequentially.</p>
        <p>Embedding layer We pre-train a FastText [8]
embedding model on the prediction corpus and extract the word
vectors to initialise the weights of the embedding
matrix. Table 2 presents the main training parameters of our
model: each word is represented by a 300-dimensional
vector, the training considers a distance window between
words of up to 8 positions, and the model is trained for
25 epochs using a continuous bag-of-words algorithm.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Attention mechanism In deep learning, attention</title>
        <p>mechanisms can improve model performance by focusing
on important features of input sequences.</p>
        <p>In our model, the attention mechanism is implemented
on top of the LSTM layer to focus on the most relevant
parts of the input sequence for predictions [9]. Our
attention mechanism works as follows:</p>
      </sec>
      <sec id="sec-2-4">
        <title>LSTM layer The core of our model is a bidirectional</title>
        <p>Long Short-Term Memory (LSTM) network. LSTMs are
a specialized type of recurrent neural network (RNN)
designed to capture long-term dependencies in sequential
data [10]. The bidirectional aspect of our LSTM processes
the input sequence in both forward and backward
directions. This bidirectionality provides the network with
context from both past and future states for any given
point (word) in the sequence (sentence) [11]. In practice,
this means that when our model is processing a word in
a Tweet, it has information about the words that came
before and after it, allowing for an increased understanding
of context.</p>
        <p>The LSTM layer consists of multiple stacked
bidirectional LSTM cells. Each cell maintains a cell state and
a hidden state, which are updated at each time step as
the input sequence is processed. The number of layers is
included in the hyperparameter optimization phase.
Output layer The final component of our model is a
fully connected (dense) layer that takes the context vector
produced by the attention mechanism as input. The
output dimension of this layer is one-dimensional, as there
are two classes in our hate speech detection class. The
output of this layer is passed through a softmax function
to produce a number between 0 and 1. Finally, the class
is assigned comparing the output with a threshold (0.5).</p>
        <p>The optimal configuration for each LSTM-based model,
resulting from Bayesian hyperparameter optimization, is
detailed in the Appendix.</p>
        <sec id="sec-2-4-1">
          <title>3.2. XLM-RoBERTa</title>
          <p>Multilingual RoBERTa (XLM-RoBERTa, or XLM-R) is a
transformer-based model that builds upon the original
• Transform the LSTM output using a fully con- BERT model and the monolingual RoBERTa (Robustly
nected layer to get attention scores for each word. Optimized BERT Pretraining Approach) model [12]. It is
• Normalise these scores into attention weights designed to handle multiple languages, making it
particwith a softmax function, creating a pseudo- ularly suitable for our task of hate speech detection in
probability distribution. Italian texts.
• Compute a context vector by taking a weighted XLM-RoBERTa is trained on 100 diferent languages and
sum of the LSTM outputs using the attention has a much larger vocabulary size (250k tokens)
comweights. This context vector emphasizes the most pared to both BERT (30k tokens) and RoBERTa (50k
toimportant parts of the input sequence for the clas- kens).</p>
          <p>sification task 3.</p>
          <p>The attention mechanism allows our model to
dynamically focus on diferent parts of the input for diferent
examples.
2We ran both random search and Bayesian optimization. The best
result came from the latter.
3We also experimented with attention masking. However, this
negatively impacted accuracy. Upon inspecting the attention scores,
we observed that the model naturally assigns negligible weights to
padding tokens.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>3.3. Training</title>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>In this section, we outline the experimental design we followed to obtain our results. We structured our experiments to systematically assess model performance under diferent training conditions and across various test sets.</title>
        <p>3.3.1. Experimental design</p>
      </sec>
      <sec id="sec-2-6">
        <title>Training sets We trained each model under two dis</title>
        <p>
          tinct scenarios: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) a training set comprising only data
from the EVALITA labeled dataset, and (2) a training set
comprising both EVALITA data and IstatHate data.
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>Evaluation We evaluate every model on three test</title>
        <p>datasets: (a) a test set comprising only data from the
EVALITA test dataset, (b) a test set comprising only data
from the IstatHate test set, and (c) a combined test set
comprising data from both EVALITA and IstatHate test
sets. None of the texts in these test sets are seen by the
models during training, in any scenario.</p>
      </sec>
      <sec id="sec-2-8">
        <title>Therefore, we have four diferent architectures</title>
        <p>and two training sets, resulting in eight distinct models.</p>
      </sec>
      <sec id="sec-2-9">
        <title>A more interesting observation can be made about the</title>
        <p>efect of including IstatHate in the training set along
3.3.2. Model Training EVALITA data: besides the expected increased
performance on the IstatHate test set, there is a case in which
LSTM-based We ran a Bayesian optimization process the performance on the EVALITA test set increases too,
to automatically extract optimal hyperparameters. This namely XLM-RoBERTa-large⋆. This non-trivial
crossoptimization process is detailed in the Appendix. We dataset improvement, suggests that training on both
trained the models for 10 epochs, and we extracted the datasets enhances the model’s generalization
capabilibest configuration based on validation loss. ties, despite the fact that the datasets were labeled by
diferent people. Finally, it is interesting to notice how a
XLM-RoBERTa Given the large size of XLM-RoBERTa simpler model like AT-BiLSTM⋆ manages to outperform
models, we were not able to run Bayesian optimization, XLM-RoBERTa-base⋆ on all test sets.
and instead employed grid search over a reduced sub- Results on the IstatHate test set are consistently lower
set of hyperparameters. We trained the models for 10 than results on the EVALITA test set, but this was
exepochs, and extracted the weights from the run with the pected, as, even when included in the training, IstatHate
lowest validation loss. We follow a training procedure is much smaller in size.
loosely based on the methodology outlined by [13], but The Full test set is a combination of the EVALITA test
with adaptations to the data and hyperparameters to op- set and the IstatHate test set, and therefore the macro F1
timise performance for our specific use case. A detailed scores on the Full test set are a weighted mean between
description of the training hyperparameters can be found the ones obtained on EVALITA and IstatHate.
in Appendix A.1. The best performing model across all test sets is
XLMRoBERTa-large⋆, i.e. fine-tuned on the training set
combining both EVALITA and IstatHate.
4. Results A detailed table that compares the training and
inference times of the diferent models can be found in
Appendix A.2.</p>
      </sec>
      <sec id="sec-2-10">
        <title>In this section, we present the results of our analysis,</title>
        <p>covering model performance, attention weight
visualizations, and Hate Speech Index (HSI) predictions.</p>
        <sec id="sec-2-10-1">
          <title>4.2. Attention visualization</title>
          <p>4.1. Model performance An advantage of an AT-BiLSTM model over a standard
BiLSTM model is its ability to visualise attention scores
Table 3 highlights the performance of the models, pre- for each word, making outputs more interpretable4.
senting the macro F1 score across the diferent test sets. Visualising attention scores provides a useful method for</p>
          <p>There are several observations that can be made about empirically examining the impact of training models on
these results. First, there is a clear positive correlation be- diferent datasets. For instance, the following are two
tween model size and performance, particularly evident Tweets classified by the AT-BiLSTM-EV model, along
in the XLM-RoBERTa models, where the larger variant
consistently outperforms the smaller ones across all test
sets. This is expected for a complex task like hate speech 4AbutttetnhteioXnLsMco-rReosBcEaRnTbaetvoikseunailzizeerddionesBnEoRtTa-blwasaeyds mspolidteIltsaltioaon[t1e4x]t,
detection. into complete words, making interpretation trickier.
with their corresponding attention scores.</p>
        </sec>
      </sec>
      <sec id="sec-2-11">
        <title>Tweet 1 (true: No Hate, predicted: Hate)</title>
        <p>IT poi rompe il caz**o a tutti perché ha accolto una
famiglia di profughi
EN then they break our ba**s because they hosted a
family of refugees</p>
      </sec>
      <sec id="sec-2-12">
        <title>The first Tweet is misclassified by the AT-BiLSTM-EV</title>
        <p>model. Analysing the attention scores, we can see how a
lot of emphasis was put on curse words both on Tweet
1 and Tweet 2. Figure 3 shows the attention scores
produced by the AT-BiLSTM⋆ model for Tweet 1 and Tweet
2, both texts are correctly classified. We can see how a
lot of attention is still put on curse words like ca**o and
bastardi, but a significant attention score is also given
to profughi ("refugees") in Tweet 1. Since the Tweet is
correctly classified as not hateful – it contains aggressive
language but not directed towards migrants or ethnic
minorities – we can assume that there is an increased
contextual understanding compared to AT-BiLSTM-EV.
Additionally, Figure 3 (bottom) shows how the
distribution of attention scores for the AT-BiLSTM⋆ model is
much more concentrated compared to AT-BiLSTM-EV.</p>
        <sec id="sec-2-12-1">
          <title>4.3. Hate Speech Index (HSI)</title>
        </sec>
      </sec>
      <sec id="sec-2-13">
        <title>In this section, we present and briefly discuss our preliminary Hate Speech Index (HSI) results. Firstly, the daily HSI is computed as follows:</title>
        <p>=
ℎ,</p>
        <p>,
ℎ, + ℎ,
where ℎ, is the number of Tweets classified as
hateful on day , and ℎ, is the number of Tweets
classified as not hateful on day .</p>
      </sec>
      <sec id="sec-2-14">
        <title>One immediately noticeable diference between the models trained solely on EVALITA and the models trained on</title>
        <p>EVALITA and IstatHate are the consistently lower levels single days it appears that it is more of a trend rather
of the predictions coming from the latter compared to than a response to a specific event/series of events.
the former for all settings. In particular, the minimum
decrease is recorded by BiLSTM models (− 0.01), while
the maximum decrease is achieved by XLM-RoBERTa- 5. Conclusion
abcahseiev(− ed0.b0y9)X. LTMh-eRoloBwEeRsTtam-beaasne⋆vwaliuthe afonratvheeraHgeS,Iinis- This study addressed the issue of hate speech detection
dicating a percentage of 11.7% hateful Tweets over the on social media, specifically focusing on X (formerly
total Tweets in the corpus. The best performing model, Twitter) and on migrants and ethnic minorities. Given
XLM-RoBERTa-large⋆, predicts 14.1% of hateful Tweets. the complexities of natural language on these platforms,</p>
        <p>With respect to the standard deviation, we observe we explored diferent approaches including lighter
bidithat, XLM-RoBERTa models show lower variability com- rectional LSTM models with and without attention
mechpared to LSTM-based models. For XLM-RoBERTa and anisms, and fine-tuned XLM-RoBERTa models both in
BiLSTM models, the standard deviation decreases when their base and large formats. We trained our models on
including IstatHate in the dataset. EVALITA 2020 HaSpeeDe 2 data and also introduced a
small labeled dataset, IstatHate, that improves the
performance of the already best performing model,
XLMCorrelation The dynamics of the moving averages of RoBERTa-large, when included in the training set.
the indices appear to be relatively coherent between mod- Despite longer inference times and higher
computaels, as confirmed by correlations in the range between tional resources required for large amounts of data,
heav0.81 (AT-BiLSTM⋆ vs XLM-RoBERTa-base-EV) and ier models like XLM-RoBERTa-large achieve significantly
0.98 (BiLSTM⋆ vs BiLSTM-EV). The lowest correlations higher performance and generalization capabilities. Yet,
between models with the same architecture and diferent AT-BiLSTM⋆ (i.e., the AT-BiLSTM model that includes
training sets amounts to 0.88 (XLM-RoBERTa-base⋆ vs both EVALITA and IstatHate data in the training),
outperXLM-RoBERTa-base-EV). forms XLM-RoBERTa-base⋆ across all test sets, a notable
achievement considering the diference in models size
We can now analyse a few peaks in the daily time series and inference time.
to empirically assess the quality of the estimates, and We compared the predictions of AT-BiLSTM-EV
the ability of the models to detect specific events. against AT-BiLSTM⋆ visualising the attention scores they
assigned to the same Tweets. Empirical evidence shows
October 24, 2018 This date refers to the difusion of that including IstatHate in the training set may improve
the news about an unfortunate event in which a 16 years contextual understanding and mitigate the bias that
simold girl was raped and killed by a group of men from pler models like LSTMs may have when classifying hate
Senegal and Nigeria. If we look at the trends in Figure 5 speech in the presence of curse words.
(top) and Figure 6 (top) in Appendix B.1, we notice how The preliminary computation of the Hate Speech Index
the increase in the proportion of hate speech persists (HSI) reveals significantly diferent levels of hate speech
in the following period. In this case, we observe that detection across diferent models and training sets, even
all models detect the event registering values more than though the training data has very similar characteristics.
twice their average. Fine-tuned XLM-RoBERTa models produce the lower
estimates in levels, especially when IstatHate is included
July 25, 2021 This peak refers to a news about another in the training set. Furthermore, when analysing hate
16 years old Italian girl that was beaten up on the street peaks, XLM-RoBERTa-large⋆ predictions highly correlate
by her 17 years old Moroccan boyfriend. From Figure 5 with major events.
(bottom) and Figure 6 (bottom) in Appendix B.1, we can Future work will focus on expanding and validating
see how not all models detect this event. In particular, the IstatHate dataset, exploiting the sampling weights,
of the models trained on both EVALITA and IstatHate, refining model architectures, and exploring additional
only XLM-RoBERTa-large⋆ and AT-BiLSTM⋆ show features to enhance detection capabilities.
a clear peak in the trend, while LSTM-based models
trained only on EVALITA struggle to identify this peak.</p>
        <p>The only model that detects the peak in both cases is References
XLM-RoBERTa-large, further empirically confirming its
robustness.</p>
      </sec>
      <sec id="sec-2-15">
        <title>We also inspected the negative shift at the beginning of 2021, detected by every model. Analysing the</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A. Optimization</title>
      <p>architecture
LSTM
XLM-R-base
XLM-R-large</p>
    </sec>
    <sec id="sec-4">
      <title>B. Results</title>
      <sec id="sec-4-1">
        <title>B.1. Peaks</title>
        <sec id="sec-4-1-1">
          <title>Here, we show the daily index of the diferent models</title>
          <p>for the dates mentioned in the results section of the
paper. The results come from the models trained on both
EVALITA and IstatHate.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Poletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tesconi</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the evalita 2018 hate speech detection task</article-title>
          ,
          <source>in: Ceur workshop proceedings</source>
          , volume
          <volume>2263</volume>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>