1. Introduction

Towards a Hate Speech Index with Attention-based LSTMs and XLM-RoBERTa

Mauro Bruno

Elena Catanese

Francesco Ortame

0 0 The number of parameters in Large Language Models ranges between a few billions to hundreds of billions of parameters, while the large version of XLM-RoBERTa “only” has 561 million parameters

The difusion of hate speech on social media requires robust detection mechanisms to measure its harmful impact. However, detecting hate speech, particularly in the complex linguistic environments of social media, presents significant challenges due to slang, sarcasm, and neologisms. State-of-the-art methods like Large Language Models (LLMs) demonstrate strong contextual understanding, but they often require prohibitive computational resources. To address this, we propose two solutions: (1) a bidirectional long short-term memory network with an attention mechanism (AT-BiLSTM) to enhance the model's interpretability and natural language understanding, and (2) fine-tuned multilingual robustly optimized BERT (XLM-RoBERTa) models. Building on the promising results from EVALITA campaigns in hate speech detection, we develop robust classifiers to analyse 20.4 million Tweets related to migrants and ethnic minorities. Further, we utilise an additional custom labeled dataset (IstatHate) for benchmarking and training and we show how its inclusion can improve classification performance. Our best model outperforms top entries from previous EVALITA campaigns. Finally, we introduce Hate Speech Indices (HSI), which capture the dynamics of hate speech over time, and assess whether their main peaks correlate with major events.

eol>hate speech detection deep learning attention mechanism RoBERTa artificial intelligence

1. Introduction

XLM-RoBERTa (large) model, benchmarked against its (3) we retrieve Tweets from these clusters, identifying the base, smaller version. We use two labeled training sets: expressions with a probability of 1 of belonging to the (a) the EVALITA 2020 HaSpeeDe 2 task dataset, and (b) clusters. This approach isolates 242,000 Tweets, of which a custom, smaller labeled dataset, which we refer to as 67,000 are unique. It is worth noticing that viral Tweets IstatHate. Our study explores the impact of training mod- (the ones that are repeated/retweeted several times) need els on both the EVALITA dataset alone and a combined to be annotated with a higher probability. A common dataset that includes EVALITA and IstatHate, evaluating practice to draw a much more eficient sample instead their performance across multiple test sets. of simple random sampling is to use stratified sampling,

Finally, we present a preliminary version of the Hate an efective method for handling skewed distributions. Speech Index (HSI), designed to quantify the proportion In particular, we adopted [7]. (4) We employ stratified of hate speech by classifying 20.4 million Italian Tweets sampling using the total number of Tweets as the target related to migrants and ethnic minorities from January variable, and we divided that variable into five classes 2018 to February 2023. using them as stratification criteria. (5) The Tweets are then stratified into the classes based on the number of retweets, with the final class being a take-all stratum, 2. Data resulting in 681 sampled texts, ensuring a coeficient of variation of 5%. (6) These 681 Tweets are then manually This section describes the data used for training, validat- labeled by Istat researchers adopting the following criteing, and testing the models and the corpus of Tweets on ria: if the language is vulgar/aggressive but generic it is which we compute the hate speech index (HSI). not labeled as hateful, if, on the contrary, it is related to migrants and/or ethnic minorities and the hate/prejudice 2.1. Corpus is clearly directed towards them, then they were labeled as hateful. The weighted estimate indicates that 34% of the Tweets contains hateful language, serving as a rough upper bound of the hate proportion within our prediction corpus. Even if our sample dataset likely over-represents hateful content, we disregard the weighting at this preliminary phase, simply adding IstatHate to the EVALITA dataset.

The prediction corpus consists of 20.4 million unlabeled

Tweets from January 2018 to February 2023. The Tweets are obtained through a two-step filtering procedure: first , a general 250-keyword filter gathers Tweets directly from X’s API;second, a smaller, immigration-related keyword iflter retrieves the relevant Tweets from the database. Thematic experts, borrowing the contents of discrimination survey questionnaires, have derived a preliminary iflter. These regular or stemmed expressions have been validated by means of topic modelling analysis and word embedding. For instance, the word cinese (“chinese”) was almost always related to markets or products and has therefore been removed. We also noticed that due to the generic term stranieri (“foreigners”) there are also some residual out-of-scope and irrelevant conversations. These issues only afects around 5% of the total texts. The final iflter consists in 21 stemmed expression (ex. immigrat-), or complete words.

2.2. Training data EVALITA Most of the labeled training data comes from

the EVALITA 2020 HaSpeeDe 2 task. The distribution of the labels in the training dataset is shown in Table 1. IstatHate Additionally, we use a custom-labeled dataset, i.e., IstatHate, derived from our corpus in the following way: ( 1 ) we fit a Latent Dirichlet Allocation (LDA) model [6] on the entire corpus, (2) we identify clusters likely to contain hateful Tweets, i.e., those with ofensive language, such as “ fate schifo” (“you suck”), and "avete rotto i c****oni" ("you pi**ed us of") and few others,

3. Methodology In this section, we present the methodology adopted in our study and outline the experimental design. We begin by introducing the model architectures, followed by a detailed description of the training procedure. 3.1. AT-BiLSTM model architecture The architecture of our attention-based bidirectional

LSTM (AT-BiLSTM) model comprises four main components: an embedding layer, a bidirectional LSTM layer, an attention layer, and an output layer. We will detail each component sequentially.

Embedding layer We pre-train a FastText [8] embedding model on the prediction corpus and extract the word vectors to initialise the weights of the embedding matrix. Table 2 presents the main training parameters of our model: each word is represented by a 300-dimensional vector, the training considers a distance window between words of up to 8 positions, and the model is trained for 25 epochs using a continuous bag-of-words algorithm.

Attention mechanism In deep learning, attention

mechanisms can improve model performance by focusing on important features of input sequences.

In our model, the attention mechanism is implemented on top of the LSTM layer to focus on the most relevant parts of the input sequence for predictions [9]. Our attention mechanism works as follows:

LSTM layer The core of our model is a bidirectional

Long Short-Term Memory (LSTM) network. LSTMs are a specialized type of recurrent neural network (RNN) designed to capture long-term dependencies in sequential data [10]. The bidirectional aspect of our LSTM processes the input sequence in both forward and backward directions. This bidirectionality provides the network with context from both past and future states for any given point (word) in the sequence (sentence) [11]. In practice, this means that when our model is processing a word in a Tweet, it has information about the words that came before and after it, allowing for an increased understanding of context.

The LSTM layer consists of multiple stacked bidirectional LSTM cells. Each cell maintains a cell state and a hidden state, which are updated at each time step as the input sequence is processed. The number of layers is included in the hyperparameter optimization phase. Output layer The final component of our model is a fully connected (dense) layer that takes the context vector produced by the attention mechanism as input. The output dimension of this layer is one-dimensional, as there are two classes in our hate speech detection class. The output of this layer is passed through a softmax function to produce a number between 0 and 1. Finally, the class is assigned comparing the output with a threshold (0.5).

The optimal configuration for each LSTM-based model, resulting from Bayesian hyperparameter optimization, is detailed in the Appendix.

3.2. XLM-RoBERTa

Multilingual RoBERTa (XLM-RoBERTa, or XLM-R) is a transformer-based model that builds upon the original • Transform the LSTM output using a fully con- BERT model and the monolingual RoBERTa (Robustly nected layer to get attention scores for each word. Optimized BERT Pretraining Approach) model [12]. It is • Normalise these scores into attention weights designed to handle multiple languages, making it particwith a softmax function, creating a pseudo- ularly suitable for our task of hate speech detection in probability distribution. Italian texts. • Compute a context vector by taking a weighted XLM-RoBERTa is trained on 100 diferent languages and sum of the LSTM outputs using the attention has a much larger vocabulary size (250k tokens) comweights. This context vector emphasizes the most pared to both BERT (30k tokens) and RoBERTa (50k toimportant parts of the input sequence for the clas- kens).

sification task 3.

The attention mechanism allows our model to dynamically focus on diferent parts of the input for diferent examples. 2We ran both random search and Bayesian optimization. The best result came from the latter. 3We also experimented with attention masking. However, this negatively impacted accuracy. Upon inspecting the attention scores, we observed that the model naturally assigns negligible weights to padding tokens.

3.3. Training In this section, we outline the experimental design we followed to obtain our results. We structured our experiments to systematically assess model performance under diferent training conditions and across various test sets.

3.3.1. Experimental design

Training sets We trained each model under two dis

tinct scenarios: ( 1 ) a training set comprising only data from the EVALITA labeled dataset, and (2) a training set comprising both EVALITA data and IstatHate data.

Evaluation We evaluate every model on three test

datasets: (a) a test set comprising only data from the EVALITA test dataset, (b) a test set comprising only data from the IstatHate test set, and (c) a combined test set comprising data from both EVALITA and IstatHate test sets. None of the texts in these test sets are seen by the models during training, in any scenario.

Therefore, we have four diferent architectures

and two training sets, resulting in eight distinct models.

A more interesting observation can be made about the

efect of including IstatHate in the training set along 3.3.2. Model Training EVALITA data: besides the expected increased performance on the IstatHate test set, there is a case in which LSTM-based We ran a Bayesian optimization process the performance on the EVALITA test set increases too, to automatically extract optimal hyperparameters. This namely XLM-RoBERTa-large⋆. This non-trivial crossoptimization process is detailed in the Appendix. We dataset improvement, suggests that training on both trained the models for 10 epochs, and we extracted the datasets enhances the model’s generalization capabilibest configuration based on validation loss. ties, despite the fact that the datasets were labeled by diferent people. Finally, it is interesting to notice how a XLM-RoBERTa Given the large size of XLM-RoBERTa simpler model like AT-BiLSTM⋆ manages to outperform models, we were not able to run Bayesian optimization, XLM-RoBERTa-base⋆ on all test sets. and instead employed grid search over a reduced sub- Results on the IstatHate test set are consistently lower set of hyperparameters. We trained the models for 10 than results on the EVALITA test set, but this was exepochs, and extracted the weights from the run with the pected, as, even when included in the training, IstatHate lowest validation loss. We follow a training procedure is much smaller in size. loosely based on the methodology outlined by [13], but The Full test set is a combination of the EVALITA test with adaptations to the data and hyperparameters to op- set and the IstatHate test set, and therefore the macro F1 timise performance for our specific use case. A detailed scores on the Full test set are a weighted mean between description of the training hyperparameters can be found the ones obtained on EVALITA and IstatHate. in Appendix A.1. The best performing model across all test sets is XLMRoBERTa-large⋆, i.e. fine-tuned on the training set combining both EVALITA and IstatHate. 4. Results A detailed table that compares the training and inference times of the diferent models can be found in Appendix A.2.

In this section, we present the results of our analysis,

covering model performance, attention weight visualizations, and Hate Speech Index (HSI) predictions.

4.2. Attention visualization

4.1. Model performance An advantage of an AT-BiLSTM model over a standard BiLSTM model is its ability to visualise attention scores Table 3 highlights the performance of the models, pre- for each word, making outputs more interpretable4. senting the macro F1 score across the diferent test sets. Visualising attention scores provides a useful method for

There are several observations that can be made about empirically examining the impact of training models on these results. First, there is a clear positive correlation be- diferent datasets. For instance, the following are two tween model size and performance, particularly evident Tweets classified by the AT-BiLSTM-EV model, along in the XLM-RoBERTa models, where the larger variant consistently outperforms the smaller ones across all test sets. This is expected for a complex task like hate speech 4AbutttetnhteioXnLsMco-rReosBcEaRnTbaetvoikseunailzizeerddionesBnEoRtTa-blwasaeyds mspolidteIltsaltioaon[t1e4x]t, detection. into complete words, making interpretation trickier. with their corresponding attention scores.

Tweet 1 (true: No Hate, predicted: Hate)

IT poi rompe il caz**o a tutti perché ha accolto una famiglia di profughi EN then they break our ba**s because they hosted a family of refugees

The first Tweet is misclassified by the AT-BiLSTM-EV

model. Analysing the attention scores, we can see how a lot of emphasis was put on curse words both on Tweet 1 and Tweet 2. Figure 3 shows the attention scores produced by the AT-BiLSTM⋆ model for Tweet 1 and Tweet 2, both texts are correctly classified. We can see how a lot of attention is still put on curse words like ca**o and bastardi, but a significant attention score is also given to profughi ("refugees") in Tweet 1. Since the Tweet is correctly classified as not hateful – it contains aggressive language but not directed towards migrants or ethnic minorities – we can assume that there is an increased contextual understanding compared to AT-BiLSTM-EV. Additionally, Figure 3 (bottom) shows how the distribution of attention scores for the AT-BiLSTM⋆ model is much more concentrated compared to AT-BiLSTM-EV.

4.3. Hate Speech Index (HSI) In this section, we present and briefly discuss our preliminary Hate Speech Index (HSI) results. Firstly, the daily HSI is computed as follows:

= ℎ,

, ℎ, + ℎ, where ℎ, is the number of Tweets classified as hateful on day , and ℎ, is the number of Tweets classified as not hateful on day .

One immediately noticeable diference between the models trained solely on EVALITA and the models trained on

EVALITA and IstatHate are the consistently lower levels single days it appears that it is more of a trend rather of the predictions coming from the latter compared to than a response to a specific event/series of events. the former for all settings. In particular, the minimum decrease is recorded by BiLSTM models (− 0.01), while the maximum decrease is achieved by XLM-RoBERTa- 5. Conclusion abcahseiev(− ed0.b0y9)X. LTMh-eRoloBwEeRsTtam-beaasne⋆vwaliuthe afonratvheeraHgeS,Iinis- This study addressed the issue of hate speech detection dicating a percentage of 11.7% hateful Tweets over the on social media, specifically focusing on X (formerly total Tweets in the corpus. The best performing model, Twitter) and on migrants and ethnic minorities. Given XLM-RoBERTa-large⋆, predicts 14.1% of hateful Tweets. the complexities of natural language on these platforms,

With respect to the standard deviation, we observe we explored diferent approaches including lighter bidithat, XLM-RoBERTa models show lower variability com- rectional LSTM models with and without attention mechpared to LSTM-based models. For XLM-RoBERTa and anisms, and fine-tuned XLM-RoBERTa models both in BiLSTM models, the standard deviation decreases when their base and large formats. We trained our models on including IstatHate in the dataset. EVALITA 2020 HaSpeeDe 2 data and also introduced a small labeled dataset, IstatHate, that improves the performance of the already best performing model, XLMCorrelation The dynamics of the moving averages of RoBERTa-large, when included in the training set. the indices appear to be relatively coherent between mod- Despite longer inference times and higher computaels, as confirmed by correlations in the range between tional resources required for large amounts of data, heav0.81 (AT-BiLSTM⋆ vs XLM-RoBERTa-base-EV) and ier models like XLM-RoBERTa-large achieve significantly 0.98 (BiLSTM⋆ vs BiLSTM-EV). The lowest correlations higher performance and generalization capabilities. Yet, between models with the same architecture and diferent AT-BiLSTM⋆ (i.e., the AT-BiLSTM model that includes training sets amounts to 0.88 (XLM-RoBERTa-base⋆ vs both EVALITA and IstatHate data in the training), outperXLM-RoBERTa-base-EV). forms XLM-RoBERTa-base⋆ across all test sets, a notable achievement considering the diference in models size We can now analyse a few peaks in the daily time series and inference time. to empirically assess the quality of the estimates, and We compared the predictions of AT-BiLSTM-EV the ability of the models to detect specific events. against AT-BiLSTM⋆ visualising the attention scores they assigned to the same Tweets. Empirical evidence shows October 24, 2018 This date refers to the difusion of that including IstatHate in the training set may improve the news about an unfortunate event in which a 16 years contextual understanding and mitigate the bias that simold girl was raped and killed by a group of men from pler models like LSTMs may have when classifying hate Senegal and Nigeria. If we look at the trends in Figure 5 speech in the presence of curse words. (top) and Figure 6 (top) in Appendix B.1, we notice how The preliminary computation of the Hate Speech Index the increase in the proportion of hate speech persists (HSI) reveals significantly diferent levels of hate speech in the following period. In this case, we observe that detection across diferent models and training sets, even all models detect the event registering values more than though the training data has very similar characteristics. twice their average. Fine-tuned XLM-RoBERTa models produce the lower estimates in levels, especially when IstatHate is included July 25, 2021 This peak refers to a news about another in the training set. Furthermore, when analysing hate 16 years old Italian girl that was beaten up on the street peaks, XLM-RoBERTa-large⋆ predictions highly correlate by her 17 years old Moroccan boyfriend. From Figure 5 with major events. (bottom) and Figure 6 (bottom) in Appendix B.1, we can Future work will focus on expanding and validating see how not all models detect this event. In particular, the IstatHate dataset, exploiting the sampling weights, of the models trained on both EVALITA and IstatHate, refining model architectures, and exploring additional only XLM-RoBERTa-large⋆ and AT-BiLSTM⋆ show features to enhance detection capabilities. a clear peak in the trend, while LSTM-based models trained only on EVALITA struggle to identify this peak.

The only model that detects the peak in both cases is References XLM-RoBERTa-large, further empirically confirming its robustness.

We also inspected the negative shift at the beginning of 2021, detected by every model. Analysing the A. Optimization

architecture LSTM XLM-R-base XLM-R-large

B. Results B.1. Peaks Here, we show the daily index of the diferent models

for the dates mentioned in the results section of the paper. The results come from the models trained on both EVALITA and IstatHate.

[1]

Bosco ,

Dell'Orletta ,

Poletto ,

Sanguinetti ,

Tesconi , et al., Overview of the evalita 2018 hate speech detection task , in: Ceur workshop proceedings , volume 2263 , CEUR , 2018 , pp. 1 - 9 .