1. Introduction

of the Taks A Textual

Irene Siragusa

irene.siragusa02@unipa.it 0 1 2

Roberto Pirrone

roberto.pirrone@unipa.it 0 1 2 0 Dipartimento di Ingegneria @ Università degli Studi di Palermo , Viale delle Scienze, Edificio 6 90129 - Palermo 1 Processing and Speech Tools for Italian , Sep 7 - 8, Parma, IT 2 Workshop Proce dings

This technical report illustrates the system developed by the CHILab team for the competition HaSpeeDe3 as part of the EVALITA 2023 campaign. The key idea for HaSpeeDe3 task A - Political Hate Speech Detection - Textual, was to develop diferent systems arranged as suitable combinations of the Pre-Trained Language Model (PTLM) used for embedding extraction, neural architectures for further elaborations over the embeddings and a classifier. In particular, dense layers, LSTM, BiLSTM and Transformers were used. The best performing system across the ones investigated in this report was made by embeddings extracted via XLM-RoBERTa coupled with BiLSTM that reaches a macro-F1 score of 0.876.

hate speech detection BiLSTM language model

1. Introduction 2. Description of the system

The continuous spread and usage of social media has be- The focus of HaSpeeDe3 was on political and religious come a problem when dealing with hate online. All social hate, where strong polarized opinions can be found. The platforms use artificial intelligence techniques to detect data set used in this edition for task A is the PolicyCorpus and report or remove some dangerous contents in terms XL [11] that contains 7000 tweets annotated manually, of hate or violence. The interest in this respect is also and a presence of hate labels above 40%. The training high in the scientific community, in fact diferent interdata set was released for the campaign with a total of national campaigns for detecting hateful speeches have 5600 samples: for developing purposes, the given data been proposed in recent years: OfensEval [ 1, 2 ], HatEset was randomly split in a training and validation set, val [ 3 ], HaHackathon [4]. Detection of hateful content in using a 80-20 ratio, resulting in 4480 and 1120 samples

Italian has been addressed by the HaSpeeDe evaluation respectively. competitions [5, 6]. This paper introduces the architecture proposed by the

CHILab team for the EVALITA 2023 campaign [7], and in particular as regards the Hate Speech Detection task (HaSpeeDe3 task A - Political Hate Speech Detection, textual) [8]. The general approach relies on encoding the text into suitable word embeddings that are processed via ers. Finally, the output classifier detects the presence of hateful content.

2.1. Pre-processing

The [URL] tag, mention references, and retweet notes were removed since they were not considered meaningful: in particular, mentions are referred to anonymized accounts thus they add no special information. This was tags1. As reported in Table 1, the [URL] tag is the most frequent one between classes and adds no information neural architectures like LSTM, BiLSTM or Transform- done after an analysis on the most cited words and hashtures. No generative models [9, 10] where considered Overall, no other relevant words appeared that suggest a in this respect to derive embeddings. Moreover, we de- strong separation between classes. The same consideracided not to use fine-tuning in our PTLMs to stress the use of light networks to be trained with low computing resources. Finally, we set up a unique approach for all the tasks we have participated in EVALITA 2023. tions can be done for the hashtags as reported in Table 2.

Although there are some hashtags that are hateful

(such as salvinipagliaccio, speranzadimettiti and governodeipeggiori), the most frequent ones are just either politi

The paper is arranged as follows: Section 2 reports a de- cians’ or parties’ names, and politics related words, that scription of our systems along with data pre-processing, do not express any polarized content. Moreover, since while results are reported and discussed in Section 3. a strong and significant distinction between hateful and nEvelop-O (R. Pirrone)

Concluding remarks are in Section 4.

form non-hateful hashtags can be done, their information has been used as a word inside the tweet, thus keeping the crucial information, while the hashtag symbol was re

1for this analysis all the words were reported in their lower case 2.2. Network architectures Diferent models were developed that share the same

All freq NH freq H freq macro structure shown in figure 1. The key idea was to 38 21 32 stress, as much as possible, existent neural architectures 32 15 27 for sequence processing, that are LSTM [14], BiLSTM 32 14 25 and Transformers [15]. Those architecture are used to 26 13 18 further process the extracted embeddings. 26 11 15 After pre-processing, the input sentences were padded 26 10 11 to ℎ + 2 tokens where ℎ is the size 22 8 10 of the longest sentence, and the remaining two tokens 15 8 10 are respectively the [CLS] and the [SEP] one. Either 15 8 8 a pre-trained language model or a static context-free 13 7 7 embedding model were used for embedding generation.

In the last case, fastText [16] was used that generates a 300 tokens embedding, while a 768 tokens embedding

Similar considerations were made for emojis: also in is obtained as usual by the diferent PTLMs. We used this case, a strong polarization in the use of emojis did not the following Encoder-based Language Models in the arise, particularly for the ones that are more associated experiments: BERT base multilingual cased [17], BERT with disgust and hate (Table 3). Since emojis are deeply base italian uncased [18], XLM-RoBERTa [19] and Alused in social media communication, they were kept. No BERTo [20] provided by the HuggingFace Transformers further elaboration were made over the tweets: words library2. The embeddings were extracted from the last were not reported to their lower case form, thus allowing layer of the PTLMs without fine-tuning. Fine-tuning in a more accurate extraction of embeddings for the case- these configuration is an option that is not taken into sensitive PTLMs. As for emojis, uppercase texts has a account since the main idea is to stress the use of light specific meaning in social media communication in terms of prosodic and emotions interpretation [12, 13].

3. Results

networks to be trained with low computing resources.

The generated embedding is fed into a module for feature extraction that consists of a LSTM or a BiLSTM The best F1-macro performances obtained on the test or a Transformer3. The output feature vector has the set from our models are reported in Table 4. The subsame size of the word embedding with the exception mitted modes were the best runs with respect to the of the BiLSTM that generates a double-length output. validation set, namely AlBERTo/BiLSTM (run 1) and fastFinally, the feature vector is passed to a classifier made Text/BiLSTM (run 2). After the release of the golden by either 300 or 768 linear units, depending on the length labels, it was possible to measure the actual performance of the embedding, and a sigmoidal output to achieve of all the developed systems and this shows up that the binary classification. Some experiments were run by XLM-RoBERTa/BiLSTM architecture gives the best reinserting a ReLU dense layer before the aforementioned sults, ranking at the 7th place on the leaderboard, while one with exactly the same size. Those architectures are the submitted runs are at last places as shown in Table 5. referred as LSTM-Deep, BiLSTM-Deep and Trasformer- Best results are obtained either when using a PTLM Deep (Figure 1.b). coupled with a LSTM/BiLSTM feature extractor and a sin

The illustrated architectures were trained only on the gle dense layer4, while the Transformer based networks given data set using a machine equipped with two In- exploit better a context-free embedding by using a two tel Xeon E5 CPUs 96GB RAM and an NVIDIA TITAN layer classifier.

Xp GPU 12GB RAM. Hyperparameters were selected as It is worth noticing that only the models that use fastfollows: dropout values in {0.1, 0.2}, batch size 32, Adam Text benefit from removing stopwords, while the PTLMs optimizer [21] with learning rate 0.01, and a Binary Cross perform almost equally over LSTM and BiLSTM conEntropy loss. Models were trained for a maximum of 1000 figurations as it was expected. In the training phase, epochs with a patience value of 50. AlBERTo outperformed the other PTLMs since it uses a

Diferent feature extractors were implemented using 1, more accurate tokenization compared to the others, and 2 or 3 LSTM/BiLSTM/Transformer layers, but the best re- it takes advantage from its inner knowledge: AlBERTo sults were obtained by the single layer feature extraction was trained on a corpus of Italian tweets that share the modules. In addition the developed models are relative same linguistic macro-structure of the PolicyCorpus. On small, where the trainable parameters range from 1M to the other hand, the best model is the one based on XLM10M. RoBERTa: this can be caused from its tokenizer that owns an inner representation for emojis, and consider them as unique tokens and not as [UKN].

3The corresponding architectures are named according the spe

cific neural module

4In table 4 some experiments and configurations are not reported, like the BiLSTM-Deep one, because they ran bad with respect to the submitted architectures.

3.1. Error analysis 4. Conclusion This paper reported the architectures developed by the

CHILab team for HaSpeeDe3 task A promoted at the EVALITA 2023 campaign. Our models show that a relatively small classical pipeline made by embedding extraction plus further neural elaboration can have good performance in hate speech detection without the need of fine-tuning PTLMs, and using few computational resources. The use of such “minimalist” architecture is intended to allow for future development of compact explainable models where explicit linguistic knowledge is injected in the network to improve its performance.

Besides the aforementioned diferences between the

PTLMs used, another analysis was made on the mis- Acknowledgments classified tweets by comparing the results of the best architectures (AlBERTo/LSTM, AlBERTo/LSTM-Deep, This work is supported by the PO FESR 2014-2020 grant XLM-RoBERTa and fastText/Transformer-Deep) and the n. 086201000543, “SCuSi - Smart Culture in Sicily” submitted models. Models agree in mis-classifing 32 tweets, and 25 of them are labeled as hateful.

None of these mis-classified tweets contain emoji, that References is their presence or absence is not source of bias in those models. Moreover, the majority of those tweets contains hashtags or expressions referring to politicians and topics of interest in the political debate, that per se are not hateful. On the contrary, tweets containing the hashtag speranzadimettiti, considered hateful as in 2, can be found in non hateful tweets. In those tweets the author disapproves the governmental behaviour of a minister: in this case it cannot be considered as hateful since it express a negative opinion without insulting.

On the other side, hateful tweets usually contains profanities and vulgar expressions: hateful tweets that are not correctly classified by the developed models, lack of those expressions or put them in an unconventional way (self-obfuscation or embedded in other words) and this lead to their mis-classification. Linguistics, Minneapolis, Minnesota, USA, 2019, tic Society of America 3 (2018) 55–1–13. pp. 54–63. URL: https://aclanthology.org/S19-2007. URL: https://journals.linguisticsociety.org/ doi:10.18653/v1/S19-2007. proceedings/index.php/PLSA/article/view/4350. [4] J. A. Meaney, S. Wilson, L. Chiruzzo, A. Lopez, doi:10.3765/plsa.v3i1.4350.

W. Magdy, SemEval 2021 task 7: HaHackathon, [13] S. Chan, A. Fyshe, Social and emotional correlates detecting and rating humor and ofense, in: of capitalization on Twitter, in: Proceedings of Proceedings of the 15th International Workshop the Second Workshop on Computational Modeling on Semantic Evaluation (SemEval-2021), Associa- of People’s Opinions, Personality, and Emotions tion for Computational Linguistics, Online, 2021, in Social Media, Association for Computational pp. 105–119. URL: https://aclanthology.org/2021. Linguistics, New Orleans, Louisiana, USA, 2018, semeval-1.9. doi:10.18653/v1/2021.semeval-1. pp. 10–15. URL: https://aclanthology.org/W18-1102. 9. doi:10.18653/v1/W18-1102. [5] C. Bosco, D. Felice, F. Poletto, M. Sanguinetti, [14] S. Hochreiter, J. Schmidhuber, Long ShortT. Maurizio, et al., Overview of the evalita 2018 Term Memory, Neural Computation 9 (1997) hate speech detection task, in: Ceur workshop 1735–1780. URL: https://doi.org/10.1162/neco. proceedings, volume 2263, CEUR, 2018, pp. 1–9. 1997.9.8.1735. doi:10.1162/neco.1997.9.8.1735. [6] M. Sanguinetti, G. Comandini, E. D. Nuovo, arXiv:https://direct.mit.edu/neco/articleS. Frenda, M. A. Stranisci, C. Bosco, T. Caselli, pdf/9/8/1735/813796/neco.1997.9.8.1735.pdf. V. Patti, I. Russo, Haspeede 2 @ evalita2020: [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Overview of the evalita 2020 hate speech detec- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attion task, EVALITA Evaluation of NLP and Speech tention is all you need, Advances in neural inforTools for Italian - December 17th, 2020 (2020). mation processing systems 30 (2017). [7] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- [16] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, noli, G. Venturi, Evalita 2023: Overview of the 8th T. Mikolov, Learning word vectors for 157 lanevaluation campaign of natural language process- guages, in: Proceedings of the International Confering and speech tools for italian, in: Proceedings ence on Language Resources and Evaluation (LREC of the Eighth Evaluation Campaign of Natural Lan- 2018), 2018. guage Processing and Speech Tools for Italian. Final [17] J. Devlin, M. Chang, K. Lee, K. Toutanova, Workshop (EVALITA 2023), CEUR.org, Parma, Italy, BERT: pre-training of deep bidirectional trans2023. formers for language understanding, CoRR [8] M. Lai, F. Celli, A. Ramponi, S. Tonelli, C. Bosco, abs/1810.04805 (2018). URL: http://arxiv.org/abs/ V. Patti, Haspeede3 at evalita 2023: Overview of the 1810.04805. arXiv:1810.04805. political and religious hate speech detection task, in: [18] S. Schweter, Italian bert and electra models, Proceedings of the Eighth Evaluation Campaign of 2020. URL: https://doi.org/10.5281/zenodo.4263142. Natural Language Processing and Speech Tools for doi:10.5281/zenodo.4263142.

Italian. Final Workshop (EVALITA 2023), CEUR.org, [19] A. Conneau, K. Khandelwal, N. Goyal, V. ChaudParma, Italy, 2023. hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, [9] G. Mialon, R. Dessì, M. Lomeli, C. Nalmpantis, L. Zettlemoyer, V. Stoyanov, Unsupervised crossR. Pasunuru, R. Raileanu, B. Rozière, T. Schick, lingual representation learning at scale, CoRR J. Dwivedi-Yu, A. Celikyilmaz, et al., Augmented abs/1911.02116 (2019). URL: http://arxiv.org/abs/ language models: a survey, arXiv preprint 1911.02116. arXiv:1911.02116.

arXiv:2302.07842 (2023). [20] M. Polignano, P. Basile, M. de Gemmis, G. Semer[10] Y. Liu, T. Han, S. Ma, J. Zhang, Y. Yang, J. Tian, aro, V. Basile, AlBERTo: Italian BERT Language H. He, A. Li, M. He, Z. Liu, Z. Wu, D. Zhu, Understanding Model for NLP Challenging Tasks X. Li, N. Qiang, D. Shen, T. Liu, B. Ge, Sum- Based on Tweets, in: Proceedings of the Sixth mary of ChatGPT/GPT-4 Research and Perspec- Italian Conference on Computational Linguistics tive Towards the Future of Large Language Models, (CLiC-it 2019), volume 2481, CEUR, 2019. URL: 2023. URL: http://arxiv.org/abs/2304.01852. doi:10. https://www.scopus.com/inward/record.uri? 48550/arXiv.2304.01852, arXiv:2304.01852 [cs]. eid=2-s2.0-85074851349&partnerID=40&md5= [11] F. Celli, M. Lai, A. Duzha, C. Bosco, V. Patti, Poli- 7abed946e06f76b3825ae5e294ffac14. cycorpus xl: An italian corpus for the detection of [21] D. P. Kingma, J. Ba, Adam: A method for stochashate speech against politics., in: CLiC-it, 2021. tic optimization, arXiv preprint arXiv:1412.6980 [12] M. Heath, Orthography in social media: (2014).

Pragmatic and prosodic interpretations of caps lock, Proceedings of the Linguis

[1]

Zampieri ,

Malmasi ,

Nakov ,

Rosenthal ,

Farra , R. Kumar, SemEval -2019 task 6: Identifying and categorizing ofensive language in social media (OfensEval) , in: Proceedings of the 13th International Workshop on Semantic Evaluation , Association for Computational Linguistics , Minneapolis, Minnesota, USA, 2019 , pp. 75 - 86 . URL: https://aclanthology.org/S19-2010. doi: 10 .18653/ v1/ S19 - 2010.

[2]

Zampieri ,

Nakov ,

Rosenthal ,

Atanasova , G. Karadzhov,

Mubarak ,

Derczynski , Z . Pitenis, c. Çöltekin, SemEval-2020 Task 12 : Multilingual Offensive Language Identification in Social Media (OffensEval 2020) , in: Proceedings of SemEval, 2020 .

[3]

Basile ,

Bosco ,

Fersini ,

Nozza ,

Patti ,

F. M.

Rangel Pardo ,

Rosso , M. Sanguinetti, SemEval2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter , in: Proceedings of the 13th International Workshop on Semantic Evaluation , Association for Computational