-

BERTicelli at HaSpeeDe 3: Fine-tuning and Cross-validating Large Language Models for Hate Speech Detection

Leonardo Grotti

0 1

Patrick Quick

1 0 CLiPS Research Center, University of Antwerp , Belgium 1 Universiteit Antwerpen, Faculty of Arts , Prinsstraat 13, B-2000, Antwerp

The present paper describes the results from the experiments carried out for the HaSpeeDe 3 shared task, an Italian-language Hate Speech (HS) detection task, at EVALITA 2023. Two BERT-based language models were selected: UmBERTo (cased) and Italian BERT (cased). For the Textual task, the models were fine-tuned and cross-validated across 5 folds. For the Contextual task, we adopted an ensemble approach: the additional features were added to the fine-tuned models through the GradientBoosterClassifier algorithm. The models perform better than the baselines (DummyClassifier and LogisticRegression) and above the average performance of participants in the shared task. While the addition of contextual features did not improve the performance of UmBERTo, it significantly bettered the results obtained with Italian BERT.

eol>Hate Speech detection Italian language BERT-based language models Fine-tuning Contextual features

1. Introduction guage Processing (NLP), which has witnessed a surge in interest and popularity in automatic toxic language deThe escalating issue of toxic language has been ampli- tection [8]. Researchers aim to develop models that can ifed by the rapid growth in social media usage over the alleviate the harm caused by online HS [9]. Automating past decade [1]. Platforms such as Facebook and Twitter the detection process not only overcomes the challenges have transformed the way individuals interact, making of manual filtering but also enables eficient analysis of it faster and often anonymous, thereby creating an ideal large volumes of data. environment for the propagation of harmful content [2]. As a reminder, we here use the terms HS as an umbrella Furthermore, previous studies have shown that this con- term and do not distinguish between its subcategories tent can be targeted at and posted by both individuals on a theoretical level. For a more extensive discussion of and groups, inciting and driving violent acts in the ofline HS hierarchies and definitions, refer to Zampieri [ 7] and world [2, 3, 4]. Caselli et al. [10]. It is worth noting that scholars often

As such, countering the phenomenon of toxic language do not agree on what constitutes HS and how it difers has garnered significant attention from legal authorities, from, e.g., ofensive or aggressive language [11]. social media platforms, and companies [5]. Platforms like Facebook, Twitter, YouTube, and other websites have taken measures to combat toxic language by implement- 2. Related Work ing bans. However, research has pointed out the limitations of companies’ control systems and their heavy re- As we have mentioned, the growing interest in addressing liance on user reports to identify problematic comments toxic language is evident in the numerous tasks dedicated or posts [6]. The manual filtering of messages containing to its detection and its various subcategories. These intoxic language has proven to be not only time-consuming clude Aggression Identification [ 12], Ofensive Language but also detrimental to human annotators [7]. Addition- Identification [ 7], and HS detection in Italian Facebook ally, studies have revealed that human-labeled data can and Twitter messages [13], among others. Over time, be influenced by individual annotators’ biases [8]. the quantity and quality of available models for toxic Such interest was reflected in the field of Natural Lan- language detection have significantly increased. Markov, Gevers, and Daelemans (2021) note that the advent of EVALITA 2023: Final Workshop of the 8th evaluation campaign, transformer-based pre-trained language models, coupled September 08–09, 2021, Parma, IT with the abundance of user-generated content on social $ lgrotti@uantwerpen.be (L. Grotti); media, has greatly improved detection accuracy. patrick.quick@student.uantwerpen.be (P. Quick) Despite the overall improvement of the models, a se https://github.com/corvusMidnight (L. Grotti); ries of challenges remain. For instance, it has been shown http0s0:0//0g-0it0h0u1b-.7c9o1m4-/3p1a9tr1ic(kLq.uGircoktt(iP). Quick) how the lack of data in languages other than English [14] © 2021 Copyright for this paper by its authors. Use permitted under Creative has exacerbated already existing issues, such as the high CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) occurrence of code words and misspellings in HS text

3.3. Models

The task organizers [20] provided development data consisting of 5,600 Italian-language tweets from the Pol- Fine-tuning, a common technique in NLP, is a form of icycorpus XL corpus, a manually-annotated HS corpus transfer learning that involves training a pre-trained [21]. The testing data consists of one subset of in-domain model on new data to adapt it for a specific downstream data and one subset of out-of-domain data. The in- task [24]. As mentioned in Section 2, there are notable domain data consists of 1,400 Italian-language tweets benefits in fine-tuning models when it comes to HS detecfrom Policycorpus XL, and the out-of-domain data con- tion tasks. Furthermore, this approach has been widely sists of 3,000 Italian-language tweets from the Italian sub- used in Italian HS detection, see, e.g., Eric et al. (2020), set of the ReligiousHate corpus, a manually-annotated Tamburini (2020), Nozza et al. (2022). Thus, we fine-tune religious HS corpus [22]. two large pre-trained language models:

1For an extensive explanation of how these factors have im

proved the performance of HS detection systems, see Yin and Zubiaga 2021 • UmBERTo-commoncrawl-cased (Run 1) is a

RoBERTa-based model using the OSCAR (Open Super-large Crawled ALMAnaCH coRpus) Italian large corpus. The model is used for both Named Entity Recognition (NER) and Part Of Speech (POS) tagging and reached excellent performance on diferent datasets. • bert-base-italian-cased (Run 2) is a BERT-based model which was trained on two million tokens and over 13GB of data. The model was pre-trained on a combination of data which includes the OPUS corpus as well as a Wikipedia dump. Note that for ease of readability, we will now refer to this model as BERT-ita.

do so, the output labels for both BERT-based models together with the additional features are used as input features for GradientBoostingClassifier.

To further assess the performance of our models, we build four baselines: one LogisticRegression and one DummyClassifier for each task. For Textual and Contextual tasks, the models were trained on the textual data and evaluated on the in-domain test data. For the crossdomain task, they were trained on the same textual data but evaluated on the out-of-domain test set instead. It is worth mentioning that no additional data was used at any stage of our experiments.

For the Textual tasks (both in- and out-of-domain) our experimental setup consists of three stages: To begin 4. Results and Discussion with, we apply two basic preprocessing steps, which consist of substituting the pseudo-random user identifiers In this section, we describe the results obtained for (e.g., ’@12020569’) with ’@USER’ and removing the hash HaSpeeDe 2023. For each model, we report precision, symbol (i.e., ’#’) from hashtags. Such steps are applied to recall, and F1 score (for both classes). We submitted reavoid excessively long tweets2 and remove unnecessary sults for every sub-task except for Task B’s XpoliticalHate noise. sub-task. All results are compared with the respective Then, both models are fine-tuned and cross-validated baselines. across five epochs using PyTorch Trainer and the Transformers library. The development data is shufled and 4.1. Baselines divided into five folds. For each fold, the models are fine- As a reference point, Table 2 first presents the baseline tuned on 80% of the development data and evaluated on results for both in- and out-of-domain tasks for each the remaining 20% across 10 epochs with an EarlyStop- class. The DummyClassifier performs slightly above ranping patience of 3. We employ cross-validation to ensure dom chance for the in-domain task, with an average F1 that the obtained results are not dependent on a particu- score of 0.52. However, the out-of-domain results are—as lar data split but rather generalize well across multiple expected—poorer, reaching an average F1 score of 0.42. folds. LogisticRegression, on the other hand, achieves com

During this stage, we also tune the learning rate (1e-3, petitive results, with average F1 scores of 0.86 for in2e-5, and 5e-05)3. We do not tune batch size as the test domain data and 0.52 for out-of-domain data. However, data was not available at this stage and increasing batch upon further inspection, we can observe how LogisticResize may have improved development set performance gression is fairly limited. With a precision of 0.80 for but worsened generalizability on unseen data (see He the non-hate speech (¬HS) class, the model exhibits a et al., 2019). Once the stability of the results has been relatively high rate of false positives. Additionally, the established through cross-validation, the models are fine- recall of 0.96 for ¬HS implies a high rate of true posituned on 85% of the training data4 (after shufling) and the tive instances captured but at the expense of potentially resulting model is saved and used to output predictions overlooking some true negatives. These results suggest on both test sets. that the model may be overly biased towards predicting

For the Contextual task, additional features are incor- instances as ¬HS, potentially missing some actual HS porated into the model using GradientBoostingClassifier. instances. Similarly, while a high precision of 0.953 is This ensemble algorithm sequentially trains weak models, achieved for the HS class, the model showcases a fairly resulting in a strong model that is a weighted combina- low recall of 0.75. In turn, this pattern implies the model’s tion of the weak models. Unlike other algorithms, Gra- inability to identify a significant portion of actual HS indientBoostingClassifier employs decision trees as weak stances, resulting in false negatives. learners and is optimized through gradient descent. To It is worth mentioning that the high performance of LogisticRegression in the in-domain task is likely related to the balanced nature of the data (700 HS v. 700 ¬HS).

When out-of-domain, unbalanced test data (see Table 1) is used, performance drastically drops.

2The presence of multiple user tags in some of the tweets caused

a mismatch in Tensor size and consequently a RuntimeError.

3These learning rates are found in Nozza et al. (2022), HuggingFace’s fine-tuning guide, and in the standard training parameters, respectively.

4Such a configuration is selected to mirror the task’s original train-test split of 5600-1400, see Celli et al. (2021).

Dummy LogReg Dummy LogReg Class

¬HS HS ¬HS HS ¬HS HS ¬HS HS 0.504 0.530 0.963 0.759 4.2. Task A Our models achieve competitive results in both Task A’s sub-tasks5, as shown in Table 3 (Textual) and Table 4 (Contextual) for each class. Starting from the former, both models perform above both baselines. However, there seems to be a substantial diference between UmBERTo’s (Run 1) and BERT-ita’s (Run 2) performance: while the ifrst reaches an F1 average of 0.89, the second reaches 0.86, with a diference between the scores of over .03. Indeed, even if the second run’s results are close to the LogisticRegression baseline, the model’s predictions (i.e., precision and recall) are more balanced across the two classes. Thus, the F1 for the HS class is higher for BERTita compared to the baseline.

The reason for the discrepancy between the two models’ performance is likely related to the size of the pretraining data: UmBERTo was trained on over 70GB (against the 13GB of BERT-ita). As such, the model likely has more sub-embeddings and sentence-embeddings available, which in turn allows for better results.

For the Contextual sub-task (Table 4), we included a set of extra features (i.e., ’anonymized description’, ’retweet count’, ’favorite count’, ’is reply’, ’is retweet’, ’is quote’, ’statuses count’, ’followers count’, and ’friends count’) to the output labels through GradientBoostingClassifier. Both models once again reach competitive results. While the first run’s results are not afected by the inclusion of contextual features, BERT-ita (Run 2) significantly beneifts from their addition. The model performs on the same level as UmBERTo, with an F1 of 0.902 for ¬HS and 0.892 for HS. The inclusion of contextual information during the training stage likely enables BERT-ita to capture more diverse linguistic patterns and generalize better to the classification task.

5Note that the overall F1 average for each model and sub-task

can be found in Table 7 below.

Though we did not formally submit results to Task B’s

sub-task XPoliticalHate, we met the requirements of the task by submitting results for the Contextual sub-task of Task A, for which the same test data was used. We will thus report results for both sub-tasks of Task B, referring to Table 4 for the results of Task B’s sub-task XPoliticalHate.

Our model performs competitively in the XPoliticalHate sub-task, which made use of in-domain test data, while our model for the sub-task XReligiousHate performed poorly in the context of out-of-domain test data. We made no consideration regarding the XPoliticalHate sub-task, as we did not take any additional steps.

The models’ performance on out-of-domain data (Table 5) is much lower than the average F1 score (0.57) but still higher than the baseline (0.52). Such low scores may relate to the imbalance between the two classes in the test data and to limitations in transfer learning. As noted by Ada et al. (2019), performance on the source task may not reflect performance on the target task. Also, the model may overfit on the data on which it was fine-tuned [ 30].

Run 1 Run 2

Class

¬HS HS ¬HS HS

Precision

0.849 0.330 0.848 0.306

Recall

0.950 0.127 0.942 0.131

Overall, our models have consistently outperformed

the baselines, demonstrating significant improvements across the board. However, it is worth noting that Table 6 reveals that some runs were below the competition’s averages. In particular, Run 2 in Task A (Textual) and Task B (XReligiousHate) failed to meet our expectations, (2021) 1–15. URL: http://dx.doi.org/10.3390/app11 as discussed in detail in Sections 4.2 and 4.3. These un- 2411684. doi:10.3390/app112411684. derperforming results can be attributed to the previously [2] F. Del Vigna, A. Cimino, F. Dell’Orletta, M. Petrochighlighted factors. chi, M. Tesconi, Hate Me, Hate Me Not: Hate Speech Detection on Facebook., ITASEC (2017) 86–95.

Task Sub-task Model F1 avg [3] A. A. Siege, Online Hate Speech, Cambridge Uni

Run 1 0.89759 versity Press, 2020. URL: https://doi.org/10.1017/97

Textual Run 2 0.86516 81108890960[.

A Contextual RRAuuvnng21 000...888998672856793 [4] lKi n.gPu.iDsteicMfeaaittiu,rDes. FoifšSelro,vNe.nLejusobceišailćly, Nunoancscteapntdaabrlde Avg 0.88616 discourse on Facebook, Znanstvena založba Filo

Run 1 0.89759 zofske fakultete (2020).

XPoliticalHate Run 2 0.89687 [5] M. Sanguinetti, F. Poletto, C. Bosco, V. Patti, B RAuvng1 00..5848081616 SMp.eSetcrhanaigsaciin,stAInmImtailgiarannTtsw,iLttaenrgCuaogrpeuRseosofuHrcaetes XReligiousHate Run 2 0.53841 and Evaluation (2018) 1–8.

Avg 0.57439 [6] M. Sanguinetti, G. Comandini, E. D. Nuovo, Table 6 S. Frenda, M. Stranisci, C. Bosco, T. Caselli, V. Patti, F1 averages for our models and the average for all models I. Russo, HaSpeeDe 2 @ EVALITA2020: Overview submitted to the task. of the EVALITA 2020 Hate Speech Detection Task, EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020 (2020) 93–101. URL: http://dx.doi.org/10.4000/books.aaccademia.6897. 5. Conclusion doi:10.4000/books.aaccademia.6897. [7] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, In this paper, we introduced two fine-tuning techniques N. Farra, R. Kumar, SemEval-2019 Task 6: Identifyto detect Italian-language HS in Twitter’s posts and ing and Categorizing Ofensive Language in Social replies. We were asked to address the issue in two dif- Media (OfensEval), Proceedings of the 13th Interferent tasks with two sub-tasks each. Two models were national Workshop on Semantic Evaluation (2019). ifne-tuned and 5-cross-validated: UmBERTo and BERT- URL: http://dx.doi.org/10.18653/v1/s19- 2010. ita. Task A was comprised of a Textual and a Contextual doi:10.18653/v1/s19-2010. sub-task: here, UmBERTo performed competitively in [8] I. Markov, W. Daelemans, Improving Cross-Domain both sub-tasks, reaching above the baseline and competi- Hate Speech Detection by Reducing the False Postion average. However, the model did not benefit from the itive Rate, Proceedings of the Fourth Workshop addition of contextual features. BERT-ita, on the other on NLP for Internet Freedom: Censorship, Dishand, performed above the baselines but significantly information, and Propaganda (2021). URL: http: lower than the task average. In contrast to UmBERTo, / / d x . d o i . o r g / 1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . n l p 4 i f - 1 .3. BERT-ita’s results improved significantly, reaching the doi:10.18653/v1/2021.nlp4if-1.3. ifrst model’s performance. [9] W. Yin, A. Zubiaga, Towards generalisable hate

For Task B, we did not submit any results for the XPolit- speech detection: a review on obstacles and soicalHate sub-task. As such, the results obtained for Task lutions, PeerJ Computer Science 7 (2021) e598. A (Contextual) were assumed to be valid for this sub-task URL: http://dx.doi.org/10.7717/peerj- cs.598. given the test data was the same. Finally, both our models doi:10.7717/peerj-cs.598. performed well below the competition average for the [10] T. Caselli, V. Basile, J. Mitrović, M. Granitzer, Hateout-of-domain task. BERT: Retraining BERT for abusive language detec

Future work should look at the potential benefits of tion in English, in: Proceedings of the 5th Workincluding additional training data for the out-of-domain shop on Online Abuse and Harms (WOAH 2021), task. Also, the addition of contextual features could be Association for Computational Linguistics, Online, tested in combination with diferent language models. 2021, pp. 17–25. URL: https://aclanthology.org/202 1.woah-1.3. doi:10.18653/v1/2021.woah-1.3.

References [11] T. Caselli, V. Patti, N. Novielli, P. Rosso, Evalita 2018: Overview on the 6th evaluation campaign of natural language processing and speech tools for italian, EVALITA Evaluation of NLP and Speech [1] M. K. Aljero, N. Dimililer, A novel stacked ensemble for hate speech recognition, Applied Sciences 11 Tools for Italian (2018) 3–8. URL: http://dx.doi.org sarotti, V. Patti (Eds.), Proceedings of the Eighth /10.4000/books.aaccad. doi:10.4000/books.aa Italian Conference on Computational Linguistics, ccademia.4437. CLiC-it 2021, Milan, Italy, January 26-28, 2022, vol[12] R. Kumar, B. Lahiri, A. Ojha, Aggressive and ofen- ume 3033 of CEUR Workshop Proceedings, CEURsive language identification in hindi, bangla, and en- WS.org, 2021. URL: https://ceur-ws.org/Vol-3033/ glish: A comparative study, SN Computer Science paper38.pdf.

2 (2021). doi:10.1007/s42979-020-00414-6. [22] A. Ramponi, B. Testa, S. Tonelli, E. Jezek, Address[13] C. Bosco, F. Dell’Orletta, F. Poletto, M. Sanguinetti, ing religious hate online: from taxonomy creation M. Tesconi, Overview of the evalita 2018 hate to automated detection, PeerJ Comput. Sci. 8 (2022) speech detection task, EVALITA Evaluation of e1128. URL: https://doi.org/10.7717/peerj-cs.1128. NLP and Speech Tools for Italian (2018) 67–74. URL: doi:10.7717/peerj-cs.1128. http://dx.doi.org/10.4000/books.aaccademia.4503. [23] Twitter, Documentation, Twitter Developer Docudoi:10.4000/books.aaccademia.4503. mentation, 2023. URL: https://developer.twitter.co [14] A. Arango, J. Pérez, B. Poblete, Cross-lingual hate m/en/docs, Accessed: 13th June 2023. speech detection based on multilingual domain- [24] S. J. Pan, Q. Yang, A Survey on Transfer Learning, specific word embeddings, CoRR abs/2104.14728 IEEE Transactions on Knowledge and Data Engi(2021). URL: https://arxiv.org/abs/2104.14728. neering 22 (2010) 1345–1359. URL: http://dx.doi.o arXiv:2104.14728. rg/10.1109/tkde.2009.191. doi:10.1109/tkde.200 [15] P. Fortuna, S. Nunes, A Survey on Automatic De- 9.191.

tection of Hate Speech in Text, ACM Computing [25] L. Eric, R. Saini, G. Kovács, K. Murphy, TheNorth Surveys 51 (2019) 1–30. URL: http://dx.doi.org/10. @ HaSpeeDe 2: BERT-based Language Model Fine1145/3232676. doi:10.1145/3232676. tuning for Italian Hate Speech Detection, EVALITA [16] C. Corazza, S. Menini, E. Cabrio, S. T. S. Villata, Evaluation of NLP and Speech Tools for Italian Cross-Platform Evaluation for Italian Hate Speech December 17th, 2020 (2020) 142–147. URL: http: Detection, Le Centre pour la Communication Scien- //dx.doi.org/10.4000/books.aaccademia.6989. tifique Directe - HAL - Université de Nantes (2019). doi:10.4000/books.aaccademia.6989. [17] I. Markov, I. Gevers, W. Daelemans, An En- [26] F. Tamburini, How “BERTology" Changed the semble Approach for Dutch Cross-Domain Hate State-of-the-Art also for Italian NLP, Proceedings Speech Detection, Natural Language Processing of the Seventh Italian Conference on Computaand Information Systems (2022) 3–15. URL: http: tional Linguistics CLiC-it 2020 (2020) 415–421. URL: //dx.doi.org/10.1007/978- 3- 031- 08473- 7_1. http://dx.doi.org/10.4000/books.aaccademia.8920. doi:10.1007/978-3-031-08473-7/1. doi:10.4000/books.aaccademia.8920. [18] D. Njagi, Z. Zuping, D. Hanyurwimfura, J. Long, A [27] D. Nozza, F. Bianchi, G. Attanasio, HATE-ITA: lexicon-based approach for hate speech detection, Hate Speech Detection in Italian Social Media International Journal of Multimedia and Ubiquitous Text, Proceedings of the Sixth Workshop on Engineering 10 (2015) 215–230. doi:10.14257/i Online Abuse and Harms (WOAH) (2022). URL: jmue.2015.10.4.21. http://dx.doi.org/10.18653/v1/2022.woah-1.24. [19] T. Davidson, D. Warmsley, M. Macy, I. Weber, Au- doi:10.18653/v1/2022.woah-1.24. tomated hate speech detection and the problem of [28] F. He, T. Liu, D. Tao, Control batch size and learnofensive language, Proceedings of the Eleventh ing rate to generalize well: Theoretical and emInternational Conference on Web and Social Media pirical evidence, in: H. Wallach, H. Larochelle, (2017) 512–521. URL: http://dx.doi.org/10.5555/329 A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett 0605.3300749. doi:10.5555/3290605.3300749. (Eds.), Advances in Neural Information Processing [20] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- Systems, volume 32, Curran Associates, Inc., 2019. noli, G. Venturi, Evalita 2023: Overview of the 8th URL: https://proceedings.neurips.cc/paper/2019/fil evaluation campaign of natural language process- e/dc6a70712a252123c40d2adba6a11d84-Paper.pdf. ing and speech tools for italian, in: Proceedings [29] S. E. Ada, E. Ugur, H. L. Akin, Generalizaof the Eighth Evaluation Campaign of Natural Lan- tion in transfer learning, CoRR abs/1909.01331 guage Processing and Speech Tools for Italian. Final (2019). URL: http://arxiv.org/abs/1909.01331. Workshop (EVALITA 2023), CEUR.org, Parma, Italy, arXiv:1909.01331.

2023. [30] L. Shao, F. Zhu, X. Li, Transfer learning for visual [21] F. Celli, M. Lai, A. Duzha, C. Bosco, V. Patti, Poli- categorization: A survey, IEEE Transactions on cycorpus XL: an italian corpus for the detection of Neural Networks and Learning Systems 26 (2015) hate speech against politics, in: E. Fersini, M. Pas- 1019–1034. doi:10.1109/TNNLS.2014.2330900.