1. Introduction

On the Impact of Hate Speech Synthetic Data on Model Fairness

Camilla Casula

Sara Tonelli

0 0 Fondazione Bruno Kessler , Trento , Italy

2025

Although attention has been devoted to the issue of online hate speech, some phenomena, such as ableism or ageism, are scarcely represented by existing datasets and case studies. This can lead to hate speech detection systems that do not perform well on underrepresented identity groups. Given the unprecedented capabilities of LLMs in producing high-quality data, we investigate the possibility of augmenting existing data with generative language models, reducing target imbalance. We experiment with augmenting 1,000 posts from the Measuring Hate Speech corpus, an English dataset annotated with target identity information, adding around 30,000 synthetic examples using both simple data augmentation methods and diferent types of generative models, comparing autoregressive and sequence-to-sequence approaches. We focus our evaluation on the performance of models on diferent identity groups, finding that performance can difer greatly for diferent targets and "simpler" data augmentation approaches can improve classification better than state-of-the-art language models. . Warning: this paper contains examples that may be ofensive or upsetting.

eol>hate speech detection synthetic data model fairness hate speech target

1. Introduction

model classification outputs and sensitive attributes [ 16].

A potential solution that has been proposed for many Generic hate speech detection models can nowadays of the issues with hate speech detection data is the creachieve high performance on benchmark datasets, es- ation of synthetic data [17]. Indeed, recent research has pecially for high-resource languages [ 1 ]. However, these shown it to be a promising solution [18, 19, 20, 21], albeit models can still present a number of issues and weak- with mixed results [22, 23]. However, no in-depth analynesses. In particular, the creation and maintenance of sis of the efects of data augmentation (DA) for less repcorpora for this task can be problematic due to the rel- resented hate speech targets has been carried out, while ative scarcity of hateful data online [ 2 ], the negative it could be beneficial not only to make systems more psychological impact on annotators [ 3 ], dataset decay accurate and robust, but also fairer, with comparable perand therefore reproducibility of results [ 4 ], and more. formance on hate speech targeting diferent demographic

Hate speech detection models have also been found groups [16]. Another aspect we investigate in this work to often have a tendency to over-rely on specific iden- is a comparison between recent generative language modtity terms, in particular minority group mentions and els and more traditional approaches to data augmentation other identity-related terms [ 5, 6, 7 ]. Another issue with with regards to hate speech detection, since increasing existing datasets and systems for this task is related to the amount of training data with synthetic examples has the representation of identity groups that are targets of been successfully exploited well before the advent of hate, which is rather unbalanced. For example, misogyny generative large language models, and can lead to imhas been covered in several datasets [8, 9], while other provements although these methods have a much lower phenomena have received much less attention, such as computational cost [24]. religious hate [10] or hate against LGBTQIA+ people In this work, we therefore address the following re[11, 12, 13]. Furthermore, phenomena such as ageism search questions: and ableism have only been marginally addressed, as (Q1) What is the impact of data augmentation on shown in the survey by Yu et al. [14]. This disparity af- model performance for specific target identities? fects in turn system fairness, because ofenses against (Q2) Can information about identity groups in the less-represented targets will be classified with a lower ac- generation process help the creation of better and more curacy, further impacting communities that are already representative synthetic examples? marginalized [15]. By fairness, in this work we mean (Q3) Can certain data augmentation setups enhance group fairness, which implies independence between the performance of models on underrepresented targets, therefore improving their fairness by reducing difer

CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- ences in performance across diferent identity groups?

tics, September 24 — 26, 2025, Cagliari, Italy We aim at answering these questions through a set of $ ccasula@fbk.eu (C. Casula); satonelli@fbk.eu (S. Tonelli) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License experiments in which we focus on the performance of Attribution 4.0 International (CC BY 4.0). models by target identity. In addition, we introduce two novel elements compared to previous work on generative DA: (i) we experiment with setups in which we exploit target identity information during generation, attempting to increase the relative representation of scarcely represented targets, with the aim of positively impacting model fairness, and (ii) we experiment with instructionifnetuned large language models (LLMs), which have recently been shown to be able to improve downstream task performance [25]. We also further investigate potential fairness-related weaknesses of models using the HateCheck test suite [7] combined with a manual analysis of generated examples.

2. Background The field of hateful content detection has gained a large

amount of traction in recent years, with increased ef- carried out on the impact data augmentation can have fort from the research community in establishing com- on the performance of models for specific targets of hate, mon guidelines and benchmarks (e.g. Basile et al. [26], or into the exploitation of target identity information to Zampieri et al. [27]) across diferent languages and tar- potentially improve fully automated data augmentation gets of hate [28, 29, 11, 30]. processes.

A potential way that has been proposed to mitigate some of the issues with hate speech datasets, such as data scarcity [ 2 ] and negative psychological impact on anno- 3. Data tators [ 3 ], is data augmentation, which could also benefit the performance of hate speech detection systems. Data For our experiments, we use the Measuring Hate Speech augmentation refers to a family of approaches aimed (MHS) Corpus [41, 42], a dataset consisting of social meat increasing the diversity of training data without col- dia posts in English from three social media platforms lecting new samples [31]. While DA is widely used to (Reddit, Twitter, and YouTube). While the corpus is meant make models more robust across many machine learning to capture diferent levels of hatefulness on a scale, it also applications, it has not been as frequently adopted or includes binary hate speech labels for benchmarking purresearched in NLP [32, 33] until recently, with LLMs that poses, which we use in our experiments. are capable of generating realistic text [34, 35]. The MHS corpus features labels regarding the binary

DA for the detection of hate speech has recently been identification of pre-specified identity groups and subexplored using generative LLMs: Juuti et al. [36] use groups in texts. Importantly, this annotation is present GPT-2 [37] to augment toxic language data in extremely regardless of hatefulness, resulting in target annotations low-resource scenarios. Similarly, Wullach et al. [18] even for posts containing supportive or counter-speech. and D’Sa et al. [19] successfully augment toxic language In the MHS dataset 1 we find annotations for seven target datasets using GPT-2. Fanton et al. [38] combine GPT-2 identity groups: race, religion, origin, gender, sexuality, and human validation to create counter-narratives that age, and disability. Their distribution in the data can cover multiple hate targets. More recently, Ocampo et al. be seen in Figure 1, which shows how the most widely [39] have applied data augmentation to increase the num- studied targets of hate speech, race and gender, are also ber of instances for the minority class in implicit and the most widely represented in the MHS corpus. subtle examples of hate speech. Casula and Tonelli [22] Given that the MHS corpus uses disaggregated annoshow that generative data augmentation for hate speech tations, we aggregate them so that each example has a detection using GPT-2 is in some cases challenged by a unique label and set of targets. First, we consider each simple oversampling baseline, while Casula et al. [23] example to be about or targeting all the identity groups analyse the qualitative diferences between original and identified by at least half of the annotators who annoparaphrased hate speech data. Finally, Hartvigsen et al. tated it. Since the hatespeech label in the dataset can [20] use manually curated (through a human-in-the-loop assume three values (0: non hateful, 1: unclear, 2: hateful), process) prompts to generate implicitly hateful sequences we binarize these by averaging all the annotations for with GPT-3 [40].

To our knowledge, no dedicated analyses have been 1https://huggingface.co/datasets/ucberkeley-dlab/measuring-hatespeech a given post, mapping it to hateful if the average score is higher than 1 and to non hateful if it is lower.2 After this process, we are left with 35,243 annotated posts, of which 9,046 are annotated as containing hate speech.

4. Methodology

For our experiments, we compare diferent generation strategies to train hate speech detection models of different sizes, aiming at assessing the impact of data augmentation based on language models on specific target identities. In order to do this, we evaluate both decoderonly and encoder-decoder models, experimenting also with their instruction-tuned counterparts. Additionally, we experiment with the inclusion of target identity information in the prompts, with the assumption that this information might lead to more varied and representative generated texts. We then use two diferent methods of exploiting existing information and data to generate new sequences: finetuning and few-shot prompting. 4.1. Generative Models information might help in generating more varied data with regards to identity group mentions for both hateful and non-hateful messages. By generating target-specific examples also for the non-hateful class, we ideally aim at implicitly contrasting identity term bias. In order to do this, we encode target identity information into the prompts given to the models in various ways. 4.3. Finetuning vs Few-Shot Prompting

A large number of works on data augmentation based

on generative models rely on finetuning a model on a small set of gold data, and then generating new data with the finetuned model, encoding the label information within the text sequences in some form (e.g. AnabyTavor et al. [34], Kumar et al. [35]). Other works use fewshot demonstration-based prompting, in which the pretrained model is prompted with one or more sequences similar to what the model is expected to generate, with no finetuning (e.g. Hartvigsen et al. [20], Azam et al. [43], Ashida and Komachi [48]). We experiment with both strategies.

While most of the work on generation-based data aug- Finetuning (FT) For finetuning, we follow an apmentation for this task focuses on decoder-only Trans- proach similar to that of Anaby-Tavor et al. [34], in which former models [22], other works have shown encoder- a generative LLM is finetuned on annotated sequences decoder Transformers to be potentially efective as well that are concatenated with labels. At generation time, [43]. Since no work has been carried out on comparing the desired label information is fed into the model, and decoder-only with encoder-decoder models for this type the model is expected to generate a sequence belongof data augmentation, we experiment with both. Then, ing to the specified class. We discuss the details of the based on work showing how instruction-tuning can im- formatting of the label information in Section 4.4. prove generalization to unseen tasks [25, 44], we aim at This method has the upside of theoretically being more experimenting also with instruction-finetuned models. likely to generate examples that are closer to the original

To favor reproducibility, we choose to only use openly distribution of the data to be augmented. However, this available models for our experiments. We employ Llama can also be a downside, if the desired efect is increasing 3.1 8B in its base and Instruct versions [45], OPT in its the variety of the data. In addition, finetuning is more base and IML (instruction-tuned) versions [46] and T5 in computationally expensive than few-shot prompting. its base and FLAN (instruction-tuned) versions [47, 44]. For models finetuned with target identity information, We use the 1.3B parameter version of OPT and OPT-IML given that each sequence can be associated with more and the Large version of T5 and Flan-T5 (770M), aiming than one target (in cases of intersectional hate speech at capturing in our analyses the efects of this kind of for instance), a diferent label-encoding sequence will be methodology with diferent model sizes. used to include all target identities represented in that post. An example of prompt to produce a a post about gender that is hateful is Write a hateful social media post

4.2. Target Identity Information about gender.

In addition to performing DA with diferent types of models and techniques, we investigate for the first time the possibility of including target identity information both when finetuning models and when prompting them, with the hypothesis that the inclusion of this kind of

2While we are aware this does not exploit the most novel and inter

esting features of the MHS dataset, the exploration of annotator (dis)agreement with regards to data augmentation is beyond the scope of this work, and is left for future research.

Few-shot prompting (FS) Following the large amount of works focusing on few-shot demonstrationbased instructions, especially with instruction-finetuned models [49, 44], we also experiment with demonstrationbased prompting, in which the models are shown 3 examples belonging to the desired label (and target identity, if available), and then asked to produce a new one.

With models exploiting target identity information for few-shot prompting, we associate the desired label and

We aim at using the same type of prompting layout across experiments. We choose to use prompting sequences in natural language, given that they have been found to lead to generally more realistic generated examples for this purpose [22]. In order to find prompts in natural language that could be leveraged by our models, we consulted the FLAN corpus [25], which is part of the finetuning data of both FLAN-T5 and OPT-IML. Among the instruction templates, we find one of the CommonGen templates [ 50] to fit with our aims: ‘ Write a sentence about the following things: [concepts], [target]’. We reformulate it to obtain a prompting sequence that reflects our application, and can be exploited by instruction-finetuned models: Write a [∅/ hateful] social media post [∅/ about t], where is a target identity category.

5. Experimental Setup For all experiments, we simulate a setup in which we have a small amount of gold data available prior to augmenta Baselines We implement three baselines using De

BERTa: i) the classifier finetuned on the starting 1k gold examples; ii) the same classifier finetuned on an oversampled version of the training data (repeating the initial 1k sequences until we get to 31k, the size of the augmented setups), which has been found efective even in crossdataset scenarios [22]; and (iii) as a stronger baseline, we also compare all of our models with models trained on No augmentation Oversampling EDA

Model

Tar data augmented using Easy Data Augmentation (EDA) generation-augmented data in terms of macro-averaged [52]. EDA consists of four operations: synonym replacement, random insertion, random swap, and random dele- F1 score and hateful class F1 (h-F1)both globally and by tion of tokens. Similarly to our other setups, we produce target identity group is reported in Table 1. All models 30k new sequences with EDA, of which 7,500 with each are tested on a held-out portion of the gold data from the operation, on the initial 1,000 examples in each fold. We MHS corpus.

Considering simply the no augmentation baseline, it then also experiment with the mixture of EDA and gener- is clear that performance can vary greatly across tarative DA, in which instead of augmenting the initial gold data with 30k synthetic sequences obtained with EDA or generative DA, we randomly select 15k examples of each and concatenate them. get groups, with up to 27% hate-F1 diferences between them. In particular, the model appears to struggle with posts about origin (Or), religion (Re), and age (Ag), while, although underrepresented compared to other target groups, posts about disability (Di) tend to be classified more accurately on average. This suggests that perfor

6. Results and Discussion

mance might also be influenced by factors other than In this section we report the results of our experiments, the representation of targets in the dataset, such as how averaged across 5 data folds using diferent random seeds. broad a target category is or how much variation there The performance of our baselines and models trained on is within it. For instance, origin can include any type

7. Qualitative Analysis

Flan-T5

FT FS

Y N Y N of discrimination based on geographical origin, poten- Table 2 tially making it harder to generalize for, and religion as Generated texts labeled as correct by human annotators in a category encompasses any type of religious discourse, terms of labels, target categories, and realism. N/A refers to in spite of each religion being targeted through specific cases in which all of the generated texts were nonsensical (0% ofense types [ 10]. This makes classification challeng- realistic), with impossible assignment of labels or categories. ing, especially for systems that rely primarily on lexical Model Tar Label Target Realism features.

Most of the models trained on generation-augmented Llama 3.1 8B FT NY 8978%% 72/% 8869%% data outperform the no augmentation baseline across tar- FS Y 93% 53% 86% gets, with diferent improvements based on target iden- N 90% / 84% tity group (origin, religion, and age in particular). Strik- Llama 3.1 8B Inst. FT Y 87% 66% 79% ingly, however, EDA performs better than all generation- N 87% / 73% based DA configurations, regardless of prompting type FS Y 89% 61% 81% or access to target information, for all targets but age. N 83% / 79%

We hypothesize EDA is efective because small pertur- OPT FT Y 93% 63% 66% bations can make models more robust, especially with N N/A / 0% regard to the hateful class, while generative models do FS Y 90% 39% 83% increase performance, but they are also more likely to N 81% / 70% inject noise. OPT-IML FT Y 96% 53% 66%

The impact of finetuning vs. few-shot prompting N N/A / 0% seems model-dependent, with diferences across models FS Y 90% 57% 79% also regarding the impact of target information. Inter- N 81% / 73% estingly, the amount of synthetic examples labeled as T5 FT Y 83% 59% 80% hateful that pass filtering does not appear to be linked N 74% / 30% with better performances of models trained on synthetic FS Y N/A N/A 0% data. N N/A / 0% In this section, we look into the synthetically generated texts and the models trained on them from a qualitative point of view. First we carry out a manual annotation Consider for example the following sentence, generated on the generated texts. Then, we turn to the HateCheck giving ‘age’ as target information: ‘F*ckin white men are test suite [7], which includes examples aimed at explor- trashy like a muthaf*cker’. In this case, Label would be ing the weaknesses of hate speech models, especially ‘hateful’, Realism would be ‘Yes’ but Target would be ‘No’, their out-of-distribution generalization, again focusing because the target identity category of the generated on performance by target. HateCheck targets are in some example is ‘race’ and not ‘age’. cases more specific than those present in our dataset, thus Inter-annotator agreement was calculated using Kripproviding a complementary view on our models’ perfor- pendorf’s alpha on 10% of the manually analyzed data mance. (112 examples). The annotators showed moderate agreement with regards to label correctness ( = 0.76), while 7.1. Manual Annotation the scores were higher for category correctness ( = 0.83) and realism ( = 0.82).

A total of 1,120 generated texts filtered with DeBERTa The results of the manual analysis are reported in Table were annotated by two annotators with a background in 2. In most cases, the addition of target information relinguistics and experience in hate speech research. For sults in more realistic texts and, in general, more accurate each combination of finetuning/prompting/target pres- label assignment. However, this is not directly associated ence for each model, they annotated 70 examples, evenly with improved model performance from augmented data. distributed across labels and, where available, targets. In addition, the rate of realistic texts and the accuracy of The examples were annotated according to label correct- the identity categories are still somewhat low compared ness, target category correctness (where available), and to the correctness of label assignment, showing that the realism. generative models we tested might have dificulties deal

For the examples generated without access to target ing with more than one type of constraint/instruction. information, the target dimension was not annotated. Indeed, while few-shot (FS) approaches sometimes lead No Augmentation EDA

Model to more realistic generated sequences, this often entails in MHS). The age category is present in MHS corpus and lower label or category correctness and vice versa. entirely missing in HateCheck. 7.2. HateCheck We report in Table 3 the results obtained by the models trained on augmented data on HateCheck in terms of hate-class F1 score, divided by target identity group We perform a second qualitative analysis using the Hat- according to the identity categories used in HateCheck. eCheck test suite [7], a collection of functional testing examples that enable targeted diagnostic insights of hate Overall, we can notice significant improvements over the no augmentation baseline across diferent setups and tarspeech detection models. All HateCheck test cases men- get identities, although these improvements are variable, tion a specific target identity, to allow the exploration of unintended biases against diferent target groups. How- and again overshadowed by Easy Data Augmentation, ever, the target groups used in HateCheck do not fully resulting in more evenly distributed scores across all taroverlap with the target identity groups in the MHS cor- gets.

Interestingly, however, even overlapping targets be

pus (Figure 1). The target identities in HateCheck are: tween the MHS dataset and HateCheck show diferent women (which would fall under gender in MHS), trans improvements across generative DA setups. For instance, people (gender in MHS), gay people (sexuality in MHS), the HateCheck target people with disabilities apparently black people (race in MHS), disabled people (disability in MHS), Muslims (religion in MHS) and immigrants (origin benefits from the inclusion of target information more than the Disability category in the MHS corpus, suggesting that the inclusion of identity group information could generative language models can be beneficial, even when afect performances diferently in and out of domain. using only openly available models. However, given their

Furthermore, HateCheck test instances are manually high computational costs, alternatives like EDA could created starting from templates that include specific men- be considered if limited resources are available, because tions of identity-group terms, which could lead to sen- they can still yield performance improvements compared tences that are very diferent from ‘real’ examples that to a low-resource setting. Again, there seems to be no could be found on social media. For instance, one of the one-fits-all solution or approach to generation or data hateful templates used in HateCheck is ‘I hate [IDEN- augmentation in this kind of scenario. TITY]’, in which [IDENTITY] is substituted with women, We acknowledge that data augmentation techniques disabled people, and so on for all of the targets reported may be used also for malicious purposes, for example above. This can result in somewhat stereotypical exam- to create thousands of hateful examples with the goal ples, while hateful posts in the MHS corpus often contain of hurting the same groups that we want to support. slurs and ofensive terms instead of more neutral names Because of this, we provide all the necessary details for to refer to people belonging to a certain identity group. the reproduction of our results, but we do not plan to openly release the code or to upload the generated data produced by our experiments, especially in order to avoid 8. Conclusions it being crawled and ending up in the training data of LLMs in the future. We are, however, open to sharing the data with other researchers who might be interested.

We have investigated the impact of data augmentation with generative models on specific targets of hate, experimenting with instruction-finetuned models and the addition of target information when generating new se- Acknowledgments quences. Overall, it appears that DA methods have different types of impact on diferent targets, but they can This work was funded by the European Union’s CERV improve performance even for scarcely represented iden- fund under grant agreement No. 101143249 (HATEtity categories (Q1). However, we observed that genera- DEMICS). tive data augmentation alone is not as strong as simpler methods such as EDA.

Through a qualitative analysis, we also emphasized the References fact that including target information when generating synthetic examples can facilitate the creation of examples that are more realistic and exhibit more correct label assignments (Q2), although further work could investigate why these characteristics do not directly correlate with downstream task performance.

Overall, our analysis shows that there is potential in data augmentation with regards to model group fairness (Q3), implying independence between model classification output and sensitive attributes [16]. However, although potentially useful, this type of DA can still lead to unpredictable results, and it is not guaranteed to always improve the performance of models across all identity groups with regards to hate speech. We plan to further explore this research direction in the future, considering also intersectionality and more specific targets (e.g. groups such as trans women rather than the gender category). In addition, we worked on English data because of the availability of the Measuring Hate Speech corpus, which was large enough to perform our DA experiments and presented the kind of fine-grained target annotation required in our study. However, we are aware that DA would benefit more classification with lower-resourced languages, so we plan to work on diferent languages in the future.

In summary, we show that data augmentation with in Text Classification, in: Proceedings of the Proceedings of the Eighth Evaluation Campaign of 2018 AAAI/ACM Conference on AI, Ethics, and Natural Language Processing and Speech Tools for Society, ACM, New Orleans LA USA, 2018, pp. Italian. Final Workshop (EVALITA 2023), CEUR.org, 67–73. URL: https://dl.acm.org/doi/10.1145/3278721. Parma, Italy, 2023.

3278729. doi:10.1145/3278721.3278729. [14] Z. Yu, I. Sen, D. Assenmacher, M. Samory, L. Fröh[6] B. Kennedy, X. Jin, A. Mostafazadeh Davani, M. De- ling, C. Dahn, D. Nozza, C. Wagner, The unseen hghani, X. Ren, Contextualizing hate speech clas- targets of hate: A systematic review of hateful sifiers with post-hoc explanation, in: Proceed- communication datasets, Social Science Computer ings of the 58th Annual Meeting of the Associa- Review (2024) 08944393241258771. doi:10.1177/ tion for Computational Linguistics, Association for 08944393241258771.

Computational Linguistics, Online, 2020, pp. 5435– [15] Z. Talat, J. Bingel, I. Augenstein, Disembodied ma5442. URL: https://aclanthology.org/2020.acl-main. chine learning: On the illusion of objectivity in nlp, 483. doi:10.18653/v1/2020.acl-main.483. ArXiv abs/2101.11974 (2021). [7] P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, [16] J. Anthis, K. Lum, M. Ekstrand, A. Feller, H. Margetts, J. Pierrehumbert, HateCheck: Func- A. D’Amour, C. Tan, The Impossibility of Fair LLMs, tional tests for hate speech detection models, in: 2024. URL: http://arxiv.org/abs/2406.03198. doi:10. Proceedings of the 59th Annual Meeting of the As- 48550/arXiv.2406.03198, arXiv:2406.03198 [cs, sociation for Computational Linguistics and the stat]. 11th International Joint Conference on Natural Lan- [17] B. Vidgen, L. Derczynski, Directions in abusive language Processing (Volume 1: Long Papers), Associa- guage training data, a systematic review: Garbage tion for Computational Linguistics, Online, 2021, pp. in, garbage out, PLOS ONE 15 (2020) e0243300. 41–58. URL: https://aclanthology.org/2021.acl-long. doi:10.1371/journal.pone.0243300. 4. doi:10.18653/v1/2021.acl-long.4. [18] T. Wullach, A. Adler, E. Minkov, Fight fire [8] S. Bhattacharya, S. Singh, R. Kumar, A. Bansal, with fire: Fine-tuning hate detectors using large A. Bhagat, Y. Dawer, B. Lahiri, A. K. Ojha, De- samples of generated hate speech, in: Findveloping a multilingual annotated corpus of misog- ings of the Association for Computational Linyny and aggression, in: Proceedings of the Second guistics: EMNLP 2021, Association for ComputaWorkshop on Trolling, Aggression and Cyberbul- tional Linguistics, Punta Cana, Dominican Republic, lying, European Language Resources Association 2021, pp. 4699–4705. URL: https://aclanthology.org/ (ELRA), Marseille, France, 2020, pp. 158–168. URL: 2021.findings-emnlp.402. doi: 10.18653/v1/2021. https://aclanthology.org/2020.trac-1.25. findings-emnlp.402. [9] E. Guest, B. Vidgen, A. Mittos, N. Sastry, G. Tyson, [19] A. G. D’Sa, I. Illina, D. Fohr, D. Klakow, H. Margetts, An Expert Annotated Dataset for the D. Ruiter, Exploring Conditional Language Detection of Online Misogyny, in: Proceedings of Model Based Data Augmentation Approaches the 16th Conference of the European Chapter of the for Hate Speech Classification, in: Text, Association for Computational Linguistics: Main Speech, and Dialogue: 24th International ConVolume, Association for Computational Linguistics, ference, TSD 2021, Olomouc, Czech Republic, Online, 2021, pp. 1336–1350. September 6–9, 2021, Proceedings, Springer-Verlag, [10] A. Ramponi, B. Testa, S. Tonelli, E. Jezek, Address- Berlin, Heidelberg, 2021, pp. 135–146. doi:10.1007/ ing religious hate online: from taxonomy creation 978-3-030-83527-9_12. to automated detection, PeerJ Computer Science 8 [20] T. Hartvigsen, S. Gabriel, H. Palangi, M. Sap, (2022) e1128. D. Ray, E. Kamar, ToxiGen: A large-scale [11] B. R. Chakravarthi, R. Priyadharshini, R. Pon- machine-generated dataset for adversarial and imnusamy, P. K. Kumaresan, K. Sampath, D. Then- plicit hate speech detection, in: Proceedings of mozhi, S. Thangasamy, R. Nallathambi, J. P. McCrae, the 60th Annual Meeting of the Association for Dataset for identification of homophobia and tran- Computational Linguistics (Volume 1: Long Pasophobia in multilingual youtube comments, 2021. pers), Association for Computational Linguistics, arXiv:2109.00227. Dublin, Ireland, 2022, pp. 3309–3326. URL: https:// [12] D. Locatelli, G. Damo, D. Nozza, A cross-lingual aclanthology.org/2022.acl-long.234. doi:10.18653/ study of homotransphobia on twitter, in: Proceed- v1/2022.acl-long.234. ings of the First Workshop on Cross-Cultural Con- [21] C. Casula, E. Leonardelli, S. Tonelli, Don’t augsiderations in NLP (C3NLP), 2023, pp. 16–24. ment, rewrite? assessing abusive language detec[13] D. Nozza, A. T. Cignarella, G. Damo, T. Caselli, tion with synthetic data, in: L.-W. Ku, A. Martins, V. Patti, HODI at EVALITA 2023: Overview of V. Srikumar (Eds.), Findings of the Association for the Homotransphobia Detection in Italian Task, in: Computational Linguistics: ACL 2024, Association for Computational Linguistics, Bangkok, Thailand, science/article/pii/S0306457322002199. doi:https: 2024, pp. 11240–11247. URL: https://aclanthology. //doi.org/10.1016/j.ipm.2022.103118. org/2024.findings-acl.669/. doi: 10.18653/v1/2024. [29] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, findings-acl.669. G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pite[22] C. Casula, S. Tonelli, Generation-based data aug- nis, Ç. Çöltekin, SemEval-2020 task 12: Mulmentation for ofensive language detection: Is it tilingual ofensive language identification in soworth it?, in: Proceedings of the 17th Conference cial media (OfensEval 2020), in: Proceedings of of the European Chapter of the Association for the Fourteenth Workshop on Semantic Evaluation, Computational Linguistics, Association for Com- International Committee for Computational Linputational Linguistics, Dubrovnik, Croatia, 2023, guistics, Barcelona (online), 2020, pp. 1425–1447. pp. 3359–3377. URL: https://aclanthology.org/2023. URL: https://aclanthology.org/2020.semeval-1.188. eacl-main.244. doi:10.18653/v1/2020.semeval-1.188. [23] C. Casula, S. Vecellio Salto, A. Ramponi, S. Tonelli, [30] E. Leonardelli, C. Casula, S. Vecellio Salto, J. E. Bak, Delving into qualitative implications of synthetic E. Muratore, A. Kołos, T. Louf, S. Tonelli, MuLTadata for hate speech detection, in: Y. Al- Telegram: A Fine-Grained Italian and Polish Dataset Onaizan, M. Bansal, Y.-N. Chen (Eds.), Pro- for Hate Speech and Target Detection, in: Proceedceedings of the 2024 Conference on Empirical ings of the Eleventh Italian Conference on CompuMethods in Natural Language Processing, As- tational Linguistics (CLiC-it 2025), 2025. sociation for Computational Linguistics, Miami, [31] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, Florida, USA, 2024, pp. 19709–19726. URL: https: T. Mitamura, E. Hovy, A survey of data aug//aclanthology.org/2024.emnlp-main.1099/. doi:10. mentation approaches for NLP, in: Findings 18653/v1/2024.emnlp-main.1099. of the Association for Computational Linguis[24] J. Chen, D. Tam, C. Rafel, M. Bansal, D. Yang, An tics: ACL-IJCNLP 2021, Association for CompuEmpirical Survey of Data Augmentation for Limited tational Linguistics, Online, 2021, pp. 968–988. Data Learning in NLP, Transactions of the Associa- URL: https://aclanthology.org/2021.findings-acl.84. tion for Computational Linguistics 11 (2023) 191– doi:10.18653/v1/2021.findings-acl.84. 211. URL: https://direct.mit.edu/tacl/article-pdf/doi/ [32] L. F. A. O. Pellicer, T. M. Ferreira, A. H. R. Costa, 10.1162/tacl_a_00542/2074871/tacl_a_00542.pdf. Data augmentation techniques in natural language [25] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, processing, Applied Soft Computing 132 (2023) B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned 109803. doi:10.1016/j.asoc.2022.109803. language models are zero-shot learners, in: In- [33] M. Bayer, M.-A. Kaufhold, C. Reuter, A Survey on ternational Conference on Learning Representa- Data Augmentation for Text Classification, ACM tions, 2022. URL: https://openreview.net/forum?id= Computing Surveys 55 (2022) 146:1–146:39. doi:10. gEZrGCozdqR. 1145/3544558. [26] V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. [34] A. Anaby-Tavor, B. Carmeli, E. Goldbraich, A. KanRangel Pardo, P. Rosso, M. Sanguinetti, SemEval- tor, G. Kour, S. Shlomov, N. Tepper, N. Zwerdling, 2019 Task 5: Multilingual Detection of Hate Speech Do Not Have Enough Data? Deep Learning to the Against Immigrants and Women in Twitter, in: Rescue!, in: Proceedings of the AAAI Conference Proceedings of the 13th International Workshop on Artificial Intelligence, volume 34, 2020, pp. 7383– on Semantic Evaluation, Association for Compu- 7390. doi:10.1609/aaai.v34i05.6233. tational Linguistics, Minneapolis, Minnesota, USA, [35] V. Kumar, A. Choudhary, E. Cho, Data augmen2019, pp. 54–63. doi:10.18653/v1/S19-2007. tation using pre-trained transformer models, in: [27] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, Proceedings of the 2nd Workshop on Life-long N. Farra, R. Kumar, SemEval-2019 Task 6: Identi- Learning for Spoken Language Systems, Associafying and Categorizing Ofensive Language in So- tion for Computational Linguistics, Suzhou, China, cial Media (OfensEval), in: Proceedings of the 2020, pp. 18–26. URL: https://aclanthology.org/2020. 13th International Workshop on Semantic Evalu- lifelongnlp-1.3. ation, Association for Computational Linguistics, [36] M. Juuti, T. Gröndahl, A. Flanagan, N. Asokan, Minneapolis, Minnesota, USA, 2019, pp. 75–86. A little goes a long way: Improving toxic landoi:10.18653/v1/S19-2010. guage classification despite data scarcity, in: Find[28] C. Bosco, V. Patti, S. Frenda, A. T. Cignarella, M. Pa- ings of the Association for Computational Linciello, F. D’Errico, Detecting racial stereotypes: An guistics: EMNLP 2020, Association for CompuItalian social media corpus where psychology meets tational Linguistics, Online, 2020, pp. 2991–3009. NLP, Information Processing and Management 60 doi:10.18653/v1/2020.findings-emnlp.269. (2023) 103118. URL: https://www.sciencedirect.com/ [37] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are Unsupervised son, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Multitask Learners, 2019. Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, [38] M. Fanton, H. Bonaldi, S. S. Tekiroğlu, M. Guerini, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, J. Wei, ScalHuman-in-the-Loop for Data Collection: A Multi- ing instruction-finetuned language models, 2022. Target Counter Narrative Dataset to Fight Online arXiv:2210.11416.

Hate Speech, in: Proceedings of the 59th Annual [45] M. A. Llama Team, The llama 3 herd of modMeeting of the Association for Computational Lin- els, 2024. URL: https://arxiv.org/abs/2407.21783. guistics and the 11th International Joint Conference arXiv:2407.21783. on Natural Language Processing (Volume 1: Long [46] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, Papers), Association for Computational Linguis- S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mitics, Online, 2021, pp. 3226–3240. doi:10.18653/ haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, v1/2021.acl-long.250. P. S. Koura, A. Sridhar, T. Wang, L. Zettlemoyer, [39] N. Ocampo, E. Sviridova, E. Cabrio, S. Villata, An Opt: Open pre-trained transformer language modin-depth analysis of implicit and subtle hate speech els, 2022. arXiv:2205.01068. messages, in: Proceedings of the 17th Conference [47] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, of the European Chapter of the Association for M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the Computational Linguistics, Association for Com- limits of transfer learning with a unified text-toputational Linguistics, Dubrovnik, Croatia, 2023, text transformer, Journal of Machine Learning Repp. 1997–2013. URL: https://aclanthology.org/2023. search 21 (2020) 1–67. URL: http://jmlr.org/papers/ eacl-main.147. v21/20-074.html. [40] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, [48] M. Ashida, M. Komachi, Towards automatic genJ. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, eration of messages countering online hate speech G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, and microaggressions, in: Proceedings of the Sixth G. Krueger, T. Henighan, R. Child, A. Ramesh, Workshop on Online Abuse and Harms (WOAH), D. M. Ziegler, J. Wu, C. Winter, C. Hesse, Association for Computational Linguistics, Seattle, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, Washington (Hybrid), 2022, pp. 11–23. URL: https: J. Clark, C. Berner, S. McCandlish, A. Radford, //aclanthology.org/2022.woah-1.2. doi:10.18653/ I. Sutskever, D. Amodei, Language Models are v1/2022.woah-1.2.

Few-Shot Learners, arXiv:2005.14165 [cs] (2020). [49] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, arXiv:2005.14165. P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, et al., [41] C. J. Kennedy, G. Bacon, A. Sahn, C. von Vacano, Opt-iml: Scaling language model instruction meta Constructing interval variables via faceted Rasch learning through the lens of generalization, 2022. measurement and multitask deep learning: a hate arXiv:2212.12017. speech application, 2020. URL: http://arxiv.org/ [50] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatabs/2009.10277. doi:10.48550/arXiv.2009.10277, ula, Y. Choi, X. Ren, CommonGen: A constrained arXiv:2009.10277 [cs]. text generation challenge for generative common[42] P. Sachdeva, R. Barreto, G. Bacon, A. Sahn, C. von sense reasoning, in: Findings of the Association Vacano, C. Kennedy, The measuring hate speech for Computational Linguistics: EMNLP 2020, Ascorpus: Leveraging rasch measurement theory for sociation for Computational Linguistics, Online, data perspectivism, in: Proceedings of the 1st 2020, pp. 1823–1840. URL: https://aclanthology.org/ Workshop on Perspectivist Approaches to NLP 2020.findings-emnlp.165. doi: 10.18653/v1/2020. @LREC2022, European Language Resources Asso- findings-emnlp.165. ciation, Marseille, France, 2022, pp. 83–94. URL: [51] P. He, J. Gao, W. Chen, Debertav3: Improvhttps://aclanthology.org/2022.nlperspectives-1.11. ing deberta using electra-style pre-training with [43] U. Azam, H. Rizwan, A. Karim, Exploring data gradient-disentangled embedding sharing, 2023. augmentation strategies for hate speech detection arXiv:2111.09543. in Roman Urdu, in: Proceedings of the Thir- [52] J. Wei, K. Zou, EDA: Easy data augmentation techteenth Language Resources and Evaluation Con- niques for boosting performance on text classifiference, European Language Resources Associa- cation tasks, in: Proceedings of the 2019 Confertion, Marseille, France, 2022, pp. 4523–4531. URL: ence on Empirical Methods in Natural Language https://aclanthology.org/2022.lrec-1.481. Processing and the 9th International Joint Con[44] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, ference on Natural Language Processing (EMNLPW. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, IJCNLP), Association for Computational LinguisA. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, tics, Hong Kong, China, 2019, pp. 6382–6388. URL: A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robin- https://aclanthology.org/D19-1670. doi:10.18653/

Below are examples the sequences and prompts used for training and prompting our models. FT-no target Write a (hateful) social media post: {text} FS-target Write a (hateful) social media post about

{target}: {text} [...]

Write a (hateful) social media post about {target}: {text} FS-no target Write a (hateful) social media post: {text}

[...]

Write a (hateful) social media post: {text}

The values used for ‘target’ are the identity group names in the MHS dataset, reported in Sec. 3.

B. Hyperparameters and Reproducibility For all of our experiments, we employ the HuggingFace

Python library. All the hyperparameters we use that are not specified in this section are the default ones from their TrainingArguments class. The classifiers we use as baselines and for filtering are trained on 5 epochs.

We finetune all generative models with batch 16 and = 1 − 3. For generation, we set top-p=0.9 and min and max lengths of generated sequences to 5 and 150 tokens respectively. Finally, we avoid repeating 4-grams.

All the classifiers that are trained on augmented data are trained for 3 epochs with batch size 16 and LR 5 − 6. In this case, at the end of training, we preserve the model from the epoch with the lowest evaluation cross-entropy loss.

The random seeds we used for shufling, subsampling the gold data, and initializing both generative and classiifcation models are 522, 97, 709, 16, and 42. These were chosen randomly. Finetuning of all classifiers and generative models, including baselines and models trained on augmented data, took 70 hours, of which 55 on a Nvidia V100 GPU and 15 on a Nvidia A40. Inference time for generating all of the sequences (a total of 8 million generated texts) took ∼ 400 hours total.

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

Zampieri ,

Nakov ,

Rosenthal ,

Atanasova , G. Karadzhov,

Mubarak ,

Derczynski ,

Pitenis , Ç. Çöltekin, SemEval-2020 Task 12 : Multilingual Ofensive Language Identification in Social Media (OfensEval 2020) , in: Proceedings of the Fourteenth Workshop on Semantic Evaluation , International Committee for Computational Linguistics, Barcelona (online) , 2020 , pp. 1425 - 1447 .

[2]

Founta ,

Djouvas ,

Chatzakou , I. Leontiadis ,

Blackburn ,

Stringhini ,

Vakali ,

Sirivianos ,

Kourtellis , Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior , in: Proceedings of the International AAAI Conference on Web and Social Media , volume 12 , 2018 .

[3]

M. J.

Riedl ,

G. M.

Masullo ,

K. N.

Whipple , The downsides of digital labor: Exploring the toll incivility takes on online comment moderators , Computers in Human Behavior 107 ( 2020 ) 106262 .

[4]

Klubicka ,

Fernández , Examining a hate speech corpus for hate speech detection and popularity prediction , in: Proceedings of 4REAL Workshop - Workshop on Replicability and Reproducibility of Research Results in Science and Technology of Language , 2018 .

[5]

Dixon ,

Li ,

Sorensen ,

Thain , L. Vasserman, Measuring and Mitigating Unintended Bias FT-target Write a (hateful) social media post about {target}: {text}