1. Introduction

LMU at HaSpeeDe3: Multi-Dataset Training for Cross-Domain Hate Speech Detection

Viktor Hangya

0 1

Alexander Fraser

0 1 0 Center for Information and Language Processing, LMU Munich 1 Munich Center for Machine Learning

We describe LMU Munich's hate speech detection system for participating in the cross-domain track of the HaSpeeDe3 shared task at EVALITA 2023. The task focuses on the politics and religion domains, having no in-domain training data for the latter. Our submission combines multiple training sets from various domains in a multitask prompt-training system. We experimented with both Italian and English source datasets as well as monolingual Italian and multilingual pre-trained language models. We found that the Italian out-of-domain datasets are the most influential on the performance in the test domains and that combining both monolingual and multilingual language models using an ensemble gives the best results. Our system ranked second in both domains.

eol>hate speech detection multitask learning prompt-training

1. Introduction

learning, aiming to build high quality models for the target domain by leveraging labeled samples from out-ofDue to the sheer amount of social media content, man- domain sources as well [10]. For hate speech detection, ual filtering for hate speech is impossible which makes [11] experimented with training classifiers using outbuilding high performance and reliable hate speech clas- of-domain training examples and showed a significant sifiers important. To promote research in the field various performance drop on the test sets compared to in-domain datasets were built [1, 2], and shared tasks were orga- training. By simply combining multiple datasets of difernized [3, 4, 5], where the best performing systems are ent domains, including the target domain, they achieved based on pre-trained language models (PLMs) [6, 7]. only slight improvements. In a similar work, one source

The HaSpeeDe3 shared task [8] is the third iteration of and one target-domain were explored [12], but the authe series on hate speech detection in Italian social media thors showed mixed results, i.e., improvements on some posts (tweets) organized at EVALITA 2023 [9], focusing domains but decrease on others. Similarly, [13] applied on strongly polarized debates in political and religious the general domain adaptation technique of [10] and topics. Two subtasks were organized: Task A – Politi- showed improvements when incorporating some out-ofcal Hate Speech Detection which on top of textual inputs, domain datasets into the final model, even though the allows for the use of contextual information, such as approach seemed sensitive to the chosen out-of-domain metadata of tweets and authors. Task B – Cross-domain dataset. In addition, [14] showed negative performance Hate Speech Detection involves only textual inputs, how- on the target-domain in German by using additional ever the main objective is to explore cross-domain hate source-domain English training examples. speech detection in the politics and religious domains, Following previous work, we rely on transfer learning by allowing the use of external datasets (open track). In to leverage out-of-domain (external) datasets to build our contrast to the politics domain where in-domain training classifiers for the political and religious domains. We data is given, in the religious domain such data is not experiment with various external datasets containing provided. Our team participated only in Task B. both Italian and English hate speech inputs. AdditionCross-domain training is a crucial problem in machine ally, in contrast to previous work which used datasets with matching label sets, we use corpora annotated with The Eighth Evaluation Campaign of Natural Language Processing and diferent label sets, e.g., stereotype. To avoid negative 0S7p–ee0c8h,2T0o2o3l,sPfoarr mItaa,liIatanl.yFinal Workshop (EVALITA 2023), September results, we combine multiple datasets in a multitask train* Corresponding author. ing fashion in order to build robust systems. Additionally, $ hangyav@cis.lmu.de (V. Hangya); fraser@cis.lmu.de (A. Fraser) we train our systems in a two-step process, where we https://www.cis.uni-muenchen.de/~hangyav (V. Hangya); ifrst pre-finetune our models on the external datasets, http0s0:0//0w-0w00w2.-c5is1.4u4n-i3-0m69ue(nVc.hHeann.dgey/a~)f;ra0s0e0r0-(0A0.0F3r-a4s8e9r1)-682X followed by fine-tuning them to the target task. As the (A. Fraser) basis of our models, we take various PLMs based on the © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License BERT [15] and RoBERTa [16] architectures, including CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) both Italian only and multilingual models. Furthermore, base model which is already aware of general hate speech in order to facilitate information sharing across the used language phenomena. This is in contrast to standard datasets, we perform prompt-training which eliminates multitask training where i) the goal is to build a single dedicated classification heads for each dataset. model supporting multiple target tasks (datasets) and

Our experiments show that using only Italian external more importantly ii) which is trained by optimizing a datasets is more beneficial compared to leveraging En- joint objective function across all datasets. glish as well. In contrast, we find that both monolingual No in-domain training data was provided for the reliand multilingual PLMs perform comparably well, and gious test datasets. In this case, we omit Step 2 and apply that they can support each other when combining them our model resulted from Step 1 in a zero-shot transfer using model ensembling.1 learning fashion, i.e., the model is only trained on the external (source) datasets but not on the target corpus.

2. Approach Our approach consists of two steps where we first pre

ifnetune a given PLM on external datasets (see Section 3.1), followed by in-domain fine-tuning in case of the political domain where such data is provided. Instead of classification heads, we leverage model prompting. Prompt-Training Prompt-training was shown to be effective and more reliable for various NLP tasks, including classification [ 17]. Instead of using classification heads on top of PLMs which add additional parameters to the model, it relies on the masked language modeling task (MLM). Using pattern-verbalizer-pairs (PVPs), an input sentence is first transformed using the pattern, e.g., I hate you. → Is this hate speech? I hate you. [MASK], and the task is to predict the masked token. Finally, the verbalizer maps the highest probability token, out of a set of valid tokens, to labels of a given dataset, e.g., Yes → Hate or No → nonHate. During training all model parameters are fine-tuned using the MLM objective.

Ensembling To further improve the robustness of our ifnal models, we employ model ensembling to combine the output of multiple models. We ensemble models in two dimensions: we combine models of the same setup but using 3 diferent random seeds, and models based on diferent PLM architectures as defined below. We simply take the mean of the probabilities of the considered models for a given input sample.

3. Experiments

3.1. Datasets

Next, we list our external dataset setups followed by the introduction of the oficial shared task data. We define the following groups of external datasets: HaSpeeDe We leverage Italian datasets from previ

ous HaSpeeDe iterations. More precisely, we take i) the training data containing 2 400 Facebook posts annotated with binary hate speech labels from HaSpeeDe1 [18], ii) 5 470 binary hate speech annotated Twitter posts from HaSpeeDe2 [5] and iii) the same Twitter posts but annotated for binary stereotype detection.

Step 1 Given a set of external training corpora ( = { : = 1.. }), we randomly select a single dataset and a batch of samples from it in each training step. For each dataset we apply a dedicated PVP (see Section 3.2) in order to handle datasets of diferent label It Additionally, to the datasets mentioned in the sets, and use cross-entropy loss to perform a single model HaSpeeDe set, we used further Italian abusive language update. This way we mix the available external datasets related datasets. Tweets from the AMI18 misogyny deduring pre-finetuning instead of performing a sequential tection shared task [19]: i) 3 200 binary and ii) 1 460 model update which could lead to catastrophic forgetting. fine-grained (discredit, stereotype, dominance, harassAdditionally, we make sure that we exhaust all datasets ment, derailing) training sets as well as iii) 1 454 binary in in each epoch, i.e., the model is trained on each target detection set (individual, group). Furthermore, we input sample once per epoch. took binary iv) hate (3 271) and v) stereotype (441) annotated training sets from the IHSC corpus [20] containing tweets related to immigrants.

Step 2 In case of the political test domain, we apply a second round of model fine-tuning given the in-domain training dataset. We follow the same training procedure as in Step 1 but using only a single training corpus instead of multiple corpora. The goal of this step is to specialize our model to the target domain, given the pre-finetuned

1 Our code is available at https://cistern.cis.lmu.de/multi_hs Mixed Finally, to test the efect of leveraging English

training data as well, in addition to the datasets contained in the HaSpeeDe set we used 7 078 politics related tweets annotated for binary hate speech detection released in [21]. Oficial HaSpeeDe3 datasets The HaSpeeDe3 shared task focuses on strongly polarized debates in two domains. For the politics domain, the binary hate speech labeled PolicyCorpusXL was made [22], containing 5 600 train and 1 400 test tweets. In the religious domain, the ReligiousHate [23] corpus contains 3 000 test tweets and no training set. 3.2. Setup we used batch size 4 with gradient accumulation steps 4 for BERT based models, while we used batch size 1 with gradient accumulation steps 16 for RoBERTa based models. We train our models for a single epoch in Step 1 of our approach, while we perform early stopping in Step 2 based on the performance on the development set. During the development of our system, we split the oficial political training set to train/dev/test splits. Since no labeled sets were provided for the religious domain for development, we simulated zero-shot transfer experiments on the politics domain.

Preprocessing We also experimented with two sets of data manipulation methods. To clean tweets, we applied standard Twitter preprocessing steps: user mention and hashtag removal, HTML and repeated character unification. Since hate speech datasets often sufer from label imbalance, we tested random oversampling, class weighting and focal loss. However, none of these approaches led to consistent improvements, thus we omitted these steps from our final systems.

4. Results We evaluate our systems using macro averaged 1 scores

as it is the oficial score used in the shared task. First, we present the comparison of various external dataset setups (Table 2), followed by the comparison of diferent PLMs and their combination with ensembling (Table 3). Finally, we present our oficial results in Table 4.

Models As the base PLMs we experiment with two

monolingual Italian and two multilingual models. AlBERTo was trained purely on Italian social media texts (Twitter in particular), based on the BERT base architecture [24]. We selected this model since it performs well on social media texts. Similarly, we experiment with UmBERTo [25] which is based on the RoBERTa base architecture, and was trained with whole word masking on Italian CommonCrawl corpus. As for the multilingual models, we used the highly popular mBERT [15] and XLM-R [16] PLMs.

We used the OpenPrompt toolkit for implementation [26], and used standard hyperparameter values. Due to memory limitation of Nvidia GTX 1080 Ti however, PVPs We aimed at keeping our used patterns and verbalizers simple and uniform across datasets. Both patterns and verbs are presented in Table 1. For binary hate External datasets As the baseline system to measure and misogyny datasets we used patterns 1 and 2 for the efectiveness of the external datasets, we only perthe Italian and English datasets respectively. Similarly, form Step 2 of our approach, i.e., we fine-tune the ofwe used 3 for the binary stereotype datasets. As verbal- the-shelf PLM (mBERT) using only the HaSpeeDe3 poliizers, we used 1 and 2 for the two languages. For the tics training corpus without any pre-finetuning steps on AMI18 misogyny fine-grained and target sets we used the external datasets. As mentioned, no in-domain data patterns number 4 and 5 respectively, with verbalizers is provided for the religious domain, thus we perform 3 and 4. zero-shot transfer learning, i.e., we only perform preifnetuning on the external datasets. Additionally, since not even a development set was provided for this domain, we simulate zero-shot transfer on the politics dataset. The gold labels of the religious test set were released after the shared task deadline, thus we are able present (oracle) results for comparison. The results in Table 2 show the positive impact of the external datasets, as the baseline systems were outperformed by a large margin. Comparing the diferent external dataset setups, we found that they perform comparably. On the politics domain the HaSpeeDe setup performed the best, although both It and Mixed lagged behind with less than half a percentage point in the two-step setting, while on the simulated zero-shot experiments the gap between HaSpeeDe and baseline HaSpeeDe

Mixed

AlBERTo UmBERTo mBERT

XLM-R mono-ens.

mix-ens.

Pol.

It2 is around 2 percentage points. These findings indicate that the misogyny detection tasks in the It setup could be slightly detrimental to the binary hate speech detection task. Furthermore, the additional English politics related dataset in the Mixed setup does not lead to further improvements on the politics domain, although they are from the same domain, indicating that leveraging only Italian external datasets is an important factor. Looking at the results on the gold religious test set, we found similar trends. The use of additional training datasets on top of the HaSpeeDe3 politics training set improves the performance3. Although the HaSpeeDe set performs well, interestingly the best performance was achieved by Mixed which includes English politics tweets, which needs further investigations. Nonetheless, based on these ifndings, we used the HaSpeeDe setup in our final system submission and in the following experiments.

2Due to the inclusion of politics related training data in the baseline

and Mixed setups, these are not applicable in the simulated zeroshot case. 3Note that we also included the politics HaSpeeDe3 train set in the HaSpeeDe, It and Mixed sets when training our models for the religious domain. Model variations In Table 3 we compare the mentioned 4 PLMs and their combinations. In the mono-ens. ensemble setup we combine the monolingual Italian models (AlBERTo and UmBERTo), while in mix-ens. all PLMs (AlBERTo, UmBERTo, mBERT and XLM-R). We found that the monolingual models outperform multilingual models in most cases, especially on the politics domain. AlBERTo has the best performance on average which is due to its pre-training on social media content. Interestingly, comparing BERT (AlBERTo and mBERT) and RoBERTa (UmBERTo and XLM-R) architectures, the former outperform the latter, which is a somewhat contradictory result as the latter often performs better. The ensemble results, however, show that although the results of diferent PLMs vary, they can support each other and by ensembling their outputs the performance can be further increased. Similarly, as for the individual models, the monolingual ensemble performed the best during our system development, however the combination of all models does not lag much behind. Furthermore, mix-ens. outperformed mono-ens. on the gold religious test set. Final Submission The shared task allowed two submitted runs for each domain. Based on our findings during development, our oficial systems were mono-ens. (Run 1) and mix-ens. (Run 2) using the HaSpeeDe external dataset setup. We note that in the case of the religious domain, we also include the HaSpeeDe3 politics training set as an external dataset. Our oficial results are shown in Table 4. We achieved the second-best result in both domains.

5. Conclusions We presented the LMU Munich team’s systems at the

HaSpeeDe3 shared task, participating in the crossdomain hate speech detection task. Our approach involves a two-step method for the politics domain: preifnetuning using external datasets followed by a second step of fine-tuning on the target domain. In case of the religious domain, we used a zero-shot transfer setup involving training on the external datasets only. Additionally, we performed prompt-training instead of the use of classification heads in order for a more seamless combination of external datasets of diferent label sets. By comparing various external datasets, including both Italian and English, we found that Italian datasets are

Acknowledgments We thank the anonymous reviewers for their helpful

feedback and the Cambridge LMU Strategic Partnership for funding for this project.4 The work was additionally supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (No. 640550) and by the German Research Foundation (DFG; grant FR 2829/4-1). more beneficial. Similarly, by comparing various PLMs we found that individually monolingual models perform better than multilingual models. On the other hand, combining multiple PLMs with model ensemble, we found that diferent models can support each other leading to improved performance. Our best result on the political domain was achieved by combining monolingual PLMs only, while combining all PLMs performed the best on the religious domain. Language Understanding, in: Proceedings of the G. Semeraro, Alberto: Modeling italian social me2019 Conference of the North American Chapter of dia language with bert, IJCoL. Italian Journal of the Association for Computational Linguistics: Hu- Computational Linguistics 5 (2019) 11–31. URL: man Language Technologies, 2019, pp. 4171–4186. https://journals.openedition.org/ijcol/472. URL: https://www.aclweb.org/anthology/N19-1423. [25] L. Parisi, S. Francia, P. Magnani, Umberto: an pdf . italian language model trained with whole word [16] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- masking, https://github.com/musixmatchresearch/ hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, umberto, 2020.

L. Zettlemoyer, V. Stoyanov, Unsupervised [26] N. Ding, S. Hu, W. Zhao, Y. Chen, Z. Liu, H. Zheng, Cross-lingual Representation Learning at Scale, M. Sun, OpenPrompt: An open-source framework in: Proceedings of the 58th Annual Meeting of for prompt-learning, in: Proceedings of the 60th Anthe Association for Computational Linguistics, nual Meeting of the Association for Computational 2020, pp. 8440–8451. URL: https://www.aclweb.org/ Linguistics: System Demonstrations, 2022, pp. 105– anthology/2020.acl-main.747/. 113. URL: https://aclanthology.org/2022.acl-demo. [17] T. Schick, H. Schütze, Exploiting cloze-questions for 10. doi:10.18653/v1/2022.acl-demo.10. few-shot text classification and natural language inference, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 255– 269. URL: https://aclanthology.org/2021.eacl-main.

20. [18] C. Bosco, D. Felice, F. Poletto, M. Sanguinetti,

T. Maurizio, et al., Overview of the evalita 2018 hate speech detection task, in: Ceur workshop proceedings, 2018, pp. 1–9. [19] E. Fersini, D. Nozza, P. Rosso, Overview of the Evalita 2018 Task on Automatic Misogyny Identification (AMI), Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (2018) 59– 66. URL: https://pdfs.semanticscholar.org/05d5/ 17f3fa5f47b16265b378c81a0839ed760ba0.pdf . [20] M. Sanguinetti, F. Poletto, C. Bosco, V. Patti,

M. Stranisci, An Italian Twitter corpus of hate speech against immigrants, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018. URL: https://aclanthology.org/L18-1443. [21] C. Toraman, F. Şahinuç, E. Yilmaz, Large-scale hate speech detection with cross-domain transfer, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 2215–2225.

URL: https://aclanthology.org/2022.lrec-1.238. [22] F. Celli, M. Lai, A. Duzha, C. Bosco, V. Patti, Policycorpus XL: An Italian Corpus for the Detection of Hate Speech Against Politics., in: In Proceedings of the Eighth Italian Conference on Computational Linguistics, 2021. URL: https://ceur-ws.org/

Vol-3033/paper38.pdf . [23] A. Ramponi, B. Testa, S. Tonelli, E. Jezek, Addressing religious hate online: from taxonomy creation to automated detection, PeerJ Computer Science 8 (2022) e1128. URL: https://peerj.com/articles/ cs-1128. [24] M. Polignano, V. Basile, P. Basile, M. de Gemmis,