<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LMU at HaSpeeDe3: Multi-Dataset Training for Cross-Domain Hate Speech Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Viktor Hangya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Fraser</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Information and Language Processing, LMU Munich</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Munich Center for Machine Learning</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe LMU Munich's hate speech detection system for participating in the cross-domain track of the HaSpeeDe3 shared task at EVALITA 2023. The task focuses on the politics and religion domains, having no in-domain training data for the latter. Our submission combines multiple training sets from various domains in a multitask prompt-training system. We experimented with both Italian and English source datasets as well as monolingual Italian and multilingual pre-trained language models. We found that the Italian out-of-domain datasets are the most influential on the performance in the test domains and that combining both monolingual and multilingual language models using an ensemble gives the best results. Our system ranked second in both domains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;hate speech detection</kwd>
        <kwd>multitask learning</kwd>
        <kwd>prompt-training</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>learning, aiming to build high quality models for the
target domain by leveraging labeled samples from
out-ofDue to the sheer amount of social media content, man- domain sources as well [10]. For hate speech detection,
ual filtering for hate speech is impossible which makes [11] experimented with training classifiers using
outbuilding high performance and reliable hate speech clas- of-domain training examples and showed a significant
sifiers important. To promote research in the field various performance drop on the test sets compared to in-domain
datasets were built [1, 2], and shared tasks were orga- training. By simply combining multiple datasets of
difernized [3, 4, 5], where the best performing systems are ent domains, including the target domain, they achieved
based on pre-trained language models (PLMs) [6, 7]. only slight improvements. In a similar work, one
source</p>
      <p>The HaSpeeDe3 shared task [8] is the third iteration of and one target-domain were explored [12], but the
authe series on hate speech detection in Italian social media thors showed mixed results, i.e., improvements on some
posts (tweets) organized at EVALITA 2023 [9], focusing domains but decrease on others. Similarly, [13] applied
on strongly polarized debates in political and religious the general domain adaptation technique of [10] and
topics. Two subtasks were organized: Task A – Politi- showed improvements when incorporating some
out-ofcal Hate Speech Detection which on top of textual inputs, domain datasets into the final model, even though the
allows for the use of contextual information, such as approach seemed sensitive to the chosen out-of-domain
metadata of tweets and authors. Task B – Cross-domain dataset. In addition, [14] showed negative performance
Hate Speech Detection involves only textual inputs, how- on the target-domain in German by using additional
ever the main objective is to explore cross-domain hate source-domain English training examples.
speech detection in the politics and religious domains, Following previous work, we rely on transfer learning
by allowing the use of external datasets (open track). In to leverage out-of-domain (external) datasets to build our
contrast to the politics domain where in-domain training classifiers for the political and religious domains. We
data is given, in the religious domain such data is not experiment with various external datasets containing
provided. Our team participated only in Task B. both Italian and English hate speech inputs.
AdditionCross-domain training is a crucial problem in machine ally, in contrast to previous work which used datasets
with matching label sets, we use corpora annotated with
The Eighth Evaluation Campaign of Natural Language Processing and diferent label sets, e.g., stereotype. To avoid negative
0S7p–ee0c8h,2T0o2o3l,sPfoarr mItaa,liIatanl.yFinal Workshop (EVALITA 2023), September results, we combine multiple datasets in a multitask
train* Corresponding author. ing fashion in order to build robust systems. Additionally,
$ hangyav@cis.lmu.de (V. Hangya); fraser@cis.lmu.de (A. Fraser) we train our systems in a two-step process, where we
 https://www.cis.uni-muenchen.de/~hangyav (V. Hangya); ifrst pre-finetune our models on the external datasets,
http0s0:0//0w-0w00w2.-c5is1.4u4n-i3-0m69ue(nVc.hHeann.dgey/a~)f;ra0s0e0r0-(0A0.0F3r-a4s8e9r1)-682X followed by fine-tuning them to the target task. As the
(A. Fraser) basis of our models, we take various PLMs based on the
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License BERT [15] and RoBERTa [16] architectures, including
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
both Italian only and multilingual models. Furthermore, base model which is already aware of general hate speech
in order to facilitate information sharing across the used language phenomena. This is in contrast to standard
datasets, we perform prompt-training which eliminates multitask training where i) the goal is to build a single
dedicated classification heads for each dataset. model supporting multiple target tasks (datasets) and</p>
      <p>Our experiments show that using only Italian external more importantly ii) which is trained by optimizing a
datasets is more beneficial compared to leveraging En- joint objective function across all datasets.
glish as well. In contrast, we find that both monolingual No in-domain training data was provided for the
reliand multilingual PLMs perform comparably well, and gious test datasets. In this case, we omit Step 2 and apply
that they can support each other when combining them our model resulted from Step 1 in a zero-shot transfer
using model ensembling.1 learning fashion, i.e., the model is only trained on the
external (source) datasets but not on the target corpus.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <sec id="sec-2-1">
        <title>Our approach consists of two steps where we first pre</title>
        <p>ifnetune a given PLM on external datasets (see
Section 3.1), followed by in-domain fine-tuning in case of
the political domain where such data is provided. Instead
of classification heads, we leverage model prompting.
Prompt-Training Prompt-training was shown to be
effective and more reliable for various NLP tasks, including
classification [ 17]. Instead of using classification heads
on top of PLMs which add additional parameters to the
model, it relies on the masked language modeling task
(MLM). Using pattern-verbalizer-pairs (PVPs), an input
sentence is first transformed using the pattern, e.g., I hate
you. → Is this hate speech? I hate you. [MASK], and the
task is to predict the masked token. Finally, the verbalizer
maps the highest probability token, out of a set of valid
tokens, to labels of a given dataset, e.g., Yes → Hate or
No → nonHate. During training all model parameters
are fine-tuned using the MLM objective.</p>
        <p>Ensembling To further improve the robustness of our
ifnal models, we employ model ensembling to combine
the output of multiple models. We ensemble models in
two dimensions: we combine models of the same setup
but using 3 diferent random seeds, and models based on
diferent PLM architectures as defined below. We
simply take the mean of the probabilities of the considered
models for a given input sample.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>3.1. Datasets</p>
      <sec id="sec-3-1">
        <title>Next, we list our external dataset setups followed by the introduction of the oficial shared task data. We define the following groups of external datasets:</title>
      </sec>
      <sec id="sec-3-2">
        <title>HaSpeeDe We leverage Italian datasets from previ</title>
        <p>ous HaSpeeDe iterations. More precisely, we take i) the
training data containing 2 400 Facebook posts annotated
with binary hate speech labels from HaSpeeDe1 [18], ii)
5 470 binary hate speech annotated Twitter posts from
HaSpeeDe2 [5] and iii) the same Twitter posts but
annotated for binary stereotype detection.</p>
        <p>Step 1 Given a set of external training corpora ( =
{ :  = 1.. }), we randomly select a single dataset
 and a batch of samples from it in each training
step. For each dataset we apply a dedicated PVP (see
Section 3.2) in order to handle datasets of diferent label It Additionally, to the datasets mentioned in the
sets, and use cross-entropy loss to perform a single model HaSpeeDe set, we used further Italian abusive language
update. This way we mix the available external datasets related datasets. Tweets from the AMI18 misogyny
deduring pre-finetuning instead of performing a sequential tection shared task [19]: i) 3 200 binary and ii) 1 460
model update which could lead to catastrophic forgetting. fine-grained (discredit, stereotype, dominance,
harassAdditionally, we make sure that we exhaust all datasets ment, derailing) training sets as well as iii) 1 454 binary
in  in each epoch, i.e., the model is trained on each target detection set (individual, group). Furthermore, we
input sample once per epoch. took binary iv) hate (3 271) and v) stereotype (441)
annotated training sets from the IHSC corpus [20] containing
tweets related to immigrants.</p>
        <p>Step 2 In case of the political test domain, we apply a
second round of model fine-tuning given the in-domain
training dataset. We follow the same training procedure
as in Step 1 but using only a single training corpus instead
of multiple corpora. The goal of this step is to specialize
our model to the target domain, given the pre-finetuned</p>
      </sec>
      <sec id="sec-3-3">
        <title>1 Our code is available at https://cistern.cis.lmu.de/multi_hs</title>
      </sec>
      <sec id="sec-3-4">
        <title>Mixed Finally, to test the efect of leveraging English</title>
        <p>training data as well, in addition to the datasets contained
in the HaSpeeDe set we used 7 078 politics related tweets
annotated for binary hate speech detection released in
[21].
Oficial HaSpeeDe3 datasets The HaSpeeDe3 shared
task focuses on strongly polarized debates in two
domains. For the politics domain, the binary hate speech
labeled PolicyCorpusXL was made [22], containing 5 600
train and 1 400 test tweets. In the religious domain, the
ReligiousHate [23] corpus contains 3 000 test tweets and
no training set.
3.2. Setup
we used batch size 4 with gradient accumulation steps
4 for BERT based models, while we used batch size 1
with gradient accumulation steps 16 for RoBERTa based
models. We train our models for a single epoch in Step
1 of our approach, while we perform early stopping in
Step 2 based on the performance on the development
set. During the development of our system, we split
the oficial political training set to train/dev/test splits.
Since no labeled sets were provided for the religious
domain for development, we simulated zero-shot transfer
experiments on the politics domain.</p>
        <p>Preprocessing We also experimented with two sets of
data manipulation methods. To clean tweets, we applied
standard Twitter preprocessing steps: user mention and
hashtag removal, HTML and repeated character
unification. Since hate speech datasets often sufer from label
imbalance, we tested random oversampling, class
weighting and focal loss. However, none of these approaches
led to consistent improvements, thus we omitted these
steps from our final systems.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>We evaluate our systems using macro averaged 1 scores</title>
        <p>as it is the oficial score used in the shared task. First,
we present the comparison of various external dataset
setups (Table 2), followed by the comparison of diferent
PLMs and their combination with ensembling (Table 3).
Finally, we present our oficial results in Table 4.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Models As the base PLMs we experiment with two</title>
        <p>monolingual Italian and two multilingual models.
AlBERTo was trained purely on Italian social media texts
(Twitter in particular), based on the BERT base
architecture [24]. We selected this model since it performs
well on social media texts. Similarly, we experiment with
UmBERTo [25] which is based on the RoBERTa base
architecture, and was trained with whole word masking
on Italian CommonCrawl corpus. As for the multilingual
models, we used the highly popular mBERT [15] and
XLM-R [16] PLMs.</p>
        <p>We used the OpenPrompt toolkit for implementation
[26], and used standard hyperparameter values. Due
to memory limitation of Nvidia GTX 1080 Ti however,
PVPs We aimed at keeping our used patterns and
verbalizers simple and uniform across datasets. Both
patterns and verbs are presented in Table 1. For binary hate External datasets As the baseline system to measure
and misogyny datasets we used patterns 1 and 2 for the efectiveness of the external datasets, we only
perthe Italian and English datasets respectively. Similarly, form Step 2 of our approach, i.e., we fine-tune the
ofwe used 3 for the binary stereotype datasets. As verbal- the-shelf PLM (mBERT) using only the HaSpeeDe3
poliizers, we used 1 and 2 for the two languages. For the tics training corpus without any pre-finetuning steps on
AMI18 misogyny fine-grained and target sets we used the external datasets. As mentioned, no in-domain data
patterns number 4 and 5 respectively, with verbalizers is provided for the religious domain, thus we perform
3 and 4. zero-shot transfer learning, i.e., we only perform
preifnetuning on the external datasets. Additionally, since
not even a development set was provided for this domain,
we simulate zero-shot transfer on the politics dataset. The
gold labels of the religious test set were released after the
shared task deadline, thus we are able present (oracle)
results for comparison. The results in Table 2 show the
positive impact of the external datasets, as the baseline
systems were outperformed by a large margin.
Comparing the diferent external dataset setups, we found that
they perform comparably. On the politics domain the
HaSpeeDe setup performed the best, although both It
and Mixed lagged behind with less than half a
percentage point in the two-step setting, while on the simulated
zero-shot experiments the gap between HaSpeeDe and
baseline
HaSpeeDe</p>
        <p>It</p>
        <p>Mixed</p>
        <p>AlBERTo
UmBERTo
mBERT</p>
        <p>XLM-R
mono-ens.</p>
        <p>mix-ens.</p>
        <p>Pol.</p>
        <p>It2 is around 2 percentage points. These findings indicate
that the misogyny detection tasks in the It setup could be
slightly detrimental to the binary hate speech detection
task. Furthermore, the additional English politics related
dataset in the Mixed setup does not lead to further
improvements on the politics domain, although they are
from the same domain, indicating that leveraging only
Italian external datasets is an important factor. Looking
at the results on the gold religious test set, we found
similar trends. The use of additional training datasets
on top of the HaSpeeDe3 politics training set improves
the performance3. Although the HaSpeeDe set performs
well, interestingly the best performance was achieved
by Mixed which includes English politics tweets, which
needs further investigations. Nonetheless, based on these
ifndings, we used the HaSpeeDe setup in our final system
submission and in the following experiments.</p>
      </sec>
      <sec id="sec-4-3">
        <title>2Due to the inclusion of politics related training data in the baseline</title>
        <p>and Mixed setups, these are not applicable in the simulated
zeroshot case.
3Note that we also included the politics HaSpeeDe3 train set in the
HaSpeeDe, It and Mixed sets when training our models for the
religious domain.
Model variations In Table 3 we compare the
mentioned 4 PLMs and their combinations. In the mono-ens.
ensemble setup we combine the monolingual Italian
models (AlBERTo and UmBERTo), while in mix-ens. all PLMs
(AlBERTo, UmBERTo, mBERT and XLM-R). We found
that the monolingual models outperform multilingual
models in most cases, especially on the politics domain.
AlBERTo has the best performance on average which is
due to its pre-training on social media content.
Interestingly, comparing BERT (AlBERTo and mBERT) and
RoBERTa (UmBERTo and XLM-R) architectures, the
former outperform the latter, which is a somewhat
contradictory result as the latter often performs better. The
ensemble results, however, show that although the
results of diferent PLMs vary, they can support each other
and by ensembling their outputs the performance can be
further increased. Similarly, as for the individual models,
the monolingual ensemble performed the best during
our system development, however the combination of all
models does not lag much behind. Furthermore, mix-ens.
outperformed mono-ens. on the gold religious test set.
Final Submission The shared task allowed two
submitted runs for each domain. Based on our findings
during development, our oficial systems were mono-ens.
(Run 1) and mix-ens. (Run 2) using the HaSpeeDe
external dataset setup. We note that in the case of the religious
domain, we also include the HaSpeeDe3 politics training
set as an external dataset. Our oficial results are shown
in Table 4. We achieved the second-best result in both
domains.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <sec id="sec-5-1">
        <title>We presented the LMU Munich team’s systems at the</title>
        <p>HaSpeeDe3 shared task, participating in the
crossdomain hate speech detection task. Our approach
involves a two-step method for the politics domain:
preifnetuning using external datasets followed by a second
step of fine-tuning on the target domain. In case of the
religious domain, we used a zero-shot transfer setup
involving training on the external datasets only.
Additionally, we performed prompt-training instead of the
use of classification heads in order for a more seamless
combination of external datasets of diferent label sets.
By comparing various external datasets, including both
Italian and English, we found that Italian datasets are</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>We thank the anonymous reviewers for their helpful</title>
        <p>feedback and the Cambridge LMU Strategic Partnership
for funding for this project.4 The work was
additionally supported by the European Research Council (ERC)
under the European Union’s Horizon 2020 research and
innovation programme (No. 640550) and by the German
Research Foundation (DFG; grant FR 2829/4-1).
more beneficial. Similarly, by comparing various PLMs
we found that individually monolingual models perform
better than multilingual models. On the other hand,
combining multiple PLMs with model ensemble, we found
that diferent models can support each other leading to
improved performance. Our best result on the political
domain was achieved by combining monolingual PLMs
only, while combining all PLMs performed the best on
the religious domain.
Language Understanding, in: Proceedings of the G. Semeraro, Alberto: Modeling italian social
me2019 Conference of the North American Chapter of dia language with bert, IJCoL. Italian Journal of
the Association for Computational Linguistics: Hu- Computational Linguistics 5 (2019) 11–31. URL:
man Language Technologies, 2019, pp. 4171–4186. https://journals.openedition.org/ijcol/472.
URL: https://www.aclweb.org/anthology/N19-1423. [25] L. Parisi, S. Francia, P. Magnani, Umberto: an
pdf . italian language model trained with whole word
[16] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- masking, https://github.com/musixmatchresearch/
hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, umberto, 2020.</p>
        <p>L. Zettlemoyer, V. Stoyanov, Unsupervised [26] N. Ding, S. Hu, W. Zhao, Y. Chen, Z. Liu, H. Zheng,
Cross-lingual Representation Learning at Scale, M. Sun, OpenPrompt: An open-source framework
in: Proceedings of the 58th Annual Meeting of for prompt-learning, in: Proceedings of the 60th
Anthe Association for Computational Linguistics, nual Meeting of the Association for Computational
2020, pp. 8440–8451. URL: https://www.aclweb.org/ Linguistics: System Demonstrations, 2022, pp. 105–
anthology/2020.acl-main.747/. 113. URL: https://aclanthology.org/2022.acl-demo.
[17] T. Schick, H. Schütze, Exploiting cloze-questions for 10. doi:10.18653/v1/2022.acl-demo.10.
few-shot text classification and natural language
inference, in: Proceedings of the 16th Conference of
the European Chapter of the Association for
Computational Linguistics: Main Volume, 2021, pp. 255–
269. URL: https://aclanthology.org/2021.eacl-main.</p>
        <p>20.
[18] C. Bosco, D. Felice, F. Poletto, M. Sanguinetti,</p>
        <p>T. Maurizio, et al., Overview of the evalita 2018
hate speech detection task, in: Ceur workshop
proceedings, 2018, pp. 1–9.
[19] E. Fersini, D. Nozza, P. Rosso, Overview
of the Evalita 2018 Task on Automatic
Misogyny Identification (AMI), Proceedings of Sixth
Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian (2018) 59–
66. URL: https://pdfs.semanticscholar.org/05d5/
17f3fa5f47b16265b378c81a0839ed760ba0.pdf .
[20] M. Sanguinetti, F. Poletto, C. Bosco, V. Patti,</p>
        <p>M. Stranisci, An Italian Twitter corpus of hate
speech against immigrants, in: Proceedings of the
Eleventh International Conference on Language
Resources and Evaluation (LREC 2018), 2018. URL:
https://aclanthology.org/L18-1443.
[21] C. Toraman, F. Şahinuç, E. Yilmaz, Large-scale hate
speech detection with cross-domain transfer, in:
Proceedings of the Thirteenth Language Resources
and Evaluation Conference, 2022, pp. 2215–2225.</p>
        <p>URL: https://aclanthology.org/2022.lrec-1.238.
[22] F. Celli, M. Lai, A. Duzha, C. Bosco, V. Patti,
Policycorpus XL: An Italian Corpus for the Detection
of Hate Speech Against Politics., in: In
Proceedings of the Eighth Italian Conference on
Computational Linguistics, 2021. URL: https://ceur-ws.org/</p>
        <p>Vol-3033/paper38.pdf .
[23] A. Ramponi, B. Testa, S. Tonelli, E. Jezek,
Addressing religious hate online: from taxonomy
creation to automated detection, PeerJ Computer
Science 8 (2022) e1128. URL: https://peerj.com/articles/
cs-1128.
[24] M. Polignano, V. Basile, P. Basile, M. de Gemmis,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>