<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BERTicelli at HaSpeeDe 3: Fine-tuning and Cross-validating Large Language Models for Hate Speech Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo Grotti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Quick</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiPS Research Center, University of Antwerp</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universiteit Antwerpen, Faculty of Arts</institution>
          ,
          <addr-line>Prinsstraat 13, B-2000, Antwerp</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The present paper describes the results from the experiments carried out for the HaSpeeDe 3 shared task, an Italian-language Hate Speech (HS) detection task, at EVALITA 2023. Two BERT-based language models were selected: UmBERTo (cased) and Italian BERT (cased). For the Textual task, the models were fine-tuned and cross-validated across 5 folds. For the Contextual task, we adopted an ensemble approach: the additional features were added to the fine-tuned models through the GradientBoosterClassifier algorithm. The models perform better than the baselines (DummyClassifier and LogisticRegression) and above the average performance of participants in the shared task. While the addition of contextual features did not improve the performance of UmBERTo, it significantly bettered the results obtained with Italian BERT.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Hate Speech detection</kwd>
        <kwd>Italian language</kwd>
        <kwd>BERT-based language models</kwd>
        <kwd>Fine-tuning</kwd>
        <kwd>Contextual features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
guage Processing (NLP), which has witnessed a surge in
interest and popularity in automatic toxic language
deThe escalating issue of toxic language has been ampli- tection [8]. Researchers aim to develop models that can
ifed by the rapid growth in social media usage over the alleviate the harm caused by online HS [9]. Automating
past decade [1]. Platforms such as Facebook and Twitter the detection process not only overcomes the challenges
have transformed the way individuals interact, making of manual filtering but also enables eficient analysis of
it faster and often anonymous, thereby creating an ideal large volumes of data.
environment for the propagation of harmful content [2]. As a reminder, we here use the terms HS as an umbrella
Furthermore, previous studies have shown that this con- term and do not distinguish between its subcategories
tent can be targeted at and posted by both individuals on a theoretical level. For a more extensive discussion of
and groups, inciting and driving violent acts in the ofline HS hierarchies and definitions, refer to Zampieri [ 7] and
world [2, 3, 4]. Caselli et al. [10]. It is worth noting that scholars often</p>
      <p>As such, countering the phenomenon of toxic language do not agree on what constitutes HS and how it difers
has garnered significant attention from legal authorities, from, e.g., ofensive or aggressive language [11].
social media platforms, and companies [5]. Platforms
like Facebook, Twitter, YouTube, and other websites have
taken measures to combat toxic language by implement- 2. Related Work
ing bans. However, research has pointed out the
limitations of companies’ control systems and their heavy re- As we have mentioned, the growing interest in addressing
liance on user reports to identify problematic comments toxic language is evident in the numerous tasks dedicated
or posts [6]. The manual filtering of messages containing to its detection and its various subcategories. These
intoxic language has proven to be not only time-consuming clude Aggression Identification [ 12], Ofensive Language
but also detrimental to human annotators [7]. Addition- Identification [ 7], and HS detection in Italian Facebook
ally, studies have revealed that human-labeled data can and Twitter messages [13], among others. Over time,
be influenced by individual annotators’ biases [8]. the quantity and quality of available models for toxic
Such interest was reflected in the field of Natural Lan- language detection have significantly increased. Markov,
Gevers, and Daelemans (2021) note that the advent of
EVALITA 2023: Final Workshop of the 8th evaluation campaign, transformer-based pre-trained language models, coupled
September 08–09, 2021, Parma, IT with the abundance of user-generated content on social
$ lgrotti@uantwerpen.be (L. Grotti); media, has greatly improved detection accuracy.
patrick.quick@student.uantwerpen.be (P. Quick) Despite the overall improvement of the models, a
se https://github.com/corvusMidnight (L. Grotti); ries of challenges remain. For instance, it has been shown
http0s0:0//0g-0it0h0u1b-.7c9o1m4-/3p1a9tr1ic(kLq.uGircoktt(iP). Quick) how the lack of data in languages other than English [14]
© 2021 Copyright for this paper by its authors. Use permitted under Creative has exacerbated already existing issues, such as the high
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) occurrence of code words and misspellings in HS text</p>
      <p>3.3. Models</p>
      <p>The task organizers [20] provided development data
consisting of 5,600 Italian-language tweets from the Pol- Fine-tuning, a common technique in NLP, is a form of
icycorpus XL corpus, a manually-annotated HS corpus transfer learning that involves training a pre-trained
[21]. The testing data consists of one subset of in-domain model on new data to adapt it for a specific downstream
data and one subset of out-of-domain data. The in- task [24]. As mentioned in Section 2, there are notable
domain data consists of 1,400 Italian-language tweets benefits in fine-tuning models when it comes to HS
detecfrom Policycorpus XL, and the out-of-domain data con- tion tasks. Furthermore, this approach has been widely
sists of 3,000 Italian-language tweets from the Italian sub- used in Italian HS detection, see, e.g., Eric et al. (2020),
set of the ReligiousHate corpus, a manually-annotated Tamburini (2020), Nozza et al. (2022). Thus, we fine-tune
religious HS corpus [22]. two large pre-trained language models:</p>
    </sec>
    <sec id="sec-2">
      <title>1For an extensive explanation of how these factors have im</title>
      <p>proved the performance of HS detection systems, see Yin and
Zubiaga 2021
• UmBERTo-commoncrawl-cased (Run 1) is a</p>
      <p>RoBERTa-based model using the OSCAR (Open
Super-large Crawled ALMAnaCH coRpus) Italian
large corpus. The model is used for both Named
Entity Recognition (NER) and Part Of Speech
(POS) tagging and reached excellent performance
on diferent datasets.
• bert-base-italian-cased (Run 2) is a BERT-based
model which was trained on two million tokens
and over 13GB of data. The model was pre-trained
on a combination of data which includes the
OPUS corpus as well as a Wikipedia dump. Note
that for ease of readability, we will now refer to
this model as BERT-ita.</p>
      <p>do so, the output labels for both BERT-based models
together with the additional features are used as input
features for GradientBoostingClassifier.</p>
      <p>To further assess the performance of our models, we
build four baselines: one LogisticRegression and one
DummyClassifier for each task. For Textual and
Contextual tasks, the models were trained on the textual data
and evaluated on the in-domain test data. For the
crossdomain task, they were trained on the same textual data
but evaluated on the out-of-domain test set instead. It
is worth mentioning that no additional data was used at
any stage of our experiments.</p>
      <p>For the Textual tasks (both in- and out-of-domain) our
experimental setup consists of three stages: To begin 4. Results and Discussion
with, we apply two basic preprocessing steps, which
consist of substituting the pseudo-random user identifiers In this section, we describe the results obtained for
(e.g., ’@12020569’) with ’@USER’ and removing the hash HaSpeeDe 2023. For each model, we report precision,
symbol (i.e., ’#’) from hashtags. Such steps are applied to recall, and F1 score (for both classes). We submitted
reavoid excessively long tweets2 and remove unnecessary sults for every sub-task except for Task B’s XpoliticalHate
noise. sub-task. All results are compared with the respective
Then, both models are fine-tuned and cross-validated baselines.
across five epochs using PyTorch Trainer and the
Transformers library. The development data is shufled and 4.1. Baselines
divided into five folds. For each fold, the models are fine- As a reference point, Table 2 first presents the baseline
tuned on 80% of the development data and evaluated on results for both in- and out-of-domain tasks for each
the remaining 20% across 10 epochs with an EarlyStop- class. The DummyClassifier performs slightly above
ranping patience of 3. We employ cross-validation to ensure dom chance for the in-domain task, with an average F1
that the obtained results are not dependent on a particu- score of 0.52. However, the out-of-domain results are—as
lar data split but rather generalize well across multiple expected—poorer, reaching an average F1 score of 0.42.
folds. LogisticRegression, on the other hand, achieves
com</p>
      <p>During this stage, we also tune the learning rate (1e-3, petitive results, with average F1 scores of 0.86 for
in2e-5, and 5e-05)3. We do not tune batch size as the test domain data and 0.52 for out-of-domain data. However,
data was not available at this stage and increasing batch upon further inspection, we can observe how
LogisticResize may have improved development set performance gression is fairly limited. With a precision of 0.80 for
but worsened generalizability on unseen data (see He the non-hate speech (¬HS) class, the model exhibits a
et al., 2019). Once the stability of the results has been relatively high rate of false positives. Additionally, the
established through cross-validation, the models are fine- recall of 0.96 for ¬HS implies a high rate of true
posituned on 85% of the training data4 (after shufling) and the tive instances captured but at the expense of potentially
resulting model is saved and used to output predictions overlooking some true negatives. These results suggest
on both test sets. that the model may be overly biased towards predicting</p>
      <p>For the Contextual task, additional features are incor- instances as ¬HS, potentially missing some actual HS
porated into the model using GradientBoostingClassifier. instances. Similarly, while a high precision of 0.953 is
This ensemble algorithm sequentially trains weak models, achieved for the HS class, the model showcases a fairly
resulting in a strong model that is a weighted combina- low recall of 0.75. In turn, this pattern implies the model’s
tion of the weak models. Unlike other algorithms, Gra- inability to identify a significant portion of actual HS
indientBoostingClassifier employs decision trees as weak stances, resulting in false negatives.
learners and is optimized through gradient descent. To It is worth mentioning that the high performance of
LogisticRegression in the in-domain task is likely related
to the balanced nature of the data (700 HS v. 700 ¬HS).</p>
      <p>When out-of-domain, unbalanced test data (see Table 1)
is used, performance drastically drops.</p>
    </sec>
    <sec id="sec-3">
      <title>2The presence of multiple user tags in some of the tweets caused</title>
      <p>a mismatch in Tensor size and consequently a RuntimeError.</p>
      <p>3These learning rates are found in Nozza et al. (2022),
HuggingFace’s fine-tuning guide, and in the standard training parameters,
respectively.</p>
      <p>4Such a configuration is selected to mirror the task’s original
train-test split of 5600-1400, see Celli et al. (2021).</p>
      <sec id="sec-3-1">
        <title>Dummy</title>
      </sec>
      <sec id="sec-3-2">
        <title>LogReg</title>
      </sec>
      <sec id="sec-3-3">
        <title>Dummy</title>
      </sec>
      <sec id="sec-3-4">
        <title>LogReg Class</title>
        <p>¬HS
HS
¬HS
HS
¬HS
HS
¬HS
HS
0.504
0.530
0.963
0.759
4.2. Task A
Our models achieve competitive results in both Task A’s
sub-tasks5, as shown in Table 3 (Textual) and Table 4
(Contextual) for each class. Starting from the former, both
models perform above both baselines. However, there
seems to be a substantial diference between UmBERTo’s
(Run 1) and BERT-ita’s (Run 2) performance: while the
ifrst reaches an F1 average of 0.89, the second reaches
0.86, with a diference between the scores of over .03.
Indeed, even if the second run’s results are close to the
LogisticRegression baseline, the model’s predictions (i.e.,
precision and recall) are more balanced across the two
classes. Thus, the F1 for the HS class is higher for
BERTita compared to the baseline.</p>
        <p>The reason for the discrepancy between the two
models’ performance is likely related to the size of the
pretraining data: UmBERTo was trained on over 70GB
(against the 13GB of BERT-ita). As such, the model likely
has more sub-embeddings and sentence-embeddings
available, which in turn allows for better results.</p>
        <p>For the Contextual sub-task (Table 4), we included a set
of extra features (i.e., ’anonymized description’, ’retweet
count’, ’favorite count’, ’is reply’, ’is retweet’, ’is quote’,
’statuses count’, ’followers count’, and ’friends count’)
to the output labels through GradientBoostingClassifier.
Both models once again reach competitive results. While
the first run’s results are not afected by the inclusion of
contextual features, BERT-ita (Run 2) significantly
beneifts from their addition. The model performs on the same
level as UmBERTo, with an F1 of 0.902 for ¬HS and 0.892
for HS. The inclusion of contextual information during
the training stage likely enables BERT-ita to capture more
diverse linguistic patterns and generalize better to the
classification task.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5Note that the overall F1 average for each model and sub-task</title>
      <p>can be found in Table 7 below.</p>
    </sec>
    <sec id="sec-5">
      <title>Though we did not formally submit results to Task B’s</title>
      <p>sub-task XPoliticalHate, we met the requirements of the
task by submitting results for the Contextual sub-task
of Task A, for which the same test data was used. We
will thus report results for both sub-tasks of Task B,
referring to Table 4 for the results of Task B’s sub-task
XPoliticalHate.</p>
      <p>Our model performs competitively in the
XPoliticalHate sub-task, which made use of in-domain test data,
while our model for the sub-task XReligiousHate
performed poorly in the context of out-of-domain test data.
We made no consideration regarding the XPoliticalHate
sub-task, as we did not take any additional steps.</p>
      <p>The models’ performance on out-of-domain data
(Table 5) is much lower than the average F1 score (0.57) but
still higher than the baseline (0.52). Such low scores may
relate to the imbalance between the two classes in the test
data and to limitations in transfer learning. As noted by
Ada et al. (2019), performance on the source task may not
reflect performance on the target task. Also, the model
may overfit on the data on which it was fine-tuned [ 30].</p>
      <p>Run 1
Run 2</p>
      <sec id="sec-5-1">
        <title>Class</title>
        <p>¬HS
HS
¬HS
HS</p>
      </sec>
      <sec id="sec-5-2">
        <title>Precision</title>
        <p>0.849
0.330
0.848
0.306</p>
      </sec>
      <sec id="sec-5-3">
        <title>Recall</title>
        <p>0.950
0.127
0.942
0.131</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Overall, our models have consistently outperformed</title>
      <p>the baselines, demonstrating significant improvements
across the board. However, it is worth noting that Table
6 reveals that some runs were below the competition’s
averages. In particular, Run 2 in Task A (Textual) and
Task B (XReligiousHate) failed to meet our expectations, (2021) 1–15. URL: http://dx.doi.org/10.3390/app11
as discussed in detail in Sections 4.2 and 4.3. These un- 2411684. doi:10.3390/app112411684.
derperforming results can be attributed to the previously [2] F. Del Vigna, A. Cimino, F. Dell’Orletta, M.
Petrochighlighted factors. chi, M. Tesconi, Hate Me, Hate Me Not: Hate Speech
Detection on Facebook., ITASEC (2017) 86–95.</p>
      <p>Task Sub-task Model F1 avg [3] A. A. Siege, Online Hate Speech, Cambridge
Uni</p>
      <p>Run 1 0.89759 versity Press, 2020. URL: https://doi.org/10.1017/97</p>
      <p>Textual Run 2 0.86516 81108890960[.</p>
      <p>A Contextual RRAuuvnng21 000...888998672856793 [4] lKi n.gPu.iDsteicMfeaaittiu,rDes. FoifšSelro,vNe.nLejusobceišailćly, Nunoancscteapntdaabrlde
Avg 0.88616 discourse on Facebook, Znanstvena založba
Filo</p>
      <p>Run 1 0.89759 zofske fakultete (2020).</p>
      <p>XPoliticalHate Run 2 0.89687 [5] M. Sanguinetti, F. Poletto, C. Bosco, V. Patti,
B RAuvng1 00..5848081616 SMp.eSetcrhanaigsaciin,stAInmImtailgiarannTtsw,iLttaenrgCuaogrpeuRseosofuHrcaetes
XReligiousHate Run 2 0.53841 and Evaluation (2018) 1–8.</p>
      <p>Avg 0.57439 [6] M. Sanguinetti, G. Comandini, E. D. Nuovo,
Table 6 S. Frenda, M. Stranisci, C. Bosco, T. Caselli, V. Patti,
F1 averages for our models and the average for all models I. Russo, HaSpeeDe 2 @ EVALITA2020: Overview
submitted to the task. of the EVALITA 2020 Hate Speech Detection Task,
EVALITA Evaluation of NLP and Speech Tools for
Italian - December 17th, 2020 (2020) 93–101. URL:
http://dx.doi.org/10.4000/books.aaccademia.6897.
5. Conclusion doi:10.4000/books.aaccademia.6897.
[7] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal,
In this paper, we introduced two fine-tuning techniques N. Farra, R. Kumar, SemEval-2019 Task 6:
Identifyto detect Italian-language HS in Twitter’s posts and ing and Categorizing Ofensive Language in Social
replies. We were asked to address the issue in two dif- Media (OfensEval), Proceedings of the 13th
Interferent tasks with two sub-tasks each. Two models were national Workshop on Semantic Evaluation (2019).
ifne-tuned and 5-cross-validated: UmBERTo and BERT- URL: http://dx.doi.org/10.18653/v1/s19- 2010.
ita. Task A was comprised of a Textual and a Contextual doi:10.18653/v1/s19-2010.
sub-task: here, UmBERTo performed competitively in [8] I. Markov, W. Daelemans, Improving Cross-Domain
both sub-tasks, reaching above the baseline and competi- Hate Speech Detection by Reducing the False
Postion average. However, the model did not benefit from the itive Rate, Proceedings of the Fourth Workshop
addition of contextual features. BERT-ita, on the other on NLP for Internet Freedom: Censorship,
Dishand, performed above the baselines but significantly information, and Propaganda (2021). URL: http:
lower than the task average. In contrast to UmBERTo, / / d x . d o i . o r g / 1 0 . 1 8 6 5 3 / v 1 / 2 0 2 1 . n l p 4 i f - 1 .3.
BERT-ita’s results improved significantly, reaching the doi:10.18653/v1/2021.nlp4if-1.3.
ifrst model’s performance. [9] W. Yin, A. Zubiaga, Towards generalisable hate</p>
      <p>For Task B, we did not submit any results for the XPolit- speech detection: a review on obstacles and
soicalHate sub-task. As such, the results obtained for Task lutions, PeerJ Computer Science 7 (2021) e598.
A (Contextual) were assumed to be valid for this sub-task URL: http://dx.doi.org/10.7717/peerj- cs.598.
given the test data was the same. Finally, both our models doi:10.7717/peerj-cs.598.
performed well below the competition average for the [10] T. Caselli, V. Basile, J. Mitrović, M. Granitzer,
Hateout-of-domain task. BERT: Retraining BERT for abusive language
detec</p>
      <p>Future work should look at the potential benefits of tion in English, in: Proceedings of the 5th
Workincluding additional training data for the out-of-domain shop on Online Abuse and Harms (WOAH 2021),
task. Also, the addition of contextual features could be Association for Computational Linguistics, Online,
tested in combination with diferent language models. 2021, pp. 17–25. URL: https://aclanthology.org/202
1.woah-1.3. doi:10.18653/v1/2021.woah-1.3.</p>
      <p>References [11] T. Caselli, V. Patti, N. Novielli, P. Rosso, Evalita
2018: Overview on the 6th evaluation campaign of
natural language processing and speech tools for
italian, EVALITA Evaluation of NLP and Speech
[1] M. K. Aljero, N. Dimililer, A novel stacked ensemble
for hate speech recognition, Applied Sciences 11
Tools for Italian (2018) 3–8. URL: http://dx.doi.org sarotti, V. Patti (Eds.), Proceedings of the Eighth
/10.4000/books.aaccad. doi:10.4000/books.aa Italian Conference on Computational Linguistics,
ccademia.4437. CLiC-it 2021, Milan, Italy, January 26-28, 2022,
vol[12] R. Kumar, B. Lahiri, A. Ojha, Aggressive and ofen- ume 3033 of CEUR Workshop Proceedings,
CEURsive language identification in hindi, bangla, and en- WS.org, 2021. URL: https://ceur-ws.org/Vol-3033/
glish: A comparative study, SN Computer Science paper38.pdf.</p>
      <p>2 (2021). doi:10.1007/s42979-020-00414-6. [22] A. Ramponi, B. Testa, S. Tonelli, E. Jezek,
Address[13] C. Bosco, F. Dell’Orletta, F. Poletto, M. Sanguinetti, ing religious hate online: from taxonomy creation
M. Tesconi, Overview of the evalita 2018 hate to automated detection, PeerJ Comput. Sci. 8 (2022)
speech detection task, EVALITA Evaluation of e1128. URL: https://doi.org/10.7717/peerj-cs.1128.
NLP and Speech Tools for Italian (2018) 67–74. URL: doi:10.7717/peerj-cs.1128.
http://dx.doi.org/10.4000/books.aaccademia.4503. [23] Twitter, Documentation, Twitter Developer
Docudoi:10.4000/books.aaccademia.4503. mentation, 2023. URL: https://developer.twitter.co
[14] A. Arango, J. Pérez, B. Poblete, Cross-lingual hate m/en/docs, Accessed: 13th June 2023.
speech detection based on multilingual domain- [24] S. J. Pan, Q. Yang, A Survey on Transfer Learning,
specific word embeddings, CoRR abs/2104.14728 IEEE Transactions on Knowledge and Data
Engi(2021). URL: https://arxiv.org/abs/2104.14728. neering 22 (2010) 1345–1359. URL: http://dx.doi.o
arXiv:2104.14728. rg/10.1109/tkde.2009.191. doi:10.1109/tkde.200
[15] P. Fortuna, S. Nunes, A Survey on Automatic De- 9.191.</p>
      <p>tection of Hate Speech in Text, ACM Computing [25] L. Eric, R. Saini, G. Kovács, K. Murphy, TheNorth
Surveys 51 (2019) 1–30. URL: http://dx.doi.org/10. @ HaSpeeDe 2: BERT-based Language Model
Fine1145/3232676. doi:10.1145/3232676. tuning for Italian Hate Speech Detection, EVALITA
[16] C. Corazza, S. Menini, E. Cabrio, S. T. S. Villata, Evaluation of NLP and Speech Tools for Italian
Cross-Platform Evaluation for Italian Hate Speech December 17th, 2020 (2020) 142–147. URL: http:
Detection, Le Centre pour la Communication Scien- //dx.doi.org/10.4000/books.aaccademia.6989.
tifique Directe - HAL - Université de Nantes (2019). doi:10.4000/books.aaccademia.6989.
[17] I. Markov, I. Gevers, W. Daelemans, An En- [26] F. Tamburini, How “BERTology" Changed the
semble Approach for Dutch Cross-Domain Hate State-of-the-Art also for Italian NLP, Proceedings
Speech Detection, Natural Language Processing of the Seventh Italian Conference on
Computaand Information Systems (2022) 3–15. URL: http: tional Linguistics CLiC-it 2020 (2020) 415–421. URL:
//dx.doi.org/10.1007/978- 3- 031- 08473- 7_1. http://dx.doi.org/10.4000/books.aaccademia.8920.
doi:10.1007/978-3-031-08473-7/1. doi:10.4000/books.aaccademia.8920.
[18] D. Njagi, Z. Zuping, D. Hanyurwimfura, J. Long, A [27] D. Nozza, F. Bianchi, G. Attanasio, HATE-ITA:
lexicon-based approach for hate speech detection, Hate Speech Detection in Italian Social Media
International Journal of Multimedia and Ubiquitous Text, Proceedings of the Sixth Workshop on
Engineering 10 (2015) 215–230. doi:10.14257/i Online Abuse and Harms (WOAH) (2022). URL:
jmue.2015.10.4.21. http://dx.doi.org/10.18653/v1/2022.woah-1.24.
[19] T. Davidson, D. Warmsley, M. Macy, I. Weber, Au- doi:10.18653/v1/2022.woah-1.24.
tomated hate speech detection and the problem of [28] F. He, T. Liu, D. Tao, Control batch size and
learnofensive language, Proceedings of the Eleventh ing rate to generalize well: Theoretical and
emInternational Conference on Web and Social Media pirical evidence, in: H. Wallach, H. Larochelle,
(2017) 512–521. URL: http://dx.doi.org/10.5555/329 A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett
0605.3300749. doi:10.5555/3290605.3300749. (Eds.), Advances in Neural Information Processing
[20] M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprug- Systems, volume 32, Curran Associates, Inc., 2019.
noli, G. Venturi, Evalita 2023: Overview of the 8th URL: https://proceedings.neurips.cc/paper/2019/fil
evaluation campaign of natural language process- e/dc6a70712a252123c40d2adba6a11d84-Paper.pdf.
ing and speech tools for italian, in: Proceedings [29] S. E. Ada, E. Ugur, H. L. Akin,
Generalizaof the Eighth Evaluation Campaign of Natural Lan- tion in transfer learning, CoRR abs/1909.01331
guage Processing and Speech Tools for Italian. Final (2019). URL: http://arxiv.org/abs/1909.01331.
Workshop (EVALITA 2023), CEUR.org, Parma, Italy, arXiv:1909.01331.</p>
      <p>2023. [30] L. Shao, F. Zhu, X. Li, Transfer learning for visual
[21] F. Celli, M. Lai, A. Duzha, C. Bosco, V. Patti, Poli- categorization: A survey, IEEE Transactions on
cycorpus XL: an italian corpus for the detection of Neural Networks and Learning Systems 26 (2015)
hate speech against politics, in: E. Fersini, M. Pas- 1019–1034. doi:10.1109/TNNLS.2014.2330900.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>