TheNorth @ HaSpeeDe 2: BERT-based Language Model Fine-tuning for
                  Italian Hate Speech Detection
             Eric Lavergne, Rajkumar Saini, György Kovács and Killian Murphy
                                 Luleå Tekniska Universitet
                                eric.lavergne@gmx.fr
                               rajkumar.saini@ltu.se
                                gyorgy.kovacs@ltu.se
                       killian.murphy@telecom-sudparis.eu

                      Abstract                             tackle both the main task, and the first sub-task
                                                           of Stereotype Detection that is potentially useful
    English. This report was written to de-                for the main task. For this sub-task the organis-
    scribe the systems that were submitted by              ers use the following definition of Stereotype: “a
    the team “TheNorth” for the HaSpeeDe                   standardized mental picture that is held in com-
    2 shared task organised within EVALITA                 mon by members of a group and that represents
    2020. To address the main task which                   an oversimplified opinion, prejudiced attitude, or
    is hate speech detection, we fine-tuned                uncritical judgment” (Merriam-Webster, 2020).
    BERT-based models. We evaluated both
                                                              Here, we have two binary classification tasks. A
    multilingual and Italian language models
                                                           simple way to perform text classification is based
    trained with the data provided and addi-
                                                           on bag-of-words representation counting the num-
    tional data. We also studied the contri-
                                                           ber of occurrences of each word within text. It is
    butions of multitask learning considering
                                                           often combined with term frequency-inverse doc-
    both hate speech detection and stereotype
                                                           ument frequency (Sparck Jones, 1988) (TF-IDF)
    detection tasks.
                                                           representation. TF-IDF allows the frequencies to
                                                           be normalized according to how often the words
1   Introduction                                           appear in all documents. With the rise of neu-
                                                           ral networks, word vectors have provided useful
Organised as part of the 7th EVALITA evalua-
                                                           features for text classification tasks. Recurrent
tion campaign (Basile et al., 2020), the HaSpeeDe
                                                           Neural Networks as the Bidirectional Long-Short
2 shared task focuses on the detection of online
                                                           Term Memory (BiLSTM) network (Schuster and
hate speech (Sanguinetti et al., 2020) in Italian-
                                                           Paliwal, 1997) have then be used to encode the
Hate speech occurs frequently on social media. It
                                                           long-term dependencies between the words. These
is defined as “any communication that disparages
                                                           systems were the most successful in the previous
a person or a group on the basis of some char-
                                                           HaSpeeDe campaign (Bosco et al., 2018).
acteristics such as race, colour, ethnicity, gender,
sexual orientation, nationality, religion, or other           In (Aluru et al., 2020), the authors showed
characteristics” (Nockleby, 2000). Regulating all          that when dealing with very low monolingual re-
user messages is very time-consuming for a hu-             sources, multilingual approaches can be interest-
man, and this is one of the reasons why automatic          ing for hate speech. In (Polignano et al., 2019b),
methods are important.                                     the AlBERTo monolingual Italian BERT-based
   Beside the main task of binary hate speech clas-        language model was trained that outperformed the
sification - aimed at deciding whether a message           state-of-the-art on the HaSpeeDe 2018 evaluation
contains hate speech or not - the HaSpeeDe 2               task (Polignano et al., 2019a).
shared task has two more sub-tasks. One being                 We have chosen to deepen the approach of fine-
stereotype detection, and the other the identifica-        tuning a BERT based language model, comparing
tion of nominal utterances. All tasks being eval-          multilingual and monolingual settings. We also
uated both on in-domain (tweets) data, and out-            assessed the contribution of additional hate speech
of-domain (newspaper headlines) data. Here, we             data from different online sources. We finally sub-
                                                           mitted the results of the same model fine-tuned
     Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   with and without multitask learning between hate
International (CC BY 4.0).                                 speech and stereotype detection tasks.
2     System Description                                where the basics shapes features are extracted by
                                                        the first layers and the task-specific combinations
2.1    Fine-tuning process
                                                        are processed in the last ones. Thus we applied
The chosen classification approach is to fine-tune      layer-wise learning rate with the following geo-
a BERT-based language model. This kind of ap-           metric equation: the learning rate in a layer is the
proach is the state-of-the-art for many text classi-    one of the following multiplied by a decay factor
fications tasks today (Sun et al., 2019; Seganti et     γ between 0 and 1.
al., 2019). BERT is a language model which aims
to learn the distribution of language (Devlin et al.,                  LRk−1 = γ × LRk
2018). It is trained with the prediction of masked
tokens in a text. The next sentence prediction task     where LRk is the learning rate of the k-th layer.
that was used simultaneously for training has been         Then the case when γ is one is the case of clas-
removed for some later BERT-based models such           sic fine-tuning with the same learning rate every-
as RoBERTa (Liu et al., 2019). BERT is a Trans-         where, and the case when γ is zero is the case of
former. In a Transformer, the recurrence of Re-         feature extraction with the whole language model
current Neural Networks is replaced by the mech-        weights that are frozen and only the parameters of
anism of attention (Vaswani et al., 2017).              the classification head are trainable. This hyper-
   It has been shown that it is possible to fine-tune   parameter γ was learned with the others during the
these models for many downstream natural lan-           hyper-parameters tuning process.
guage processing tasks, including the one we are
interested in, which is text classification. This can   2.3   Monolingual and multilingual language
be achieved by removing the language modelling                models
head and replacing it by a head appropriate for         We compared the use of several language mod-
the target task. The designers of BERT prepared         els. Many models similar to BERT have been
this by adding a token at the beginning of each         trained since 2018, and a lot are available for use.
text sequence, named CLS for classification. The        Although the models are often first and foremost
purpose of this token is to contain the information     trained for English, multilingual models have been
useful for the classification task at the end of the    trained on data of several languages in order to
forwarding process. Then a classifier head can just     counteract the lack of data for some languages. It
take this CLS token as input to classify the whole      is the case of mBERT and XLM-Roberta (Con-
text sequence. In our case we decided to add a          neau et al., 2020). Also machine learning re-
simple linear layer with a softmax on top of it, for    searchers trained monolingual models for their
simplicity and because it is efficient enough since     own language, as CamemBERT for French and
the other layers are fine-tuned.                        AlBERTo or UmBERTo for Italian. Multilingual
                                                        models have the advantage that they are trainable
2.2    Layer-wise learning rate                         on data in different languages; it is very useful for
An important consideration of fine-tuning de-           low-resources tasks. However, they are expected
scribed in (Sun et al., 2019) is the choice of the      to perform in dozens of languages while mono-
learning rate. Besides being as usual the most          lingual models focus on just one, with the same
important hyper-parameter in the gradient descent       number of parameters. For this reason, monolin-
learning algorithm, it could also be responsible        gual models often perform better when sufficient
here for some catastrophic forgetting if it were too    data is available, as we show here.
high. Catastrophic forgetting refers to the fact of        We evaluated two multilingual models, mBERT
erasing the information of the weights of the pre-      and XLM-RoBERTa, and three Italian monolin-
trained model and can happen when the gradient          gual models, AlBERTo, UmBERTo, and PoliB-
updates are too high.                                   ERT. AlBERTo was pretrained on TWITA, that
   Moreover, the learning rate can be gradually de-     is a collection of Italian tweets (Polignano et al.,
creased in the first layers of the models. It aims at   2019b). UmBERTo was pretrained on Common-
limiting the update in these first layers that have     crawl ITA exploiting OSCAR Italian large corpus
been showed to contain the most primal informa-         (Parisi et al., 2020). Finally, PoliBERT was fine-
tion about the language. One can think of the clas-     tuned for sentiment analysis on Italian tweets by
sical example in computer vision neural networks        its creators (Barone, 2020).
   We tried to use more data, with different set-        then a kind of transfer learning. The error analy-
tings. For the multilingual models, we could use         sis conducted on HaSpeeDe 2018 evaluation sug-
all type of hate speech data. For the monolingual        gests a significant correlation between the usage
models, we used the little data available for Ital-      of stereotype and hate speech (Francesconi et al.,
ian but we tried also to use translated multilingual     2019). Moreover, they showed that the false pos-
data. These additions were not conclusive, so we         itive rate of hate speech tweets is slightly bigger
stuck to the HaSpeeDe 2 data for the submissions.        for tweets with stereotype.
                                                            A question that arises when doing multitasking
2.4   Random search hyper-parameters tuning              is the way to combine the loss of the tasks in one.
The tuning of the hyper-parameters is relevant in        The simple solution is to sum them uniformly. It
order to get good results, and that is especially        might not be the best solution when there is imbal-
the case for the learning rate and the layer-wise        ance between the tasks, for instance when the scale
decay factor γ. We tuned hyper-parameters with           of the outputs of one is much higher than the oth-
random search which has been shown to be of-             ers. A solution brought by (Kendall et al., 2017)
ten more efficient than grid-search (Bergstra and        is to use trainable weights based on uncertainty.
Bengio, 2012). The hyper-parameters to be tuned          (Liebel and Körner, 2018) upgrades the regulari-
are the batch size, the learning rate, the layer-wise    sation term of this solution and (Gong et al., 2019)
multiplier and the length of the model (maximum          shows in a benchmark that this last solution is of-
number of tokens). We did ten trials for each lan-       ten the best. We evaluated this solution and we
guage model. The number of epochs is selected            compared with the single-task setting.
with early stopping on the validation macro F1-
score with a split of 80/20. Table 1 shows the best      2.6    Cross-validation ensembling and
hyper-parameters obtained that have been used for               submitted models
the systems submitted.                                   Two submissions are allowed during the
                                                         HaSpeeDe 2 test phase. We chose to submit
         Hyper-parameter           Value                 a fine-tuned UmBERTo trained separately for
         Learning rate            2.10-4                 each of the two tasks and a fined-tuned UmBERTo
         Layer-wise γ               0.35                 with multitasking on both Stereotype and Hate
         Batch Size                   32                 Speech detection. The hyper-parameters used to
         Max Length                  100                 train these models were presented in Table 1.
         Language Model         UmBERTo                     Since we compared the different language mod-
                                                         els with 5-fold cross-validation, we then ensem-
Table 1:    Hyper-parameters used for our                bled the 5 models obtained for each fold in order to
HaSpeeDe 2 submission after the tuning process           get the final model. The ensembling was done by
                                                         considering the mean of the probabilities returned
   It is very important that the learning rate and the   by each model.
layer-wise multiplier γ are tuned simultaneously
because the choice of the multiplier strongly mod-       3     Data Description
ifies the amplitude of the gradient.                     The organisers provided a train dataset of 6,839
2.5   Multitask Learning                                 tweets, annotated with Hate Speech and Stereo-
                                                         type labels (as described in Table 2).
We evaluated the usage of multitask learning be-
tween the two classification tasks of the competi-           Dataset                         HS      Ster
tion that are hate speech detection and stereotype           Development Data (Tweets)     0.404    0.445
detection. Multitask learning consists of learning           Test Data (Tweets)            0.492    0.450
to perform several tasks. It can be done by learn-           Test Data (News)              0.362    0.350
ing the tasks simultaneously with common first
layers but task-specific heads (Ruder, 2017). In         Table 2: Distribution of Hate Speech and Stereo-
our case each task has its own output linear layer.      type labels in HaSpeeDe 2 data.
When the tasks should be based on similar rep-
resentations, it is supposed to do a good regular-         The test data of HaSpeeDe 2 consists of two
ization with useful shared representations. It is        subsets: an in-domain set (1,263 tweets) and an
out-of-domain set (500 newspaper headlines).            data. The averages of the 5 macro F1-scores are
   The hate speech labels are slightly unbalanced       shown in Table 3.
towards non-hate speech. Thus we tried to use
                                                              System                      HS    Ster
adapted losses to prevent tendency towards non-
                                                                              Baselines
hate speech predictions. We used class-weighted
loss, which assigns a higher weight to the obser-             Most Frequent Class       0.374 0.353
vations from the minority class in the computing              TF-IDF Bag-of-words       0.703 0.677
of the loss. We also tried to use a smoothed F1-              Word vectors + BiLSTM 0.721 0.654
score – a differentiable loss in phase with the F1.                Multilingual language models
Neither approach improved the results in a signif-            mBERT                     0.757 0.716
icant way.                                                    XLM-RoBERTa               0.761 0.677
   The pre-processing was simple. We removed                          Italian language models
emoticons and hashtags and we replaced urls and               AlBERTo                   0.773 0.716
user names with associated tags as done in the                PoliBERT                  0.795 0.733
evaluation data. Each tweet was padded with a                 UmBERTo                   0.799 0.733
size of 100. Then we used the pre-processing and
tokenization pipeline specific to each language         Table 3: Macro F1-scores averaged over 5-fold
model as provided by the authors of the models.         cross-validation on HaSpeeDe 2 training data.

4     Results
                                                        4.4    Test Results
4.1    Macro F1-score                                   The scores of our two systems evaluated on the
The metric used for the evaluation is the macro         HaSpeeDe 2 test data are summarized in Table 4.
F1-score. The F1-score of a class is computed by        These systems are 5 UmBERTo models trained on
calculating the harmonic mean between the preci-        each of the 5 training folds and ensembled. The
sion and recall for this class. The macro F1-score      second system is the same as the first with the use
is the mean between the F1-scores for each class.       of multitask learning.
It is less sensitive to the imbalance between the
classes.                                                    System                      Tweets    News
                                                                       Hate Speech Detection
4.2    Baselines                                            Most Frequent Class           0.337   0.389
We used several baselines to evaluate our results           Classic Features + SVM        0.721   0.621
during the development process. The first ones              UmBERTo                       0.790   0.671
are those obtained by dummy classifiers, one that           UmBERTo + Multitasking        0.809   0.660
always predicts the most frequent class and the             Best HaSpeeDe 2               0.809   0.774
other one that makes a random stratified predic-                        Stereotype Detection
tion according to the distribution of the classes in        Most Frequent Class           0.355   0.394
the training data. We also computed the results of          Classic Features + SVM        0.715   0.669
more developed systems, that are a TF-IDF bag of            UmBERTo                       0.772   0.685
words and a BiLSTM with trainable word vectors              UmBERTo + Multitasking        0.768   0.647
inputs.                                                     Best HaSpeeDe 2               0.772   0.720
   The HaSpeeDe 2 organisers provided two base-
line systems after the results were submitted. The      Table 4: Macro F1-scores on HaSpeeDe 2 test
first is a most frequent class predictor and the sec-   datasets.
ond is a linear SVM with unigrams, char-grams
and TF-IDF representation.                              5     Discussion
4.3    Validation Results                               5.1    Multilingual and monolingual models
We tuned the hyper-parameters for each evaluated        According to Table 3, multilingual models per-
language model as described in Section 2.4. For         formed worse than monolingual models based on
each language model, we then computed 5-fold            HaSpeeDe 2 data alone, although they achieved
cross-validation results on HaSpeeDe 2 training         respectable results.
   Moreover, even when we used additional data         tasking learning performed much better on the in-
from other languages to train the multilingual         domain data for the hate speech detection task. It
models, they still did not manage to outperform        is not the case however for the out-of-domain data,
the monolingual models, as we were hoping they         neither for the stereotype detection task.
would.                                                    Table 6 describes in more detail the differences
   Within the Italian models, UmBERTo and              between the predictions of the two systems for
PoliBERT performed better than AlBERTo on              data containing stereotypes and data not contain-
these tasks. While the good performance of PoliB-      ing stereotypes. We observed that the improve-
ERT can be linked to its pre-training for a tweet      ment linked to multitask learning consists mainly
classification task (sentiment analysis) potentially   in a reduction in the number of false positives in
useful for hate speech detection, it is more diffi-    favour of the number of true negatives in data not
cult to explain the competitiveness of UmBERTo,        labeled as Stereotype. Assuming that hate speech
which was trained on data not coming from Twit-        makes significant use of stereotype, one could sup-
ter and less numerous than for AlBERTo. One ex-        pose that the multitask model has learned to dis-
planation could be the better quality of this data,    card some data that do not have the characteristics
or a better optimisation by its creators.              of stereotypes and are therefore unlikely to contain
                                                       hate speech.
5.2   Out-of-domain data and in-domain data
Our results on the HaSpeeDe 2 test dataset are                   Data labeled as Stereotype
summarized in the Table 4. The results obtained                   Predicted False Predicted True
on in-domain data correspond to what we ex-                False              +3               -3
pected from our cross-validation results. Our sys-         True               +7               -7
tems achieved the best macro F1-scores on the in-              Data not labeled as Stereotype
domain test set (Tweets) for both hate speech and                 Predicted False Predicted True
stereotype detection. However, the results on out-         False             +28              -28
of-domain data (News) are far from being as good.          True               +1               -1
This can be explained by the different distribution
of this data compared to the training data.            Table 6: Hate Speech Confusion matrix of the
   Table 5 shows the confusion matrix for our first    multitask system minus the one of the single-task
system evaluated on out-of-domain data. The er-        system, for Stereotype and Non Stereotype tweets
ror is mostly due to the high number of false neg-     test data.
atives. The classifier predicts too many sequences
as non-hate speech. This suggests that this clas-      6   Conclusion
sifier trained with hate speech on Twitter is strug-
gling to detect hate speech in newspaper headlines.    In this work, we compared the fine-tuning of
It can be assumed that hate speech in newspapers       multilingual and monolingual BERT-based lan-
is more subtle, with less coarseness and aggres-       guage models for hate speech detection. We
siveness that make it easier to detect on Twitter.     also investigated the addition of multitask learning
                                                       with the Stereotype detection task linked to hate
              Predicted False   Predicted True         speech. We obtained the best macro F1-scores of
      False              312                 7         HaSpeeDe 2 on the in-domain test data. However,
      True               117                64         the results were worse for out-of-domain test data,
                                                       and further research could be conducted to better
Table 5: Hate Speech Confusion matrix for Um-          understand the reasons for this and address it.
BERTo evaluated on news test data.

                                                       References
5.3   Multitasking Benefits
                                                       Sai Saketh Aluru, Binny Mathew, Punyajoy Saha, and
We have chosen to submit a system with multitask         Animesh Mukherjee. 2020. Deep Learning Models
learning on both Stereotype and Hate Speech de-          for Multilingual Hate Speech Detection.
tection and an other one without, in order to study    Gianfranco Barone. 2020. Politic BERT based Sen-
the benefits of it. Indeed, the system with multi-       timent Analysis. https://huggingface.co/
  unideeplearning/polibert_sa. accessed                    Loreto Parisi, Simone Francia, and Paolo
  on Sept 18, 2020.                                          Magnani.       2020.    UmBERTo:  an Ital-
                                                             ian Language Model trained with whole
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-         word Masking.        https://github.com/
  cia C. Passaro. 2020. EVALITA 2020: Overview of            musixmatchresearch/umberto. accessed
  the 7th Evaluation Campaign of Natural Language            on Sept 18, 2020.
  Processing and Speech Tools for Italian. In Valerio
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.        Marco Polignano, Pierpaolo Basile, Marco De Gem-
  Passaro, editors, Proceedings of Seventh Evalua-          mis, and Giovanni Semeraro. 2019a. Hate Speech
  tion Campaign of Natural Language Processing and          Detection through AlBERTo Italian Language Un-
  Speech Tools for Italian. Final Workshop (EVALITA         derstanding Model. In NL4AI@AI*IA.
  2020), Online. CEUR.org.
                                                           Marco Polignano, Pierpaolo Basile, Marco De Gem-
James Bergstra and Y. Bengio. 2012. Random Search           mis, Giovanni Semeraro, and Valerio Basile. 2019b.
  for Hyper-Parameter Optimization. The Journal of          AlBERTo: Italian BERT Language Understanding
  Machine Learning Research, 13:281–305, 03.                Model for NLP Challenging Tasks Based on Tweets.
                                                            In Proceedings of the Sixth Italian Conference on
Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,         Computational Linguistics (CLiC-it 2019), volume
  M. Sanguinetti, and M. Tesconi. 2018. Overview            2481. CEUR.
  of the EVALITA 2018 Hate Speech Detection Task.
  In EVALITA@CLiC-it.                                      Sebastian Ruder. 2017. An Overview of Multi-
                                                             Task Learning in Deep Neural Networks. CoRR,
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,            abs/1706.05098.
  Vishrav Chaudhary, Guillaume Wenzek, Francisco
  Guzmán, Edouard Grave, Myle Ott, Luke Zettle-           Manuela Sanguinetti, Gloria Comandini, Elisa Di
  moyer, and Veselin Stoyanov. 2020. Unsupervised           Nuovo, Simona Frenda, Marco Stranisci, Cristina
  Cross-lingual Representation Learning at Scale.           Bosco, Tommaso Caselli, Viviana Patti, and Irene
                                                            Russo. 2020. Overview of the EVALITA 2020 Sec-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               ond Hate Speech Detection Task (HaSpeeDe 2). In
   Kristina Toutanova. 2018. BERT: Pre-training of          Valerio Basile, Danilo Croce, Maria Di Maro, and
   Deep Bidirectional Transformers for Language Un-         Lucia C. Passaro, editors, Proceedings of the 7th
   derstanding. CoRR, abs/1810.04805.                       evaluation campaign of Natural Language Process-
Chiara Francesconi, Cristina Bosco, Fabio Poletto, and      ing and Speech tools for Italian (EVALITA 2020),
  M. Sanguinetti. 2019. Error Analysis in a Hate            Online. CEUR.org.
  Speech Detection Task: The Case of HaSpeeDe-TW
                                                           M. Schuster and K.K. Paliwal. 1997. Bidirec-
  at EVALITA 2018. In CLiC-it.
                                                             tional recurrent neural networks. Trans. Sig. Proc.,
Ting Gong, Tyler Lee, Cory Stephenson, Venkata Ren-          45(11):2673–2681, November.
  duchintala, Suchismita Padhy, Anthony Ndirango,
                                                           Alessandro Seganti, Helena Sobol, Iryna Orlova,
  Gokce Keskin, and Oguz Elibol. 2019. A compari-
                                                             Hannam Kim, Jakub Staniszewski, Tymo-
  son of loss weighting strategies for multi-task learn-
                                                             teusz Krumholc, and Krystian Koziel.         2019.
  ing in deep neural networks. IEEE Access, PP:1–1,
                                                             NLPR@SRPOL at SemEval-2019 Task 6 and Task
  09.
                                                             5: Linguistically enhanced deep learning offensive
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2017.          sentence classifier. In SemEval@NAACL-HLT.
  Multi-Task Learning Using Uncertainty to weigh
  Losses for Scene Geometry and Semantics. CoRR,           Karen Sparck Jones, 1988. A Statistical Interpretation
  abs/1705.07115.                                            of Term Specificity and Its Application in Retrieval,
                                                             page 132–142. Taylor Graham Publishing, GBR.
Lukas Liebel and Marco Körner.    2018.  Aux-
  iliary Tasks in Multi-task Learning.   CoRR,             Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.
  abs/1805.06334.                                            2019. How to Fine-Tune BERT for Text Classifi-
                                                             cation? In Maosong Sun, Xuanjing Huang, Heng
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-          Ji, Zhiyuan Liu, and Yang Liu, editors, Chinese
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,              Computational Linguistics, pages 194–206, Cham.
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.              Springer International Publishing.
  RoBERTa: A Robustly Optimized BERT Pretrain-
  ing Approach.                                            Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
                                                             Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Merriam-Webster. 2020.  stereotype, noun.                    Kaiser, and Illia Polosukhin. 2017. Attention is All
 https://www.merriam-webster.com/                            you Need. In I. Guyon, U. V. Luxburg, S. Bengio,
 dictionary/stereotype.      Accessed on                     H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
 2020-11-05.                                                 nett, editors, Advances in Neural Information Pro-
                                                             cessing Systems 30, pages 5998–6008. Curran As-
John T. Nockleby. 2000. Hate Speech. Macmillan,              sociates, Inc.
  New York.