Fontana-Unipi @ HaSpeeDe2: Ensemble of Transformers for the Hate
                        Speech Task at Evalita

               Michele Fontana                                           Giuseppe Attardi
           Dipartimento di Informatica                               Dipartimento di Informatica
               Università di Pisa                                       Università di Pisa
      m.fontana12@studenti.unipi.it                                  attardi@di.unipi.it


                        Abstract                                (Wang et al., 2018). With our experiments we try
                                                                to assess the effectiveness of transformers trained
    We describe our approach and experi-                        on Italian documents in a task involving Italian
    ments to tackle Task A of the second edi-                   texts from different sources. We experiments with
    tion of HaSpeeDe, within the Evalita 2020                   both a transformer model trained specifically on
    evaluation campaign. The proposed model                     Italian tweets and one trained on generic web doc-
    consists in an ensemble of classifiers built                uments.
    from three variants of a common neural ar-                     We combine several instances of classifiers
    chitecture. Each classifier uses contextual                 based on these transformers, in order to address
    representations from transformers trained                   the problem of over-fitting due to the small size of
    on Italian texts, fine tuned on the train-                  the training set.
    ing set of the challenge. We tested the                        For this edition of the Evalita HaSpeeDe task,
    proposed model on the two official test                     the organizers released two test sets, an in-domain
    sets, the in-domain test set containing just                one consisting of tweets and an out-of-domain one
    tweets and the out-of-domain one includ-                    containing also news headlines.
    ing also news headlines. Our submissions                       The ensemble model of our official submission
    ranked 4th on the tweets test set and 17th                  achieved a competitive score of 78.03 Macro-F1
    on the second test set.                                     on the in-domain test set but did not perform as
                                                                well on the second test set.
1    Introduction                                                  We make available the source code for our
                                                                experiments as Open Source at https://
   The spreading of hateful messages on social                  github.com/mikelefonty/Haspeede2.
media has become a serious issue, therefore tech-
niques of hate speech detection have become quite               2   Related Work
relevant. The goal of the Hate Speech Detec-
tion task (Sanguinetti et al., 2020) at Evalita                    The first edition of HaSpeeDe was held in 2018.
2020 (Basile et al., 2020) is to improve the auto-              The results produced during this contest were the
matic detection of hate messages in Italian tweets.             starting point of our research. As described in
The organizers provided to the participants the                 (Bosco et al., 2018), most of the systems were
dataset HaSpeeDe2, which consists of 6,837 Ital-                based on neural networks and used word embed-
ian tweets, containing, besides the raw text, also              dings, such as FastText (Grave et al., 2018) or
hashtags and emojis. The Task A can be cast into                word2vec (Polignano and Basile, 2018) in the first
a binary classification task: the model has to pre-             layer of their architecture. The embeddings layer
dict whether a given message contains hate speech               was usually followed by a Recurrent Network or
or not.                                                         a Convolutional Neural Network to get an internal
   Approaches based on transformer models have                  representation of the input text. This hidden repre-
become quite popular recently and have proved ef-               sentation was provided as input to a series of dense
fective in reaching state-of-the-art scores on major            layers to obtain the final classification result.
NLP tasks such as those of the GLUE benchmark                      Over the last couple of years, the trend in ap-
                                                                proaches to language analysis has changed con-
     Copyright © 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-       siderably, as can be seen by examining the models
ternational (CC BY 4.0).                                        used in competitions like SemEval 2020 OffensE-
val 2 (Zampieri et al., 2020). In these new models,           training. We designed three architecture variants,
to get a better text representation, the embedding            which were employed as the basic building blocks
layer is often replaced by a Transformer (Vaswani             to construct the ensembles:
et al., 2017) such as BERT (Devlin et al., 2019),
                                                                 • ALB-SINGLE: It consists of a first layer
RoBERTa (Liu et al., 2019), or Multilingual BERT
                                                                   provided by the AlBERTo transformer, fol-
(Devlin et al., 2019).
                                                                   lowed by a single neuron with a sigmoid ac-
   We followed this trend but we also focused our
                                                                   tivation function.
attention on the problem raised by the small size
of the dataset. As Risch and Krestel (2020) men-                 • DB-SINGLE: It follows the same structure
tion, transformer models tend to have a high vari-                 of ALB-SINGLE; it just replaces AlBERTo
ance with respect to the input dataset, that often                 with DBMDZ in the first layer.
leads to overfitting. The authors therefore suggest
to implement an ensemble of classifiers to reduce                • DB-MLP: Compared to DB-SINGLE, it
the variance and consequently improve the gener-                   adds a new dense layer, using a ReLU acti-
alization capabilities of the trained model.                       vation function, between the transformer and
   In the following, we describe a similar approach                the output neuron.
based on the Bagging technique (Breiman, 1996),                  The final model is an ensemble consisting of a
where we apply three different transformer-based              number of instances of each of the above architec-
classifiers to populate the ensemble and to get the           tures. For each architecture, e.g. ALB-SINGLE,
final prediction.                                             we construct instances in the following way. After
                                                              initializing the weights randomly within a given
3       System Architecture
                                                              interval and generating the training data by apply-
   During the design phase of our classifier, we              ing the bootstrap technique to the original dataset,
looked for a transformer trained directly on a sig-           we start training the model. When that phase is
nificantly large collection of Italian texts and par-         over, we insert the resulting model in the ensem-
ticularly on Italian tweets, in order to compensate           ble. We repeat this process several times with dif-
for the small size of the training data. We found             ferent random weights initialization. Note that,
two possible models based on BERT: AlBERTo                    due to the random initialization, no two classifiers
(Polignano et al., 2019) 1 and DBMDZ 2 . The for-             in the ensemble are identical to each other. More
mer is trained on TWITA (Basile et al., 2018), a              formally, the model consists of N elements,
191 GB collection of Italian tweets gathered by
                                                                        N = NAL + NDB + NM LP
the authors, and tested on the SENTIPOLC task
during the EVALITA 2016 campaign, where it                    where NAL , NDB , NM LP represent, respectively,
achieved state-of-the-art accuracy in subjectivity,           the number of instances of ALB-SINGLE, DB-
polarity, and irony detection on Italian tweets. We           SINGLE and DB-MLP classifiers.
considered this model suitable for hate speech de-               In retrospect, it might have been worth while
tection, since its source are Italian tweets and the          to consider instances of the architecture obtained
SENTIPOLC task is a classification task similar               varying them more thoroughly than just in the
to ours. DBMDZ instead is trained on a more gen-              initial weights, for example, by changing in the
eral domain, from a 13 GB dataset, which includes             hyper-parameters or number of layers.
a dump of Italian Wikipedia and texts from web                   Our classification algorithm is a slight general-
pages selected from the Opus Corpora. 3 We de-                ization of the most classical one, which collects
cided to test both transformer models, assessing              results from each member of the ensemble and
their performance through a validation phase on a             outputs the class which gets the majority of pre-
development set.                                              dictions over all iterations. The process, described
   These transformers were used in the input stage            by Algorithm 1, performs nrun iterations. Dur-
of all our architectures, providing contextual em-            ing the ith iteration, the algorithm starts sampling
beddings for sentences that were fine tuned during            randomly from the ensemble a given number of
    1
      https://github.com/marcopoli/AlBERTo-it
                                                              instances for each type of classifier (line 3-5) and
    2
      https://huggingface.co/dbmdz/bert-base-italian-uncase   initializing to 0 the variable class1, which con-
    3
      http://opus.nlpl.eu/                                    tains the total number of votes that the hate class
Algorithm 1 Classification Algorithm                              Classifier            Macro-F1          Std
Input: t: the tweet to classify.
                                                                  ALB-SINGLE              76.896        0.7266
Input: (nAL , nDB , nM LP ): number of classifiers
                                                                  DB-SINGLE               77.613        0.3251
of each type to be sampled.
                                                                  DB-MLP                  78.562         0.521
Input: (NAL , NDB , NM LP ): number of classi-
fiers of each type in the ensemble.
                                                           Table 1: Results of the experiments comparing
Input: nrun : number of desired iterations.
                                                           the baseline architectures. We report the expected
Output: cf inal : predicted class
                                                           value and the standard deviation of the F1 score
                                                           computed with respect to the 3 validation folds.
    1: preds = []
    2: for run = 1 to nrun do
    3: albs = sample al(nAL , NAL )                        dataset into two disjoint subsets, a development
    4: dbs = sample db(nDB , NDB )                         and an internal test set, in the proportion of 80%
 5:    mlps = sample ml(nM LP , NM LP )                    and 20%, respectively. The split was done by
 6:    sampled classif = albs ∪ dbs ∪ mlps                 means of Stratified Sampling, according to the dis-
 7:    class1 = 0 // votes for class 1                     tribution of the target variable hs. We applied
 8:    for cl in sampled classif do                        the Stratified 3-fold-CV technique to validate our
 9:       class1 += cl(t) // cl’s classification           model. Given that we are solving a binary classi-
10:    end for                                             fication problem, we picked the Binary Cross En-
11:    preds[run] =                                        tropy as our loss. We chose AdamW as our op-
            class1 ≥ d nAL +nDB  +nM LP
                                          
                               2        e                  timizer; we set the first 10% of the total steps as
12: end for 
                  nP
                   run
                                                         warmup steps. We conducted the experiences on a
                                    nrun
13: cf inal =          pred[i] ≥ d 2 e                     GPU offered by Google Colab 4 . Our models are
                     i
14: return cf inal                                         implemented in PyTorch (Paszke et al., 2019). To
                                                           extract as much information as possible from input
                                                           texts, we preprocessed them through hashtag seg-
receives during the iteration (line 7). It then col-       mentation by means of Tweet Preprocessor.5 We
lects the predictions of the selected models on the        also converted emojis into their Italian description
tweet t (line 8-10). cl(t) ∈ {0, 1} represents the         by using the emoji 6 and Google Translate 7 li-
prediction of classifier cl for the tweet t; in particu-   braries.
lar cl(t) = 1 if and only if cl classifies t as hateful.      We analyzed the behaviour of the three baseline
The output of iteration i is the most predicted class      architectures we planned to include in the ensem-
(line 11). The final result of the algorithm is then       ble.
the class cf inal ∈ {0, 1}, which obtained the most           We trained each model for a maximum of 4
votes over all the nrun iterations (line 13-14). If        epochs, using a batch of size 16 and setting the
cf inal = 1, it means that the tweet t has been clas-      maximum text length to 100. A grid search re-
sified as hateful.                                         vealed that the optimal learning rate for DB-MLP
   A simpler variant of the algorithm would be to          is 5 · 10−5 , and 6 · 10−5 for the remaining mod-
just add the counts of each class by all classifiers in    els. The optimal number of neurons in the hidden
all iterations and return the class with the highest       layer of DB-MLP is 50.
count. We plan to compare these two approaches                Table 1 highlights the following aspect: DB-
in a future work.                                          SINGLE achieves better performance than ALB-
                                                           SINGLE, even though the dataset used to train
4        Experiments                                       AlBERTo was composed by a large collection of
   In this section we describe the experiments we          tweets. The obtained values of the macro-F1 are
performed to tune the hyper-parameters of our              the baselines of our work.
model. We will focus on the search to choose                  We then describe the results obtained through
the best values for nDB , nAL , nM LP , that is how           4
                                                                https://colab.research.google.com/
many instances to select at each iteration in the             5
                                                                https://pypi.org/project/tweet-preprocessor/
classification algorithm.                                     6
                                                                https://pypi.org/project/emoji
                                                              7
   Before starting the experiments, we divided the              https://pypi.org/project/googletrans/
     nDB    nM LP    nAL    Macro-F1     Std               Accuracy      Precision     Recall       F1
     20       25      30      80.057    0.534              79.313         78.510      78.685     78.592
     15       20      25      80.038    0.580
     15       30      30      80.036    0.585
                                                       Table 4: Results of the final model on the internal
     15       25      30      80.026    0.563
     15       30      15      80.020    0.481          test set.

Table 2: Ranking of the 5 best configurations we
found, varying the number the number of instances         We picked the first configuration from Table 2
selected from the ensemble. nDB stands for the         for our final model and tested it on the internal test
number of instances of the DB-SINGLE model,            set, obtaining the results shown in Table 4.
and similarly for nM LP and nAL . We report the
                                                       5    Results and Discussion
expected value and the standard deviation of the
F1 score computed with respect to the 3 validation        The results of our final model applied to the data
folds.                                                 of the two official test sets of the competition are
                                                       shown in Table 5. The model performs pretty well
     nDB    nM LP    nAL    Macro-F1     Std           on the in-domain dataset, reaching the 4th posi-
                                                       tion in the rankings. However, it did not rank as
     30        0      0      79.074     0.300
                                                       well in detecting hate speech on the out-of-domain
     0        30      0      79.581    0.3787
     0         0      30     79.482     0.596
                                                       dataset, obtaining an F1-score of just 65.46. The
     30       30      30     79.832     0.525          low recall for the hate class highlights that the
                                                       model fails too often to identify news headlines
Table 3: Scores by each architecture, both indi-       containing some form of hate speech. In compar-
vidually and together in the ensemble. We report       ison with the official top rankings, listed in Table
the average value and the standard deviation of the    6, our model achieved about 12 points below the
F1 score computed with respect to the 3 validation     top score of 77.44% F1.
folds.                                                    Surprised by this fact, we investigated more
                                                       deeply, looking for an explanation for such poor
                                                       result on the out-of-domain dataset.
the ensemble model. To build the classifier, we           We randomly sampled from the test set some
trained 30 instances of each architecture, keeping     hateful headlines missed by the model, some of
the same hyper-parameters obtained from the pre-       which are shown in Table 7.
vious grid search. We thus set:                           In these headlines, the qualification as hate is
                                                       implicit and harder to recognize, since it seems
           NAL = NDB = NM LP = 30                      due more to the presence of stereotypes (nomads,
                                                       asylum seekers, Muslims, foreigners), than to the
   We noted that the generalization capability         presence of explicit hate expressions.
of the ensemble is strictly related to the triple         Broadly speaking, we identified some possible
(nDB , nM LP , nAL ), so we performed another grid     reasons for the difference in performance across
search, looking for the optimal combination of the     the two test sets:
three parameters. Table 2 shows the five best con-
figurations found by this search. The optimal val-         • Linguistic register: Tweets often exhibit a
ues for the triple, (20, 25, 30), allow the ensemble         more informal and colloquial language, while
to achieve an F1-score of 80.0%, with a gain of              headlines employ a more formal lexicon and
about 2 points with respect to the score by a single         a more objective tone. This is a crucial differ-
DB-MLP (see Table 1).                                        ence in identifying hateful messages: while
   We analyzed the contribution of each architec-            in tweets the feeling of hatred transpires
ture individually to the ensemble combination. As            clearly and directly, in headlines this message
shown in Table 3, the best results are obtained with         is conveyed in a more subtle way, often allud-
instances of all three architectures. Nevertheless,          ing to concepts from political propaganda or
the results presented in Table 2, show that a more           common stereotypes. Prior knowledge about
balanced combination achieves better accuracy.               the subject and inference might be necessary
                            NOT HATE                              HATE
                     Precision Recall F1              Precision    Recall    F1      Macro-F1     Position
         Tweets        81.93   72.85 77.12              74.89      83.44    78.94     78.03          4
          News         71.88   99.37 83.42              96.61      31.49    47.50     65.46         17

                     Table 5: Results of the submitted model on the official blind test sets.


            Tweets                       News                      than news headlines. Thus, the model has
 Position         F1 score    Position     F1 score
                                                                   fewer elements to exploit to correctly classify
                                                                   a piece of news.
     1           80.88           1           77.44
     2           78.97           2           73.14           These difficulties seem to be shared with other
     3           78.93           3           72.56           submissions which all got lower scores on the out-
     4        78.03 (ours)       4           71.83           of-domain dataset. We expected that pretrained
     5           77.82           5            70.2           contextual embedding would be more effective in
     6           77.66           17       65.46 (ours)       addressing the domain adaptation issue. Further
                                                             experiments would be needed to improve the re-
Table 6: Comparison between our final results and            silience of our model.
the top-5 F1-scores. The values are taken from the
official rankings.                                           6     Conclusions
                                                                We described an ensemble of neural classifiers,
            Hateful News Headlines                           relying on contextual embeddings from transform-
                                                             ers, for automated detection of hateful content in
 anziana rapinata sull’autobus, i due no-
                                                             Italian texts. We presented the general architec-
 madi in fuga si rifugiano al campo di via
                                                             ture of our base classification models and how
 Candoni
                                                             they were combined into an ensemble through a
 (elderly woman robbed on the bus, the two
                                                             bagging technique. We performed extensive ex-
 fleeing nomads take refuge at the camp on
                                                             periments to tune our models and the ensemble
 via Candoni)
                                                             on a validation test set. The results achieved by
 Expo: Bordonali, richiedenti asilo in                       our ensemble model on the in-domain test set con-
 campo base simbolo fallimento governo.                      firm its ability in detecting hateful tweets; however
 (Expo: Bordonali, asylum seekers in base                    the same model performed poorly on the out-of-
 camp government failure symbol.)                            domain dataset, showing particularly an inability
 Il cardinale Müller: ”non possiamo pre-                    to adapt to handling news headlines. We plan to
 gare come o con i musulmani”                                investigate this issue in future research.
 (”we cannot pray like nor with Muslims”)
 Salvini: ”Il calcio? Rimpiango i tre
                                                             References
 stranieri in campo”
 (Salvini: ”Soccer? I regret the three for-                  Valerio Basile, Mirko Lai, and Manuela Sanguinetti.
                                                               2018. Long-term social media data collection at
 eigners on the field”)                                        the university of turin. In Elena Cabrio, Alessandro
                                                               Mazzei, and Fabio Tamburini, editors, Proceedings
Table 7: Examples of hateful headlines, randomly               of the Fifth Italian Conference on Computational
picked from the out-of-domain test set, that are               Linguistics (CLiC-it 2018), Torino, Italy, December
misclassified by our model.                                    10-12, 2018, volume 2253 of CEUR Workshop Pro-
                                                               ceedings. CEUR-WS.org.
                                                             Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
    to decipher the presence of hate. Examining                cia C. Passaro. 2020. Evalita 2020: Overview
    the entire body of the article might have been             of the 7th evaluation campaign of natural language
    helpful.                                                   processing and speech tools for italian. In Valerio
                                                               Basile, Danilo Croce, Maria Di Maro, and Lucia C.
                                                               Passaro, editors, Proceedings of Seventh Evalua-
  • Length of text: Tweets are usually longer                  tion Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA        on Computational Linguistics (CLiC-it 2018), Turin,
  2020), Online. CEUR.org.                                 Italy, December 12-13, 2018, volume 2263 of CEUR
                                                           Workshop Proceedings. CEUR-WS.org.
Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.       Marco Polignano, Pierpaolo Basile, Marco de Gemmis,
  Overview of the EVALITA 2018 hate speech detec-         Giovanni Semeraro, and Valerio Basile. 2019. Al-
  tion task. In Tommaso Caselli, Nicole Novielli, Vi-     berto: Italian BERT language understanding model
  viana Patti, and Paolo Rosso, editors, Proceedings      for NLP challenging tasks based on tweets. In
  of the Sixth Evaluation Campaign of Natural Lan-        Raffaella Bernardi, Roberto Navigli, and Giovanni
  guage Processing and Speech Tools for Italian. Fi-      Semeraro, editors, Proceedings of the Sixth Ital-
  nal Workshop (EVALITA 2018) co-located with the         ian Conference on Computational Linguistics, Bari,
  Fifth Italian Conference on Computational Linguis-      Italy, November 13-15, 2019, volume 2481 of CEUR
  tics (CLiC-it 2018), Turin, Italy, December 12-13,      Workshop Proceedings. CEUR-WS.org.
  2018, volume 2263 of CEUR Workshop Proceed-
  ings. CEUR-WS.org.                                     Julian Risch and Ralf Krestel. 2020. Bagging BERT
                                                            models for robust aggression identification. In
L. Breiman. 1996. Bagging predictors.        Machine        Ritesh Kumar, Atul Kr. Ojha, Bornini Lahiri, Mar-
   Learning, 24:123–140.                                    cos Zampieri, Shervin Malmasi, Vanessa Murdock,
                                                            and Daniel Kadar, editors, Proceedings of the Sec-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               ond Workshop on Trolling, Aggression and Cyber-
   Kristina Toutanova. 2019. BERT: pre-training             bullying, TRAC@LREC 2020, Marseille, France,
   of deep bidirectional transformers for language un-      May 2020, pages 55–61. European Language Re-
   derstanding. In Jill Burstein, Christy Doran, and        sources Association (ELRA).
   Thamar Solorio, editors, Proceedings of the 2019
   Conference of the North American Chapter of the       Manuela Sanguinetti, Gloria Comandini, Elisa
   Association for Computational Linguistics: Human       Di Nuovo, Simona Frenda, Marco Stranisci,
   Language Technologies, NAACL-HLT 2019, Min-            Cristina Bosco, Tommaso Caselli, Viviana Patti, and
   neapolis, MN, USA, June 2-7, 2019, Volume 1 (Long      Irene Russo. 2020. HaSpeeDe 2@EVALITA2020:
   and Short Papers), pages 4171–4186. Association        Overview of the EVALITA 2020 Hate Speech
   for Computational Linguistics.                         Detection Task. In Valerio Basile, Danilo Croce,
                                                          Maria Di Maro, and Lucia C. Passaro, editors,
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-       Proceedings of Seventh Evaluation Campaign of
  mand Joulin, and Tomas Mikolov. 2018. Learn-            Natural Language Processing and Speech Tools for
  ing word vectors for 157 languages. In Proceed-         Italian. Final Workshop (EVALITA 2020), Online.
  ings of the International Conference on Language        CEUR.org.
  Resources and Evaluation (LREC 2018).
                                                         Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-        Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,            Kaiser, and Illia Polosukhin. 2017. Attention
  Luke Zettlemoyer, and Veselin Stoyanov. 2019.            is all you need. In Isabelle Guyon, Ulrike von
  Roberta: A robustly optimized BERT pretraining           Luxburg, Samy Bengio, Hanna M. Wallach, Rob
  approach. CoRR, abs/1907.11692.                          Fergus, S. V. N. Vishwanathan, and Roman Garnett,
                                                           editors, Advances in Neural Information Processing
Adam Paszke, Sam Gross, Francisco Massa, Adam              Systems 30: Annual Conference on Neural Informa-
  Lerer, James Bradbury, Gregory Chanan, Trevor            tion Processing Systems 2017, 4-9 December 2017,
  Killeen, Zeming Lin, Natalia Gimelshein, Luca            Long Beach, CA, USA, pages 5998–6008.
  Antiga, Alban Desmaison, Andreas Kopf, Edward
  Yang, Zachary DeVito, Martin Raison, Alykhan Te-       Alex Wang, Amanpreet Singh, Julian Michael, Fe-
  jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,      lix Hill, Omer Levy, and Samuel Bowman. 2018.
  Junjie Bai, and Soumith Chintala. 2019. Py-              GLUE: A multi-task benchmark and analysis plat-
  torch: An imperative style, high-performance deep        form for natural language understanding. In Pro-
  learning library. In H. Wallach, H. Larochelle,          ceedings of the 2018 EMNLP Workshop Black-
  A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-     boxNLP: Analyzing and Interpreting Neural Net-
  nett, editors, Advances in Neural Information Pro-       works for NLP, pages 353–355, Brussels, Belgium,
  cessing Systems 32, pages 8024–8035. Curran As-          November. Association for Computational Linguis-
  sociates, Inc.                                           tics.
Marco Polignano and Pierpaolo Basile. 2018. Hansel:      Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa
 Italian hate speech detection through ensemble           Atanasova, Georgi Karadzhov, Hamdy Mubarak,
 learning and deep neural networks. In Tommaso            Leon Derczynski, Zeses Pitenis, and Çagri Çöltekin.
 Caselli, Nicole Novielli, Viviana Patti, and Paolo       2020. Semeval-2020 task 12: Multilingual offensive
 Rosso, editors, Proceedings of the Sixth Evalua-         language identification in social media (offenseval
 tion Campaign of Natural Language Processing and         2020). CoRR, abs/2006.07235.
 Speech Tools for Italian. Final Workshop (EVALITA
 2018) co-located with the Fifth Italian Conference