DH-FBK @ HaSpeeDe2: Italian Hate Speech Detection
                     via Self-Training and Oversampling

        Elisa Leonardelli                    Stefano Menini                        Sara Tonelli
     Fondazione Bruno Kessler            Fondazione Bruno Kessler            Fondazione Bruno Kessler
           Trento, Italy                       Trento, Italy                       Trento, Italy
    eleonardelli@fbk.eu                     menini@fbk.eu                     satonelli@fbk.eu


                      Abstract                             and vice-versa (Corazza et al., 2018). Other re-
                                                           cent studies confirmed that detecting hate speech
     We describe in this paper the system                  on different social media platforms would require
     submitted by the DH-FBK team to the                   a platform-specific setting, and that just merging
     HaSpeeDe evaluation task, and dealing                 all training data coming from different sources
     with Italian hate speech detection (Task              does not always improve performance, in particu-
     A). While we adopt a standard ap-                     lar when testing on Twitter (Corazza et al., 2019).
     proach for fine-tuning AlBERTo, the Ital-                The problem of developing hate speech detec-
     ian BERT model trained on tweets, we                  tion systems that are robust when analysing differ-
     propose to improve the final classifica-              ent sources or data that vary over time is however
     tion performance by two additional steps,             an understudied problem. Therefore, the task of
     i.e. self-training and oversampling. In-              out-of-domain classification introduced this year
     deed, we extend the initial training data             at HaSpeeDe is particularly important and will
     with additional silver data, carefully sam-           hopefully foster the development and evaluation of
     pled from domain-specific tweets and ob-              classifiers with good generalisation capabilities.
     tained after first training our system only              Concerning our classification approach, we
     with the task training data. Then, we re-             build a standard pipeline based on AlBERTo
     train the classifier by merging silver and            (Polignano et al., 2019b), the Italian transformer-
     task training data but oversampling the lat-          based model trained on Twitter data, since BERT-
     ter, so that the obtained model is more               like models represent the state of the art for hate
     robust to possible inconsistencies in the             speech detection (Zampieri et al., 2020). We ex-
     silver data. With this configuration, we              tend it in two ways: first, we use self-training to
     obtain a macro-averaged F1 of 0.753 on                build a first classifier with the task training data
     tweets, and 0.702 on news headlines.                  and annotate a large set of tweets collected via
                                                           Islam- and immigrant-specific hashtags. The sil-
1    Introduction                                          ver data and the task training set are then merged
                                                           to train a second, possibly more robust classifier,
Although hate speech detection may seem a solved           which we use to classify the test set. When re-
task on English, with more than 60 systems partic-         training, we introduce over-sampling in one of the
ipating in the last Offenseval edition reaching an         two runs submitted by our team, i.e. we repeat
F1 > 0.90 (Zampieri et al., 2020), this goal has           five times the task training data so that they are
not been reached when moving to other languages            balanced with respect to the silver data. This, to-
and settings. For example, at the last HaSpeeDe            gether with self-training, proved to be effective
shared task on Italian (Bosco et al., 2018) the best       when evaluated in a five-fold fashion on the train-
systems reached 0.83 F1 on Facebook data and               ing set, outperforming a standard approach based
0.80 on Twitter data (Cimino et al., 2018), but the        only on fine-tuning with AlBERTo.
performance dropped below 0.70 F1 when dealing
with a cross-domain setting, i.e. training on Face-        2   Related Work
book and testing on Twitter (Cimino et al., 2018),
                                                           While most approaches to hate speech detection
     Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   have been proposed for English, other systems
International (CC BY 4.0).                                 have been recently developed to deal with a num-
ber of other languages, including Turkish, Arabic,          • Task B: binary classification task aimed at de-
Danish (Zampieri et al., 2020), German (Wiegand               termining whether a message contains stereo-
et al., 2018) and Spanish (Basile et al., 2019).              types or not
Concerning Italian, the first Hate Speech Detec-
tion task (HaSpeeDe) for Italian was organized at           • Task C: sequence labeling task aimed at rec-
EVALITA-2018 (Bosco et al., 2018). The task                   ognizing nominal utterances in hateful tweets
consisted in automatically annotating messages             We participate in Task A, which in 2020 has
from Twitter and Facebook, with a boolean value         the goal also to investigate variation in language
indicating the presence (or not) of hate speech.        and time concerning hate speech detection. To this
The participating systems adopt a wide range of         purpose, the training set contains Twitter data, ac-
approaches, including bi-LSTM (la Peña Sarracén       companied by a test set including both in-domain
et al., 2018), SVM (Santucci et al., 2018), ensem-      and out-of-domain data (tweets + news headlines),
ble classifiers (Polignano and Basile, 2018; Bai et     as well as from different time periods.
al., 2018), RNN (Fortuna et al., 2018), CNN and
GRU (von Grunigen et al., 2018). The authors of         4       Data
the best-performing system, ItaliaNLP (Cimino et
                                                        In our experiments we use two types of data, the
al., 2018), experiment with three different classifi-
                                                        HaSpeeDe2 dataset provided by the task organis-
cation models: one based on linear SVM, another
                                                        ers, and domain-specific data collected from Twit-
one based on a 1-layer BiLSTM and a newly-
                                                        ter, that we include as silver data. The two datasets
introduced one based on a 2-layer BiLSTM which
                                                        are described below.
exploits multi-task learning with additional data
from the 2016 SENTIPOLC task (Barbieri et al.,          4.1      HaSpeeDe2 Dataset
2016). The same training and test set released for
                                                        This dataset contains the training data provided
HaSpeeDe have been recently used also for other
                                                        by the organizers. These data specifically focus
types of evaluation, for example to compare classi-
                                                        on the presence or the absence of hateful con-
fier performance and settings across different lan-
                                                        tent towards immigrants, muslims or roma people.
guages (Corazza et al., 2020), confirming the im-
                                                        It consists of 6,839 annotated tweets, with 2,766
portance of domain-specific language models and
                                                        messages annotated as hateful and 4,073 as non-
the effectiveness of deep learning approaches (in
                                                        hateful.
this case, LSTM + fasttext embeddings). Since
the development of BERT-like transformer-based          4.2      Silver data description
models, however, they have become state-of-the-
                                                        Since the task is focused on hate speech against
art approaches in several NLP tasks. This includes
                                                        immigrants and minorities, we decided to exploit
also hate speech detection for Italian, with the
                                                        a set of tweets in Italian that covers similar topics
BERT model AlBERTo (Polignano et al., 2019b),
                                                        and that was collected within the European project
which has recently achieved top-scores in two out
                                                        Hatemeter1 (Ferret et al., 2019). For this project,
of three HaSpeeDe 2018 tasks (Polignano et al.,
                                                        conducted between February 2018 and January
2019a). For this reason, we decided to develop a
                                                        2020, we downloaded tweets using hashtags of
classifier using the same model and the same ap-
                                                        hate towards the Islam community, for example
proach.
                                                        #nomoschee, #stopIslam, etc. Even if the dataset
                                                        mainly covers Islam, references to other minorities
3    Task Description                                   like Roma or generic Immigrants are also present.
For the 2020 edition of EVALITA (Basile et al.,         To ensure that also other minorities are well rep-
2020), the HaSpeeDe task (Sanguinetti et al.,           resented, we randomly select from this dataset
2020) has focused on three main phenomena rele-         tweets that contain the most common words as
vant to online hate speech detection by proposing       chosen from the training data provided by task or-
three different tasks:                                  ganizers, i.e. Rom, nomade, migrante, straniero,
                                                        profugo, islam, mussulmano (musulmano), terror-
                                                        ista. Overall, around 20,400 additional tweets
    • Task A (main task): binary classification task
                                                        were selected. We then perform a first round of
      aimed at determining whether a message con-
                                                            1
      tains hate speech or not                                  http://hatemeter.eu/
classification of the “new” tweets using the avail-       transformed-based language model that can be
able data provided by organizers as training. This        fine-tuned and adapted to specific tasks by adding
results in a new silver dataset composed of 11,129        just one additional output layer to the neural net-
hate and 9,254 non-hate tweets. This additional           work. As different BERT models exist, we first
dataset is then merged with the task gold data and        evaluated whether to use a multilingual version
used to re-train the classifier. Details are reported     of BERT or the Italian version trained on Twitter
in the following Section.                                 data, called AlBERTo (Polignano et al., 2019b).
                                                             The comparison and evaluation of the differ-
5     System Description                                  ent models and approaches is done with a 6-fold
The classifier developed for both runs submitted          cross-validation using the task training set. Each
by our team is based on the Italian BERT model            fold consists of about 1,000 tweets as test while
trained on tweets, called AlBERTo (Polignano et           the others are used as train and validation (500
al., 2019b). After fine-tuning it on the task train-      tweets). The performance score is obtained as the
ing data, we use the obtained classifier to automat-      average of the six folds, so that the final evaluation
ically annotate the additional dataset described in       is unbiased and independent as much as possible
Section 4.2. These silver data are then merged            from the specific splits into train, validation and
with the task training data and used to fine-tune         test.
AlBERTo a second time. For one of the two sub-                In our setup we tested two models, first Mul-
mitted runs, we also experiment with oversam-             tilingual BERT, covering 104 languages including
pling as follows:                                         Italian 2 and then AlBERTo, which was trained us-
                                                          ing the official BERT source code on 200M tweets
    • Run1: we add the silver data to the tweets          in the Italian language. For the fine-tuning of Al-
      provided by the organizers for the training,        BERTo we run it for 15 epochs, using a learning
      keeping 500 of the released tweets for vali-        rate of 2e-5 with 1000 steps per loops on batches
      dation. In this setting, the training set size is   of 64 examples. Since AlBERTo performed bet-
      ∼27,000 tweets, including 20,400 silver in-         ter than multilingual BERT on each fold, it was
      stances.                                            included in the final system configuration for the
                                                          task. The cross-validation over 6 folds using only
    • Run2: we add the silver data to the tweets
                                                          the task training set with AlBERTo resulted in an
      provided by the organizers as in Run1, but
                                                          average Macro-F1 of 83.12 for Run1 and 82.15 for
      the tweet from organizers are oversampled by
                                                          Run2.
      repeating them five times (and shuffling) in
      the training set, while tweets from the silver
      dataset occur only once. In this setting, the       5.2    Data Preprocessing
      training set includes ∼52,000 tweets, with
                                                          The data, both from the dataset provided by the
      39% of them being silver data.
                                                          organisers and the silver one, are preprocessed as
   We tested also the option to automatically as-         follows. First we split hashtags by adapting to
sign a tag to each tweet, stressing the presence of       Italian the Ekphrasis tool (Gimpel et al., 2010),
a certain topic (immigrants/roma people/islam) us-        which recognises the tokens in a hashtag based
ing a keyword-based approach. However, with this          on Google n-grams. With the same tool we also
additional information the classifier performed           normalise the text to replace all mentions to users
worse than without any topic indicator, so we re-         and urls with <user> and <url> respectively. We
moved it from the final runs. Below we report             also replace with a dedicated tag all the instances
a detailed description of the process to select the       of “money”, “time”, “date” and in general any
best classification model, and of the preprocessing       “number“. The emojis are replaced with their de-
steps.                                                    scriptions3 in order to have a textual representation
                                                          to be used with AlBERTo.
5.1    Model selection
                                                             2
The best performance in a wide variety of NLP                 with 12-layer, 768-hidden, 12-heads, 110M parameters
                                                             3
                                                              manually translated to Italian from the English descrip-
tasks is currently obtained with approaches based         tion at rhttps://unicode.org/emoji/charts/
on BERT (Devlin et al., 2019), a pre-trained              full-emoji-list.html.
                                         Hate class                      Non-hate class           Macro Avg.
    DocType.   System          Precision    Recall       F1       Precision    Recall     F1           F1
               Run1             0.7237      0.7958       0.758     0.7806      0.7051    0.7409      0.7495
               Run2             0.727       0.8006       0.762     0.7855      0.7083    0.7448      0.7534
    Tweets     baselineMF          0           0           0       0.5075      1.000     0.6733      0.3366
               baselineSVM      0.7096      0.7347      0.7219     0.7334      0.7082    0.7206      0.7212
               best system                                                                           0.8088
               Run1             0.6833      0.453       0.5448     0.7395      0.8808    0.804       0.6744
               Run2             0.6911      0.5193       0.593     0.7609      0.8683    0.8111      0.702
     News      baselineMF          0           0           0       0.638       1.000     0.7789      0.3894
               baselineSVM      0.6071      0.3756      0.4641     0.7087      0.862     0.7779      0.621
               best system                                                                           0.7744

Table 1: Results of the two submitted runs for Task A on tweets and on news headlines. BaselineMF =
most-frequent baseline; baselineSVM = linear SVM with unigrams, char-grams and TF-IDF representa-
tion

6     Evaluation                                          in the training set. This will likely make the two
                                                          sets different in terms of topics.
We submitted two runs each for the in-domain
(tweets) and out-of-domain (news headlines) text                                          Actual Values
                                                                       Run 1
types in Task A. The results obtained on the test                                        non-hate hate
set are reported in Table 1 and compared with two                             non-hate     452      127
                                                                 Predicted
baselines provided by the task organisers, one ob-                            hate         189      495
tained by always assigning the most frequent la-                                          Actual Values
                                                                       Run 2
bel (i.e. non-hateful), and the other by training                                        non-hate hate
an SVM classifier with unigrams, char-grams and                               non-hate     454      124
                                                                 Predicted
TF-IDF representation as features. We also com-                               hate         187      498
pare our results with the top-ranked system in each
subtask (additional details on such systems have              Table 2: Confusion matrix on tweets results
not been disclosed at the moment of writing).                We report in Table 2 and 3 the confusion ma-
   As expected, on out-of-domain data (news               trix showing the number of true positives and neg-
headlines) we obtain lower results than on tweets,        atives, and false positives and negatives obtained
since the training set is retrieved exclusively from      with the two runs on tweets and news headlines.
Twitter. Furthermore, our approach does not in-           While on tweets the performance on the hate class
clude any specific tuning aimed at treating news          is overall better, in particular concerning recall,
headlines differently from tweets. On the con-            this does not apply to news headlines, with a low
trary, the additional data used for self-training are     recall for the hate class. The reason for this low
all gathered from Twitter, which may negatively           score lies in the different linguistic expressions
affect performance on out-of-domain data.                 connected with hate between tweets and head-
   On both document types, Run2 performs better           lines: while in tweets they are more direct, and
than Run1, showing that our oversampling strat-           more frequently connected with profanities that a
egy to reduce the weight of silver data is effec-         classifier can easily recognise, hateful content in
tive. However, results obtained with 6-fold cross-        news headlines is usually expressed in more subtle
validation only on the training set were signifi-         ways. As an example, we report below two head-
cantly higher, both with macro F1 > 0.80. This            lines misclassified by our system. The first one
may be explained by the fact that, as pointed out         (i) was classified as non-hateful, even if it conveys
by the task organisers, tweets from the test set          hateful content. The second one (ii) was instead
were collected in a different time period than those      classified as hateful, although it is not:
                                 Actual Values           a better data quality, it may be useful to select only
               Run 1
                                non-hate hate            the silver instances that have been automatically
                    non-hate      281       99           classified with high confidence.
       Predicted
                    hate           38       82
                                 Actual Values
               Run 2                                     References
                                non-hate hate
                    non-hate      277       87           Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tom-
       Predicted                                           maso Caselli, and Malvina Nissim. 2018. Rug @
                    hate           42       94
                                                           EVALITA 2018: Hate speech detection in italian
                                                           social media. In Proceedings of the Sixth Evalua-
Table 3: Confusion matrix on news headlines re-            tion Campaign of Natural Language Processing and
sults                                                      Speech Tools for Italian. Final Workshop (EVALITA
                                                           2018) co-located with the Fifth Italian Conference
                                                           on Computational Linguistics (CLiC-it 2018), Turin,
    i) Sea Watch, l’ultima presa in giro degli immi-       Italy.
       grati all’Italia: i minori nati tutti lo stesso
       giorno (EN: Sea Watch, migrants making fun        Francesco Barbieri, Valerio Basile, Danilo Croce,
                                                           Malvina Nissim, Nicole Novielli, and Viviana Patti.
       of Italy: all underage migrants born on the         2016. Overview of the evalita 2016 sentiment polar-
       same day)                                           ity classification task.

    ii) Matera, Salvini contestato durante il            Valerio Basile, Cristina Bosco, Elisabetta Fersini, Deb-
        comizio. E lui risponde: “Bravi, avete vinto       ora Nozza, Viviana Patti, Francisco Manuel Rangel
                                                           Pardo, Paolo Rosso, and Manuela Sanguinetti.
        dieci immigrati da mantenere” (EN: Matera,         2019. Semeval-2019 task 5: Multilingual detec-
        Salvini challenged at a rally, and he replies:     tion of hate speech against immigrants and women
        “Congratulations, you won ten migrants to          in twitter. In Proceedings of the 13th International
        pay for”)                                          Workshop on Semantic Evaluation, pages 54–63.
                                                         Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
   Both examples have a similar structure, are             cia C. Passaro. 2020. Evalita 2020: Overview
written in standard Italian and mention migrants.          of the 7th evaluation campaign of natural language
Furthermore, the second example reports a hateful          processing and speech tools for italian. In Valerio
direct speech, but since it is only reported it does       Basile, Danilo Croce, Maria Di Maro, and Lucia C.
                                                           Passaro, editors, Proceedings of Seventh Evalua-
not mean that the journalist agrees with what was          tion Campaign of Natural Language Processing and
said by the politician Matteo Salvini.                     Speech Tools for Italian. Final Workshop (EVALITA
                                                           2020), Online. CEUR.org.
7     Conclusions
                                                         Cristina Bosco, Dell’Orletta Felice, Fabio Poletto,
In this paper we described the system devel-               Manuela Sanguinetti, and Tesconi Maurizio. 2018.
                                                           Overview of the evalita 2018 hate speech detection
oped by the DH-FBK team to participate in the
                                                           task. In EVALITA 2018-Sixth Evaluation Campaign
HaSpeeDe shared Task A. We submitted two runs,             of Natural Language Processing and Speech Tools
both based on AlBERTo and using in-domain                  for Italian, volume 2263, pages 1–9. CEUR.
silver data as additional training data in a self-
                                                         Andrea Cimino, Lorenzo De Mattei, and Felice
learning framework. The only difference between            Dell’Orletta. 2018. Multi-task learning in deep
the two configurations is that, for Run2, the task         neural networks at EVALITA 2018. In Proceedings
training data were repeated five times, to balance         of the Sixth Evaluation Campaign of Natural Lan-
the weight of silver data.                                 guage Processing and Speech Tools for Italian. Fi-
                                                           nal Workshop (EVALITA 2018) co-located with the
   Our evaluation shows that, both in a cross-             Fifth Italian Conference on Computational Linguis-
validation setting and on the task test set, over-         tics (CLiC-it 2018), Turin, Italy.
sampling has a positive effect on the classification
                                                         Michele Corazza, Stefano Menini, Pinar Arslan,
results. As expected, performance on in-domain             Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and
data (i.e. training and testing on tweets) is better       Serena Villata. 2018. Comparing different su-
than on out-of-domain data (i.e. training on tweets        pervised approaches to hate speech detection. In
and testing on news headlines). In the future, we          Proceedings of the Sixth Evaluation Campaign of
                                                           Natural Language Processing and Speech Tools for
may try to address this issue by including as silver       Italian. Final Workshop (EVALITA 2018) co-located
data also news headlines, so that also the speci-          with the Fifth Italian Conference on Computational
ficity of this kind of text is taken into account. For     Linguistics (CLiC-it 2018), Turin, Italy.
Michele Corazza, Stefano Menini, Elena Cabrio, Sara          detection through alberto italian language under-
  Tonelli, and Serena Villata. 2019. Cross-platform          standing model. In Mehwish Alam, Valerio Basile,
  evaluation for italian hate speech detection. In Pro-      Felice Dell’Orletta, Malvina Nissim, and Nicole
  ceedings of the Sixth Italian Conference on Com-           Novielli, editors, Proceedings of the 3rd Workshop
  putational Linguistics, Bari, Italy, November 13-15,       on Natural Language for Artificial Intelligence co-
  2019.                                                      located with the 18th International Conference of
                                                             the Italian Association for Artificial Intelligence
Michele Corazza, Stefano Menini, Elena Cabrio, Sara          (AIIA 2019), Rende, Italy, November 19th-22nd,
  Tonelli, and Serena Villata. 2020. A multilingual          2019, volume 2521 of CEUR Workshop Proceed-
  evaluation for online hate speech detection. ACM           ings. CEUR-WS.org.
  Transactions on Internet Technology, 20(2):10:1–
  10:22.                                                   Marco Polignano, Pierpaolo Basile, Marco de Gemmis,
                                                            Giovanni Semeraro, and Valerio Basile. 2019b. Al-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and               berto: Italian BERT language understanding model
   Kristina Toutanova. 2019. BERT: Pre-training of          for NLP challenging tasks based on tweets. In
   deep bidirectional transformers for language under-      Raffaella Bernardi, Roberto Navigli, and Giovanni
   standing. In Proceedings of the 2019 Conference          Semeraro, editors, Proceedings of the Sixth Ital-
   of the North American Chapter of the Association         ian Conference on Computational Linguistics, Bari,
   for Computational Linguistics: Human Language            Italy, November 13-15, 2019, volume 2481 of CEUR
   Technologies, pages 4171–4186, Minneapolis, Min-         Workshop Proceedings. CEUR-WS.org.
   nesota, June.
                                                           Manuela Sanguinetti, Gloria Comandini, Elisa
Jérôme Ferret, Mario Laurent, Daniela Andreatta, An-      Di Nuovo, Simona Frenda, Marco Stranisci,
    drea Di Nicola, Elisa Martini, M Guerini, S Tonelli,    Cristina Bosco, Tommaso Caselli, Viviana Patti, and
    Georgios Antonopoulos, and Parisa Diba. 2019.           Irene Russo. 2020. HaSpeeDe 2@EVALITA2020:
    Hatemeter d18: Training module a for academics          Overview of the EVALITA 2020 Hate Speech
    and research organisations.                             Detection Task. In Valerio Basile, Danilo Croce,
                                                            Maria Di Maro, and Lucia C. Passaro, editors,
Paula Fortuna, Ilaria Bonavita, and Sérgio Nunes.          Proceedings of Seventh Evaluation Campaign of
  2018. Merging datasets for hate speech classifica-        Natural Language Processing and Speech Tools for
  tion in italian. In Proceedings of the Sixth Evalua-      Italian. Final Workshop (EVALITA 2020), Online.
  tion Campaign of Natural Language Processing and          CEUR.org.
  Speech Tools for Italian. Final Workshop (EVALITA
  2018) co-located with the Fifth Italian Conference       Valentino Santucci, Stefania Spina, Alfredo Milani,
  on Computational Linguistics (CLiC-it 2018), Turin,        Giulio Biondi, and Gabriele Di Bari. 2018. Detect-
  Italy, December 12-13, 2018.                               ing hate speech for italian language in social media.
                                                             In Proceedings of the Sixth Evaluation Campaign of
Kevin Gimpel, Nathan Schneider, Brendan O’Connor,            Natural Language Processing and Speech Tools for
  Dipanjan Das, Daniel Mills, Jacob Eisenstein,              Italian. Final Workshop (EVALITA 2018) co-located
  Michael Heilman, Dani Yogatama, Jeffrey Flani-             with the Fifth Italian Conference on Computational
  gan, and Noah A Smith. 2010. Part-of-speech tag-           Linguistics (CLiC-it 2018), Turin, Italy.
  ging for twitter: Annotation, features, and experi-
  ments. Technical report, Carnegie-Mellon Univer-         Dirk von Grunigen, Ralf Grubenmann, Fernando Ben-
  sity, School of Computer Science.                          ites, Pius Von Daniken, and Mark Cieliebak. 2018.
                                                             spmmmp at germeval 2018 shared task: Classifica-
Gretel Liz De la Peña Sarracén, Reynaldo Gil Pons,         tion of offensive content in tweets using convolu-
  Carlos Enrique Muñiz-Cuza, and Paolo Rosso.               tional neural networks and gated recurrent units. In
  2018. Hate speech detection using attention-based          Proceedings of GermEval 2018, 14th Conference on
  LSTM. In Proceedings of the Sixth Evaluation               Natural Language Processing (KONVENS 2018).
  Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA        Michael Wiegand, Melanie Siegel, and Josef Ruppen-
  2018) co-located with the Fifth Italian Conference         hofer. 2018. Overview of the germeval 2018 shared
  on Computational Linguistics (CLiC-it 2018), Turin,        task on the identification of offensive language. In
  Italy.                                                     Proceedings of GermEval 2018, 14th Conference
                                                             on Natural Language Processing (KONVENS 2018),
Marco Polignano and Pierpaolo Basile. 2018. Hansel:          pages 1 – 10, Vienna, Austria. Austrian Academy of
 Italian hate speech detection through ensemble              Sciences.
 learning and deep neural networks. In Proceedings
 of the Sixth Evaluation Campaign of Natural Lan-          Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa
 guage Processing and Speech Tools for Italian. Fi-         Atanasova, Georgi Karadzhov, Hamdy Mubarak,
 nal Workshop (EVALITA 2018) co-located with the            Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin.
 Fifth Italian Conference on Computational Linguis-         2020. SemEval-2020 Task 12: Multilingual Offen-
 tics (CLiC-it 2018), Turin, Italy.                         sive Language Identification in Social Media (Of-
                                                            fensEval 2020). In Proceedings of the 14th Inter-
Marco Polignano, Pierpaolo Basile, Marco de Gem-            national Workshop on Semantic Evaluation. Associ-
 mis, and Giovanni Semeraro. 2019a. Hate speech             ation for Computational Linguistics.