DH-FBK @ HaSpeeDe2: Italian Hate Speech Detection via Self-Training and Oversampling Elisa Leonardelli Stefano Menini Sara Tonelli Fondazione Bruno Kessler Fondazione Bruno Kessler Fondazione Bruno Kessler Trento, Italy Trento, Italy Trento, Italy eleonardelli@fbk.eu menini@fbk.eu satonelli@fbk.eu Abstract and vice-versa (Corazza et al., 2018). Other re- cent studies confirmed that detecting hate speech We describe in this paper the system on different social media platforms would require submitted by the DH-FBK team to the a platform-specific setting, and that just merging HaSpeeDe evaluation task, and dealing all training data coming from different sources with Italian hate speech detection (Task does not always improve performance, in particu- A). While we adopt a standard ap- lar when testing on Twitter (Corazza et al., 2019). proach for fine-tuning AlBERTo, the Ital- The problem of developing hate speech detec- ian BERT model trained on tweets, we tion systems that are robust when analysing differ- propose to improve the final classifica- ent sources or data that vary over time is however tion performance by two additional steps, an understudied problem. Therefore, the task of i.e. self-training and oversampling. In- out-of-domain classification introduced this year deed, we extend the initial training data at HaSpeeDe is particularly important and will with additional silver data, carefully sam- hopefully foster the development and evaluation of pled from domain-specific tweets and ob- classifiers with good generalisation capabilities. tained after first training our system only Concerning our classification approach, we with the task training data. Then, we re- build a standard pipeline based on AlBERTo train the classifier by merging silver and (Polignano et al., 2019b), the Italian transformer- task training data but oversampling the lat- based model trained on Twitter data, since BERT- ter, so that the obtained model is more like models represent the state of the art for hate robust to possible inconsistencies in the speech detection (Zampieri et al., 2020). We ex- silver data. With this configuration, we tend it in two ways: first, we use self-training to obtain a macro-averaged F1 of 0.753 on build a first classifier with the task training data tweets, and 0.702 on news headlines. and annotate a large set of tweets collected via Islam- and immigrant-specific hashtags. The sil- 1 Introduction ver data and the task training set are then merged to train a second, possibly more robust classifier, Although hate speech detection may seem a solved which we use to classify the test set. When re- task on English, with more than 60 systems partic- training, we introduce over-sampling in one of the ipating in the last Offenseval edition reaching an two runs submitted by our team, i.e. we repeat F1 > 0.90 (Zampieri et al., 2020), this goal has five times the task training data so that they are not been reached when moving to other languages balanced with respect to the silver data. This, to- and settings. For example, at the last HaSpeeDe gether with self-training, proved to be effective shared task on Italian (Bosco et al., 2018) the best when evaluated in a five-fold fashion on the train- systems reached 0.83 F1 on Facebook data and ing set, outperforming a standard approach based 0.80 on Twitter data (Cimino et al., 2018), but the only on fine-tuning with AlBERTo. performance dropped below 0.70 F1 when dealing with a cross-domain setting, i.e. training on Face- 2 Related Work book and testing on Twitter (Cimino et al., 2018), While most approaches to hate speech detection Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 have been proposed for English, other systems International (CC BY 4.0). have been recently developed to deal with a num- ber of other languages, including Turkish, Arabic, • Task B: binary classification task aimed at de- Danish (Zampieri et al., 2020), German (Wiegand termining whether a message contains stereo- et al., 2018) and Spanish (Basile et al., 2019). types or not Concerning Italian, the first Hate Speech Detec- tion task (HaSpeeDe) for Italian was organized at • Task C: sequence labeling task aimed at rec- EVALITA-2018 (Bosco et al., 2018). The task ognizing nominal utterances in hateful tweets consisted in automatically annotating messages We participate in Task A, which in 2020 has from Twitter and Facebook, with a boolean value the goal also to investigate variation in language indicating the presence (or not) of hate speech. and time concerning hate speech detection. To this The participating systems adopt a wide range of purpose, the training set contains Twitter data, ac- approaches, including bi-LSTM (la Peña Sarracén companied by a test set including both in-domain et al., 2018), SVM (Santucci et al., 2018), ensem- and out-of-domain data (tweets + news headlines), ble classifiers (Polignano and Basile, 2018; Bai et as well as from different time periods. al., 2018), RNN (Fortuna et al., 2018), CNN and GRU (von Grunigen et al., 2018). The authors of 4 Data the best-performing system, ItaliaNLP (Cimino et In our experiments we use two types of data, the al., 2018), experiment with three different classifi- HaSpeeDe2 dataset provided by the task organis- cation models: one based on linear SVM, another ers, and domain-specific data collected from Twit- one based on a 1-layer BiLSTM and a newly- ter, that we include as silver data. The two datasets introduced one based on a 2-layer BiLSTM which are described below. exploits multi-task learning with additional data from the 2016 SENTIPOLC task (Barbieri et al., 4.1 HaSpeeDe2 Dataset 2016). The same training and test set released for This dataset contains the training data provided HaSpeeDe have been recently used also for other by the organizers. These data specifically focus types of evaluation, for example to compare classi- on the presence or the absence of hateful con- fier performance and settings across different lan- tent towards immigrants, muslims or roma people. guages (Corazza et al., 2020), confirming the im- It consists of 6,839 annotated tweets, with 2,766 portance of domain-specific language models and messages annotated as hateful and 4,073 as non- the effectiveness of deep learning approaches (in hateful. this case, LSTM + fasttext embeddings). Since the development of BERT-like transformer-based 4.2 Silver data description models, however, they have become state-of-the- Since the task is focused on hate speech against art approaches in several NLP tasks. This includes immigrants and minorities, we decided to exploit also hate speech detection for Italian, with the a set of tweets in Italian that covers similar topics BERT model AlBERTo (Polignano et al., 2019b), and that was collected within the European project which has recently achieved top-scores in two out Hatemeter1 (Ferret et al., 2019). For this project, of three HaSpeeDe 2018 tasks (Polignano et al., conducted between February 2018 and January 2019a). For this reason, we decided to develop a 2020, we downloaded tweets using hashtags of classifier using the same model and the same ap- hate towards the Islam community, for example proach. #nomoschee, #stopIslam, etc. Even if the dataset mainly covers Islam, references to other minorities 3 Task Description like Roma or generic Immigrants are also present. For the 2020 edition of EVALITA (Basile et al., To ensure that also other minorities are well rep- 2020), the HaSpeeDe task (Sanguinetti et al., resented, we randomly select from this dataset 2020) has focused on three main phenomena rele- tweets that contain the most common words as vant to online hate speech detection by proposing chosen from the training data provided by task or- three different tasks: ganizers, i.e. Rom, nomade, migrante, straniero, profugo, islam, mussulmano (musulmano), terror- ista. Overall, around 20,400 additional tweets • Task A (main task): binary classification task were selected. We then perform a first round of aimed at determining whether a message con- 1 tains hate speech or not http://hatemeter.eu/ classification of the “new” tweets using the avail- transformed-based language model that can be able data provided by organizers as training. This fine-tuned and adapted to specific tasks by adding results in a new silver dataset composed of 11,129 just one additional output layer to the neural net- hate and 9,254 non-hate tweets. This additional work. As different BERT models exist, we first dataset is then merged with the task gold data and evaluated whether to use a multilingual version used to re-train the classifier. Details are reported of BERT or the Italian version trained on Twitter in the following Section. data, called AlBERTo (Polignano et al., 2019b). The comparison and evaluation of the differ- 5 System Description ent models and approaches is done with a 6-fold The classifier developed for both runs submitted cross-validation using the task training set. Each by our team is based on the Italian BERT model fold consists of about 1,000 tweets as test while trained on tweets, called AlBERTo (Polignano et the others are used as train and validation (500 al., 2019b). After fine-tuning it on the task train- tweets). The performance score is obtained as the ing data, we use the obtained classifier to automat- average of the six folds, so that the final evaluation ically annotate the additional dataset described in is unbiased and independent as much as possible Section 4.2. These silver data are then merged from the specific splits into train, validation and with the task training data and used to fine-tune test. AlBERTo a second time. For one of the two sub- In our setup we tested two models, first Mul- mitted runs, we also experiment with oversam- tilingual BERT, covering 104 languages including pling as follows: Italian 2 and then AlBERTo, which was trained us- ing the official BERT source code on 200M tweets • Run1: we add the silver data to the tweets in the Italian language. For the fine-tuning of Al- provided by the organizers for the training, BERTo we run it for 15 epochs, using a learning keeping 500 of the released tweets for vali- rate of 2e-5 with 1000 steps per loops on batches dation. In this setting, the training set size is of 64 examples. Since AlBERTo performed bet- ∼27,000 tweets, including 20,400 silver in- ter than multilingual BERT on each fold, it was stances. included in the final system configuration for the task. The cross-validation over 6 folds using only • Run2: we add the silver data to the tweets the task training set with AlBERTo resulted in an provided by the organizers as in Run1, but average Macro-F1 of 83.12 for Run1 and 82.15 for the tweet from organizers are oversampled by Run2. repeating them five times (and shuffling) in the training set, while tweets from the silver dataset occur only once. In this setting, the 5.2 Data Preprocessing training set includes ∼52,000 tweets, with The data, both from the dataset provided by the 39% of them being silver data. organisers and the silver one, are preprocessed as We tested also the option to automatically as- follows. First we split hashtags by adapting to sign a tag to each tweet, stressing the presence of Italian the Ekphrasis tool (Gimpel et al., 2010), a certain topic (immigrants/roma people/islam) us- which recognises the tokens in a hashtag based ing a keyword-based approach. However, with this on Google n-grams. With the same tool we also additional information the classifier performed normalise the text to replace all mentions to users worse than without any topic indicator, so we re- and urls with and respectively. We moved it from the final runs. Below we report also replace with a dedicated tag all the instances a detailed description of the process to select the of “money”, “time”, “date” and in general any best classification model, and of the preprocessing “number“. The emojis are replaced with their de- steps. scriptions3 in order to have a textual representation to be used with AlBERTo. 5.1 Model selection 2 The best performance in a wide variety of NLP with 12-layer, 768-hidden, 12-heads, 110M parameters 3 manually translated to Italian from the English descrip- tasks is currently obtained with approaches based tion at rhttps://unicode.org/emoji/charts/ on BERT (Devlin et al., 2019), a pre-trained full-emoji-list.html. Hate class Non-hate class Macro Avg. DocType. System Precision Recall F1 Precision Recall F1 F1 Run1 0.7237 0.7958 0.758 0.7806 0.7051 0.7409 0.7495 Run2 0.727 0.8006 0.762 0.7855 0.7083 0.7448 0.7534 Tweets baselineMF 0 0 0 0.5075 1.000 0.6733 0.3366 baselineSVM 0.7096 0.7347 0.7219 0.7334 0.7082 0.7206 0.7212 best system 0.8088 Run1 0.6833 0.453 0.5448 0.7395 0.8808 0.804 0.6744 Run2 0.6911 0.5193 0.593 0.7609 0.8683 0.8111 0.702 News baselineMF 0 0 0 0.638 1.000 0.7789 0.3894 baselineSVM 0.6071 0.3756 0.4641 0.7087 0.862 0.7779 0.621 best system 0.7744 Table 1: Results of the two submitted runs for Task A on tweets and on news headlines. BaselineMF = most-frequent baseline; baselineSVM = linear SVM with unigrams, char-grams and TF-IDF representa- tion 6 Evaluation in the training set. This will likely make the two sets different in terms of topics. We submitted two runs each for the in-domain (tweets) and out-of-domain (news headlines) text Actual Values Run 1 types in Task A. The results obtained on the test non-hate hate set are reported in Table 1 and compared with two non-hate 452 127 Predicted baselines provided by the task organisers, one ob- hate 189 495 tained by always assigning the most frequent la- Actual Values Run 2 bel (i.e. non-hateful), and the other by training non-hate hate an SVM classifier with unigrams, char-grams and non-hate 454 124 Predicted TF-IDF representation as features. We also com- hate 187 498 pare our results with the top-ranked system in each subtask (additional details on such systems have Table 2: Confusion matrix on tweets results not been disclosed at the moment of writing). We report in Table 2 and 3 the confusion ma- As expected, on out-of-domain data (news trix showing the number of true positives and neg- headlines) we obtain lower results than on tweets, atives, and false positives and negatives obtained since the training set is retrieved exclusively from with the two runs on tweets and news headlines. Twitter. Furthermore, our approach does not in- While on tweets the performance on the hate class clude any specific tuning aimed at treating news is overall better, in particular concerning recall, headlines differently from tweets. On the con- this does not apply to news headlines, with a low trary, the additional data used for self-training are recall for the hate class. The reason for this low all gathered from Twitter, which may negatively score lies in the different linguistic expressions affect performance on out-of-domain data. connected with hate between tweets and head- On both document types, Run2 performs better lines: while in tweets they are more direct, and than Run1, showing that our oversampling strat- more frequently connected with profanities that a egy to reduce the weight of silver data is effec- classifier can easily recognise, hateful content in tive. However, results obtained with 6-fold cross- news headlines is usually expressed in more subtle validation only on the training set were signifi- ways. As an example, we report below two head- cantly higher, both with macro F1 > 0.80. This lines misclassified by our system. The first one may be explained by the fact that, as pointed out (i) was classified as non-hateful, even if it conveys by the task organisers, tweets from the test set hateful content. The second one (ii) was instead were collected in a different time period than those classified as hateful, although it is not: Actual Values a better data quality, it may be useful to select only Run 1 non-hate hate the silver instances that have been automatically non-hate 281 99 classified with high confidence. Predicted hate 38 82 Actual Values Run 2 References non-hate hate non-hate 277 87 Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tom- Predicted maso Caselli, and Malvina Nissim. 2018. Rug @ hate 42 94 EVALITA 2018: Hate speech detection in italian social media. In Proceedings of the Sixth Evalua- Table 3: Confusion matrix on news headlines re- tion Campaign of Natural Language Processing and sults Speech Tools for Italian. Final Workshop (EVALITA 2018) co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), Turin, i) Sea Watch, l’ultima presa in giro degli immi- Italy. grati all’Italia: i minori nati tutti lo stesso giorno (EN: Sea Watch, migrants making fun Francesco Barbieri, Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli, and Viviana Patti. of Italy: all underage migrants born on the 2016. Overview of the evalita 2016 sentiment polar- same day) ity classification task. ii) Matera, Salvini contestato durante il Valerio Basile, Cristina Bosco, Elisabetta Fersini, Deb- comizio. E lui risponde: “Bravi, avete vinto ora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. dieci immigrati da mantenere” (EN: Matera, 2019. Semeval-2019 task 5: Multilingual detec- Salvini challenged at a rally, and he replies: tion of hate speech against immigrants and women “Congratulations, you won ten migrants to in twitter. In Proceedings of the 13th International pay for”) Workshop on Semantic Evaluation, pages 54–63. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- Both examples have a similar structure, are cia C. Passaro. 2020. Evalita 2020: Overview written in standard Italian and mention migrants. of the 7th evaluation campaign of natural language Furthermore, the second example reports a hateful processing and speech tools for italian. In Valerio direct speech, but since it is only reported it does Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evalua- not mean that the journalist agrees with what was tion Campaign of Natural Language Processing and said by the politician Matteo Salvini. Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR.org. 7 Conclusions Cristina Bosco, Dell’Orletta Felice, Fabio Poletto, In this paper we described the system devel- Manuela Sanguinetti, and Tesconi Maurizio. 2018. Overview of the evalita 2018 hate speech detection oped by the DH-FBK team to participate in the task. In EVALITA 2018-Sixth Evaluation Campaign HaSpeeDe shared Task A. We submitted two runs, of Natural Language Processing and Speech Tools both based on AlBERTo and using in-domain for Italian, volume 2263, pages 1–9. CEUR. silver data as additional training data in a self- Andrea Cimino, Lorenzo De Mattei, and Felice learning framework. The only difference between Dell’Orletta. 2018. Multi-task learning in deep the two configurations is that, for Run2, the task neural networks at EVALITA 2018. In Proceedings training data were repeated five times, to balance of the Sixth Evaluation Campaign of Natural Lan- the weight of silver data. guage Processing and Speech Tools for Italian. Fi- nal Workshop (EVALITA 2018) co-located with the Our evaluation shows that, both in a cross- Fifth Italian Conference on Computational Linguis- validation setting and on the task test set, over- tics (CLiC-it 2018), Turin, Italy. sampling has a positive effect on the classification Michele Corazza, Stefano Menini, Pinar Arslan, results. As expected, performance on in-domain Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and data (i.e. training and testing on tweets) is better Serena Villata. 2018. Comparing different su- than on out-of-domain data (i.e. training on tweets pervised approaches to hate speech detection. In and testing on news headlines). In the future, we Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for may try to address this issue by including as silver Italian. Final Workshop (EVALITA 2018) co-located data also news headlines, so that also the speci- with the Fifth Italian Conference on Computational ficity of this kind of text is taken into account. For Linguistics (CLiC-it 2018), Turin, Italy. Michele Corazza, Stefano Menini, Elena Cabrio, Sara detection through alberto italian language under- Tonelli, and Serena Villata. 2019. Cross-platform standing model. In Mehwish Alam, Valerio Basile, evaluation for italian hate speech detection. In Pro- Felice Dell’Orletta, Malvina Nissim, and Nicole ceedings of the Sixth Italian Conference on Com- Novielli, editors, Proceedings of the 3rd Workshop putational Linguistics, Bari, Italy, November 13-15, on Natural Language for Artificial Intelligence co- 2019. located with the 18th International Conference of the Italian Association for Artificial Intelligence Michele Corazza, Stefano Menini, Elena Cabrio, Sara (AIIA 2019), Rende, Italy, November 19th-22nd, Tonelli, and Serena Villata. 2020. A multilingual 2019, volume 2521 of CEUR Workshop Proceed- evaluation for online hate speech detection. ACM ings. CEUR-WS.org. Transactions on Internet Technology, 20(2):10:1– 10:22. Marco Polignano, Pierpaolo Basile, Marco de Gemmis, Giovanni Semeraro, and Valerio Basile. 2019b. Al- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and berto: Italian BERT language understanding model Kristina Toutanova. 2019. BERT: Pre-training of for NLP challenging tasks based on tweets. In deep bidirectional transformers for language under- Raffaella Bernardi, Roberto Navigli, and Giovanni standing. In Proceedings of the 2019 Conference Semeraro, editors, Proceedings of the Sixth Ital- of the North American Chapter of the Association ian Conference on Computational Linguistics, Bari, for Computational Linguistics: Human Language Italy, November 13-15, 2019, volume 2481 of CEUR Technologies, pages 4171–4186, Minneapolis, Min- Workshop Proceedings. CEUR-WS.org. nesota, June. Manuela Sanguinetti, Gloria Comandini, Elisa Jérôme Ferret, Mario Laurent, Daniela Andreatta, An- Di Nuovo, Simona Frenda, Marco Stranisci, drea Di Nicola, Elisa Martini, M Guerini, S Tonelli, Cristina Bosco, Tommaso Caselli, Viviana Patti, and Georgios Antonopoulos, and Parisa Diba. 2019. Irene Russo. 2020. HaSpeeDe 2@EVALITA2020: Hatemeter d18: Training module a for academics Overview of the EVALITA 2020 Hate Speech and research organisations. Detection Task. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Paula Fortuna, Ilaria Bonavita, and Sérgio Nunes. Proceedings of Seventh Evaluation Campaign of 2018. Merging datasets for hate speech classifica- Natural Language Processing and Speech Tools for tion in italian. In Proceedings of the Sixth Evalua- Italian. Final Workshop (EVALITA 2020), Online. tion Campaign of Natural Language Processing and CEUR.org. Speech Tools for Italian. Final Workshop (EVALITA 2018) co-located with the Fifth Italian Conference Valentino Santucci, Stefania Spina, Alfredo Milani, on Computational Linguistics (CLiC-it 2018), Turin, Giulio Biondi, and Gabriele Di Bari. 2018. Detect- Italy, December 12-13, 2018. ing hate speech for italian language in social media. In Proceedings of the Sixth Evaluation Campaign of Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Natural Language Processing and Speech Tools for Dipanjan Das, Daniel Mills, Jacob Eisenstein, Italian. Final Workshop (EVALITA 2018) co-located Michael Heilman, Dani Yogatama, Jeffrey Flani- with the Fifth Italian Conference on Computational gan, and Noah A Smith. 2010. Part-of-speech tag- Linguistics (CLiC-it 2018), Turin, Italy. ging for twitter: Annotation, features, and experi- ments. Technical report, Carnegie-Mellon Univer- Dirk von Grunigen, Ralf Grubenmann, Fernando Ben- sity, School of Computer Science. ites, Pius Von Daniken, and Mark Cieliebak. 2018. spmmmp at germeval 2018 shared task: Classifica- Gretel Liz De la Peña Sarracén, Reynaldo Gil Pons, tion of offensive content in tweets using convolu- Carlos Enrique Muñiz-Cuza, and Paolo Rosso. tional neural networks and gated recurrent units. In 2018. Hate speech detection using attention-based Proceedings of GermEval 2018, 14th Conference on LSTM. In Proceedings of the Sixth Evaluation Natural Language Processing (KONVENS 2018). Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA Michael Wiegand, Melanie Siegel, and Josef Ruppen- 2018) co-located with the Fifth Italian Conference hofer. 2018. Overview of the germeval 2018 shared on Computational Linguistics (CLiC-it 2018), Turin, task on the identification of offensive language. In Italy. Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018), Marco Polignano and Pierpaolo Basile. 2018. Hansel: pages 1 – 10, Vienna, Austria. Austrian Academy of Italian hate speech detection through ensemble Sciences. learning and deep neural networks. In Proceedings of the Sixth Evaluation Campaign of Natural Lan- Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa guage Processing and Speech Tools for Italian. Fi- Atanasova, Georgi Karadzhov, Hamdy Mubarak, nal Workshop (EVALITA 2018) co-located with the Leon Derczynski, Zeses Pitenis, and Çağrı Çöltekin. Fifth Italian Conference on Computational Linguis- 2020. SemEval-2020 Task 12: Multilingual Offen- tics (CLiC-it 2018), Turin, Italy. sive Language Identification in Social Media (Of- fensEval 2020). In Proceedings of the 14th Inter- Marco Polignano, Pierpaolo Basile, Marco de Gem- national Workshop on Semantic Evaluation. Associ- mis, and Giovanni Semeraro. 2019a. Hate speech ation for Computational Linguistics.