Fontana-Unipi @ HaSpeeDe2: Ensemble of Transformers for the Hate Speech Task at Evalita Michele Fontana Giuseppe Attardi Dipartimento di Informatica Dipartimento di Informatica Università di Pisa Università di Pisa m.fontana12@studenti.unipi.it attardi@di.unipi.it Abstract (Wang et al., 2018). With our experiments we try to assess the effectiveness of transformers trained We describe our approach and experi- on Italian documents in a task involving Italian ments to tackle Task A of the second edi- texts from different sources. We experiments with tion of HaSpeeDe, within the Evalita 2020 both a transformer model trained specifically on evaluation campaign. The proposed model Italian tweets and one trained on generic web doc- consists in an ensemble of classifiers built uments. from three variants of a common neural ar- We combine several instances of classifiers chitecture. Each classifier uses contextual based on these transformers, in order to address representations from transformers trained the problem of over-fitting due to the small size of on Italian texts, fine tuned on the train- the training set. ing set of the challenge. We tested the For this edition of the Evalita HaSpeeDe task, proposed model on the two official test the organizers released two test sets, an in-domain sets, the in-domain test set containing just one consisting of tweets and an out-of-domain one tweets and the out-of-domain one includ- containing also news headlines. ing also news headlines. Our submissions The ensemble model of our official submission ranked 4th on the tweets test set and 17th achieved a competitive score of 78.03 Macro-F1 on the second test set. on the in-domain test set but did not perform as well on the second test set. 1 Introduction We make available the source code for our experiments as Open Source at https:// The spreading of hateful messages on social github.com/mikelefonty/Haspeede2. media has become a serious issue, therefore tech- niques of hate speech detection have become quite 2 Related Work relevant. The goal of the Hate Speech Detec- tion task (Sanguinetti et al., 2020) at Evalita The first edition of HaSpeeDe was held in 2018. 2020 (Basile et al., 2020) is to improve the auto- The results produced during this contest were the matic detection of hate messages in Italian tweets. starting point of our research. As described in The organizers provided to the participants the (Bosco et al., 2018), most of the systems were dataset HaSpeeDe2, which consists of 6,837 Ital- based on neural networks and used word embed- ian tweets, containing, besides the raw text, also dings, such as FastText (Grave et al., 2018) or hashtags and emojis. The Task A can be cast into word2vec (Polignano and Basile, 2018) in the first a binary classification task: the model has to pre- layer of their architecture. The embeddings layer dict whether a given message contains hate speech was usually followed by a Recurrent Network or or not. a Convolutional Neural Network to get an internal Approaches based on transformer models have representation of the input text. This hidden repre- become quite popular recently and have proved ef- sentation was provided as input to a series of dense fective in reaching state-of-the-art scores on major layers to obtain the final classification result. NLP tasks such as those of the GLUE benchmark Over the last couple of years, the trend in ap- proaches to language analysis has changed con- Copyright © 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- siderably, as can be seen by examining the models ternational (CC BY 4.0). used in competitions like SemEval 2020 OffensE- val 2 (Zampieri et al., 2020). In these new models, training. We designed three architecture variants, to get a better text representation, the embedding which were employed as the basic building blocks layer is often replaced by a Transformer (Vaswani to construct the ensembles: et al., 2017) such as BERT (Devlin et al., 2019), • ALB-SINGLE: It consists of a first layer RoBERTa (Liu et al., 2019), or Multilingual BERT provided by the AlBERTo transformer, fol- (Devlin et al., 2019). lowed by a single neuron with a sigmoid ac- We followed this trend but we also focused our tivation function. attention on the problem raised by the small size of the dataset. As Risch and Krestel (2020) men- • DB-SINGLE: It follows the same structure tion, transformer models tend to have a high vari- of ALB-SINGLE; it just replaces AlBERTo ance with respect to the input dataset, that often with DBMDZ in the first layer. leads to overfitting. The authors therefore suggest to implement an ensemble of classifiers to reduce • DB-MLP: Compared to DB-SINGLE, it the variance and consequently improve the gener- adds a new dense layer, using a ReLU acti- alization capabilities of the trained model. vation function, between the transformer and In the following, we describe a similar approach the output neuron. based on the Bagging technique (Breiman, 1996), The final model is an ensemble consisting of a where we apply three different transformer-based number of instances of each of the above architec- classifiers to populate the ensemble and to get the tures. For each architecture, e.g. ALB-SINGLE, final prediction. we construct instances in the following way. After initializing the weights randomly within a given 3 System Architecture interval and generating the training data by apply- During the design phase of our classifier, we ing the bootstrap technique to the original dataset, looked for a transformer trained directly on a sig- we start training the model. When that phase is nificantly large collection of Italian texts and par- over, we insert the resulting model in the ensem- ticularly on Italian tweets, in order to compensate ble. We repeat this process several times with dif- for the small size of the training data. We found ferent random weights initialization. Note that, two possible models based on BERT: AlBERTo due to the random initialization, no two classifiers (Polignano et al., 2019) 1 and DBMDZ 2 . The for- in the ensemble are identical to each other. More mer is trained on TWITA (Basile et al., 2018), a formally, the model consists of N elements, 191 GB collection of Italian tweets gathered by N = NAL + NDB + NM LP the authors, and tested on the SENTIPOLC task during the EVALITA 2016 campaign, where it where NAL , NDB , NM LP represent, respectively, achieved state-of-the-art accuracy in subjectivity, the number of instances of ALB-SINGLE, DB- polarity, and irony detection on Italian tweets. We SINGLE and DB-MLP classifiers. considered this model suitable for hate speech de- In retrospect, it might have been worth while tection, since its source are Italian tweets and the to consider instances of the architecture obtained SENTIPOLC task is a classification task similar varying them more thoroughly than just in the to ours. DBMDZ instead is trained on a more gen- initial weights, for example, by changing in the eral domain, from a 13 GB dataset, which includes hyper-parameters or number of layers. a dump of Italian Wikipedia and texts from web Our classification algorithm is a slight general- pages selected from the Opus Corpora. 3 We de- ization of the most classical one, which collects cided to test both transformer models, assessing results from each member of the ensemble and their performance through a validation phase on a outputs the class which gets the majority of pre- development set. dictions over all iterations. The process, described These transformers were used in the input stage by Algorithm 1, performs nrun iterations. Dur- of all our architectures, providing contextual em- ing the ith iteration, the algorithm starts sampling beddings for sentences that were fine tuned during randomly from the ensemble a given number of 1 https://github.com/marcopoli/AlBERTo-it instances for each type of classifier (line 3-5) and 2 https://huggingface.co/dbmdz/bert-base-italian-uncase initializing to 0 the variable class1, which con- 3 http://opus.nlpl.eu/ tains the total number of votes that the hate class Algorithm 1 Classification Algorithm Classifier Macro-F1 Std Input: t: the tweet to classify. ALB-SINGLE 76.896 0.7266 Input: (nAL , nDB , nM LP ): number of classifiers DB-SINGLE 77.613 0.3251 of each type to be sampled. DB-MLP 78.562 0.521 Input: (NAL , NDB , NM LP ): number of classi- fiers of each type in the ensemble. Table 1: Results of the experiments comparing Input: nrun : number of desired iterations. the baseline architectures. We report the expected Output: cf inal : predicted class value and the standard deviation of the F1 score computed with respect to the 3 validation folds. 1: preds = [] 2: for run = 1 to nrun do 3: albs = sample al(nAL , NAL ) dataset into two disjoint subsets, a development 4: dbs = sample db(nDB , NDB ) and an internal test set, in the proportion of 80% 5: mlps = sample ml(nM LP , NM LP ) and 20%, respectively. The split was done by 6: sampled classif = albs ∪ dbs ∪ mlps means of Stratified Sampling, according to the dis- 7: class1 = 0 // votes for class 1 tribution of the target variable hs. We applied 8: for cl in sampled classif do the Stratified 3-fold-CV technique to validate our 9: class1 += cl(t) // cl’s classification model. Given that we are solving a binary classi- 10: end for fication problem, we picked the Binary Cross En- 11: preds[run] = tropy as our loss. We chose AdamW as our op- class1 ≥ d nAL +nDB +nM LP  2 e timizer; we set the first 10% of the total steps as 12: end for  nP run   warmup steps. We conducted the experiences on a nrun 13: cf inal = pred[i] ≥ d 2 e GPU offered by Google Colab 4 . Our models are i 14: return cf inal implemented in PyTorch (Paszke et al., 2019). To extract as much information as possible from input texts, we preprocessed them through hashtag seg- receives during the iteration (line 7). It then col- mentation by means of Tweet Preprocessor.5 We lects the predictions of the selected models on the also converted emojis into their Italian description tweet t (line 8-10). cl(t) ∈ {0, 1} represents the by using the emoji 6 and Google Translate 7 li- prediction of classifier cl for the tweet t; in particu- braries. lar cl(t) = 1 if and only if cl classifies t as hateful. We analyzed the behaviour of the three baseline The output of iteration i is the most predicted class architectures we planned to include in the ensem- (line 11). The final result of the algorithm is then ble. the class cf inal ∈ {0, 1}, which obtained the most We trained each model for a maximum of 4 votes over all the nrun iterations (line 13-14). If epochs, using a batch of size 16 and setting the cf inal = 1, it means that the tweet t has been clas- maximum text length to 100. A grid search re- sified as hateful. vealed that the optimal learning rate for DB-MLP A simpler variant of the algorithm would be to is 5 · 10−5 , and 6 · 10−5 for the remaining mod- just add the counts of each class by all classifiers in els. The optimal number of neurons in the hidden all iterations and return the class with the highest layer of DB-MLP is 50. count. We plan to compare these two approaches Table 1 highlights the following aspect: DB- in a future work. SINGLE achieves better performance than ALB- SINGLE, even though the dataset used to train 4 Experiments AlBERTo was composed by a large collection of In this section we describe the experiments we tweets. The obtained values of the macro-F1 are performed to tune the hyper-parameters of our the baselines of our work. model. We will focus on the search to choose We then describe the results obtained through the best values for nDB , nAL , nM LP , that is how 4 https://colab.research.google.com/ many instances to select at each iteration in the 5 https://pypi.org/project/tweet-preprocessor/ classification algorithm. 6 https://pypi.org/project/emoji 7 Before starting the experiments, we divided the https://pypi.org/project/googletrans/ nDB nM LP nAL Macro-F1 Std Accuracy Precision Recall F1 20 25 30 80.057 0.534 79.313 78.510 78.685 78.592 15 20 25 80.038 0.580 15 30 30 80.036 0.585 Table 4: Results of the final model on the internal 15 25 30 80.026 0.563 15 30 15 80.020 0.481 test set. Table 2: Ranking of the 5 best configurations we found, varying the number the number of instances We picked the first configuration from Table 2 selected from the ensemble. nDB stands for the for our final model and tested it on the internal test number of instances of the DB-SINGLE model, set, obtaining the results shown in Table 4. and similarly for nM LP and nAL . We report the 5 Results and Discussion expected value and the standard deviation of the F1 score computed with respect to the 3 validation The results of our final model applied to the data folds. of the two official test sets of the competition are shown in Table 5. The model performs pretty well nDB nM LP nAL Macro-F1 Std on the in-domain dataset, reaching the 4th posi- tion in the rankings. However, it did not rank as 30 0 0 79.074 0.300 well in detecting hate speech on the out-of-domain 0 30 0 79.581 0.3787 0 0 30 79.482 0.596 dataset, obtaining an F1-score of just 65.46. The 30 30 30 79.832 0.525 low recall for the hate class highlights that the model fails too often to identify news headlines Table 3: Scores by each architecture, both indi- containing some form of hate speech. In compar- vidually and together in the ensemble. We report ison with the official top rankings, listed in Table the average value and the standard deviation of the 6, our model achieved about 12 points below the F1 score computed with respect to the 3 validation top score of 77.44% F1. folds. Surprised by this fact, we investigated more deeply, looking for an explanation for such poor result on the out-of-domain dataset. the ensemble model. To build the classifier, we We randomly sampled from the test set some trained 30 instances of each architecture, keeping hateful headlines missed by the model, some of the same hyper-parameters obtained from the pre- which are shown in Table 7. vious grid search. We thus set: In these headlines, the qualification as hate is implicit and harder to recognize, since it seems NAL = NDB = NM LP = 30 due more to the presence of stereotypes (nomads, asylum seekers, Muslims, foreigners), than to the We noted that the generalization capability presence of explicit hate expressions. of the ensemble is strictly related to the triple Broadly speaking, we identified some possible (nDB , nM LP , nAL ), so we performed another grid reasons for the difference in performance across search, looking for the optimal combination of the the two test sets: three parameters. Table 2 shows the five best con- figurations found by this search. The optimal val- • Linguistic register: Tweets often exhibit a ues for the triple, (20, 25, 30), allow the ensemble more informal and colloquial language, while to achieve an F1-score of 80.0%, with a gain of headlines employ a more formal lexicon and about 2 points with respect to the score by a single a more objective tone. This is a crucial differ- DB-MLP (see Table 1). ence in identifying hateful messages: while We analyzed the contribution of each architec- in tweets the feeling of hatred transpires ture individually to the ensemble combination. As clearly and directly, in headlines this message shown in Table 3, the best results are obtained with is conveyed in a more subtle way, often allud- instances of all three architectures. Nevertheless, ing to concepts from political propaganda or the results presented in Table 2, show that a more common stereotypes. Prior knowledge about balanced combination achieves better accuracy. the subject and inference might be necessary NOT HATE HATE Precision Recall F1 Precision Recall F1 Macro-F1 Position Tweets 81.93 72.85 77.12 74.89 83.44 78.94 78.03 4 News 71.88 99.37 83.42 96.61 31.49 47.50 65.46 17 Table 5: Results of the submitted model on the official blind test sets. Tweets News than news headlines. Thus, the model has Position F1 score Position F1 score fewer elements to exploit to correctly classify a piece of news. 1 80.88 1 77.44 2 78.97 2 73.14 These difficulties seem to be shared with other 3 78.93 3 72.56 submissions which all got lower scores on the out- 4 78.03 (ours) 4 71.83 of-domain dataset. We expected that pretrained 5 77.82 5 70.2 contextual embedding would be more effective in 6 77.66 17 65.46 (ours) addressing the domain adaptation issue. Further experiments would be needed to improve the re- Table 6: Comparison between our final results and silience of our model. the top-5 F1-scores. The values are taken from the official rankings. 6 Conclusions We described an ensemble of neural classifiers, Hateful News Headlines relying on contextual embeddings from transform- ers, for automated detection of hateful content in anziana rapinata sull’autobus, i due no- Italian texts. We presented the general architec- madi in fuga si rifugiano al campo di via ture of our base classification models and how Candoni they were combined into an ensemble through a (elderly woman robbed on the bus, the two bagging technique. We performed extensive ex- fleeing nomads take refuge at the camp on periments to tune our models and the ensemble via Candoni) on a validation test set. The results achieved by Expo: Bordonali, richiedenti asilo in our ensemble model on the in-domain test set con- campo base simbolo fallimento governo. firm its ability in detecting hateful tweets; however (Expo: Bordonali, asylum seekers in base the same model performed poorly on the out-of- camp government failure symbol.) domain dataset, showing particularly an inability Il cardinale Müller: ”non possiamo pre- to adapt to handling news headlines. We plan to gare come o con i musulmani” investigate this issue in future research. (”we cannot pray like nor with Muslims”) Salvini: ”Il calcio? Rimpiango i tre References stranieri in campo” (Salvini: ”Soccer? I regret the three for- Valerio Basile, Mirko Lai, and Manuela Sanguinetti. 2018. Long-term social media data collection at eigners on the field”) the university of turin. In Elena Cabrio, Alessandro Mazzei, and Fabio Tamburini, editors, Proceedings Table 7: Examples of hateful headlines, randomly of the Fifth Italian Conference on Computational picked from the out-of-domain test set, that are Linguistics (CLiC-it 2018), Torino, Italy, December misclassified by our model. 10-12, 2018, volume 2253 of CEUR Workshop Pro- ceedings. CEUR-WS.org. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- to decipher the presence of hate. Examining cia C. Passaro. 2020. Evalita 2020: Overview the entire body of the article might have been of the 7th evaluation campaign of natural language helpful. processing and speech tools for italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evalua- • Length of text: Tweets are usually longer tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA on Computational Linguistics (CLiC-it 2018), Turin, 2020), Online. CEUR.org. Italy, December 12-13, 2018, volume 2263 of CEUR Workshop Proceedings. CEUR-WS.org. Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Manuela Sanguinetti, and Maurizio Tesconi. 2018. Marco Polignano, Pierpaolo Basile, Marco de Gemmis, Overview of the EVALITA 2018 hate speech detec- Giovanni Semeraro, and Valerio Basile. 2019. Al- tion task. In Tommaso Caselli, Nicole Novielli, Vi- berto: Italian BERT language understanding model viana Patti, and Paolo Rosso, editors, Proceedings for NLP challenging tasks based on tweets. In of the Sixth Evaluation Campaign of Natural Lan- Raffaella Bernardi, Roberto Navigli, and Giovanni guage Processing and Speech Tools for Italian. Fi- Semeraro, editors, Proceedings of the Sixth Ital- nal Workshop (EVALITA 2018) co-located with the ian Conference on Computational Linguistics, Bari, Fifth Italian Conference on Computational Linguis- Italy, November 13-15, 2019, volume 2481 of CEUR tics (CLiC-it 2018), Turin, Italy, December 12-13, Workshop Proceedings. CEUR-WS.org. 2018, volume 2263 of CEUR Workshop Proceed- ings. CEUR-WS.org. Julian Risch and Ralf Krestel. 2020. Bagging BERT models for robust aggression identification. In L. Breiman. 1996. Bagging predictors. Machine Ritesh Kumar, Atul Kr. Ojha, Bornini Lahiri, Mar- Learning, 24:123–140. cos Zampieri, Shervin Malmasi, Vanessa Murdock, and Daniel Kadar, editors, Proceedings of the Sec- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ond Workshop on Trolling, Aggression and Cyber- Kristina Toutanova. 2019. BERT: pre-training bullying, TRAC@LREC 2020, Marseille, France, of deep bidirectional transformers for language un- May 2020, pages 55–61. European Language Re- derstanding. In Jill Burstein, Christy Doran, and sources Association (ELRA). Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Manuela Sanguinetti, Gloria Comandini, Elisa Association for Computational Linguistics: Human Di Nuovo, Simona Frenda, Marco Stranisci, Language Technologies, NAACL-HLT 2019, Min- Cristina Bosco, Tommaso Caselli, Viviana Patti, and neapolis, MN, USA, June 2-7, 2019, Volume 1 (Long Irene Russo. 2020. HaSpeeDe 2@EVALITA2020: and Short Papers), pages 4171–4186. Association Overview of the EVALITA 2020 Hate Speech for Computational Linguistics. Detection Task. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- Proceedings of Seventh Evaluation Campaign of mand Joulin, and Tomas Mikolov. 2018. Learn- Natural Language Processing and Speech Tools for ing word vectors for 157 languages. In Proceed- Italian. Final Workshop (EVALITA 2020), Online. ings of the International Conference on Language CEUR.org. Resources and Evaluation (LREC 2018). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Kaiser, and Illia Polosukhin. 2017. Attention Luke Zettlemoyer, and Veselin Stoyanov. 2019. is all you need. In Isabelle Guyon, Ulrike von Roberta: A robustly optimized BERT pretraining Luxburg, Samy Bengio, Hanna M. Wallach, Rob approach. CoRR, abs/1907.11692. Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Adam Paszke, Sam Gross, Francisco Massa, Adam Systems 30: Annual Conference on Neural Informa- Lerer, James Bradbury, Gregory Chanan, Trevor tion Processing Systems 2017, 4-9 December 2017, Killeen, Zeming Lin, Natalia Gimelshein, Luca Long Beach, CA, USA, pages 5998–6008. Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te- Alex Wang, Amanpreet Singh, Julian Michael, Fe- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, lix Hill, Omer Levy, and Samuel Bowman. 2018. Junjie Bai, and Soumith Chintala. 2019. Py- GLUE: A multi-task benchmark and analysis plat- torch: An imperative style, high-performance deep form for natural language understanding. In Pro- learning library. In H. Wallach, H. Larochelle, ceedings of the 2018 EMNLP Workshop Black- A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar- boxNLP: Analyzing and Interpreting Neural Net- nett, editors, Advances in Neural Information Pro- works for NLP, pages 353–355, Brussels, Belgium, cessing Systems 32, pages 8024–8035. Curran As- November. Association for Computational Linguis- sociates, Inc. tics. Marco Polignano and Pierpaolo Basile. 2018. Hansel: Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Italian hate speech detection through ensemble Atanasova, Georgi Karadzhov, Hamdy Mubarak, learning and deep neural networks. In Tommaso Leon Derczynski, Zeses Pitenis, and Çagri Çöltekin. Caselli, Nicole Novielli, Viviana Patti, and Paolo 2020. Semeval-2020 task 12: Multilingual offensive Rosso, editors, Proceedings of the Sixth Evalua- language identification in social media (offenseval tion Campaign of Natural Language Processing and 2020). CoRR, abs/2006.07235. Speech Tools for Italian. Final Workshop (EVALITA 2018) co-located with the Fifth Italian Conference