UR NLP @ HaSpeeDe 2 at EVALITA 2020: Towards Robust Hate Speech Detection with Contextual Embeddings Julia Hoffmann Udo Kruschwitz University of Regensburg University of Regensburg Julia1.Hoffmann@ur.de Udo.Kruschwitz@ur.de Abstract ously provide rights to freedom of expression and prevent censorship and illegal discrimination. All We describe our approach to address this contributes to making automatically detecting Task A of the EVALITA 2020 Hate Speech hate speech a challenging task. Detection (HaSpeeDe2) challenge. We Nevertheless, social media platforms such as submitted two runs that are both based on Twitter have defined clear guidelines prohibiting contextual embeddings – which we had the use of hateful behaviour.1 Accounts with such chosen due to their effectiveness in solving contents can be reported and are subsequently a wide range of NLP problems. For our deleted. The challenge is to be able to detect such baseline run we use stacked embeddings content automatically with both high precision and that serve as features in a linear SVM. Our high recall. second run is a simple ensemble approach The EVALITA evaluation campaign introduced of three SVMs with majority voting. Both a hate speech detection challenge applied to Ital- approaches outperform the official base- ian social media in 2018 (Bosco et al., 2018). Its lines by a large margin, and the ensemble success led to the continuation of the challenge in classifier in particular demonstrates robust 2020, now called HaSpeeDe 2, which is split up performance on different types of test data into three subtasks (Sanguinetti et al., 2020). This coming 6th (out of 27 runs) for news head- report discusses our two runs that we submitted lines and 10th (out of 27) for Twitter feeds. to HaSpeeDe 2 Task A of EVALITA 2020 (Basile et al., 2020). We will first give some background 1 Introduction on the problem aimed at motivating our choice of approach. We will then introduce our systems, re- Hate speech in social media (and its automatic port results and discuss some findings. We will detection) has become a major problem in recent also outline some scope for future developments. years. It can be generically defined as “language that is used to express hatred towards a targeted 2 Background group or is intended to be derogatory, to humili- ate, or to insult the members of the group” (David- We will provide some background that should son et al., 2017) and is often based on aspects like motivate the system architectures we developed. race, religion, ethnicity, and gender. The prob- There are several aspects to be mentioned here. lem is that what is considered acceptable for some First of all, given the impressive advances in a might not be for others. In addition to that, there is broad range of natural language processing tasks a fine line between freedom of expression on the using a transformer-based architecture (Vaswani one hand and censorship and illegal discrimination et al., 2017) capturing contextual embeddings – on the other (Zimmerman et al., 2018). In fact, most prominently utilizing the various flavours of this fine balance is reflected by the fundamental BERT (Devlin et al., 2019) – we decided to adopt human rights (as outlined in articles 19 and 20 of a transformer architecture as well. There are two (The United Nations, 1948) and (The United Na- ways language models such as BERT could be tions General Assembly, 1966) which simultane- used – using pre-training and fine-tuning or just feature-based without fine-tuning. Copyright c 2020 for this paper by its authors. Use 1 permitted under Creative Commons License Attribution 4.0 https://help.twitter.com/en/rules-and-policies/hateful- International (CC BY 4.0). conduct-policy This leads us to the next design decision. The 3 System Architecture winning team in the 2018 HaSpeeDe competition, We submitted two runs of which the first one can ItaliaNLP, submitted as one of their runs a SVM be considered our own baseline approach. We first with three different feature categories, namely raw present both architectures at a conceptual level and and lexical text, morpho-syntactic and lexicon fea- will go into the technical details when we discuss tures, which performed extremely well in par- the experimental setup in the next section. Our ticular when trained and tested on Twitter data runs are: (Cimino et al., 2018). Rather than designing an end-to-end neural architecture that would be fine- • Model 1: Stacked embeddings as features of tuned on the available training data we therefore a linear SVM opted for a simpler and slightly more transparent architecture with an SVM backbone as our clas- • Model 2: Ensemble of several SVMs with dif- sifier, i.e. the feature-based approach mentioned ferent text representations – both contextual above. embeddings and TF-IDF-based. Ensemble methods have repeatedly been shown Both models can be realised in many different to outperform individual classifiers for a variety ways. The core idea, as motivated before, is to of tasks including hate speech detection. For ex- experiment with transformer-based contextual em- ample, an ensemble of ten simple neural classi- beddings but to avoid fine-tuning and instead de- fiers proposed by (Zimmerman et al., 2018) out- ploy a traditional, more transparent approach of performed a BERT-based approach on the stan- SVM. The ensemble can consist of a variety of dard HatebaseTwitter benchmark dataset (MacA- different systems that can be aggregated in many vaney et al., 2019). Other recent examples that ways. In this paper (and as submitted) we treat demonstrate the effectiveness of ensemble meth- each system as equally important and use a simple ods for hate speech detection include (Alonso et majority vote. al., 2020; Nourbakhsh et al., 2019; Seganti et al., Stacked embeddings have been shown to be ef- 2019; Zampieri et al., 2020; Badjatiya et al., 2017; fective in NLP applications, e.g. (Akbik et al., Park and Fung, 2017). We should add that these 2018; Akbik et al., 2019). Conceptually there is findings are not limited to the area of hate speech some similarity to ensemble approaches in that detection as ensemble methods have a long history a combination of differently derived embedding in being successfully utilized in a broad range of models turns out to be more effective than each machine learning approaches, e.g. (Molteni et al., approach individually. 1996). Simple but effective ensemble approaches have also been used for example in sentiment clas- 3.1 Model 1: Stacked embeddings + SVM sification of tweets, e.g. (Hagen et al., 2015), and Our own baseline model combines two different other social media classification tasks. document embeddings: transformer document and document pool embeddings which are then fed Finally, given the task definition in which the into a linear SVM to train a classifier. We keep classifier was to be trained on social media data the architecture deliberately simple. but then tested on both social media and news There is a wide range of transformer-based lan- headlines we were aiming at an approach that guage models. One of our motivations was to would have a robust performance across domains train a classifier that will generalise beyond a spe- rather than being tailored specifically to one type cific domain but also has the potential to gener- of data. alise beyond a specific language. We therefore One additional motivation for our work is the opted for XLM-RoBERTa (XLM-R) that has been intention to develop approaches that can be ap- shown to outperform alternative multilingual mod- plied to different languages (we will get back to els such as mBERT in various NLP tasks (Con- that point when we outline future directions). neau et al., 2020). XLM-R is based on XLM and RoBERTa. It is trained on data covering 100 We will now demonstrate how those motivating languages in a very large (2TB) CommonCrawl. considerations lead to the system architecture we Transformer document embeddings are obtained propose. from (the large version of) XLM-R. In addition to that we use document pool embeddings which news headlines, respectively. The classes 0 consist of word embeddings using Flair (Akbik et and 1 in the Twitter test set include 641 and al., 2019). The exact experimental choices are de- 622 tweets respectively. In the news headline scribed further down. test set 319 entries have the label 0, 181 the label 1 (see Table 2). 3.2 Model 2: Ensemble of SVMs Our second system is an ensemble classifier con- Label Twitter Test Set News Test Set sisting of three SVMs each trained on a different 0 641 319 text representation, namely: 1 622 181 • Transformer document embeddings using Total 1,263 500 XLM-R Table 2: Test Data • Document pool embeddings • Straightforward TF-IDF. 4.2 Data Preprocessing The first two of these are exactly the same as we have seen in Model 1 except that they are not In line with our overall aim of simplicity and gen- stacked but fed into different classifiers. Again eralisibility (rather than tuning) we applied a sim- we observe that the general setup is kept sim- ple pre-processing pipeline that would apply to ple to avoid overfitting for the specific problem at both Twitter data as well as news headlines. There hand thereby allowing more scope for future ex- are only small variations in the different normal- periments. ization steps as follows. For any embedding-based processing the text 4 Experimental Setup was lower-cased and punctuation was removed so We applied our systems to Task A - Hate Speech that any input, be it tweet or news headline, would Detection (Main Task). be represented as a string of unpunctuated tokens. For the calculation of our (sparse) TF-IDF repre- 4.1 Data Sets sentation the text was tokenized and in addition to Training and test data is briefly described here. that stopwords were removed. After that each to- ken was vectorized using TF-IDF. Figure 1 shows • Training Data Set: the training data set con- an overview of the preprocessing. sists of 6,839 tweets in total, 2,766 of them classified as hate speech. The corpus has three columns: tweet ID, text and the label (0 = no hate speech, 1 = hate speech). Table 1 summarises these numbers. Label Training Data Set 0 4,073 1 2,766 Total 6,839 Table 1: Training Data • Test Data Set: unlike training data which was all Twitter feeds, there were two sets of test data, the first one sampled from Twitter and the second one from news headlines. The Twitter test set has 1,263 entries in total, the news test set 500. The two columns in both Figure 1: Data Preprocessing sets are the ID and the text of the tweet and 4.3 Implementation Ensemble of SVMs: three different feature rep- All implementation was done in Python. For all resentations are used to train one SVM each as il- text and document embeddings we used flairNLP2 . lustrated in Table 4. The first two incorporate the Our SVMs were developed using scikit-learn (Pe- same representations as already seen in Figure 2. dregosa et al., 2011), and for the preprocessing of the TF-IDF version and TF-IDF calculation we Classifier Features used NLTK 3 and scikit-learn. SVM2.1 Transformer Document Embeddings Stacked embeddings + SVM: as outlined, we SVM2.2 Document Pool Embeddings use stacked embeddings composed of Transformer SVM2.3 TF-IDF Document and Document Pool Embeddings. The Transformer Document Embeddings are obtained Table 4: Overview of SVM Ensemble using XLM-R. Document Pool Embeddings are Again we used grid-search for parameter tuning calculated using a mean-pooling over all word em- (see Table 5). beddings. It consists of forward and backward em- beddings for the Italian language as provided by Parameter SVM2.1 SVM2.2 SVM2.3 flair (Akbik et al., 2018) and as recommended. An C 1.0 1.0 1.0 overview is given in Figure 2. kernel ’linear’ ’linear’ ’rbf’ degree 3 3 3 gamma 1 1 1 Table 5: Parameters of the SVMs for Model 2 (En- semble of SVMs) Input is run against each classifier, and through majority voting over these three predictions the fi- nal classification category is determined. 5 Results We first present detailed results and then discuss our findings and insights. We start with our base- Figure 2: Embeddings in our Baseline (Model 1) line approach and then move on to the classi- fier ensemble. Macro-F1 is the official metric Flair allows for the easy combination of embed- for this competition. In addition to that we look dings to create stacked embeddings – one for each at Precision, Recall and F1 at category-level and input text. These vectors (together with the labels) also include confusion matrices for each approach are then used to train the SVM. Using grid-search (Model 1 and Model 2) and test set (Twitter data on the training data the most suitable parameter and news headlines). There were 27 runs submit- settings were determined, and Table 3 specifies ted for each dataset and the official baseline was a the settings which were then used in the submit- linear SVM with TF-IDF of word and char-grams. ted run. 5.1 Model 1: Our Baseline Parameter Value Twitter Data: Training and testing on Twitter C 1.0 data results in a Macro-F1 score of 0.7399 which kernel ’linear’ makes it into position 16 (out of 27). The official degree 3 task baseline is 0.7212. Details are displayed in gamma 1 Table 6 and Figure 3. News Headlines: On the news headlines test Table 3: Parameters of the SVM (Baseline) data we get a Macro-F1 of 0.6684 with official 2 https://github.com/flairNLP/flair baseline result of 0.6210 (rank 12). More details 3 https://www.nltk.org are in Table 7 and Figure 4. Metric 0 1 Metric 0 1 Precision 0.7722 0.7137 Precision 0.7356 0.6780 Recall 0.6927 0.7894 Recall 0.8809 0.4420 F1 0.7303 0.7496 F1 0.8017 0.5351 Table 6: Results: Model 1 (Stacked embeddings + Table 7: Results: Model 1 (Stacked embeddings + SVM) on Twitter Data SVM) on News Data Metric 0 1 Precision 0.7894 0.7349 Recall 0.7192 0.8023 F1 0.7527 0.7671 Table 8: Results: Model 2 (Ensemble of SVMs) on Twitter Data News Headlines: On the news headlines test Figure 3: Confusion Matrix: Model 1 (Stacked data we get a Macro-F1 of 0.6984 with an official embeddings + SVM) on Twitter Data (p = pre- baseline result of 0.6210 (rank 6). More details dicted, t = true) can be found in Table 9 and Figure 6. Metric 0 1 Precision 0.7445 0.8280 Recall 0.9498 0.4254 F1 0.8347 0.5620 Table 9: Results: Model 2 (Ensemble of SVMs) on News Data Figure 4: Confusion Matrix: Model 1 (Stacked embeddings + SVM) on News Data (p = predicted, t = true) 5.2 Model 2: Ensemble Twitter Data: Our ensemble approach gets a Macro-F1 of 0.7599 (rank 10). More details are included in Table 8 and Figure 5. Figure 6: Confusion Matrix: Model for 2 (Ensem- ble of SVMs) on News Data (p = predicted, t = true) 6 Discussion Our first observation we derive from the results is that the ensemble approach we proposed for this task does provide a robust and solid performance – solid in that it scores well in the ranked list of sys- tems and robust in that it also ranks highly when Figure 5: Confusion Matrix: Model 2 (Ensemble applied to out-of-domain data (coming 6th out of of SVMs) on Twitter Data (p = predicted, t = true) 27 submitted runs on data it had not been trained on). Given the simplicity of our system architec- 7 Conclusion ture and the composition of the official baseline We presented a simple but effective architecture system we also note the superiority of transformer- to detect hate speech in Italian social media and based contextual embeddings over bag-of-words news headlines. Our ensemble-based architecture approaches (while this comes as no surprise it is relies on contextual embeddings trained on a large still worth pointing out). Moving from a feature- multilingual corpus which we see as the basis for based to a pre-training plus fine-tuning approach the robustness of the approach. There is plenty will most certainly further push up the scores. of room for further improvement and the results Looking at the balance between precision and we report here will serve as a benchmark in this recall, we find that both our approaches have a ten- development. dency to return a fair number of false positives for the Twitter data set. This could indicate that words Acknowledgements and phrases used to express hateful content is quite This work was supported by the project common in social media even if it does not actu- COURAGE: A Social Media Companion Safe- ally represent hate speech. On the other hand, we guarding and Educating Students funded by the record a large proportion of false negatives when Volkswagen Foundation, grant number 95564. classifying news headlines. This could be an indi- cator of a more subtle way in which hate speech is expressed in traditional news outlets. References Generally speaking, both models perform better A. Akbik, D. Blythe, and R. Vollgraf. 2018. Con- on Twitter data than on news headlines – again an textual string embeddings for sequence labeling. In E. M. Bender, L. Derczynski, and P. Isabelle, editors, insight that was to be expected due to the training Proceedings of the 27th International Conference data. However, the fact that our approach managed on Computational Linguistics, COLING 2018, Santa to score higher in the ranked list of systems for Fe, New Mexico, USA, August 20-26, 2018, pages data it was not trained on is a result that confirms 1638–1649. Association for Computational Linguis- tics. our initial assumptions – that using a corpus with a very broad range of topics, styles and languages A. Akbik, T. Bergmann, D. Blythe, K. Rasul, as our core language model would help in making S. Schweter, and R. Vollgraf. 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP. In the system transfer more easily to unseen input. Proceedings of the 2019 Conference of the North This leads us to an area of future research. American Chapter of the Association for Compu- tational Linguistics: Human Language Technolo- While it would be possible to improve the perfor- gies, NAACL-HLT 2019, Demonstrations, pages 54– mance of our system by making the preprocessing, 59, Minneapolis, Minnesota, June. Association for the language model and any fine-tuning step match Computational Linguistics. more closely the expected test data – e.g. by using P. Alonso, R. Saini, and G. Kovács. 2020. Hate AlBERTo, a BERT-based transformer trained on Speech Detection Using Transformer Ensembles on Italian Twitter data (Polignano et al., 2019) – we the HASOC Dataset. In A. Karpov and R. Potapova, are actually aiming at something else. As part of editors, Speech and Computer, pages 13–21, Cham. Springer International Publishing. the COURAGE research project4 we are exploring ways to help teenagers manage social media expo- P. Badjatiya, S. Gupta, M. Gupta, and V. Varma. 2017. sure by providing a virtual companion that would, Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference among other things, automatically identify exam- on World Wide Web Companion, pages 759–760. In- ples of hate speech, bullying or other toxic con- ternational World Wide Web Conferences Steering tent. Given this is a multi-national effort we are Committee. interested in architectures that work for languages V. Basile, D. Croce, M. Di Maro, and L. C. Passaro. including Italian, Spanish, German and English 2020. EVALITA 2020: Overview of the 7th Evalua- with as little fine-tuning as possible. The ensemble tion Campaign of Natural Language Processing and introduced here with its multilingual transformer Speech Tools for Italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, edi- backbone turns out to be a step in that direction. tors, Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), On- 4 https://www.upf.edu/web/courage line. CEUR.org. C. Bosco, F. Dell’Orletta, F. Poletto, M. Sanguinetti, F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, and M. Tesconi. 2018. Overview of the EVALITA B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, 2018 hate speech detection task. In T. Caselli, R. Weiss, V. Dubourg, et al. 2011. Scikit-learn: Ma- N. Novielli, V. Patti, and P. Rosso, editors, Proceed- chine learning in python. Journal of machine learn- ings of the Sixth Evaluation Campaign of Natural ing research, 12(Oct):2825–2830. Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018), volume 2263 of M. Polignano, P. Basile, M. de Gemmis, G. Semer- CEUR Workshop Proceedings. CEUR-WS.org. aro, and V. Basile. 2019. Alberto: Italian BERT language understanding model for NLP challenging A. Cimino, L. De Mattei, and F. Dell’Orletta. 2018. tasks based on tweets. In R. Bernardi, R. Navigli, Multi-task learning in deep neural networks at and G. Semeraro, editors, Proceedings of the Sixth EVALITA 2018. In T. Caselli, N. Novielli, V. Patti, Italian Conference on Computational Linguistics, and P. Rosso, editors, Proceedings of the Sixth Eval- Bari, Italy, November 13-15, 2019, volume 2481 of uation Campaign of Natural Language Process- CEUR Workshop Proceedings. CEUR-WS.org. ing and Speech Tools for Italian. Final Workshop (EVALITA 2018), volume 2263 of CEUR Workshop M. Sanguinetti, G. Comandini, E. Di Nuovo, S. Frenda, Proceedings. CEUR-WS.org. M. Stranisci, C. Bosco, T. Caselli, V. Patti, and I. Russo. 2020. HaSpeeDe 2@EVALITA2020: A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, Overview of the EVALITA 2020 Hate Speech De- G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettle- tection Task. In Valerio Basile, Danilo Croce, Maria moyer, and V. Stoyanov. 2020. Unsupervised cross- Di Maro, and Lucia C. Passaro, editors, Proceedings lingual representation learning at scale. In D. Juraf- of Seventh Evaluation Campaign of Natural Lan- sky, J. Chai, N. Schluter, and J. R. Tetreault, editors, guage Processing and Speech Tools for Italian. Fi- Proceedings of ACL, pages 8440–8451. Association nal Workshop (EVALITA 2020), Online. CEUR.org. for Computational Linguistics. A. Seganti, H. Sobol, I. Orlova, H. Kim, T. Davidson, D. Warmsley, M. W. Macy, and I. We- J. Staniszewski, T. Krumholc, and K. Koziel. ber. 2017. Automated hate speech detection and 2019. NLPR@SRPOL at SemEval-2019 task 6 and the problem of offensive language. In Proceedings task 5: Linguistically enhanced deep learning offen- of ICWSM 2017, Montréal, Québec, Canada, May sive sentence classifier. In Proceedings of the 13th 15-18, 2017, pages 512–515. AAAI Press. International Workshop on Semantic Evaluation, J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. pages 712–721, Minneapolis, Minnesota, USA, BERT: Pre-training of deep bidirectional transform- June. Association for Computational Linguistics. ers for language understanding. In Proceedings of NAACL, pages 4171–4186, Minneapolis, Min- The United Nations General Assembly. 1966. Interna- nesota, June. Association for Computational Lin- tional covenant on civil and political rights. Treaty guistics. Series, 999:171, December. M. Hagen, M. Potthast, M. Büchner, and B. Stein. The United Nations. 1948. Universal Declaration of 2015. Webis: An ensemble for twitter sentiment Human Rights. The United Nations, December. detection. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, pages 582–589. L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In Proceedings of S. MacAvaney, H. Yao, E. Yang, K. Russell, N. Gohar- the 31st International Conference on Neural Infor- ian, and O. Frieder. 2019. Hate speech detection: mation Processing Systems, NIPS’17, pages 6000– Challenges and solutions. PLoS ONE, 14:1–16. 6010, USA. Curran Associates Inc. F. Molteni, R. Buizza, T. N Palmer, and T. Petroliagis. M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, 1996. The ECMWF ensemble prediction system: G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pite- Methodology and validation. Quarterly Journal of nis, and Ç. Çöltekin. 2020. SemEval-2020 Task the Royal Meteorological Society, 122(529):73–119. 12: Multilingual Offensive Language Identifica- tion in Social Media (OffensEval 2020). CoRR, A. Nourbakhsh, F. Vermeer, G. Wiltvank, and abs/2006.07235. R. van der Goot. 2019. sthruggle at SemEval-2019 task 5: An ensemble approach to hate speech detec- S. Zimmerman, U. Kruschwitz, and C. Fox. 2018. Im- tion. In Proceedings of the 13th International Work- proving Hate Speech Detection with Deep Learn- shop on Semantic Evaluation, pages 484–488, Min- ing Ensembles. In Proceedings of the Eleventh In- neapolis, Minnesota, USA, June. Association for ternational Conference on Language Resources and Computational Linguistics. Evaluation (LREC 2018), Miyazaki, Japan. Euro- J. H. Park and P. Fung. 2017. One-step and two-step pean Language Resources Association (ELRA). classification for abusive language detection on twit- ter. In Proceedings of The First Workshop on Abu- sive Language Online, pages 41–45. Association for Computational Linguistics.