RuG @ EVALITA 2018: Hate Speech Detection In Italian Social Media Xiaoyu Bai∗ , Flavio Merenda∗∓ , Claudia Zaghi∗ , Tommaso Caselli∗ , Malvina Nissim∗ ∗ Rikjuniversiteit Groningen, Groningen, The Netherlands ∓ Università degli Studi di Salerno, Salerno, Italy f.merenda|t.caselli|m.nissim@rug.nl x.bai.5|c.zaghi@student.rug.nl Abstract have been urged to deal with and remove offen- sive and/or abusive content but the phenomenon is English. We describe the systems the RuG so pervasive that developing systems that automat- Team developed in the context of the Hate ically detect and classify offensive on-line content Speech Detection Task in Italian Social has become a pressing need (Bleich, 2014; Nobata Media at EVALITA 2018. We submitted a et al., 2016; Kennedy et al., 2017). total of eight runs, participating in all four The Natural Language Processing and Compu- subtasks. The best macro-F1 score in all tational Social Science communities have been re- subtasks was obtained by a Linear SVM, ceptive to such urgency, and the automatic detec- using hate-rich embeddings. Our best sys- tion of abusive and/or offensive language, trolling, tem obtains competitive results, by rank- and cyberbulling (Waseem et al., 2017; Schmidt ing 6th (out of 14) in HaSpeeDe-FB, 3rd and Wiegand, 2017) has seen a growing interest. (out of 15) in HaSpeeDe-TW, 8th (out of This has taken various forms: datasets in multi- 13) in Cross-HaSpeeDe FB, and 6th (out ple languages1 , thematic workshops2 , and shared of 13) in Cross-HaSpeeDe TW. evaluation exercises, such as the GermEval 2018 Shared Task (Wiegand et al., 2018), and the Se- Italiano. Illustriamo i dettagli dei due mEval 2019 Task 5: HateEval3 and Task 6: Of- sistemi che il Team RuG ha sviluppato fensEval4 . The EVALITA 2018 Hate Speech De- nell’ambito dell’esercizio di valutazione tection task (haspeede)5 (Bosco et al., 2018) su riconoscimento di messagi d’odio in also falls in the latter category, and focuses on testi da Social Media per l’italiano. Ab- the automatic identification of hate messages from biamo partecipato a tutti e quattro i sotto- Facebook comments and tweets in Italian. We task, inviando un totale di otto predi- participated in this shared task with two different zioni. La migliore macro-F1, è ottenuta models, exploiting the concept of polarised em- da un SVM che usa embedding polariz- beddings (Merenda et al., 2018). The details of zati, costruiti sfruttando contenuto ricco our participation are the core of this paper. Code di odio. Il nostro miglior sistema ha and outputs are available at https://github. ottenuto dei risultati competitivi, classi- com/tommasoc80/evalita2018-rug. ficandosi 6◦ (su 14) in HaSpeeDe-FB, 3◦ (su 15) in HaSpeeDe-TW, 8◦ (su 13) 2 Task nel Cross-HaSpeeDe FB, e 6◦ (su 13) in Cross-HaSpeeDe TW. The haspeede task derives from the harmoniza- tion process of originally separate annotation ef- forts from two research groups, converging onto a 1 Introduction uniform label granularity (Del Vigna et al., 2017; Poletto et al., 2017; Sanguinetti et al., 2018). For The use of “bad” words and “bad” language has details on the data see Section 3.1, and the task been the battleground for freedom of speech for 1 centuries. The spread of Social Media platforms, http://bit.ly/2RZUlKH 2 https://sites.google.com/view/alw2018 and especially of micro-blog platforms (e.g. Face- 3 http://bit.ly/2EEC7Me book and Twitter), has favoured the growth of on- 4 http://bit.ly/2P7pTQ9 5 line hate speech. Social media sites and platforms http://di.unito.it/haspeedeevalita18 overview paper (Bosco et al., 2018). general topics that may contain hateful messages The hate detection task is articulated in four bi- such as immigration, religion, politics, gender is- nary (hate vs non-hate) sub-tasks, two in-domain, sues, while the Twitter dataset is focused on spe- two cross-domain. The in-domain sub-tasks re- cific targets, i.e., categories or groups of individ- quire training and test data to belong to the same uals who are likely to become victims of hate text type, either Facebook (HaSpeeDe-FB) or speech (migrants, Muslims, and Roma6 ). It is also Twitter (HaSpeeDe-TW), while the cross-domain interesting to note that the label distribution in the sub-tasks require training on one text type and Facebook test data is flipped compared to training, testing on the other: Facebook-Twitter (Cross- with a strong majority of hateful comments. HaSpeeDe FB) and Twitter-Facebook (Cross- HaSpeeDe TW). 3.2 Additional Resources: Source-Driven Embeddings 3 Data and Resources We addressed the task by adopting a closed-task All of our runs for all subtasks are based on super- setting. However, as a strategy to potentially in- vised approaches, where data (and features) play crease the generalization capabilities of our sys- a major role for the final results of a system. Fur- tems and tune them towards better recognition thermore, our contribution adopted a closed-task of hate content, we developed hate- and offense- setting, i.e. we did not include any training data sensitive word embeddings. beyond what was provided within the task. We To do so, we scraped comments from a list of did however build enhanced distributed represen- selected Facebook pages likely to contain offen- tations of words exploiting additional data (see sive and/or hateful content in the form of com- Section 3.2). This section illustrates the datasets ments to posts, extracting over 1M comments. We and language resources used in our submissions. built word embeddings over the acquired data with the word2vec tool skip-gram model (Mikolov et 3.1 Resources Provided by the Organisers al., 2013), using 300 dimensions, a context win- The organizers provided a total of 6,000 labeled dow of 5, and minimum frequency 1. In the re- Italian messages for training, split as follows: mainder of this paper we refer to these representa- 3,000 comments from Facebook, and 3,000 mes- tions as “hate-rich embeddings”. More details on sages from Twitter. For test, they subsequently the creation process, including the complete list made available 1000 instances for each text type. of Facebook pages used, and a preliminary eval- Table 1 illustrates the distribution of the classes uation of these specialised representations can be in the different text types both in training and test found in (Merenda et al., 2018). data. Note that the distribution of labels in the test data is unknown at developing time. 4 Systems and Runs We detail in this section our final submissions. Table 1: Distribution of the labeled samples in the The models have been developed in parallel to training and test data per text type. our participating systems at the GermEval 2018 Text type Class Training Test Shared Task (Bai et al., 2018), sharing with them non-hate 1,618 323 some core aspects. Facebook hate 1,382 677 non-hate 2,028 676 4.1 Run 1: Binary SVM Twitter hate 972 324 Our first model is a Linear Support Vector Ma- chine (SVM), built using the LinearSVC scikit Although the task organisers have balanced the learn implementation (Pedregosa et al., 2011). datasets with respect to size, and have adopted the We performed minimal pre-processing by re- same annotation granularity (hate vs. non-hate), moving stop words using the Python module the two datasets are very different both in terms stop-words7 , and lowercasing the tokens. of class distribution (i.e. 46.06% of messages la- 6 belled as hateful in Facebook vs. 32.40% in Twit- The Romani, Romany, or Roma are an ethnic group of traditionally itinerant people who originated in northern India ter in training) and with regard to their contents. and are nowadays subject to ethnic discrimination. 7 For instance, the Facebook data is concerned with https://pypi.org/project/stop-words/ We used two groups of surface features, namely: i.) word n-grams in the range 1–3; and ii.) character n-grams in the range 2–4. The sparse vector representation of each (training) instance is then concatenated with its dense vector representa- tion, as follows: for every word w in an instance i, Figure 1: Feature representation of each sample we derived a 300 dimension representation, w, ~ by fed to the ensemble model. On top, the represen- means of a look-up in the hate-rich embeddings. tation of a training sample, on bottom, the repre- We performed max pooling over these word em- sentation of a test sample. beddings, w,~ to obtain a 300 dimension represen- tation of the full instance, ~i. Words not covered in sigmoid activation function computes the distribu- the hate-oriented embeddings are ignored. Finally, tion of the two labels. (Other network hyperpa- class weights are balanced and SVM parameters rameters: Number of filters: 6; Filter use default values (C = 1). sizes: 3, 5, 8; Strides: 1). We used binary cross-entropy as loss function and Adam as opti- 4.2 Run 2: Binary Ensemble Model miser. In training, we set a batch size of 64 and Our second submission uses a binary ensemble ran it for 10 epochs. We also applied two dropouts: model, which combines a Convolutional Neural 0.6 between the embeddings and the convolutional Network (CNN) system and the linear SVM (Sec- layer, and 0.8 between the max-pooling and the tion 4.1), with a logistic regression meta-classifier fully-connected layer. on top. Predictions on training data are obtained via ten-fold cross-validation. 5 Results and Ranking In the ensemble model, each input instance to Table 2 reports the results and ranking for our runs the meta-classifier is represented by the concate- for all four subtasks. We also include the scores nation of four features: a) the class predictions of the CNN (not submitted to the official competi- for that instance made by the SVM, b) the predic- tion), marked with a ∗.9 tions of the CNN, and c) two additional surface- level features: the instance’s length in terms of Table 2: System results and ranking, including the characters and the percentage of offensive terms out-of-competition runs for CNN alone. in the instance. This latter feature is obtained via Subtask Model10 Rank Macro F1 a look-up in a list of offensive terms in Italian ob- SVM 6/14 0.7751 tained from the article Le Parole per ferire by Tul- HaSpeeDe-FB Ensemble 9/14 0.7428 lio De Mauro8 and the “bad words” category in CNN∗ n/a 0.7138 the Italian Wiktionary. The feature is expressed SVM 3/15 0.7934 by the ratio between the frequency of any of the HaSpeeDe-TW Ensemble 9/15 0.7530 instance’s tokens comprised in the list and the in- CNN∗ n/a 0.7363 stance’s length in terms of tokens. Figure 1 shows SVM 8/13 0.5409 Cross-HaSpeeDe FB Ensemble 9/13 0.4845 the features fed to the ensemble meta-classifier. CNN∗ n/a 0.4692 The CNN is an adaptation of available archi- SVM 6/13 0.6021 tectures for sentence classification (Kim, 2014; Cross-HaSpeeDe TW Ensemble 7/13 0.5545 Zhang and Wallace, 2015), using Keras (Chollet CNN∗ n/a 0.6093 and others, 2015), and is composed of: i.) a word embeddings input layer using the hate-rich em- The SVM models obtain, by far, better results than beddings; ii.) a single convolutional layer; iii.) the Ensemble models. It is likely that the Ensem- a single max-pooling layer; iv.) a single fully- ble systems suffer from the lower performances of connected layer; and v.) a sigmoid output layer. 9 Being allowed to submit a maximum of two runs per sub- The max-pooling layer output is flattened, con- task, we based our choice of models on the results of a 10-fold catenated, and fed to the fully-connected layer cross validation of the three architectures on the training data. 10 The SVM correposnds to run id 1 and the Ensemble composed of 50 hidden-units with the ReLU ac- model to run id 3 in the official submitted runs - see tivation function. The final output layer with the Submissions-Haspeede in the GitHub repository https: //github.com/tommasoc80/evalita2018-rug/ 8 https://bit.ly/2J4TPag tree/master/Submissions-Haspeede the CNN. We also observe differences in perfor- sibly some jargon and topics are shared. While mance on the two datasets across the subtasks. this has a positive effect when training and test- ing on Facebook (HaSpeeDe-FB), it has instead a Table 3: SVM’s performance per class detrimental effect when testing on Twittter (Cross- non-hate hate HaSpeeDe FB), since this dataset has a large ma- Subtask P R P R jority of non-hate instances, and we tend to over- HaSpeeDe-FB 0.6990 0.6904 0.8531 0.8581 predict the hate class (see Table 3). HaSpeeDe-TW 0.8577 0.8831 0.7401 0.6944 CrossHaSpeeDe FB 0.8318 0.4023 0.3997 0.8302 In HaSpeeDe-TW and Cross-HaSpeeDe TW CrossHaSpeeDe TW 0.4375 0.6934 0.7971 0.5745 (training on Twitter) the impact of the hate-rich embeddings is a lot less clear. Indeed, recall for In-domain, in absolute terms, we do better on the hate class is always lower than non-hate, with Twitter (.7934) than on Facebook (.7751), and this the large majority of errors (more than 50% in is even truer in relative terms, as performance all runs) being hate messages wrongly classified overall in the competition is better on Facebook as non-hateful, thus seemingly just following the (best: 0.8288) than on Twitter (best: 0.7993). class imbalance of the Twitter trainset. Our high score on HaSpeeDe-TW comes from In both datasets, hate content is expressed either high precision and recall on non-hate, while for in a direct way, by means of “bad words” or direct HaSpeeDe-FB, we do well on the hate class. This insults to the target(s), or more implicitly and sub- can be due to label distribution (hate is always mi- tly. This latter type of hate messages is definitely nority class, but more balanced in Facebook), but the main source of errors for our systems in all also to the fact that we use Facebook-based hate- subtasks. Finally, we observe that in some cases rich embeddings, which might push towards better the annotation of messages as hateful is subject to hate detection. disagreement and debate. For instance, all mes- Cross-domain, results are globally lower, as ex- sages containing the word rivoluzione [revolution] pected, with best scores on Cross-HaSpeeDe FB are marked as hateful, even though there is a lack and Cross-HaSpeeDe TW of 0.6541 and 0.6985, of linguistic evidence. respectively (Bosco et al., 2018). Our models experience a more substantial loss when trained 7 Conclusion and Future Work on Facebook and tested on Twitter (in Cross- Developing our systems for the Hate Speech HaSpeeDe FB we lose over 25 percentage points Detection in Italian Social Media task at compared to HaSpeeDe-TW, where the Twitter EVALITA 2018, we focused on the generation of test set is the same), than viceversa (we lose ca. 17 distributed representations of text that could not percentage points on the Facebook test set). only enhance the generalisation power of the mod- els, but also better capture the meaning of words 6 Discussion in hate-rich contexts of use. We did so exploiting The drop in performance in the cross-domain set- Facebook on-line communities to generate hate- tings is likely due to topics, and data collection rich embeddings (Merenda et al., 2018). strategies (general topics on Facebook, specific A Linear SVM system outperformed a meta- targets on Twitter). In other words, despite the use classifer that used predictions from the SVM it- of hate-rich embeddings as a strategy to make the self, and a CNN, due to the low performance of systems generalize better, our models remain too the CNN component. Major errors of the systems sensitive to training data, which is strongly repre- are due to implicit hate messages, where even the sented as word and character n-grams. hate-rich embeddings fail. A further aspect to con- The impact of the hate-rich embeddings is sider in this task is the difference in text type and most strongly seen in HaSpeeDe-FB and Cross- class balance of the two datasets. Both of these as- HaSpeeDe FB, with recall for the hate class being pects have a major impact on system performance substantially higher than for the non-hate class. in the cross-genre settings. This could be due to the fact that the hate-rich Finally, to better generalize to unseen data and embeddings have been generated from comments genres, future work will focus on developing sys- in Facebook pages, that is, the same text type as tems able to further abstract from the actual lexi- the training data in the two tasks, so that pos- cal content of the messages by capturing general writing patterns of haters. One avenue to explore Proceedings of the 25th International Conference in this respect is “bleaching” text (van der Goot on World Wide Web, pages 145–153. International World Wide Web Conferences Steering Committee. et al., 2018), a newly suggested technique used to fade the actual strings into more abstract, signal- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, preserving representations of tokens. B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- sos, D. Cournapeau, M. Brucher, M. Perrot, and References E. Duchesnay. 2011. Scikit-learn: Machine learn- ing in Python. Journal of Machine Learning Re- Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tommaso search, 12:2825–2830. Caselli, and Malvina Nissim. 2018. RuG at Ger- mEval: Detecting Offensive Speech in German So- Fabio Poletto, Marco Stranisci, Manuela Sanguinetti, cial Media. In Josef Ruppenhofer, Melanie Siegel, Viviana Patti, and Cristina Bosco. 2017. Hate and Michael Wiegand, editors, Proceedings of the speech annotation: Analysis of an italian twitter cor- GermEval 2018 Workshop. pus. In CEUR WORKSHOP PROCEEDINGS, vol- ume 2006, pages 1–6. CEUR-WS. Erik Bleich. 2014. Freedom of expression versus racist hate speech: Explaining differences between high Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, court regulations in the usa and europe. Journal of Viviana Patti, and Marco Stranisci. 2018. An Ital- Ethnic and Migration Studies, 40(2):283–300. ian Twitter Corpus of Hate Speech against Immi- grants. In Nicoletta Calzolari (Conference chair), Cristina Bosco, Fabio Poletto Dell’Orletta, Felice, Khalid Choukri, Christopher Cieri, Thierry De- Manuela Sanuguinetti, and Maurizio Tesconi. 2018. clerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Overview of the EVALITA Hate Speech Detection Bente Maegaard, Joseph Mariani, Hlne Mazo, Asun- Task. In Tommaso Caselli, Nicole Novielli, Viviana cion Moreno, Jan Odijk, Stelios Piperidis, and Patti, and Paolo Rosso, editors, Proceedings of the Takenobu Tokunaga, editors, Proceedings of the 6th evaluation campaign of Natural Language Pro- Eleventh International Conference on Language Re- cessing and Speech tools for Italian (EVALITA’18), sources and Evaluation (LREC 2018), Miyazaki, Turin, Italy. CEUR.org. Japan, May 7-12, 2018. European Language Re- sources Association (ELRA). François Chollet et al. 2015. Keras. https:// keras.io. Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language pro- Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, cessing. In Proceedings of the Fifth International Marinella Petrocchi, and Maurizio Tesconi. 2017. Workshop on Natural Language Processing for So- Hate me, hate me not: Hate speech detection on cial Media. Association for Computational Linguis- facebook. In Proceedings of the First Italian Con- tics, Valencia, Spain, pages 1–10. ference on Cybersecurity (ITASEC17), Venice, Italy, January 17-20, 2017, pages 86–95. Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malv- ina Nissim, and Barbara Plank. 2018. Bleaching George Kennedy, Andrew McCollough, Edward text: Abstract features for cross-lingual gender pre- Dixon, Alexei Bastidas, John Ryan, Chris Loo, and diction. In Proceedings of the 56th Annual Meet- Saurav Sahay. 2017. Technology solutions to com- ing of the Association for Computational Linguistics bat online harassment. In Proceedings of the First (Volume 2: Short Papers), volume 2, pages 383–389. Workshop on Abusive Language Online, pages 73– 77. Zeerak Waseem, Thomas Davidson, Dana Warms- ley, and Ingmar Weber. 2017. Understanding Yoon Kim. 2014. Convolutional neural net- abuse: A typology of abusive language detection works for sentence classification. arXiv preprint subtasks. In Proceedings of the First Workshop on arXiv:1408.5882. Abusive Language Online, pages 78–84, Vancouver, Flavio Merenda, Claudia Zaghi, Tommaso Caselli, and BC, Canada, August. Association for Computational Malvina Nissim. 2018. Source-driven Representa- Linguistics. tions for Hate Speech Detection, proceedings of the Michael Wiegand, Melanie Siegel, and Josef Ruppen- 5th italian conference on computational linguistics hofer. 2018. Overview. In Josef Ruppenhofer, (clic-it 2018). Turin, Italy. Melanie Siegel, and Michael Wiegand, editors, Pro- Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- ceedings of the GermEval 2018 Workshop. frey Dean. 2013. Efficient estimation of word Ye Zhang and Byron Wallace. 2015. A sensitiv- representations in vector space. arXiv preprint ity analysis of (and practitioners’ guide to) convo- arXiv:1301.3781. lutional neural networks for sentence classification. Chikashi Nobata, Joel Tetreault, Achint Thomas, arXiv preprint arXiv:1510.03820. Yashar Mehdad, and Yi Chang. 2016. Abu- sive language detection in online user content. In