=Paper=
{{Paper
|id=Vol-2481/paper22
|storemode=property
|title=Cross-Platform Evaluation for Italian Hate Speech Detection
|pdfUrl=https://ceur-ws.org/Vol-2481/paper22.pdf
|volume=Vol-2481
|authors=Michele Corazza,Stefano Menini,Elena Cabrio,Sara Tonelli,Serena Villata
|dblpUrl=https://dblp.org/rec/conf/clic-it/CorazzaMCTV19
}}
==Cross-Platform Evaluation for Italian Hate Speech Detection==
Cross-Platform Evaluation for Italian Hate Speech Detection Michele Corazza† , Stefano Menini‡ , Elena Cabrio† , Sara Tonelli‡ , Serena Villata† † Université Côte d’Azur, CNRS, Inria, I3S, France ‡ Fondazione Bruno Kessler, Trento, Italy michele.corazza@inria.fr {menini,satonelli}@fbk.eu {elena.cabrio,serena.villata}@unice.fr Abstract 1 Introduction English. Despite the number of ap- Given the well-acknowledged rise in the pres- proaches recently proposed in NLP for ence of toxic and abusive speech on social media detecting abusive language on social net- platforms like Twitter and Facebook, there have works, the issue of developing hate speech been several efforts within the Natural Language detection systems that are robust across Processing community to deal with such prob- different platforms is still an unsolved lem, since the computational analysis of language problem. In this paper we perform a com- can be used to quickly identify offenses and ease parative evaluation on datasets for hate the removal of abusive messages. Several work- speech detection in Italian, extracted from shops (Waseem et al., 2017; Fišer et al., 2018) and four different social media platforms, i.e. evaluation campaigns (Fersini et al., 2018; Bosco Facebook, Twitter, Instagram and What- et al., 2018; Wiegand et al., 2018) have been re- sApp. We show that combining such cently organized to discuss existing approaches to platform-dependent datasets to take ad- hate speech detection, propose shared tasks and vantage of training data developed for foster the development of benchmarks for system other platforms is beneficial, although evaluation. their impact varies depending on the social However, most of the available datasets and network under consideration.1 approaches for hate speech detection proposed so far concern the English language, and even Italiano. Nonostante si osservi un cre- more frequently they target a single social me- scente interesse per approcci che identi- dia platform (mainly Twitter). In low-resource fichino il linguaggio offensivo sui social scenarios it is therefore common to have smaller network attraverso l’NLP, la necessità di datasets for specific platforms, raising research sviluppare sistemi che mantengano una questions such as: would it be advisable to com- buona performance anche su piattaforme bine such platform-dependent datasets to take ad- diverse è ancora un tema di ricerca aper- vantage of training data developed for other plat- to. In questo contributo presentiamo una forms? Should such data just be added to the train- valutazione comparativa su dataset per ing set or they should be selected in some way? l’identificazione di linguaggio d’odio pro- And what happens if training data are available venienti da quattro diverse piattaforme: only for one platform and not for the other? Facebook, Twitter, Instagram and Wha- In this paper we address all the above questions tsApp. Lo studio dimostra che, combinan- focusing on hate speech detection for Italian. Af- do dataset diversi per aumentare i dati di ter identifying a modular neural architecture that training, migliora le performance di clas- is rather stable and well-performing across dif- sificazione, anche se l’impatto varia a se- ferent languages and platforms (Corazza et al., conda della piattaforma considerata. to appear), we perform our comparative evalua- tion on freely available datasets for hate speech 1 detection in Italian, extracted from four differ- Copyright c 2019 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- ent social media platform, i.e. Facebook, Twit- ternational (CC BY 4.0). ter, Instagram and Whatsapp. In particular, we test the same model while altering only some fea- are the starting point of this study. The task sum- tures and pre-processing aspects. Besides, we use mary presented in (Bosco et al., 2018) listed some a multi-platform training set but test on data taken remarks on the elements affecting the system ro- from the single platforms. We show that the pro- bustness that led us extend the cross-platform ex- posed solution of combining platform-dependent periments to new platforms, including also What- datasets in the training phase is beneficial for all sApp and Instagram data. To our knowledge, there platforms but Twitter, for which results obtained have not been attempts to develop Italian systems by training on tweets only outperform those ob- for hate speech detection on these two platforms, tained with a training on the mixed dataset. probably because of the lack of suitable datasets. We therefore annotate our own Instagram data for 2 Related work the task, while we take advantage of a recently de- veloped dataset for cyberbullying detection to test In 2018, the first Hate Speech Detection our system on WhastApp. (HaSpeeDe) task for Italian (Bosco et al., 2018) has been organized at EVALITA-20182 , the eval- 3 Data and linguistic resources uation campaign for NLP and speech processing tools for Italian. The task consists in automati- In the following, we present the datasets used to cally annotating messages from Twitter and Face- train and test our system and their annotations book, with a boolean value indicating the presence (Section 3.1). Then, we describe the word embed- (or not) of hate speech. Two cross-platform tasks dings (Section 3.2) we have used in our experi- (Cross-HaSpeeDe) were also proposed, where the ments. training was done on platform-specific data (Face- 3.1 Datasets book or Twitter) and the test on data from an- Twitter dataset released for the HaSpeeDe other platform (Twitter or Facebook). In general, (Hate Speech Detection) shared task organized at as expected, results obtained for Cross-HaSpeeDe EVALITA 2018. This dataset includes a total were lower compared to those obtained for the in- amount of 4,000 tweets (2,704 negative and 1,296 domain tasks, due to the heterogeneous nature of positive instances, i.e. containing hate speech), the datasets provided for the task, both in terms of comprising for each tweet the respective annota- class distribution and data composition. Indeed, tion, as can be seen in Example 1. The two classes not only are Facebook posts in the task dataset considered in the annotation are “hateful post” or longer, but they are also on average more likely to “not”. contain hate speech (68% hate posts in the Face- book test set vs. 32% in the Twitter one). This led 1. Annotation: hateful. to a performance drop, with the best system scor- altro che profughi? sono zavorre e tutti uo- ing 0.8288 F1 on in-domain Facebook data, and mini (EN: other than refugees? they are bal- 0.6068 when the same model is tested on Twitter last and all men). data (Cimino et al., 2018). The best performing systems on the cross-tasks Facebook dataset also released for the were ItaNLP (Cimino et al., 2018) when training HaSpeeDe (Hate Speech Detection) shared task. on Twitter data and testing on Facebook, and Inria- It consists of 4,000 Facebook comments collected FBK (Corazza et al., 2018) in the other configu- from 99 posts crawled from web pages (1,941 ration. The former adopts a newly-introduced ap- negative, and 2,059 positive instances), compris- proach based on a 2-layer BiLSTM which exploits ing for each comment the respective annotation, multi-task learning with additional data from the as can be seen in Example 2. The two classes 2016 SENTIPOLC task3 . The latter, instead, uses considered in the annotation are “hateful post” or a simple recurrent model with one hidden layer of “not”. size 500, a GRU of size 200 and no dropout. 2. Annotation: hateful. The Cross-HaSpeeDe tasks and the analysis of Matteo serve un colpo di stato. Qua tra poco system performance in a cross-platform scenario dovremo andare in giro tutti armati come in 2 America. (EN: Matteo, we need a coup. Soon http://www.evalita.it/2018 3 http://www.di.unito.it/˜tutreeb/ we will have to go around armed as in the sentipolc-evalita16/index.html U.S.). Whatsapp dataset collected to study pre-teen • Domain-specific embeddings: we trained cyberbullying (Sprugnoli et al., 2018). Such Fasttext embeddings from a sample of Ital- dataset has been collected through a WhatsApp ian tweets (Basile and Nissim, 2013), with experimentation with Italian lower secondary embedding size of 300. We used the binary school students and contains 10 chats, subse- version of the model. quently annotated according to different dimen- sions as the roles of the participants (e.g. bully, 4 System Description victim) and the presence of cyberbullying expres- sions in the message, distinguished between dif- Since our goal is to compare the effect of various ferent classes of insults, discrimination, sexual features, word embeddings, pre-processing tech- talk and aggressive statements. The annotation niques on hate speech detection applied to differ- is carried out at token level. To create additional ent platforms, we use a modular neural architec- training instances for our model, we join subse- ture for binary classification that is able to support quent sentences of the same author (to avoid cases both word-level and message-level features. The in which the user writes one word per message) re- components are chosen to support the processing sulting in 1,640 messages (595 positive instances). of social-media specific language. We consider as positive instances of hate speech the ones in which at least one token was annotated 4.1 Modular neural architecture as a cyberbullying expression, as in Example 3). We use a modular neural architecture (see Figure 3. Annotation: Cyberbulling expression. 1) in Keras (Chollet and others, 2015). The ar- fai schifo, ciccione! (EN: you suck, fat guy). chitecture that constitutes the base for all the dif- ferent models uses a single feed forward hidden Instagram dataset includes a total amount of layer of 500 neurons, with a ReLu activation and 6,710 messages, which we randomly collected a single output with a sigmoid activation. The loss from Instagram focusing on students’ profiles used to train the model is binary cross-entropy. (6,510 negative and 200 positive instances) iden- We choose this particular architecture because it tified through the monitoring system described in showed good performance in the EVALITA shared (Menini et al., 2019). Since no Instagram datasets task for cross-platform hate speech detection, as in Italian were available, and we wanted to include well as in other hate speech detection tasks for this platform to our study, we manually annotated German and English (Corazza et al., to appear). them as “hateful post” (as in Example 4) or “not”. The architecture is built to support both word-level 4. Annotation: hateful. (i.e. embeddings) and message-level features. In Sei una troglodita (EN: you are a caveman). particular, we use a recurrent layer to learn an en- coding (xn in the Figure) derived from word em- 3.2 Word Embeddings beddings, obtained as the output of the recurrent In our experiments we test two types of embed- layer at the last timestep. This encoding gets then dings, with the goal to compare generic with so- concatenated with the other selected features, ob- cial media-specific ones. In both cases, we rely taining a vector of message-level features. on Faxttext embeddings (Bojanowski et al., 2017), since they include both word and subword infor- xn si mation, tackling the issue of out-of-vocabulary words, which are very common in social media ⊕ data: xe xi yi x1 RNN • Generic embeddings: we use embedding ⊕ spaces obtained directly from the Fasttext si−1 ... website4 for Italian. In particular, we use the Italian embeddings trained on Common Crawl and Wikipedia (Grave et al., 2018) with size 300. A binary Fasttext model is also Figure 1: Modular neural architecture for Italian available and was therefore used; hate speech detection 4 urlhttps://fasttext.cc/docs/en/crawl-vectors.html 4.2 Preprocessing we measure the number of hashtags and men- tions, the number of exclamation and ques- The language used in social media platforms has tion marks, the number of emojis, the number some peculiarities with respect to standard lan- of words written in uppercase guage, as for example the presence of URLs, ”@” user mentions, emojis and hashtags. We therefore run the following pre-processing steps: 5 Experimental Setup • URL and mention replacement: both urls and In order to be able to compare the results ob- mentions are replaced by the strings ”URL” tained while experimenting with different train- and ”username” respectively; ing datasets and features, we used fixed hyper- parameters, derived from our best submission at • Hashtag splitting: Since hashtags often pro- EVALITA 2018 for the cross-platform task that in- vide important semantic content, we wanted volved training on Facebook data and testing on to test how splitting them into single words Twitter. In particular, we used a GRU (Cho et would impact on the performance of the clas- al., 2014) of size 200 as the recurrent layer and sifier. To this end, we use the Ekphrasis tool we applied no dropout to the feed-forward layer. (Baziotis et al., 2017) to do hashtag splitting Additionally, we used the provided test set for the and evaluate the classifier performance with two Evalita tasks, using 20% of the development and without splitting. Since the aforemen- set for validation. For Instagram and WhatsApp, tioned tool only supports English, it has been since no standard test set is available, we split the adapted to Italian by using language-specific whole dataset using 60% of it for training, while Google ngrams.5 the remaining 40% is split in half and used for val- idation and testing. For this purpose, we use the 4.3 Features train test split function provided by sklearn (Pe- dregosa et al., 2011), using 42 as seed for the ran- • Word Embeddings: We evaluate the contri- dom number generator. bution of word embeddings extracted from One of our goals was to establish whether merg- social media data, compared with the per- ing data from multiple social media platforms can formance obtained using generic embedding be used to improve performance on single plat- spaces, as described in Section 3.2. form test sets. In particular, we used the following datasets for training: • Emoji transcription: We evaluate the im- pact of keeping emojis or transcribing them • Multi-platform: we merge all the datasets in plain text. To this purpose, we use the offi- mentioned in Section 3 for training. cial plaintext descriptions of the emojis (from the unicode consortium website), translated • Multi-platform filtered by length: we use to Italian with Google translate and then man- the same datasets mentioned before, but only ually corrected, as a substitute for emojis considered instances with a length lower or equal to 280 characters, ignoring URLs and • Hurtlex: We assess the impact of using a user mentions. This was done to match Twit- lexicon of hurtful words (Bassignana et al., ter length restrictions. 2018), created starting from the Italian hate lexicon developed by the linguist Tullio De • Same Platform: for each of the datasets, we Mauro, organized in 17 categories. This is trained and tested the model on data from the used to associate to the messages a score for same platform. ‘hurtfulness’ In addition to the experiments performed on dif- ferent datasets, we also compare the system per- • Social media specific features: We consider formance obtained by using different embeddings. a number of metrics related to the language In particular, we train the system by using Italian used in social media platforms. In particular, Fasttext word embeddings trained on Common- 5 http://storage.googleapis.com/books/ Crawl and Wikipedia, and Fasttext word embed- ngrams/books/datasetsv2.html dings trained by us on a sample of Italian tweets Emoji Platform Training set Embeddings Features F1 no hate F1 hate Macro AVG Transcription Multi Platform Twitter Social Yes 0.984 0.432 0.708 Instagram Single Platform Twitter Social Yes 0.981 0.424 0.702 Multi Platform Twitter Social Yes 0.773 0.871 0.822 Facebook Single Platform Twitter Social Yes 0.733 0.892 0.812 Multi Platform Twitter Social Yes 0.852 0.739 0.796 WhatsApp Single Platform Twitter Social Yes 0.814 0.694 0.754 Single Platform Twitter Hurtlex No 0.879 0.717 0.798 Twitter Filtered Multi Platform Twitter Hurtlex No 0.858 0.720 0.789 Multi Platform Twitter Hurtlex No 0.851 0.712 0.782 Table 1: Classification results (Basile and Nissim, 2013), with an embedding Hurtlex is in this case more useful than social net- size of 300. As described in Section 4.3, we also work specific features. While the precise cause for train our models including either social-media or this would require more investigation, one possi- Hurtlex features. Finally, we compare classifi- ble explanation is the fact that Twitter is known cation performance with and without emoji tran- for having a relatively lenient approach to content scription. moderation. This would let more hurtful words slip in, increasing the effectiveness of Hurtlex as 6 Results a feature, in addition to word embeddings. Addi- For each platform, we report in Table 1 the tionally, emoji transcription seems to be less use- best performing configuration considering embed- ful for Twitter than for other platforms. This might ding type, features and emoji transcription. We be explained with the fact that the Twitter dataset also report the performance obtained by merg- has relatively less emojis when compared to the ing all training data (Multi-platform), using only others. platform-specific training data (Single platform) One final outtake confirmed by the results is and filtering training instances > 280 characters the fact that embeddings trained on social media (Filtered Multi platform) when testing on Twitter. platforms (in this case Twitter) always outperform For Instagram, Facebook and Whatsapp, the general-purpose embeddings. This shows that the best performing configuration is identical. They language used on social platforms has peculiarities all use emoji transcription, Twitter embeddings that might not be present in generic corpora, and and social-specific features. Using multi-platform that it is therefore advisable to use domain-specific training data is also helpful, and all the best per- resources. forming models on the aforementioned datasets 7 Conclusions use data obtained from multiple sources. How- ever, the only substantial improvement can be ob- In this paper, we examined the impact of using served in the WhatsApp dataset, probably because datasets from multiple platforms in order to clas- it is the smallest one, and the classifier benefits sify hate speech on social media. While the results from more training data. of our experiments successfully demonstrated that The results obtained on the Twitter test set dif- using data from multiple sources helps the perfor- fer from the aforementioned ones in several ways. mance of our model in most cases, the resulting First of all, the in-domain training set is the best improvement is not always sizeable enough to be performing one, while the restricted length dataset useful. Additionally, when dealing with tweets, is slightly better than the non restricted one. These using data from other social platforms slightly de- results suggest that learning to detect hate speech creases performance, even when we filter the data on the short length interactions that happen on to contain only short sequences of text. As for Twitter does not benefit from using data from other future work, further experiments could be per- platforms. This effect can be at least partially mit- formed, by testing all possible combinations of igated by restricting the length of the social inter- training sources and test sets. This way, we could actions considered and retaining only the training establish what social platforms share more traits instances that are more similar to Twitter ones. when it comes to hate speech, allowing for better Another remark concerning only Twitter is that detection systems. At the moment, however, the size of the datasets varies too broadly to allow for the 2014 Conference on Empirical Methods in Nat- a fair comparison, and we would need to extend ural Language Processing (EMNLP), pages 1724– 1734. Association for Computational Linguistics. some of the datasets. Finally, another approach could be tested, where a model trained on Face- François Chollet et al. 2015. Keras. https:// book is used for longer sequences of text, while github.com/fchollet/keras. the Twitter model is applied to the shorter ones. Andrea Cimino, Lorenzo De Mattei, and Felice Dell’Orletta. 2018. Multi-task learning in deep Acknowledgments neural networks at EVALITA 2018. In Proceedings of the Sixth Evaluation Campaign of Natural Lan- Part of this work was funded by the CREEP guage Processing and Speech Tools for Italian. Fi- project (http://creep-project.eu/), a nal Workshop (EVALITA 2018) co-located with the Digital Wellbeing Activity supported by EIT Dig- Fifth Italian Conference on Computational Linguis- tics (CLiC-it 2018), Turin, Italy. ital in 2018 and 2019. This research was also sup- ported by the HATEMETER project (http:// Michele Corazza, Stefano Menini, Pinar Arslan, hatemeter.eu/) within the EU Rights, Equal- Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and Serena Villata. 2018. Comparing different su- ity and Citizenship Programme 2014-2020. pervised approaches to hate speech detection. In Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for References Italian. Final Workshop (EVALITA 2018) co-located with the Fifth Italian Conference on Computational Valerio Basile and Malvina Nissim. 2013. Sentiment Linguistics (CLiC-it 2018), Turin, Italy. analysis on italian tweets. In Proceedings of the 4th Workshop on Computational Approaches to Subjec- Michele Corazza, Stefano Menini, Elena Cabrio, Sara tivity, Sentiment and Social Media Analysis, pages Tonelli, and Serena Villata. to appear. Robust Hate 100–107, Atlanta. Speech Detection: A Cross-Language Evaluation. Transactions on Internet Technology. Elisa Bassignana, Valerio Basile, and Viviana Patti. 2018. Hurtlex: A multilingual lexicon of words to Elisabetta Fersini, Paolo Rosso, and Maria An- hurt. In 5th Italian Conference on Computational zovino. 2018. Overview of the task on auto- Linguistics, CLiC-it 2018, volume 2253, pages 1–6. matic misogyny identification at ibereval 2018. In CEUR-WS. IberEval@SEPLN, volume 2150 of CEUR Work- shop Proceedings, pages 214–228. CEUR-WS.org. Christos Baziotis, Nikos Pelekis, and Christos Doulk- eridis. 2017. DataStories at SemEval-2017 Task Darja Fišer, Ruihong Huang, Vinodkumar Prab- 4: Deep LSTM with Attention for Message-level hakaran, Rob Voigt, Zeerak Waseem, and Jacqueline and Topic-based Sentiment Analysis. In Proceed- Wernimont. 2018. Proceedings of the 2nd work- ings of the 11th International Workshop on Semantic shop on abusive language online (alw2). In Pro- Evaluation (SemEval-2017), pages 747–754, Van- ceedings of the 2nd Workshop on Abusive Language couver, Canada, August. Association for Computa- Online (ALW2). Association for Computational Lin- tional Linguistics. guistics. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- Piotr Bojanowski, Edouard Grave, Armand Joulin, and mand Joulin, and Tomas Mikolov. 2018. Learn- Tomas Mikolov. 2017. Enriching word vectors with ing word vectors for 157 languages. In Proceed- subword information. Transactions of the Associa- ings of the International Conference on Language tion for Computational Linguistics, 5:135–146. Resources and Evaluation (LREC 2018). Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Stefano Menini, Giovanni Moretti, Michele Corazza, Manuela Sanguinetti, and Maurizio Tesconi. 2018. Elena Cabrio, Sara Tonelli, and Serena Villata. Overview of the EVALITA 2018 hate speech de- 2019. A system to monitor cyberbullying based on tection task. In Proceedings of the Sixth Evalua- message classification and social network analysis. tion Campaign of Natural Language Processing and In Proceedings of the Third Workshop on Abusive Speech Tools for Italian. Final Workshop (EVALITA Language Online, pages 105–110, Florence, Italy, 2018) co-located with the Fifth Italian Conference August. Association for Computational Linguistics. on Computational Linguistics (CLiC-it 2018), Turin, Italy. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger sos, D. Cournapeau, M. Brucher, M. Perrot, and Schwenk, and Yoshua Bengio. 2014. Learning E. Duchesnay. 2011. Scikit-learn: Machine learn- phrase representations using rnn encoder–decoder ing in Python. Journal of Machine Learning Re- for statistical machine translation. In Proceedings of search, 12:2825–2830. Rachele Sprugnoli, Stefano Menini, Sara Tonelli, Fil- ippo Oncini, and Enrico Piras. 2018. Creating a whatsapp dataset to study pre-teen cyberbullying. In Proceedings of the 2nd Workshop on Abusive Lan- guage Online (ALW2), pages 51–59. Association for Computational Linguistics. Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy, and Joel Tetreault. 2017. Proceedings of the first workshop on abusive language online. In Proceed- ings of the First Workshop on Abusive Language On- line. Association for Computational Linguistics. Michael Wiegand, Melanie Siegel, and Josef Ruppen- hofer. 2018. Overview of the germeval 2018 shared task on the identification of offensive language. In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS 2018).