Detecting Hate Speech for Italian Language in Social Media Valentino Santucci, Stefania Spina University for Foreigners of Perugia {valentino.santucci, stefania.spina}@unistrapg.it Alfredo Milani University of Perugia alfredo.milani@unipg.it Giulio Biondi, Gabriele Di Bari University of Florence {giulio.biondi, gabriele.dibari}@unifi.it Abstract 2016; Del Vigna et al., 2017; Davidson et al., 2017; Badjatiya et al., 2017; Gitari et al., 2015). English. In this report we describe the In this paper, we provide the description of hate speech detection system for the Ital- our hate speech detection system for the Ital- ian language developed by a joint team ian language. The system, namely HSD4I PG, of researchers from the two universi- has been developed by a joint team of re- ties of Perugia (University for Foreign- searchers from the University for Foreigners of ers of Perugia and University of Perugia). Perugia and the University of Perugia. The The experimental results obtained in the code of HSD4I PG is provided online at the url HaSpeeDe task of the Evalita 2018 eval- https://github.com/Gabriele91/HSD4I PG. uation campaign are analyzed. Finally, a suggestion for future research directions is The rest of the paper is organized as follows. provided in the conclusion. The main system architecture is provided in Sec- tion 2, while the single software components are Italiano. In questo documento descri- described in Sections 3-6. Experimental results viamo il sistema di hate speech detection are provided in Section 7, while conclusion and per la lingua Italiana sviluppato da una future lines of research are depicted in Section 8. squadra di ricercatori dell’Università per Stranieri di Perugia e dell’Università degli 2 Architecture of the Hate Speech Studi di Perugia. I risultati sperimentali Detector ottenuti nel task HaSpeeDe, organizzato nell’ambito di Evalita 2018, sono ripor- The hate speech detector we have developed, tati e analizzati. Infine, una possibile di- namely HSD4I PG, is composed by several soft- rezione di ricerca è fornita nelle conclu- ware components: sioni. • a tokenizer for Italian posts from social me- dia, 1 Introduction In the recent years there was an exponential • the popular FastText tool (Bojanowski et al., growth of social media that has revolutionized 2016) used to generate a word embedding communication and content publishing. However, model, social media are also increasingly exploited for the propagation of hate speech. This issue motivates • a features generator that generates a vector the recent research on hate speech detection sys- of numeric features for each post to be clas- tems (Zhang and Luo, 2018; Waseem and Hovy, sified, 1 • a (trainable) classifier that, for each post, pre- 2. alternative spellings of some bad words have dicts its class label. been normalized (e.g., ”vaffa” is replaced with its most popular form), Moreover, the following resources have been adopted: 3. some common mispellings and abbreviations • the Ita Twitter corpus (Spina, 2016) have been corrected (e.g., ”cmq” is replaced that includes 1,234,865 tweets extracted with ”comunque”), from the Italian timeline in a time 4. hashtags have been split into multiple tokens span of seven months (November 2012 using the Python library ”compound-word- - May 2013). The tweets were ex- splitter”, tracted randomly, 2,000 per day, using the R package TwitteR (https://cran.r- 5. apostrophes have been considered as token project.org/web/packages/twitteR/); separators, • the Italian Lexicon of Hate Speech 6. tokens composed by digits characters have that was collected based on an Italian been replaced with the token NUM, monolingual dictionary, Il Nuovo De 7. tokens corresponding to Twitter mentions Mauro, which is also available online have been replaced with the token MEN, (https://dizionario.internazionale.it); 8. tokens corresponding to web links have been • the Sentix italian lexicon for sentiment analy- replaced with the token URL, sis (Basile and Nissim, 2013); 9. emojis have been kept as tokens on their own, • the training sets of 3,000 Facebook posts and while other punctuation characters have been 3,000 tweets available for the ”Haspeede” removed, task of Evalita 2018. 10. all the textual tokens have been replaced with As any other supervised classifier system, their stemmed form by using the NLTK im- HSD4I PG requires a training stage, that is de- plementation of the Snowball stemming algo- picted in Figure 1. The word embedding model rithm for the Italian language (Porter, 1980). is trained by FastText using the Ita twitter corpus. Moreover, in order to provide additional exper- Numeric features are obtained by aggregating the imental results, we have also tried a lighter variant FastText features and by generating some ad-hoc of the tokenizer that only perform the tasks num- extra-features. These numeric features are finally bered from 5 to 10. fed to a Support Vector Machine (SVM) (Cortes and Vapnik, 1995) in order to generate a classifier 4 The Word Embedding Model model. After the SVM classifier has been trained, the A word embedding model is generated by Fast- prediction of (unlabeled) posts is performed fol- Text (Bojanowski et al., 2016) using the skipgram lowing the scheme depicted in Figure 2. technique. Fed with the Ita Twitter corpus, FastText pro- 3 The Tokenizer duces a numeric vector representation for every n- gram contained in the corpus’ posts in such a way A tokenizer for the Italian language adopted in so- that the n-grams belonging to tokens appearing in cial media has been designed by modifying the similar contexts are close to each other in the con- output produced by the ”TweetTokenizer” class tinuous numerical space. of the popular Python library NLTK (Bird et al., After the model has been generated, a numeric 2009). representation for a given token w can be simply A variety of corrections have been introduced. computed by summing up the numeric representa- The most important ones are: tions of the n-grams that compose w. 1. two or more consecutive occurrences of the Since out-of-vocabulary words are quite com- same vowel have been replaced by a single mon in social media texts, we think that the sub- occurrence (e.g., ”ciaooo” is replaced with words information contained in the n-grams is ”ciao”), particularly useful in our scenario. 5 The Features Generator • post length in number of characters, The word embedding model allows to generate a • post length in number of tokens. numeric representation for every token. Therefore, in order to produce a (constant length) numeric As an illustrative example, let consider that: representation of the whole post, we need to ag- FastText has generated numeric vectors of size 300 gregate the vectors corresponding to the tokens of for every single token w of a post p, and that the post. Six different aggregation functions have the combination of the three aggregators sum, been considered: average (avg), standard devi- min, max has been chosen. Then, the numeric ation (std), minimum (min), maximum (max), vector representing p has 300 × 3 + 20 = 920 median (med), and sum (sum). Any combina- dimensions and it is formed by concatenating the tion of these aggregators can be adopted, thus the three vectors, each one of size 300, given by ev- features generator requires an experimental tuning ery chosen aggregator together with the 20 extra- (see Section 7). features. Moreover, 20 additional extra-features have Finally, in the case the number of features is too been introduced: large for the classifier, during the training phase we are able to reduce the dimensionality to a • number of hateful tokens, computed using given number k by selecting the features having the Italian Lexicon of Hate Speech (Spina, the largest mutual information with respect to the 2016), class labels. • average sentiment polarity and intensity, computed using the Sentix lexicon (Basile 6 The Classifier and Nissim, 2013), After some preliminary experiments, we have de- cided to adopt a Support Vector Machine (SVM) • number of web links, classifier (Cortes and Vapnik, 1995). SVM is a su- • number of mentions, pervised technique for training a classifier model by efficiently computing a separation hyperplane • a boolean flag to indicate if it is a reply tweet (between the two classes to be predicted) in a (im- or not, plicitly) higher dimensional space (with respect • number of hashtags, to the features dimensionality). The SVM im- plementation of the Python’s library Scikit-Learn • maximum length of an hashtag (in charac- (Pedregosa et al, 2011) has been used. ters), Compared to the popular neural network model, the SVM technique has less parameters to be • a boolean flag to indicate if it is a retweet or tuned, it is computationally more efficient, and it not, generally obtains comparable performances. • the percentage of capital letters, Finally, it is important to note that, before the training phase, all the training features have been • the percentage of tokens whose letters are all standardized in such a way that their means and in capital case, variances, across all the training instance, are, re- spectively, 0 and 1. • number of exclamation marks, • number of tokens composed by three or more 7 Experiments dots, 7.1 Experimental Setting • number of punctutation characters, The parameters of the different software compo- nents of HSD4I PG have been tuned using a grid • number of emojis, search approach and a 10-folds cross-validation • number of repeated consecutive vowels, scheme. FastText parameters have been chosen in the • percentage of tokens representing a correct following ranges: number of epochs epoch ∈ Italian word, {5, 20, 50, 100}, the initial learning rate lr ∈ {0.05, 0.1}, the negative sampling neg ∈ 7.2 Experimental Results {5, 20, 50}, the window size ws ∈ {5, 10}. Table 2 provides the results obtained by Moreover, the skipgram model has been consid- HSD4I PG in the four proposed tasks. In ered, while other FastText parameters that have particular, the Macro-Average F1 score for each been set to constant values are: dim = 300, subtask is shown, along with the difference from minCount = 1, minn = 3, and maxn = 6. the best competitor in the subtask. Regarding the features generator (see Section 5), a combination of the six aggregators has to be SubTask HSD4I PG Distance chosen. Importantly, for combinations resulting in from best more than 1,000 features, the filtering procedure HaSpeeDe-FB 0.7841 0.0447 HaSpeeDe-TW 0.7744 0.0249 described at the end of Section 5 is performed. Cross-HaSpeeDe-FB 0.6279 0.0262 After some preliminary experiments, we have Cross-HaSpeeDe-TW 0.5545 0.1440 decided to use the following ranges in order to tune the SVM parameters: kernel ∈ Table 2: Subtask results of HSD4I PG {rbf, linear}, C ∈ {1.8, 2, 2.2, 2.4}. More- over, the gamma and class weight param- Table 2 shows that HSD4I PG achieved results eters have been set to, respectively, auto and comparable to the best competitors, except in the balanced. task Cross-HaSpeeDe-TW. The complete results The best parameter setting resulting from the for all the tasks are available in (Bosco et al., experimental tuning is provided in Table 1. 2018). Besides, in Tables 3 and 4, three additional rows corresponding to the new executions A,B,C Parameter Value previously discussed (and performed after the of- epoch 50 ficial HaSpeeDe evaluation) are provided. lr 0.05 Interestingly, the results in Table 4 show that FastText ns 50 HSD4I PG, tuned with different parameter set- ws 5 tings, would have ranked 3rd in the HaSpeeDe- sum TW subtask (see (Bosco et al., 2018)). Features Generator aggregators min max 8 Conclusion and Future Work kernel rbf SVM In this paper we have introduced a system for the C 2.2 hate speech detection of social media texts in Ital- Table 1: Tuned parameter setting ian language. The results we have obtained for the HaSpeeDe task of the Evalita 2018 campaign are This setting has been used to generate the re- provided. sults submitted as ”run 2” at the Haspeede task It is worth to point out that the results of most of Evalita 2018 by the team ”Perugia1”. For a participants are very similar and quite far from be- mistake, we have submitted a wrong file as ”run ing fully accurate. The question is whether hate 1”. Anyway, in the following section we also pro- annotation is objective or subjective. Few of the vide the results of three additional executions of posts in the datasets looks to be difficult to anno- HSD4I PG: tate even for a human being. Indeed, we think that different people can produce different annotations. Execution A) It uses the same setting of Table Therefore, it can be interesting to model the sub- 1 except that C = 2, jective perception of hatefulness and exploit such information in the detection task, perhaps, taking Execution B) It uses the same setting of Table inspiration by recommender system techniques. 1 except that the lighter variant of the tok- enizer (see Section 3) has been adopted, References Execution C) It uses the same setting of Ta- Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, ble 1 except that C = 2 and the lighter vari- and Vasudeva Varma. 2017. Deep Learning for Hate ant of the tokenizer (see Section 3) has been Speech Detection in Tweets. In Proceedings of the adopted. 26th International Conference on World Wide Web Figure 1: Training in HSD4I PG Figure 2: Classification in HSD4I PG Not HS HS Macro-Avg F-score Precision Recall F-score Precision Recall F-score A 0.7261 0.6811 0.7029 0.8522 0.8774 0.8646 0.7838 B 0.7219 0.6749 0.6976 0.8496 0.8759 0.8625 0.7801 C 0.7166 0.6811 0.6984 0.8514 0.8715 0.8715 0.7799 Table 3: Additional results in the subtask HaSpeeDe-FB Not HS HS Macro-Avg F-score Precision Recall F-score Precision Recall F-score A 0.8489 0.8728 0.8607 0.7180 0.6759 0.6963 0.7785 B 0.8545 0.8950 0.8743 0.7568 0.6821 0.7175 0.7959 C 0.8575 0.8905 0.8737 0.7517 0.6914 0.7203 0.7970 Table 4: Additional results in the subtask HaSpeeDe-TW Companion - WWW ’17 Companion, pages 759– 760, New York, New York, USA. ACM Press. Valerio Basile and Malvina Nissim. 2013. Sentiment Analysis on Italian Tweets. In In Proceedings of the 4th Workshop on Computational Approaches to Sub- jectivity, Sentiment and Social Media Analysis, At- lanta, Georgia, 14 June 2013. Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media, Inc., 1st edition. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information. 7. Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Manuela Sanguinetti, and Maurizio Tesconi. 2018. Overview of the Evalita 2018 Hate Speech Detection Task. In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evaluation campaign of Natural Language Pro- cessing and Speech tools for Italian (EVALITA’18), Turin, Italy. CEUR.org. Corinna Cortes and Vladimir Vapnik. 1995. Support- vector networks. Machine Learning, 20(3):273– 297. Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. 3. Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Marinella Petrocchi, and Maurizio Tesconi. 2017. Hate me, hate me not: Hate speech detection on Facebook. In CEUR Workshop Proceedings. Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Damien, and Jun Long. 2015. A lexicon-based approach for hate speech detection. International Journal of Multimedia and Ubiquitous Engineering. Fabian Pedregosa et al. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res., 12:2825– 2830. M.F. Porter. 1980. An algorithm for suffix stripping. Program, 14(3):130–137, 3. Stefania Spina. 2016. Fiumi di parole. Discorso e grammatica delle conversazioni scritte in Twitter. StreetLib, Loreto, Italy. Zeerak Waseem and Dirk Hovy. 2016. Hateful Sym- bols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In Proceedings of the NAACL Student Research Workshop. Ziqi Zhang and Lei Luo. 2018. Hate Speech Detec- tion: A Solved Problem? The Challenging Case of Long Tail on Twitter. 2.