=Paper=
{{Paper
|id=Vol-2263/paper039
|storemode=property
|title=Comparing Different Supervised Approaches to Hate Speech Detection
|pdfUrl=https://ceur-ws.org/Vol-2263/paper039.pdf
|volume=Vol-2263
|authors=Michele Corazza,Stefano Menini,Pinar Arslan,Rachele Sprugnoli,Elena Cabrio,Sara Tonelli,Serena Villata
|dblpUrl=https://dblp.org/rec/conf/evalita/CorazzaMASCTV18
}}
==Comparing Different Supervised Approaches to Hate Speech Detection==
Comparing Different Supervised Approaches to Hate Speech Detection Michele Corazza† , Stefano Menini‡ , Pinar Arslan† , Rachele Sprugnoli‡ Elena Cabrio† , Sara Tonelli‡ , Serena Villata† † Université Côte d’Azur, CNRS, Inria, I3S, France ‡ Fondazione Bruno Kessler, Trento, Italy {michele.corazza,pinar.arslan}@inria.fr {menini,sprugnoli,satonelli}@fbk.eu {elena.cabrio,serena.villata}@unice.fr Abstract detection from Facebook to Twitter posts (Task 3.1: Cross-HaSpeeDe FB) and Cross-domain task English. This paper reports on the sys- hate speech detection from Twitter to Facebook tems the InriaFBK Team submitted to the posts (Task 3.2: Cross-HaSpeeDe TW). We build EVALITA 2018 - Shared Task on Hate our models for these binary classification sub- Speech Detection in Italian Twitter and tasks testing recurrent neural networks, ngram- Facebook posts (HaSpeeDe). Our submis- based neural networks1 and a LinearSVC (Support sions were based on three separate classes Vector Machine) approach2 . In HaSpeeDe-TW, of models: a model using a recurrent layer, which has comparatively short sequences with re- an ngram-based neural network and a Lin- spect to HaSpeeDe-FB, an ngram-based neural earSVC. For the Facebook task and the network and a LinearSVC model were used, while two cross-domain tasks we used the recur- for HaSpeeDe-FB and the two cross-domain tasks rent model and obtained promising results, recurrent models were used. especially in the cross-domain setting. For Twitter, we used an ngram-based neural 2 System Description network and the LinearSVC-based model. We adopt a supervised approach and, to select the Italiano. Questo articolo descrive i mo- best model for each task, we perform grid search delli del team InriaFBK per lo Shared Ta- over different machine learning classifiers such sk on Hate Speech Detection in Italian as Neural Networks (NN), Support Vector Ma- Twitter and Facebook posts (HaSpeeDe) chines (SVM) and Logistic Regression (LR). Both di EVALITA 2018. Tre classi di modelli ngram-based (unigram and bigram) and recurrent differenti sono state utilizzate: un model- models using embeddings were tested, but only lo che usa un livello ricorrente, una rete the ones that were submitted for the tasks will be neurale basata su ngrammi e un model- described. A LinearSVC model from scikit-learn lo basato su LinearSVC. Per Facebook e (Pedregosa et al., 2011a) was also tested, and it i due task cross-domain, si è scelto un mo- showed good performance on the Twitter dataset. dello ricorrente che ha ottenuto buoni ri- In order to perform a grid search over the param- sultati, specialmente per quanto riguarda eters and models, the training set released by the i task cross-domain. Per Twitter, sono stati task organisers was partitioned in three: 60% of it utilizzati la rete neurale basata su ngram- was used for training, 20% for validation and 20% mi e il modello basato su LinearSVC. for testing.3 2.1 Preprocessing 1 Introduction Since misspellings, neologisms, acronyms and jar- In this paper, we describe the submitted systems gon are common in social media interactions, it for each of the four subtasks organized within was necessary to carefully preprocess the data, in the HaSpeeDe evaluation exercise at EVALITA 1 https://gitlab.com/ashmikuz/ 2018 (Bosco et al., 2018): Hate speech detec- creep-cyberbullying-classifier 2 tion on Facebook comments (Task 1: HaSpeeDe- https://github.com/0707pinar/ Hate-Speech-Detection/ FB), Hate speech detection on tweets (Task 2: 3 To split the data we use the scikit-learn HaSpeeDe-TW), Cross-domain task hate speech train test split function, using 42 as seed value. order to normalize it without losing information. asymmetric topology is used for the neural net- For this reason, we first replace URLs with the work: the sequences of word embeddings are fed word “url” and “@” user mentions with “user- to a recurrent layer, whose output is then concate- name” by using regular expressions. nated with the social features. The concatenated Since hashtags often provide important seman- vector is then fed to one or two feed forward fully tic content, they are normalized by splitting them connected layers that use the Rectified Linear Unit into the words composing them. To this end, (ReLU) as their activation function. The output we adapted to Italian the Ekphrasis tool (Bazio- layer is a single neuron with a sigmoid activation, tis et al., 2017), using as ngram model the Italian while binary cross-entropy is used as the loss func- Google ngrams starting from year 2000. In addi- tion for the model. tion to the aforementioned normalizations, for the Batch normalization and various kinds of LinearSVC model we also stemmed Italian words dropout have been tested to reduce the variance of via the Snowball Stemmer (Bird and Loper, 2004) the models. Experimental results suggested that and we removed stopwords. applying the former to the output of the recur- rent layer had a negative effect on performance. 2.2 Feature Description For this reason, batch normalization was applied We used the following text-derived features: only to the output of the hidden layers. As for dropout, we tried three different mechanisms. A • Word Embeddings: Italian fastText embed- simple dropout layer (Srivastava et al., 2014) is dings (Bojanowski et al., 2016)4 employed in applied to the output of the hidden layers, as ap- the recurrent models (Section 2.3); plying dropout to the output of the recurrent layer • Ngrams: unigrams and bigrams, used for introduces too much noise and does not improve the ngram-based neural network and the lin- performance. We also tested a dropout on the em- earSVC (Sections 2.4, 2.5); beddings (Gal and Ghahramani, 2016) that effec- tively skips some of the word embeddings in the • Social-network specific features: the num- sequence, as dropping part of the embedding vec- ber of hashtags and mentions, the number of tor causes a loss of information, while dropping exclamation and question marks, the number entire words can help reduce overfitting. In ad- of emojis, the number of words that are writ- dition, a recurrent dropout (Gal and Ghahramani, ten in uppercase. 2016) was also tested. While evaluating the models, we tested both a Long Short Term Mem- • Sentiment and Emotion features: the word- ory (LSTM) (Gers et al., 1999) and a Gated Re- level emotion and sentiment tags for Italian current Unit (GRU) (Cho et al., 2014) as recurrent words extracted from the EmoLex (Moham- layers. The latter is functionally very similar to mad and Turney, 2013; Mohammad and Tur- an LSTM but by using less weights it can some- ney, 2010) resource. times reduce the variance of the model, improving its performance. 2.3 Recurrent Neural Network Model In order to classify hate speech in social media 2.4 Ngram-based Neural Networks interactions, we believe that recurrent neural net- works are a useful tool, given their ability to re- Ngram-based neural networks are structurally member the sequence of inputs while considering similar to the recurrent models. We first com- their order, differently from the feed-forward mod- pute the unigrams and bigrams over the lemma- els. In the context of our classifier, this allows the tized social media posts. The resulting vector is model to remember the whole sequence of words then normalized by using tf-idf from scikit-learn in the order they appear in. and concatenated to the social-specific features. More specifically, our recurrent models, imple- One or two hidden feed-forward layers are then mented using Keras (Chollet and others, 2015), used, and the same output layer as in the recurrent combine both sequences of word embeddings and models is used. The same dropout and batch nor- social media features. In order to achieve that, an malization techniques used in the recurrent models 4 https://github.com/facebookresearch/ have been tested for the ngram-based neural net- fastText works as well. For the first submitted run of Task 2: HaSpeeDe-TW, we used unigrams and bigrams scribed in subsection 2.5. This run was ranked along with the required preprocessing steps based sixth out of 19 submissions. As our second run on on tf-idf model. the Task 2: HaSpeeDe-TW, an ngram-based neu- ral network was used having a single fully con- 2.5 Linear SVC System nected hidden layer with size 200. Simple dropout We implemented a Linear Support Vector Clas- was applied to the hidden layer with value 0.5. sification system (i.e., LinearSVC) (Fan et al., This run ranked fourth. Both runs show better 2008) based on bag-of-words (i.e., unigrams), us- performance when classifying the non hate speech ing scikit-learn (Pedregosa et al., 2011b) for the class as displayed in Table 2. first submitted run in Task 2: HaSpeeDe-TW. We First Run chose this system as it scales well for large-scale Category P R F1 Instances samples, and it is efficient to solve text classifica- Non Hate 0.873 0.827 0.850 676 tion problems. To deal with imbalanced labels, we Hate 0.675 0.750 0.711 324 Macro AVG 0.774 0.788 0.780 1000 set the class weight parameter as “balanced”. Second Run To mitigate overfitting, penalty parameter C was Non Hate 0.842 0.899 0.870 676 scaled as 0.7. Hate 0.755 0.648 0.698 324 Macro AVG 0.799 0.774 0.784 1000 3 Submitted Runs and Results Table 2: Results on HaSpeeDe-TW In this Section we describe the single runs submit- ted for each task and we present the results. The 3.3 Task 3.1: Cross-HaSpeeDe FB official ranking reported for each run is given in terms of macro-average F-score. For Task 3.1: Cross-HaSpeeDe FB two recurrent models were used. In the first submitted run, two 3.1 Task 1: HaSpeeDe-FB hidden layers of size 500 were used. An LSTM of For Task 1: HaSpeeDe-FB, two recurrent models size 200 was adopted as the recurrent layer. Em- were used. The first submitted run used a single beddings dropout was applied with value 0.5 and fully connected layer of size 200 and a GRU of a simple dropout was applied to the output of the size 100 as the recurrent layer. Recurrent dropout feed-forward layers with value 0.5. The recurrent was applied to the GRU with value 0.2. The sec- model for the second run had one hidden layer of ond submitted run used two fully connected layers size 500. A GRU of size 200 was used as the re- of size 500 and a GRU of size 300 as the recurrent current layer and no dropout was applied. The first layer. Simple dropout was applied to the output of run ranked second out of 17 submissions while the the feed-forward layers with value 0.5. The first second run registered the best score in the Task run ranked third and the second ranked fourth out 3.1: Cross-HaSpeeDe FB. In both runs, the mod- of 18 submissions (Table 1). As shown in Table els showed good performance over the non hate 1, both runs yield a better performance on the hate speech class, whereas the precision on the hate speech class. speech class does not exceed 0.5 (see Table 3). First Run First Run Category P R F1 Instances Category P R F1 Instances Non Hate 0.810 0.675 0.736 676 Non Hate 0.763 0.687 0.723 323 Hate 0.497 0.670 0.570 324 Hate 0.858 0.898 0.877 677 Macro AVG 0.653 0.672 0.653 1000 Macro AVG 0.810 0.793 0.800 1000 Second Run Second Run Non Hate 0.818 0.660 0.731 676 Non Hate 0.716 0.703 0.709 323 Hate 0.494 0.694 0.580 324 Hate 0.859 0.867 0.863 677 Macro AVG 0.656 0.677 0.654 1000 Macro AVG 0.788 0.785 0.786 1000 Table 1: Results on HaSpeeDe-FB Table 3: Results on Cross-HaSpeeDe FB 3.2 Task 2: HaSpeeDe-TW 3.4 Task 3.2: Cross-HaSpeeDe TW In the first submitted run for Task 2: HaSpeeDe- For Task 3.2: Cross-HaSpeeDe TW two recurrent TW, we used the LinearSVC-based model de- models were used. In the first submitted run, two hidden layers of size 500 were used together with messages reporting the title of a news, eg. “Il Gi- a GRU of size 200 as the recurrent layer. Sim- appone senza immigrati a corto di forza lavoro”. ple dropout was applied to the output of the feed- In Task 2: HaSpeeDe-TW, when the classifier forward layers with value 0.2, whereas the recur- relies on sentiment and emotion features, we reg- rent dropout has value 0.2. In the second submit- istered several misclassified instances containing ted run, one hidden layer of size 200 was used relevant content words not covered by EmoLex. adopting an LSTM of size 200 as the recurrent This is due to the fact that for every English word, layer. Embeddings dropout was applied with value EmoLex provides only one translated entry, thus 0.5. The first run ranked fourth out of 17 submis- limiting the overall coverage. For instance, “to sions, while the other run ranked second. Table 4 kill” is translated in Italian with “uccidere” not shows that in both cases the system showed good considering synonyms such as “ammazzare” often performance over the hate speech class, while de- used in the dataset. tecting negative instances proved difficult, in par- Finally, we noticed some inconsistencies in the ticular in terms of precision over the non hate gold standard. For example, the message “Al solo speech class. vederle danno il voltastomaco!” is annotated as hate speech while, the almost equivalent, “Appena First Run le ho viste ho vomitato” is considered a non hate Category P R F1 Instances Non Hate 0.493 0.703 0.580 323 speech instance while our models identify it as Hate 0.822 0.656 0.730 677 hate speech. Similarly, an insult like “ridicoli” Macro AVG 0.658 0.679 0.655 1000 is annotated as non hate speech in “CERTO CHE Second Run GLI ONOREVOLI DEL PD SI RICONOSCONO Non Hate 0.537 0.653 0.589 323 Hate 0.815 0.731 0.771 677 A KILOMETRI ... RIDICOLI” but as hate speech Macro AVG 0.676 0.692 0.680 1000 in “Ci vorrebbe anche qua Putin, invece di quei RIDICOLI...PAROLACCE PAROLACCE”. Table 4: Results on Cross-HaSpeeDe TW 5 Conclusions In this paper we presented an overview of the 4 Error Analysis and Discussion runs submitted for the four subtasks of HaSpeeDe evaluation exercise. We implemented a number Although all our runs obtained satisfactory re- of different models, comparing recurrent neural sults in each task, there is still room for improve- networks, ngram-based neural networks and lin- ment. In particular, we noticed that our models ear SVC. While RNNs perform better in three of have problems in classifying social media mes- four tasks, classification on Twitter data achieves sages containing the following specific phenom- a better ranking using the ngram based neural net- ena: (i) dialects (e.g. “un se ponno sentı̀...ma come work. Our system was ranked first among all se fà...”) or bad orthography (e.g. “Io no nesdune the teams in one of the cross-domain task, i.e. delle due.....momti pesanti”); (ii) sarcasm, “Dopo Cross-HaSpeeDe FB. This is probably due to the i campi rom via pure i centri sociali. L’unico fact that considering the whole sequence of inputs problema sarà distinguere gli uni dagli altri”; (iii) with a recurrent neural networks and using a pre- references to world knowledge, typically used for learned representation by using word embeddings an indirect attack not containing an explicit insult help the model to learn some common traits of (e.g. “un certo Adolf sarebbe utile ancora oggi hate speech across different social media. con certi soggetti”); (iv) metaphorical expressions, usually referring to ways to physically eliminate Acknowledgments the targets of hate speech messages (e.g. “Rus- pali”). Part of this work was funded by the CREEP As for false positives, some errors come from project (http://creep-project.eu/), a the misclassification of messages containing the Digital Wellbeing Activity supported by EIT lemmas “terrorista”, “terrorismo”, “immigrato” Digital in 2018. This research was also sup- that are extremely frequent in particular in the ported by the HATEMETER project (http:// Twitter dataset. These lemmas are associated to hatemeter.eu/) within the EU Rights, Equal- the hate speech class even when they appear in ity and Citizenship Programme 2014-2020. References F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- Christos Baziotis, Nikos Pelekis, and Christos Doulk- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- eridis. 2017. DataStories at SemEval-2017 Task sos, D. Cournapeau, M. Brucher, M. Perrot, and 4: Deep LSTM with Attention for Message-level E. Duchesnay. 2011a. Scikit-learn: Machine learn- and Topic-based Sentiment Analysis. In Proceed- ing in Python. Journal of Machine Learning Re- ings of the 11th International Workshop on Semantic search, 12:2825–2830. Evaluation (SemEval-2017), pages 747–754, Van- couver, Canada, August. Association for Computa- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- tional Linguistics. fort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Steven Bird and Edward Loper. 2004. Nltk: the nat- Weiss, Vincent Dubourg, et al. 2011b. Scikit-learn: ural language toolkit. In Proceedings of the ACL Machine learning in python. Journal of machine 2004 on Interactive poster and demonstration ses- learning research, 12(Oct):2825–2830. sions, page 31. Association for Computational Lin- guistics. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Piotr Bojanowski, Edouard Grave, Armand Joulin, and Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Tomas Mikolov. 2016. Enriching Word Vec- Dropout: a simple way to prevent neural networks tors with Subword Information. arXiv preprint from overfitting. The Journal of Machine Learning arXiv:1607.04606. Research, 15(1):1929–1958. Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Manuela Sanguinetti, and Maurizio Tesconi. 2018. Overview of the EVALITA 2018 HaSpeeDe Hate Speech Detection (HaSpeeDe) Task. In Tom- maso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, Proceedings of the 6th evalua- tion campaign of Natural Language Processing and Speech tools for Italian (EVALITA18), Turin, Italy, December. CEUR.org. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. François Chollet et al. 2015. Keras. https:// keras.io. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A library for large linear classification. Journal of ma- chine learning research, 9(Aug):1871–1874. Yarin Gal and Zoubin Ghahramani. 2016. A theoret- ically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems, pages 1019–1027. Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. Saif M Mohammad and Peter D Turney. 2010. Emo- tions evoked by common words and phrases: Us- ing mechanical turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and genera- tion of emotion in text, pages 26–34. Association for Computational Linguistics. Saif M Mohammad and Peter D Turney. 2013. Crowd- sourcing a word–emotion association lexicon. Com- putational Intelligence, 29(3):436–465.