=Paper=
{{Paper
|id=Vol-2765/175
|storemode=property
|title=CHILab @ HaSpeeDe 2: Enhancing Hate Speech Detection with Part-of-Speech Tagging (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2765/paper175.pdf
|volume=Vol-2765
|authors=Giuseppe Gambino,Roberto Pirrone
|dblpUrl=https://dblp.org/rec/conf/evalita/GambinoP20
}}
==CHILab @ HaSpeeDe 2: Enhancing Hate Speech Detection with Part-of-Speech Tagging (short paper)==
CHILab @ HaSpeeDe 2: Enhancing Hate Speech Detection with Part-of-Speech Tagging Giuseppe Gambino and Roberto Pirrone Dipartimento di Ingegneria Università degli Studi di Palermo giuseppe.gambino09@community.unipa.it roberto.pirrone@unipa.it Abstract ting better over time. Due to the societal concern and how widespread hate speech is becoming on The present paper describes two neural the Internet, there is strong motivation to study au- network systems used for Hate Speech De- tomatic detection of hate speech. By doing so, the tection tasks that make use not only of spread of hateful content can be reduced, having the pre-processed text but also of its Part- a safer place to stay online for the community but of-Speech (PoS) tag. The first system also a more attractive place for advertising spon- uses a Transformer Encoder block, a rel- sors who do not want their brand to be associ- atively novel neural network architecture ated with hateful content. Obviously, detecting that arises as a substitute for recurrent neu- hate speech is a challenging task. For example, ral networks. The second system uses a in case of wrong classification, a content creator Depth-wise Separable Convolutional Neu- could suffer socio-economic consequences such as ral Network, a new type of CNN that has the demonetization of one of its contents or the ban become known in the field of image pro- from the platform used. Therefore, the goal of hate cessing thanks to its computational effi- speech detection is not only to identify a text that ciency. These systems have been used for contains words that at first sight could be negative, the participation to the HaSpeeDe 2 task of but also to be able to distinguish news headlines the EVALITA 2020 workshop with CHI- that talk about crime news from a text that contains Lab as the team name, where our best sys- an effective “attack” against a person or group on tem, the one that uses Transformer, ranked the basis of attributes such as race, religion, eth- first in two out of four tasks and ranked nic origin, national origin, sex, disability, sexual third in the other two tasks. The systems orientation, or gender identity. have also been tested on English, Spanish and German languages. The rest of the paper is arranged as follows. Section 2 reports a description of our systems de- veloped for hate speech detection tasks. Section 1 Introduction 3 shows the results obtained in the HaSpeeDe 2 Hate speech is not unfortunately a new problem in (Sanguinetti et al., 2020) task of the EVALITA the society, but recently it has found fertile ground 2020 (Basile et al., 2020) conference, together in social media platforms that enable users to ex- with other results obtained with different lan- press themselves freely and often anonymously. guages. Results are showed in Section 4 and con- While the ability to freely express oneself is a hu- clusions are discussed in Section 5. man right, inducing and spreading hate towards another group is an abuse of this liberty (MacA- 2 Description of the Systems vaney et al., 2019). As such, many online micro-blogs such as Face- book, YouTube, Reddit, and Twitter consider hate In this section we present the implementation de- speech harmful, and have both policies and instru- tails of all the used architectures. Both the sys- ments to remove hate speech content, that are get- tems we implemented share the use of PoS Tag- ging technique that is applied to the pre-processed Copyright © 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- text, and passed as an additional input to the neural ternational (CC BY 4.0). network. 2.1 Pre-processing Tweet HS Before training a model, it is common practice to @user useless people like 1 clean the data, especially if they are retrieved from all Muslims social media. For this reason we implemented a @user no more refugees in Italy 1 classic text pre-processing pipeline, that consists please no more of: lower casing the text; removing HTML tags, Four bicycles stolen from Milan-Sanremo 0 mention and symbols; standardizing words by cut- cyclists: found in a gypsy camp url ting the characters repeated more than two times in TRAGEDY IN PRISON - The nomad 0 a row. We also made some keyword substitutions Carlo Helt takes his own life url in all our data sets: Table 1: Some examples translated into English • URLs and the “url” keyword of the drawn from the development data set proposed in HaSpeeDe 2 data set were replaced by the HaSpeeDe 2 competition together with their the symbol LINKURL label: nominal utterances used in hate speech along with journalistic tweets • Happy emoticons like “ :) ” or “ :D ” were replaced by the symbol HAPPYEMO URLs to the vocabulary. In this way we have in- • Angry or sad emoticons like “ :@ ” or “ :( ” jected some parts of the speech of the social me- were replaced by the symbol BADEMO dia language into a standard PoS Tagging model. It is important to note that we have not removed We were definitely aware that tweet oriented mod- the emojis from the text as our word embedding els such as UDPipe tool (Straka, 2018) trained takes into account emojis as plain words. on POSTWITA-UD Treebank (Sanguinetti et al., 2018) would have performed better than our solu- 2.2 Part-of-Speech Tagging tion on the in-domain data but our solution guar- In this work we use the PoS Tagging technique to anteed a more balanced performance. An example provide our networks with more information about of our PoS Tagging is showed in Figure 1. the meaning of a sentence through an explicit clas- sification on the basis of its grammatical structure. This is a crucial point with regards to hate sen- tences. In fact they tend to have particular struc- tures. As an example, one of the most widespread hate sentence is the verbless one, also known as nominal utterance (Comandini et al., 2018). An- other example are journalistic tweets (Comandini and Patti, 2019). Starting from a preliminary di- Figure 1: PoS Tagging example rect inspection of the development data set pro- posed in HaSpeeDe 2, we found that usually a 2.3 Word Embedding journalistic tweet is a short tweet that ends with an URL. Such texts can be easily misclassified due to It is well known in the NLP community that word the presence of some negative words that explain embeddings are one of the features that most af- the news. Table 1 reports some examples of these fects the performance of a model. types of statements. For our application we chose fastText (Bo- As the HaSpeeDe 2 organizers required explic- janowski et al., 2016), a word embedding devel- itly to use the same system for both tasks A and oped by Facebook Research. FastText enriches B, we set up a PoS Tagging model not too bi- word vectors with subword information treating ased towards either news headlines or tweets. As each word as composed of n-grams. Each word a consequence, we enriched the PoS Tagger pro- vector is the sum of the vector representations of vided by the Python’s spaCy library (Honnibal each of its n-grams. In this way, two words not and Montani, 2017). As this model is trained on only will have nearby vectors if they have simi- Wikipedia, we used some regex formulas to add lar context but also if they are similar. This is a the keywords for emoticons, emojis, hashtags, and great feature to treat miss-spelling that occurs of- ten in social languages. We trained from scratch leads to a significantly shorter training time than the word embedding for the Italian language with recurrent solutions. Attention is a means of selec- the Gensim library (Řehůřek and Sojka, 2010) on tively weighting different elements in input data, a 2014 MacBook Pro 13” with 8GB RAM and so that they will have an adjusted impact on the AVX2 FMA CPU extension and it took about 5 hidden states of downstream layers. hours. The embedding model has been trained A Transformer was conceived as an encoder- for 10 epochs on 5 millions Italian tweets, with decoder model, that is an ideal approach for ma- a size = 300, window size = 5, and min count = chine translation tasks and language modeling. In 2. These tweets were extracted from TWITA 2018 this work we used the Transformer encoder archi- Dataset (Basile and Nissim, 2013) and are all re- tecture, as an alternative to recurrent or convolu- lated to the words: immigrati, islam, migranti, tional neural networks (CNN) (see Figure 2). We musulmani, profughi, rom, stranieri, salvini, crim- used just one Transformer encoder for the text in- inali, africani, terroni, #dallavostraparte, #salvini, put and one for the PoS input, then we averaged #stopinvasione, #piazzapulita, #quintacolonna. them thorough max pooling. Finally, we used For the French, English and German tweets we dropout and a dense layer to get the output proba- used pre-trained models (Camacho-Collados et al., bilities. After testing various combinations of pa- 2020). Regarding the PoS Tagging embedding, we rameters, we found that the most efficient for this have applied the TensorFlow’s Embedding Layer task are: 12 heads in Multi-Head attention layer, for all the languages considered. 768 hidden units, embedding size equal to 300, dropout = 0.2 and batch size equal to 128. Train- 2.4 System 1: The Transformer ing lasted 3 epochs, about 40 seconds each. 2.5 System 2: Depth-wise Separable Convolutional Neural Network Figure 3: The DSC System Depth-wise Separable Convolution (DSC) is Figure 2: The Transformer System a well known technique in Computer Vision to lower dramatically the number of parameters in Transformers (Vaswani et al., 2017) are the cur- CNN. DSC consists in decomposing classical 3D rent state-of-the-art models for dealing with se- convolution, performing at first a depth-wise spa- quences. Unlike previous architectures for NLP, tial convolution for each channel, followed by a such as LSTM and GRU, there are no recurrent point-wise convolution which mixes together the connections and thus no real memory of previous resulting output channels. This computational states. Transformers get around this lack of mem- trick achieves in mimicking the true convolution ory by perceiving entire sequences simultaneously kernel operation, while reducing the size of the and treating them with an attention mechanism. model, and speeding up the training with almost In this way, Transformers achieve parallelism that the same accuracy. Our neural network architecture is reported in not. The task is motivated by the fact that stereo- Figure 3, and takes inspiration from Yoon Kim’s types constitute a common source of error in HS well-known architecture (Kim, 2014). We made identification (Francesconi et al., 2019). Task B some changes taking into consideration both the data sets are the same as Task A. Our results for vectorized text and its PoS Tagging. The over- both the in-domain and out-of-domain test sets are all architecture is made by two parallel DSC net- reported in Table 3. works that receive the text, and PoS embedding re- spectively. The two convolutional blocks are then Test data Model Rank F1 averaged through max pooling. After testing vari- news Transformer 1/12 0.7203 ous combinations of parameters, we found that the news DSC 2/12 0.7184 most efficient setup for this task: [16, 32, 64] con- tweets Transformer 3/12 0.7615 volutional filters, kernel size = 2, dropout = 0.3, tweets DSC 5/12 0.7386 and batch size = 32. Training lasted 8 epochs, about 5 seconds each. Table 3: Results of the HaSpeeDe 2 Task B 3 Results 3.3 Multilingual Detection of Hate Speech In this Section we describe the HaSpeeDe 2 tasks of the EVALITA 2020 competition, and we We tested our systems also against data sets com- present our results obtained in each of them. To ing from either Hate Speech or Offensive Lan- evaluate the degree of generality of our approach, guage detection tasks for other languages. we also tested it on hate speech detection tasks for languages other than Italian, that is English, Span- English Spanish ish and German. The official ranking reported for Min 0.3500 0.4930 each run is given in terms of macro-average F- Mean 0,4484 0.6821 score. Median 0.4500 0.7010 Max 0.6510 0.7300 3.1 HaSpeeDe 2 Task A - Hate Speech Transformer 0.6041 0.7423 Detection DSC 0.5823 0.7375 This is the main task, and it consists of a binary classification aimed at determining whether the Table 4: Results of the HatEval Subtask A message contains Hate Speech or not. We fine- tuned the parameters for this task and then we used Table 4 reports the results of SemEval 2019 the model as it is for the other tasks. We were pro- Task 5 (HateEval) (Basile et al., 2019) about the vided with a labeled training set – made of tweets binary detection of hate speech against immigrants only – and two unlabeled test sets: one containing and women in Spanish and English messages ex- in-domain data, i.e. tweets, and the other out-of- tracted from Twitter. domain data, i.e. news headlines. Our results for German both Task A test sets are reported in Table 2. Min 0,5487 Test data Model Rank F1 Mean 0,7151 news Transformer 1/27 0.7744 Median 0.7295 news DSC 4/27 0.7183 Max 0.7695 tweets Transformer 3/27 0.7893 Transformer 0,7384 tweets DSC 5/27 0.7782 DSC 0,7240 Table 2: Results of the HaSpeeDe 2 Task A Table 5: Results of the GermEval 2019 Task 2 Table 5 shows the results of GermEval 2019 3.2 HaSpeeDe 2 Task B - Stereotype Task 2 - Subtask A (Struß et al., 2019). The pour- Detection pose of this task is to initiate and foster research Task B is a binary classification aimed at determin- on the binary identification of offensive content in ing whether the message contains stereotypes or German language micro-posts. 4 Discussion will focus on injecting more and more the gram- matical structure of a sentence into a model, in fact As it can be seen in the results, the Transformer we are planning a language model that does not model has always outperformed the DSC model: only have the purpose of predicting a word based we expected this outcome due to the nature of the on the given context but that it is also capable of DSC model, designed to be as light as possible but predicting the PoS Tag of that word. still performing. Regarding the results obtained with the Italian language, we are satisfied with our implementations which have achieved excel- References lent ranking positions in all tasks. In particular, the Transformer model outperformed all the systems Valerio Basile and Malvina Nissim. 2013. Sentiment analysis on Italian tweets. In Proceedings of the 4th that participated to the tasks ranking first with out- Workshop on Computational Approaches to Subjec- of-domain data. This can be seen as a great ability tivity, Sentiment and Social Media Analysis, pages of our model to generalize starting from a training 100–107, Atlanta. data set different from that of the application. Re- garding the results obtained with in-domain data Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel we performed slightly worse, ranking third. This Rangel Pardo, Paolo Rosso, and Manuela San- is probably due to the PoS Tagging model that we guinetti. 2019. SemEval-2019 task 5: Multilingual used in fact it is a model trained on Wikipedia and Detection of Hate Speech Against Immigrants and not on social language, even if slightly modified Women in Twitter. In Proceedings of the 13th Inter- national Workshop on Semantic Evaluation, pages to manage hashtags, emoticons and URLs, it cer- 54–63, Minneapolis, Minnesota, USA, June. Asso- tainly does not perform well on social texts as if it ciation for Computational Linguistics. were a purely PoS Tagging model trained on social media language. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- cia C. Passaro. 2020. EVALITA 2020: Overview of As regards the results obtained with the other the 7th Evaluation Campaign of Natural Language languages, we can see that with the Spanish lan- Processing and Speech Tools for Italian. In Valerio guage we get an excellent result, surpassing the Basile, Danilo Croce, Maria Di Maro, and Lucia C. first official ranked of the HatEval 2019 compe- Passaro, editors, Proceedings of Seventh Evalua- tition in Spanish. Our models do not achieve as tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA good results as that of English and German even 2020), Online. CEUR.org. if the Transformer’s score is always above the me- dian value. We think that this is caused by the na- Piotr Bojanowski, Edouard Grave, Armand Joulin, and ture of languages, because Germanic languages, Tomas Mikolov. 2016. Enriching Word Vec- tors with Subword Information. arXiv preprint such as English and German, probably benefit less arXiv:1607.04606. than Latin ones from the additional use of the PoS Tagging, in the way we used it. We are still inves- Jose Camacho-Collados, Yerai Doval, Eugenio tigating how to get added value from PoS Tagging Martı́nez-Cámara, Luis Espinosa-Anke, Francesco for the English and German languages. Barbieri, and Steven Schockaert. 2020. Learning Cross-lingual Embeddings from Twitter via Distant Supervision. In Proceedings of ICWSM. 5 Conclusion Gloria Comandini and Viviana Patti. 2019. An Impos- In this paper we have introduced two systems for sible Dialogue! Nominal Utterances and Populist the hate speech detection of social media texts in Rhetoric in an Italian Twitter Corpus of Hate Speech Italian, Spanish, English and German language. against Immigrants. In Proceedings of the Third The main feature of these models is to use as input Workshop on Abusive Language Online, pages 163– to the neural network not only the pre-processed 171. Association for Computational Linguistics. text, but also it’s PoS Tag. We are satisfied with Gloria Comandini, Manuela Speranza, and Bernardo the results obtained, because the systems imple- Magnini. 2018. Effective Communication without mented are light and performing. Furthermore we Verbs? Sure! Identification of Nominal Utterances have shown that the use of models that include the in Italian Social Media Texts. In Proceedings of the Fifth Italian Conference on Computational Linguis- additional use of the PoS Tagging, to give it more tics (CLiC-it 2018), Torino,Italy, December 10-12, meaning, has given an added value, reached the 2018, volume 2253 of CEUR Workshop Proceed- top positions in the tasks ranking. Our future work ings. CEUR.org, 12. Chiara Francesconi, Cristina Bosco, Fabio Poletto, and Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob M. Sanguinetti. 2019. Error Analysis in a Hate Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Speech Detection Task: The Case of HaSpeeDe-TW Kaiser, and Illia Polosukhin. 2017. Attention Is All at EVALITA 2018. In CLiC-it. You Need. Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embed- dings, convolutional neural networks and incremen- tal parsing. To appear. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natu- ral Language Processing (EMNLP), pages 1746– 1751, Doha, Qatar, October. Association for Com- putational Linguistics. Sean MacAvaney, Hao-Ren Yao, Eugene Yang, Katina Russell, Nazli Goharian, and Ophir Frieder. 2019. Hate speech detection: Challenges and solutions. PLOS ONE, 14(8):1–16, 08. Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Cor- pora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45– 50, Valletta, Malta, May. ELRA. http://is. muni.cz/publication/884893/en. Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli, Alessandro Mazzei, Oronzo Antonelli, and Fabio Tamburini. 2018. PoSTWITA-UD: an Italian Twit- ter treebank in Universal Dependencies. In Proceed- ings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May. European Language Re- sources Association (ELRA). Manuela Sanguinetti, Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Marco Stranisci, Cristina Bosco, Tommaso Caselli, Viviana Patti, and Irene Russo. 2020. HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR.org. Milan Straka. 2018. UDPipe 2.0 Prototype at CoNLL 2018 UD Shared Task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207, Brussels, Belgium, October. Association for Computational Linguistics. Julia Struß, Melanie Siegel, Josef Ruppenhofer, Michael Wiegand, and Manfred Klenner. 2019. Overview of GermEval Task 2, 2019 Shared Task on the Identification of Offensive Language. In ”Pro- ceedings of the 15th Conference on Natural Lan- guage Processing (KONVENS 2019)”, 10.