UO @ HaSpeeDe2: Ensemble Model for Italian Hate Speech Detection Mariano Jason Rodriguez Cisnero Reynier Ortega Bueno Universidad de Oriente Universidad de Oriente Santiago de Cuba, Cuba Santiago de Cuba, Cuba mjasoncuba@gmail.com reynier@uo.edu.cu Abstract tigations about this topic, such as (Cimino et al., 2018), involving LSTM (Liu and Guo, 2019) and English. This document describes our transformers (Vaswani et al., 2017) that gain atten- participation in the Hate Speech Detection tion in NLP community due to their results. task at Evalita 2020. Our system is based We propose a model based on multiple repre- on deep learning techniques, specifically sentations learned by means of deep learning tech- RNNs and attention mechanism, mixed niques and linguistic knowledge. Particularly a with transformer representations and lin- Long Short Term Memory architecture mixed with guistic features. In the training process linguistic features and language model representa- a multi task learning was used to in- tions given by a special kind of transformer model, crease the system effectiveness. The re- BERT. sults show how some of the selected fea- The paper is organized as follows. The Sec- tures were not a good combination within tion 2 introduces a brief description of HaSpeeDe the model. Nevertheless, the generaliza- Task. Our hate detection system is presented tion level achieved yield encourage re- in Section 3. The experiments and results sults. are discussed in Section 4. Finally, in Sec- tion 5 the conclusions and future directions are given. The code of this work is avail- 1 Introduction able on GitHub: https://github.com/ mjason98/evalita20_hate Modern societies found easy and interesting ways for sharing information via Social Media. Users 2 HaSpeeDe2 Task discover freedom to express themselves through Hate speech and stereotypes recognition on so- online communication. Even if the ability to freely cial media have become an attractive research area express oneself is a human right, some users take from the computational point of view. In the sec- this opportunity to spread hateful content. A dan- ond edition of HaSpeeDe (Sanguinetti et al., 2020) gerous and hurtful potential arises with this kind at Evalita 2020 (Basile et al., 2020), the organiz- of information. Recognizing automatically such ers proposed to address three subtasks. The main content is an interesting topic for researchers. subtask is the subtask A, which aims at determin- Creative methods have been proposed to tackle ing the presence or absence of hateful content in a the fascinating task of recognizing hate in texts text. The dataset is composed by 6839 short texts, (De la Pena Sarracén et al., 2018; Gambäck and 2766 labeled as hate speech and 4076 as not hate Sikdar, 2017). Some of those works face the speech. In this work we focused only on subtask problem using feature extraction (Schmidt and A. The subtask B consists of a binary classification Wiegand, 2017) and classification algorithms like problem oriented to stereotypes’ detection. The SVM (Santucci et al., 2018). In the last years, last subtask C is a sequence labeling task aims at Deep Learning approaches have become one of recognizing Nominal Utterances in hateful tweets. the most successful research areas in Natural Lan- guage Processing (NLP). There are exciting inves- 3 Our Proposal Copyright© 2020 for this paper by its authors. Use per- mitted under Creative Commons License Attribution 4.0 In- We dealt with hate detection task as a text classi- ternational (CC BY 4.0). fication problem to classify “hateful” or “no hate- ful” categories. We train a deep learning model To create the information gain feature (IgF), we based on attention mechanism and Recurrent Neu- calculated the IG for every word and the highest ral Networks, specifically a Bidirectional Long ones are chosen3 . Then, the occurrence of those Short Term Memory (Bi-LSTM) (Hochreiter and selected words in the text are counted. Schmidhuber, 1997) mixed with linguistic fea- 3.2 Italian BERT tures and transformers representations by means of an interpretable multi-source fusion component Finally, we use a pre-trained BERT4 to accom- (Karimi et al., 2018). plish the calculation of a deep representation of In Section 3.1 and Section 3.2 we describe the the text. One of the most widely used auto- linguistic features and the transformer representa- encoding pre-trained Language Models (PLMs) is tion used in this work. The Section 3.3 presents BERT (Devlin et al., 2018). BERT is trained us- the preprocessing phase. Finally, the neural net- ing the masked language modeling task that ran- work model and the feature ensemble are de- domly masks some tokens in a text sequence, and scribed in Section 3.4. then independently recovers the masked tokens by conditioning on the encoding vectors obtained by 3.1 Linguistic Feature a bidirectional Transformer. To build the hate detection model, we start by ex- Inside BERT, the information is passed forward tracting several sets of linguistic features: crosswise transformer layers. In this work, we WordNet Features: We count the number used a specific output from one of those layers, of verbs, adverbs, nouns and adjectives. Also, this operation can be expressed by: for every word, we calculated the average of its h0 = Hl0 (texttok ) similarity with respect to the others using the hi = Hli (hi−1 ) similarity path function provided by the word- hn = Hln (hn−1 ) net2 corpus. Furthermore, we consider the degree of lexical ambiguity by counting the number of where texttok is the text after its tokenization5 , synsets of each word within the text. hi is the output of the ith transformer layer(Hli ) Hurt and Sentiment content: HurtLex called hidden state and n is the total transformer (Bassignana et al., 2018) is a lexicon of offen- layers in BERT. Then, for an specific i, from the sive, aggressive, and hateful words in over 50 lan- tensor of order 2 hi it is computed the vector fbert , guages. The words according to the 17 categories as a deep representation of the initial text who will offered by the lexicon are counted and added as act as PLM feature. linguistic features jointly with polarity and seman- X v v= hi [k, :] fbert = tic values obtained from SenticNet (Cambria et al., ||v|| k=0 2018) corpus. 3.3 Preprocessing Information Gain: Information gain (Lewis, In the preprocessing step, firstly stopwords were 1992) had been a good feature selection measure removed . Then, the hashtags composed of many for text categorization. It takes into account the words are split (e.g: #NessunDorma becomes # presence of the term in a category as well as its nessun dorma). We use a regular expression6 al- absence and can be defined by: gorithm to archive this step. XX p(t, C) Secondly, using the FreeLing7 tool we obtain IG(tk , Ci ) = p(t, C) · log2 for each word it lemma, and non alphanumeric t p(t)·p(C) C characters are removed. Finally, the remaining words are represented as vectors using a pre- where C ∈ {Ci , C̄i } and t ∈ {tk , t¯k }. In this trained word embedding generated by Word2Vec formula, probabilities are interpreted on an event model (Mikolov et al., 2013). space of documents, where p(t¯k , Ci ) is the proba- 3 bility that, for a random document d, term tk does We selected the top 50 words with highest IG value. 4 https://huggingface.co/dbmdz/bert-base-italian-cased not occur in d who belongs to category Ci . In our 5 The text is represented as a vector of integers using the case, categories were two: hateful and no hateful, tokenizer function in BERT Model 6 and the term is the word’s lemma. The automaton was created using the re library from python and the words from an italian corpus. 2 7 The wordnet came from the python library nltk http://nlp.lsi.upc.edu/freeling/index.php 3.4 The Deep Ensemble Model features from different sources. A naive way of The standard LSTM receives sequentially at each doing this is concatenating the vector representa- time step a vector xt and produces a hidden state tions into a single vector. This scheme considers ht . Each hidden state ht is calculated as follow: all sources equally, but one source may yield a bet- ter result than others. With IMF we propose to it = σ(W (i) xt + U (i) ht−1 + b(i) ) consider the contribution of every source of fea- ft = σ(W (f ) xt + U (f ) ht−1 + b(f ) ) ture via an attention mechanism. The IMF can be expressed by: ot = σ(W (o) xt + U (o) ht−1 + b(o) ) ut = σ(W (u) xt + U (u) ht−1 + b(u) ) ri = tanh(Wpi fi + bpi ) ct = i + t ⊕ +ft ⊕ ct−1 ht = ot ⊕ tanh(ct ) (1) where ri represents a projection of fi , the ith fea- ture vector passed to IMF ensuring that every ri Where all W (∗) , U (∗) and b(∗) are parameters have the same size. In this step, all the Wpi , bpi , to be learned during training. Function σ is the Wa and ba are parameters to be learned during sigmoid function and ⊗ stands for element-wise training, then: multiplication. Bidirectional LSTM, on the other hand, makes ai = Wa ri + ba αi = π(ai ) the same operations as standard LSTM but, X βi = αi ri z= βk (3) processes the incoming text in a left-to-right and k=0 a right-to-left order in parallel. Thus, it output → − ← − become ĥt = [ ht , ht ] for the two directions. where αi represents the importance of ri to the final calculation of z, the IMF outcome. By adding an attention mechanism, we allow the model to decide which part of the sequence To increase the learning power of our system, “attends to”. First, lets define the softmax function we used a multitask learning (Caruana, 1997) in π(v) for a vector v = [v0 , · · · , vn−1 ] as: which we predict the polarity of tweets in parallel ev with the classes of the hate speech detection sub- π(v) = P vi task. This approach have been developed before i=0 e (Cimino et al., 2018) in HaSpeede at Evalita 2018 Then, let I ∈ RN ×L be the matrix of input vec- (Bosco et al., 2018). The tweets used to accom- tors, where L the size of then and N the length of plish the multitask learning are extracted from the the given sequence. We define the attention layer Sentipolc-2016 (Barbieri et al., 2016) challenge. (AttLSTM), as a regular LSTM layer like (1) with Finally we present the composition of the previ- extra operations described as follow: ous layers and features to create our deep ensem- ble model: ak,t = π(Wk · hTt−1 + bk ) αk,t = aTk,t · I βt = [α0,t , · · · , αS−1,t ] xt = Wa · βi + ba E = [w0 , w1 , · · · , wN −1 ] (2) ob1 = BiLST M (E) (4) Here k ∈ [0, S − 1] represents the number of attention’s heads, Wk ∈ RN ×M where M is the where E represents the vector representation of size of the hidden state vector ht , Wa ∈ RM ×SM , the text, see Section 3.3. Equation (4) is the first ba and bk are learnable parameters. The (∗)T is block of our model, and the second block can be the transpose operation and the output of the layer described as follow: is O = [h0 , ..., ht , ..., hN ], a concatenation of the hidden states produced by the AttLSTM at each A = AttLST M (ob1 ) time step. mi = max Aj,i j=0,··· ,N −1 As mentioned before, we propose a feature en- ob2 = [m0 , · · · , mM −1 ] (5) semble by using an interpretable multi-source fu- sion component (IMF). The IMF aims to combine The vector ob2 is the return of a MaxPool layer over the A vector sequence, then: to the column were not used in the corresponding run. We used a 10% of the training dataset for vali- F = [ob2 , fbert , fwn , fhs , fig ] dation. We report the accuracy measure computed ob3 = IM F (F ) on this validation data. ŷ = σ(Wh ob3 + bh ) Both Tables show that the presence of BERT in- crease the performance, also almost all the runs yˆf = σ(Wf ob3 + bf ) (6) have higher values with IMF in contrast to not us- The third block is described in (6) where Wh , ing it. Increasing the number of attention heads Wf , bf and bh are learnable parameters and without IMF increase the results, but the opposite ŷ, yˆf ∈ R. The vectors fbert , fwn , fhs and fig cor- occurs in the presence of the IMF. respond to the BERT, WordNet, Hurt-Sentiment and Information Gain features respectively. The Name heads bert ig wn-hs acc prediction of the tweets polarity is determined by run1 2 0.764386 the yˆf value and the hate value trough ŷ. run2 - × × 0.742690 The overall weighted loss of the model is cal- run3 3 0.767544 culated by cross-entropy, with higher importance run4 2 × 0.713450 value for the hate speech predictions that polarity run5 2 × 0.763158 predictions. The overall loss is calculated accord- run6 - 0.757310 ing to the following formula. run7 - × 0.724152 run8 - × 0.755848 X X L1 = − yi log(yˆi ) L2 = − yfi log(yˆfi ) Table 1: Experiment results without IMF. loss = λL1 + (1 − λ)L2 (0 ≤ λ ≤ 1) (7) Here L1 and L2 are the cross-entropy loss of Name heads bert ig wn-hs acc hate predictions and sentiment polarity predictions run1 2 0.795848 respectively. The value λ is the main task impor- run2 - × × 0.779101 tance weight. The values yi and yfi represents the run3 3 0.764620 ground true hate classification and polarity clas- run4 2 × 0.720760 sification respectively. Then, the final loss is ob- run5 2 × 0.774854 tained as a convex sum of L1 and L2. run6 - 0.767544 run7 - × 0.719298 4 Experiments and Results run8 - × 0.777778 In this section we show the results of our proposed method in subtask A and discuss about them. The Table 2: Experiment results with IMF. organizers allow a maximum of two submissions for every subtask in the challenge. We named our The pretrained embedding have a size of 300, team UO. the number of neurons in the Bi-LSTM and in the Experiments where conducted in two main di- AttLSTM was 128. The λ value was equal to 0.75 rections: Firstly, to investigate the impact of the and the dropout (Srivastava et al., 2014) after the IMF fusion strategy and secondly, to evaluate the embedding layer was 0.3. The optimizer algorithm impact of each proposed single-modal representa- to train the whole model was Adam (Kingma and tion into our proposal. The results of our experi- Ba, 2015), with a learning rate of 0.01. ments are presented in Table 1 and Table 2. The bold models in Table 2 were chosen as final In those tables, the column named heads is submission for the subtask. The run1 uses the at- the number of attention headers in the Att-LSTM tention layer proposed in Section 3.2 and consider layer. If this space is empty, this layer was not all proposed representations. The run2 does not used. Columns bert and ig correspond to the use attention mechanism and handcraft features, presence or not of BERT and IG representations. using only the BERT text representation and the The column wn-hs express the presence of Hurt- rest of the architecture. Sentiment and WordNet based representations. If The Table 3 shows the official results of our sys- a cell has a cross, the representation associated tem. The evaluation was performed on two distinct corpora: one conformed by tweets and the other by Elisa Bassignana, Valerio Basile, and Viviana Patti. news headlines. 2018. Hurtlex: A multilingual lexicon of words to hurt. In 5th Italian Conference on Computational Runs macro-F Linguistics, CLiC-it 2018, volume 2253, pages 1–6. UO:tweets run1 0.6878 CEUR-WS. UO:tweets run2 0.7214 Cristina Bosco, Dell’Orletta Felice, Fabio Poletto, BEST RATED:tweets 0.8088 Manuela Sanguinetti, and Tesconi Maurizio. 2018. UO:news run1 0.6657 Overview of the evalita 2018 hate speech detection task. In EVALITA 2018-Sixth Evaluation Campaign UO:news run2 0.7314 of Natural Language Processing and Speech Tools BEST RATED:news 0.7744 for Italian, volume 2263, pages 1–9. CEUR. Erik Cambria, Soujanya Poria, Devamanyu Hazarika, Table 3: Official results. and Kenneth Kwok. 2018. Senticnet 5: Discov- ering conceptual primitives for sentiment analysis These results show that between our two mod- by means of context embeddings. In Thirty-Second els, the simple one get better results. The simplic- AAAI Conference on Artificial Intelligence. ity is not a condition for a better performance us- Rich Caruana. 1997. Multitask learning. Machine ing deep learning. These results also express that learning, 28(1):41–75. some linguistic features decrease the effectiveness Andrea Cimino, Lorenzo De Mattei, and Felice of the model, but the similarity between the results Dell’Orletta. 2018. Multi-task learning in deep neu- in the tweets and news evaluation sets suggest that ral networks at evalita 2018. Proceedings of the the system is able to generalize with a good per- 6th evaluation campaign of Natural Language Pro- formance. cessing and Speech tools for Italian (EVALITA’18), pages 86–95. 5 Conclusions and Future Work Gretel Liz De la Pena Sarracén, Reynaldo Gil Pons, Carlos Enrique Muniz Cuza, and Paolo Rosso. In this paper we presented an Ensemble Model for 2018. Hate speech detection using attention-based the task Hate Speech Detection (HaSpeeDe2) sub- lstm. EVALITA Evaluation of NLP and Speech Tools task A at Evalita 2020. Our proposal combines lin- for Italian, 12:235. guistic features and RNNs with transformers rep- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and resentations using an IMF. In the training phase, Kristina Toutanova. 2018. Bert: Pre-training of we used a multi-task learning approaches to rec- deep bidirectional transformers for language under- ognize hate speech and polarity simultaneously. standing. arXiv preprint arXiv:1810.04805. The achieved results show that the ability of this Björn Gambäck and Utpal Kumar Sikdar. 2017. Us- ensemble to generalize the detection of hate con- ing convolutional neural networks to classify hate- tent in different text genres. Nevertheless, some speech. In Proceedings of the first workshop on abu- sive language online, pages 85–90. handcraft features decrements its results. Moti- vated by this, we plan to explore better features se- Sepp Hochreiter and Jürgen Schmidhuber. 1997. lection, other attention mechanisms and multitask Long short-term memory. Neural Computation, 9(8):1735–1780. learning techniques to improve the performance. Hamid Karimi, Proteek Roy, Sari Saba-Sadiya, and Jil- iang Tang. 2018. Multi-source multi-class fake References news detection. In Proceedings of the 27th Inter- national Conference on Computational Linguistics, Francesco Barbieri, Valerio Basile, Danilo Croce, pages 1546–1557. Malvina Nissim, Nicole Novielli, and Viviana Patti. 2016. Overview of the evalita 2016 sentiment polar- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A ity classification task. method for stochastic optimization. In Yoshua Ben- gio and Yann LeCun, editors, 3rd International Con- Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- ference on Learning Representations, ICLR 2015, cia C. Passaro. 2020. Evalita 2020: Overview San Diego, CA, USA, May 7-9, 2015, Conference of the 7th evaluation campaign of natural language Track Proceedings. processing and speech tools for italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. David D Lewis. 1992. An evaluation of phrasal Passaro, editors, Proceedings of Seventh Evalua- and clustered representations on a text categoriza- tion Campaign of Natural Language Processing and tion task. In Proceedings of the 15th annual inter- Speech Tools for Italian. Final Workshop (EVALITA national ACM SIGIR conference on Research and 2020), Online. CEUR.org. development in information retrieval, pages 37–50. Gang Liu and Jiabao Guo. 2019. Bidirectional lstm with attention mechanism and convolutional layer for text classification. Neurocomputing, 337:325– 338. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed rep- resentations of words and phrases and their compo- sitionality. In Christopher J. C. Burges, Léon Bot- tou, Zoubin Ghahramani, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural In- formation Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119. Manuela Sanguinetti, Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Marco Stranisci, Cristina Bosco, Tommaso Caselli, Viviana Patti, and Irene Russo. 2020. Overview of the evalita 2020 second hate speech detection task (haspeede 2). In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of the 7th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA 2020), Online. CEUR.org. Valentino Santucci, Stefania Spina, Alfredo Milani, Giulio Biondi, and Gabriele Di Bari. 2018. Detect- ing hate speech for italian language in social media. In EVALITA 2018, co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), volume 2263. Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language pro- cessing. In Proceedings of the Fifth International workshop on natural language processing for social media, pages 1–10. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- nov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information pro- cessing systems, pages 5998–6008.