Hate Speech Detection using Attention-based LSTM Gretel Liz De la Peña Sarracén1 , Reynaldo Gil Pons2 , Carlos Enrique Muñiz Cuza2 , Paolo Rosso1 1 PRHLT Research Center, Universitat Politècnica de València, Spain gredela@posgrado.upv.es prosso@dsic.upv.es 2 CERPAMID, Cuba {rey,carlos}@cerpamid.co.cu Abstract it from offline communication and make it poten- tially more dangerous and hurtful. Therefore, the English. This paper describes the system identification of HS is an important step for deal- we developed for EVALITA 2018, the 6th ing with the urgent need for effective counter mea- evaluation campaign of Natural Language sures to this issue. Processing and Speech tools for Italian, on The evaluation campaign EVALITA 20181 Hate Speech Detection (HaSpeeDe). The launched this year the HaSpeeDe (Hate Speech task consists in automatically annotating Detection) task2 (Bosco et al., 2018). It consists in Italian messages from two popular micro- automatically annotating messages from two pop- blogging platforms, Twitter and Facebook, ular micro-blogging platforms, Twitter and Face- with a boolean value indicating the pres- book, with a boolean value indicating the presence ence or not of hate speech. We propose (or not) of HS. an Attention-based in Long Short-Term Deep neural network are greatly studied due Memory Recurrent Neural Network where to their flexibility in capturing nonlinear relation- the attention layer helps to calculate the ships. Long Short-Term Memory units (LSTM) contribution of each part of the text to- (Hochreiter and Schmidhuber, 1997) are one of the wards targeted hateful messages. most used in Natural Language Processing (NLP). They are able to learn the dependencies in lengths Italiano. In questo articolo descriviamo il of considerably large chains. Moreover, attention sistema che abbiamo sviluppato per il task models have become an effective mechanism to di Hate Speech Detection (HaSpeeDe), obtain better results (Yang et al., 2017; Zhang et presso EVALITA 2018, la sesta campagna al., 2017; Wang et al., 2016; Lin et al., 2017; di valutazione dellelaborazione del lin- Rush et al., 2015). In (Yang et al., 2016), the au- guaggio naturale. Il task consiste nel- thors use a hierarchical attention network for doc- lannotare automaticamente testi italiani ument classification. The model has two levels da due popolari piattaforme di micro- of attention mechanisms applied at the word and blogging, Twitter e Facebook, con un val- sentence-level, enabling it to attend differentially ore booleano indicando la presenza o to more and less important content when con- meno di incitamento allodio. Il nostro ap- structing the document representation. The exper- proccio usa una rete neurale ricorrente iments show that the architecture outperforms pre- LSTM attention-based, in cui il layer di vious methods by a substantial margin. In this pa- attenzione aiuta a calcolare il contributo per, we propose a similar Attention-based LSTM di ciascuna porzione del testo verso mes- for HaSpeeDe. The attention layer is applied on saggi di odio mirati. the top of a Bidirectional LSTM to generate a con- text vector for each word embedding which is then fed to another LSTM network to detect the pres- 1 Introduction ence or not of hate in the text. The paper is orga- In recent years, Hate Speech (HS) has become a nized as follows. Section 2 describes our system. major issue as a hot topic in the domain of social 1 http://www.evalita.it/2018 media. Some key aspects (such as virality, or pre- 2 http://www.di.unito.it/tutreeb/haspeede- sumed anonymity) that characterize it, distinguish evalita18/index.html Experimental results are then discussed in Section 3. Finally, we present our conclusions with a sum- mary of our findings in Section 4. 2 System 2.1 Preprocessing In the preprocessing step, the text is cleaned. Firstly, the emoticons are recognized and replaced by corresponding words that express the sentiment they convey. Also, all links and urls are removed. Afterwards, text is morphologically analyzed by FreeLing (Padró and Stanilovsky, 2012). In this way, for each resulting token, its lemma is as- signed. Then, the texts are represented as vec- tors with a word embedding model. We used pre- trained word vectors in Italian from fastText (Bo- janowski et al., 2016). 2.2 Method We propose a model that consists in a Bidirec- Figure 1: General architecture tional LSTM neural network (Bi-LSTM) at the word level as Figure 1 shows. At each time step t the Bi-LSTM gets as input a word vector xt The bidirectional LSTM, on the other hand, with syntactic and semantic information, known makes the same operations as standard LSTM but, as word embedding (Mikolov et al., 2013). After- processes the incoming text in a left-to-right and a ward, an attention layer is applied over each hid- right-to-left order in parallel. Thus, the output is a → − ← − den state ĥt . The attention weights are learned us- two hidden state at each time step ht and ht . ing the concatenation of the current hidden state The proposed method uses a Bidirectional ht of the Bi-LSTM and the past hidden state st−1 LSTM network which considers each new hid- of the Post-Attention LSTM (Pos-Att-LSTM). Fi- den state as the concatenation of these two ĥt = → − ← − nally, the presence of hate (or not) in a text is pre- [ ht , ht ]. The idea of this Bi-LSTM is to capture dicted by this final Pos-Att-LSTM network. long-range and backwards dependencies. 2.3 Bidirectional LSTM 2.4 Attention Layer In NLP problems, standard LSTM receives se- quentially (left to right order) at each time step a With an attention mechanism we allow the Bi- word embedding xt and produces a hidden state LSTM to decide which part of the sentence should ht . Each hidden state ht is calculated as follow: “attend”. Importantly, we let the model learn what to attend on the basis of the input sentence and what it has produced so far. Figure 2 shows the input gatet = σ(W (i) xt + U (i) ht−1 + b(i) ) general attention mechanism. f orget gatet = σ(W (f ) xt + U (f ) ht−1 + b(f ) ) Let H ∈ R2∗Nh ×Tx the matrix of hidden states output gatet = σ(W (o) xt + U (i) ht−1 + b(o) ) [hˆ1 , hˆ2 , ..., hˆTx ] produced by the Bi-LSTM, where Nh is the size of the hidden state and Tx is the new memt = σ(W (u) xt + U (u) ht−1 + b(u) ) length of the given sentence. The goal is then to f inal memt = it ⊗ ut + ft ⊗ ct−1 derive a context vector ct that captures relevant in- ht = ot ⊗ tanh(ct ) formation and feeds it as an input to the next level (Pos-Att-LSTM). Each ct is calculate as follow: Where all W∗ , U∗ and b∗ are parameters to be learned during training. The function σ is the sig- Tx moid function and ⊗ stands for element-wise mul- X ct = αt,t0 hˆt0 tiplication. t0 =1 Twitter Facebook F1 P R F1 P R SVM 0.748|0.772|0.737 0.780|0.787|0.781 M1 0.869|0.881|0.863 0.865|0.872|0.863 M2 0.865|0.867|0.865 0.894|0.895|0.894 M3 0.853|0.860|0.854 0.864|0.873|0.864 M4 0.877|0.891|0.871 0.899|0.903|0.899 Table 1: 5-fold cross-validation results on the training corpus (Twitter and Facebook) in terms of Figure 2: Attention layer F1-score (F1), Precision (P) and Recall (R). The best results are in bold. run2 in M2 and M4, iden- tifies models that take dictionaries into account. βt,t0 αt,t0 = PTx i=1 βt,i As run1 in M1 and M3, we first evaluated the model described before which is compound βt,t0 = tanh(Wa ∗ [ĥt , st−1 ] + ba ) for the Bi-LSTM, the Attention layer and the LSTM (Bi-LSTM+Att+LSTM). Also, a variation Where Wa and ba are the trainable attention in this model originated a new model for analiz- weights, st−1 is the past hidden state of the Pos- ing the contribution of the Bi-LSTM layer. There- Att-LSTM and ĥt is the current hidden state. The fore, we substituted the Bi-LSTM for a LSTM idea of the concatenation layer is to take into ac- (LSTM+Att+LSTM). count not only the input sentence but also the past hidden state to produce the attention weights. Then, we processed the training sets to generate resources that we called the hate words dictionar- 2.5 Post-Attention LSTM ies. For each train set we generated a dictionary The goal of the Post-Att-LSTM is to predict of the most common words in the texts labeled as whether the text is hateful or not. This network at hateful. Taking into account this dictionaries, we each time step receives the context vector ct which added a linguistic characteristic to texts which de- is propagated until the final hidden state sTx . This fines if it contains a word into the correspondent vector is a high level representation of the text and dictionary. Thus, run 2 of the model is obtained is used in the final softmax layer as follow: considering this linguistic characteristic. We used a SVM as baseline to compare the re- ŷ = sof tmax(Wg ∗ sTx + bg ) sults of the different variants of the model and all variants achieved better results than this baseline. Where Wg and bg are the parameters for the The results show that the original model out- softmax layer. Finally, cross entropy is used as performs the results of the variant where the Bi- the loss function, which is defined as: LSTM is not used. It is important to note that this X occurs for run2 where the linguistic characteris- L=− yi ∗ log(yˆi ) tic is taken into account. In fact, when this fea- i ture is not used the results decrease and the origi- yi is the true classification of the i-th text. nal model obtains the worst results in most cases. Therefore, taking into account the run2 of each 3 Results variant, the results suggest that the best option is to Table 1 shows the results obtained by dif- use the Bi-LSTM with the linguistic characteristic. ferent variants of the proposed method with The HaSpeeDe task was three sub-tasks, based the 5-fold cross-validation in terms of F1- on the dataset used. First, only the Facebook score, precision and recall on the training set. dataset could be used to classify the Facebook The models are: M1 - LSTM+Att+LSTM test set (HaSpeeDe-FB), where our system takes (run1), M2 - LSTM+Att+LSTM (run2), M3 macro-average F1-score of 0.7147 and 0.7144, - Bi-LSTM+Att+LSTM (run1) and M4 - Bi- reaching the 11th and 10th positions for run1 and LSTM+Att+LSTM (run2). run2 of the model respectively. Another subtask was HaSpeeDe-TW, here only the Twitter dataset Processing and Speech tools for Italian (EVALITA can be used to classify the Twitter test set, where 2018), Turin, Italy. CEUR.org. our system takes scores of 0.6638 and 0.6567, Sepp Hochreiter and Jürgen Schmidhuber. 1997. reaching the 12th and 13th positions for run1 and Long short-term memory. Neural computation, run2 of the model respectively. Finally, two other 9(8):1735–1780. tasks consisted of using one of the datasets to train Kai Lin, Dazhen Lin, and Donglin Cao. 2017. Sen- and the other to classify (Cross-HaSpeeDe). Here timent analysis model based on structure attention our system takes scores of 0.4544 and 0.5436, mechanism. In UK Workshop on Computational In- reaching places 10th and 7th in Cross-HaSpeeDe- telligence, pages 17–27. Springer. FB and scores of 0.4451 and 0.318, for places 10th Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- and 12th in Cross-HaSpeeDe-TW. rado, and Jeff Dean. 2013. Distributed representa- We think that these results can be improved with tions of words and phrases and their compositional- a more careful tunning of the model parameters. In ity. In Advances in neural information processing systems, pages 3111–3119. addition, it may be necessary to enrich the system with linguistic resources for the treatment of the Lluı́s Padró and Evgeny Stanilovsky. 2012. Freeling Italian language. 3.0: Towards wider multilinguality. In LREC2012. Alexander M Rush, Sumit Chopra, and Jason We- 4 Conclusion ston. 2015. A neural attention model for ab- stractive sentence summarization. arXiv preprint We propose an Attention-based Long Short-Term arXiv:1509.00685. Memory Network Recurrent Neural Network for the EVALITA 2018 task on Hate Speech Detec- Yequan Wang, Minlie Huang, Li Zhao, et al. 2016. Attention-based lstm for aspect-level aentiment clas- tion (HaSpeeDe). The model consists of a bidi- sification. In Proceedings of the 2016 conference on rectional LSTM neural network with an attention empirical methods in natural language processing, mechanism that allows to estimate the importance pages 606–615. of each word and then, this context vector is used Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, with another LSTM model to estimate whether a Alex Smola, and Eduard Hovy. 2016. Hierarchi- text is hateful or not. The results showed that the cal attention networks for document classification. use of a linguistic characteristic based on the oc- In Proceedings of the 2016 Conference of the North currence of hateful words in the texts allows to im- American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, prove the performance of the model. In addition, pages 1480–1489. experiments performed on the training sets with 5-fold cross-validation suggest that the use of the Min Yang, Wenting Tu, Jingxuan Wang, Fei Xu, and Xiaojun Chen. 2017. Attention based lstm for target Bi-LSTM layer is important when this linguistic dependent sentiment classification. In AAAI, pages characteristic is taken into account. 5013–5014. Acknowledgments Yu Zhang, Pengyuan Zhang, and Yonghong Yan. 2017. Attention-based lstm with multi-task learning for The work of the fourth author was partially sup- distant speech recognition. Proc. Interspeech 2017, ported by the SomEMBED TIN2015-71147-C2- pages 3857–3861. 1-P research project (MINECO/FEDER). References Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vec- tors with subword information. arXiv preprint arXiv:1607.04606. Cristina Bosco, Felice Dell’Orletta, Fabio Poletto, Manuela Sanguinetti, and Maurizio Tesconi. 2018. Overview of the Evalita 2018 Hate Speech Detec- tion Task. In Tommaso Caselli, Nicole Novielli, Vi- viana Patti, and Paolo Rosso, editors, Proceedings of Sixth Evaluation Campaign of Natural Language