Hate Speech Detection using Attention-based LSTM
Gretel Liz De la Peña Sarracén1 , Reynaldo Gil Pons2 , Carlos Enrique Muñiz Cuza2 , Paolo Rosso1
                 1
                   PRHLT Research Center, Universitat Politècnica de València, Spain
                                   gredela@posgrado.upv.es
                                      prosso@dsic.upv.es
                                         2
                                           CERPAMID, Cuba
                               {rey,carlos}@cerpamid.co.cu


                        Abstract                         it from offline communication and make it poten-
                                                         tially more dangerous and hurtful. Therefore, the
        English. This paper describes the system         identification of HS is an important step for deal-
        we developed for EVALITA 2018, the 6th           ing with the urgent need for effective counter mea-
        evaluation campaign of Natural Language          sures to this issue.
        Processing and Speech tools for Italian, on         The evaluation campaign EVALITA 20181
        Hate Speech Detection (HaSpeeDe). The            launched this year the HaSpeeDe (Hate Speech
        task consists in automatically annotating        Detection) task2 (Bosco et al., 2018). It consists in
        Italian messages from two popular micro-         automatically annotating messages from two pop-
        blogging platforms, Twitter and Facebook,        ular micro-blogging platforms, Twitter and Face-
        with a boolean value indicating the pres-        book, with a boolean value indicating the presence
        ence or not of hate speech. We propose           (or not) of HS.
        an Attention-based in Long Short-Term               Deep neural network are greatly studied due
        Memory Recurrent Neural Network where            to their flexibility in capturing nonlinear relation-
        the attention layer helps to calculate the       ships. Long Short-Term Memory units (LSTM)
        contribution of each part of the text to-        (Hochreiter and Schmidhuber, 1997) are one of the
        wards targeted hateful messages.                 most used in Natural Language Processing (NLP).
                                                         They are able to learn the dependencies in lengths
        Italiano. In questo articolo descriviamo il      of considerably large chains. Moreover, attention
        sistema che abbiamo sviluppato per il task       models have become an effective mechanism to
        di Hate Speech Detection (HaSpeeDe),             obtain better results (Yang et al., 2017; Zhang et
        presso EVALITA 2018, la sesta campagna           al., 2017; Wang et al., 2016; Lin et al., 2017;
        di valutazione dellelaborazione del lin-         Rush et al., 2015). In (Yang et al., 2016), the au-
        guaggio naturale. Il task consiste nel-          thors use a hierarchical attention network for doc-
        lannotare automaticamente testi italiani         ument classification. The model has two levels
        da due popolari piattaforme di micro-            of attention mechanisms applied at the word and
        blogging, Twitter e Facebook, con un val-        sentence-level, enabling it to attend differentially
        ore booleano indicando la presenza o             to more and less important content when con-
        meno di incitamento allodio. Il nostro ap-       structing the document representation. The exper-
        proccio usa una rete neurale ricorrente          iments show that the architecture outperforms pre-
        LSTM attention-based, in cui il layer di         vious methods by a substantial margin. In this pa-
        attenzione aiuta a calcolare il contributo       per, we propose a similar Attention-based LSTM
        di ciascuna porzione del testo verso mes-        for HaSpeeDe. The attention layer is applied on
        saggi di odio mirati.                            the top of a Bidirectional LSTM to generate a con-
                                                         text vector for each word embedding which is then
                                                         fed to another LSTM network to detect the pres-
    1   Introduction                                     ence or not of hate in the text. The paper is orga-
    In recent years, Hate Speech (HS) has become a       nized as follows. Section 2 describes our system.
    major issue as a hot topic in the domain of social      1
                                                              http://www.evalita.it/2018
    media. Some key aspects (such as virality, or pre-      2
                                                              http://www.di.unito.it/tutreeb/haspeede-
    sumed anonymity) that characterize it, distinguish   evalita18/index.html
Experimental results are then discussed in Section
3. Finally, we present our conclusions with a sum-
mary of our findings in Section 4.

2     System
2.1    Preprocessing
In the preprocessing step, the text is cleaned.
Firstly, the emoticons are recognized and replaced
by corresponding words that express the sentiment
they convey. Also, all links and urls are removed.
Afterwards, text is morphologically analyzed by
FreeLing (Padró and Stanilovsky, 2012). In this
way, for each resulting token, its lemma is as-
signed. Then, the texts are represented as vec-
tors with a word embedding model. We used pre-
trained word vectors in Italian from fastText (Bo-
janowski et al., 2016).

2.2    Method
We propose a model that consists in a Bidirec-                     Figure 1: General architecture
tional LSTM neural network (Bi-LSTM) at the
word level as Figure 1 shows. At each time step
t the Bi-LSTM gets as input a word vector xt                The bidirectional LSTM, on the other hand,
with syntactic and semantic information, known           makes the same operations as standard LSTM but,
as word embedding (Mikolov et al., 2013). After-         processes the incoming text in a left-to-right and a
ward, an attention layer is applied over each hid-       right-to-left order in parallel. Thus, the output is a
                                                                                              →
                                                                                              −       ←
                                                                                                      −
den state ĥt . The attention weights are learned us-    two hidden state at each time step ht and ht .
ing the concatenation of the current hidden state           The proposed method uses a Bidirectional
ht of the Bi-LSTM and the past hidden state st−1         LSTM network which considers each new hid-
of the Post-Attention LSTM (Pos-Att-LSTM). Fi-           den state as the concatenation of these two ĥt =
                                                          →
                                                          − ←   −
nally, the presence of hate (or not) in a text is pre-   [ ht , ht ]. The idea of this Bi-LSTM is to capture
dicted by this final Pos-Att-LSTM network.               long-range and backwards dependencies.
2.3    Bidirectional LSTM
                                                         2.4   Attention Layer
In NLP problems, standard LSTM receives se-
quentially (left to right order) at each time step a     With an attention mechanism we allow the Bi-
word embedding xt and produces a hidden state            LSTM to decide which part of the sentence should
ht . Each hidden state ht is calculated as follow:       “attend”. Importantly, we let the model learn what
                                                         to attend on the basis of the input sentence and
                                                         what it has produced so far. Figure 2 shows the
    input gatet = σ(W (i) xt + U (i) ht−1 + b(i) )       general attention mechanism.
 f orget gatet = σ(W (f ) xt + U (f ) ht−1 + b(f ) )        Let H ∈ R2∗Nh ×Tx the matrix of hidden states
 output gatet = σ(W (o) xt + U (i) ht−1 + b(o) )         [hˆ1 , hˆ2 , ..., hˆTx ] produced by the Bi-LSTM, where
                                                         Nh is the size of the hidden state and Tx is the
     new memt = σ(W (u) xt + U (u) ht−1 + b(u) )         length of the given sentence. The goal is then to
    f inal memt = it ⊗ ut + ft ⊗ ct−1                    derive a context vector ct that captures relevant in-
               ht = ot ⊗ tanh(ct )                       formation and feeds it as an input to the next level
                                                         (Pos-Att-LSTM). Each ct is calculate as follow:
   Where all W∗ , U∗ and b∗ are parameters to be
learned during training. The function σ is the sig-                               Tx
moid function and ⊗ stands for element-wise mul-
                                                                                  X
                                                                           ct =           αt,t0 hˆt0
tiplication.                                                                      t0 =1
                                                                       Twitter               Facebook
                                                                     F1 P R                 F1 P R
                                                         SVM      0.748|0.772|0.737      0.780|0.787|0.781
                                                         M1       0.869|0.881|0.863      0.865|0.872|0.863
                                                         M2       0.865|0.867|0.865      0.894|0.895|0.894
                                                         M3       0.853|0.860|0.854      0.864|0.873|0.864
                                                         M4       0.877|0.891|0.871      0.899|0.903|0.899

                                                        Table 1: 5-fold cross-validation results on the
                                                        training corpus (Twitter and Facebook) in terms of
               Figure 2: Attention layer                F1-score (F1), Precision (P) and Recall (R). The
                                                        best results are in bold. run2 in M2 and M4, iden-
                                                        tifies models that take dictionaries into account.
                            βt,t0
                   αt,t0 = PTx
                            i=1 βt,i
                                                           As run1 in M1 and M3, we first evaluated
                                                        the model described before which is compound
         βt,t0 = tanh(Wa ∗ [ĥt , st−1 ] + ba )         for the Bi-LSTM, the Attention layer and the
                                                        LSTM (Bi-LSTM+Att+LSTM). Also, a variation
  Where Wa and ba are the trainable attention
                                                        in this model originated a new model for analiz-
weights, st−1 is the past hidden state of the Pos-
                                                        ing the contribution of the Bi-LSTM layer. There-
Att-LSTM and ĥt is the current hidden state. The
                                                        fore, we substituted the Bi-LSTM for a LSTM
idea of the concatenation layer is to take into ac-
                                                        (LSTM+Att+LSTM).
count not only the input sentence but also the past
hidden state to produce the attention weights.             Then, we processed the training sets to generate
                                                        resources that we called the hate words dictionar-
2.5    Post-Attention LSTM                              ies. For each train set we generated a dictionary
The goal of the Post-Att-LSTM is to predict             of the most common words in the texts labeled as
whether the text is hateful or not. This network at     hateful. Taking into account this dictionaries, we
each time step receives the context vector ct which     added a linguistic characteristic to texts which de-
is propagated until the final hidden state sTx . This   fines if it contains a word into the correspondent
vector is a high level representation of the text and   dictionary. Thus, run 2 of the model is obtained
is used in the final softmax layer as follow:           considering this linguistic characteristic.
                                                           We used a SVM as baseline to compare the re-
           ŷ = sof tmax(Wg ∗ sTx + bg )                sults of the different variants of the model and all
                                                        variants achieved better results than this baseline.
  Where Wg and bg are the parameters for the               The results show that the original model out-
softmax layer. Finally, cross entropy is used as        performs the results of the variant where the Bi-
the loss function, which is defined as:                 LSTM is not used. It is important to note that this
                        X                               occurs for run2 where the linguistic characteris-
                L=−           yi ∗ log(yˆi )            tic is taken into account. In fact, when this fea-
                          i
                                                        ture is not used the results decrease and the origi-
    yi is the true classification of the i-th text.     nal model obtains the worst results in most cases.
                                                        Therefore, taking into account the run2 of each
3     Results
                                                        variant, the results suggest that the best option is to
Table 1 shows the results obtained by dif-              use the Bi-LSTM with the linguistic characteristic.
ferent variants of the proposed method with                The HaSpeeDe task was three sub-tasks, based
the 5-fold cross-validation in terms of F1-             on the dataset used. First, only the Facebook
score, precision and recall on the training set.        dataset could be used to classify the Facebook
The models are: M1 - LSTM+Att+LSTM                      test set (HaSpeeDe-FB), where our system takes
(run1), M2 - LSTM+Att+LSTM (run2), M3                   macro-average F1-score of 0.7147 and 0.7144,
- Bi-LSTM+Att+LSTM (run1) and M4 - Bi-                  reaching the 11th and 10th positions for run1 and
LSTM+Att+LSTM (run2).                                   run2 of the model respectively. Another subtask
was HaSpeeDe-TW, here only the Twitter dataset            Processing and Speech tools for Italian (EVALITA
can be used to classify the Twitter test set, where       2018), Turin, Italy. CEUR.org.
our system takes scores of 0.6638 and 0.6567,           Sepp Hochreiter and Jürgen Schmidhuber. 1997.
reaching the 12th and 13th positions for run1 and         Long short-term memory. Neural computation,
run2 of the model respectively. Finally, two other        9(8):1735–1780.
tasks consisted of using one of the datasets to train   Kai Lin, Dazhen Lin, and Donglin Cao. 2017. Sen-
and the other to classify (Cross-HaSpeeDe). Here          timent analysis model based on structure attention
our system takes scores of 0.4544 and 0.5436,             mechanism. In UK Workshop on Computational In-
reaching places 10th and 7th in Cross-HaSpeeDe-           telligence, pages 17–27. Springer.
FB and scores of 0.4451 and 0.318, for places 10th      Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
and 12th in Cross-HaSpeeDe-TW.                            rado, and Jeff Dean. 2013. Distributed representa-
   We think that these results can be improved with       tions of words and phrases and their compositional-
a more careful tunning of the model parameters. In        ity. In Advances in neural information processing
                                                          systems, pages 3111–3119.
addition, it may be necessary to enrich the system
with linguistic resources for the treatment of the      Lluı́s Padró and Evgeny Stanilovsky. 2012. Freeling
Italian language.                                         3.0: Towards wider multilinguality. In LREC2012.
                                                        Alexander M Rush, Sumit Chopra, and Jason We-
4   Conclusion                                            ston. 2015. A neural attention model for ab-
                                                          stractive sentence summarization. arXiv preprint
We propose an Attention-based Long Short-Term             arXiv:1509.00685.
Memory Network Recurrent Neural Network for
the EVALITA 2018 task on Hate Speech Detec-             Yequan Wang, Minlie Huang, Li Zhao, et al. 2016.
                                                          Attention-based lstm for aspect-level aentiment clas-
tion (HaSpeeDe). The model consists of a bidi-            sification. In Proceedings of the 2016 conference on
rectional LSTM neural network with an attention           empirical methods in natural language processing,
mechanism that allows to estimate the importance          pages 606–615.
of each word and then, this context vector is used
                                                        Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
with another LSTM model to estimate whether a             Alex Smola, and Eduard Hovy. 2016. Hierarchi-
text is hateful or not. The results showed that the       cal attention networks for document classification.
use of a linguistic characteristic based on the oc-       In Proceedings of the 2016 Conference of the North
currence of hateful words in the texts allows to im-      American Chapter of the Association for Computa-
                                                          tional Linguistics: Human Language Technologies,
prove the performance of the model. In addition,          pages 1480–1489.
experiments performed on the training sets with
5-fold cross-validation suggest that the use of the     Min Yang, Wenting Tu, Jingxuan Wang, Fei Xu, and
                                                          Xiaojun Chen. 2017. Attention based lstm for target
Bi-LSTM layer is important when this linguistic           dependent sentiment classification. In AAAI, pages
characteristic is taken into account.                     5013–5014.

Acknowledgments                                         Yu Zhang, Pengyuan Zhang, and Yonghong Yan. 2017.
                                                          Attention-based lstm with multi-task learning for
The work of the fourth author was partially sup-          distant speech recognition. Proc. Interspeech 2017,
ported by the SomEMBED TIN2015-71147-C2-                  pages 3857–3861.
1-P research project (MINECO/FEDER).


References
Piotr Bojanowski, Edouard Grave, Armand Joulin,
   and Tomas Mikolov. 2016. Enriching word vec-
   tors with subword information. arXiv preprint
   arXiv:1607.04606.

Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.
  Overview of the Evalita 2018 Hate Speech Detec-
  tion Task. In Tommaso Caselli, Nicole Novielli, Vi-
  viana Patti, and Paolo Rosso, editors, Proceedings
  of Sixth Evaluation Campaign of Natural Language