<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hate Speech Detection using Attention-based LSTM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gretel Liz De la Pe n˜a Sarrace´n</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reynaldo Gil Pons</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Enrique Mun˜ iz Cuza</string-name>
          <email>carlosg@cerpamid.co.cu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CERPAMID</institution>
          ,
          <country country="CU">Cuba</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>PRHLT Research Center, Universitat Polite`cnica de Vale`ncia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. This paper describes the system we developed for EVALITA 2018, the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian, on Hate Speech Detection (HaSpeeDe). The task consists in automatically annotating Italian messages from two popular microblogging platforms, Twitter and Facebook, with a boolean value indicating the presence or not of hate speech. We propose an Attention-based in Long Short-Term Memory Recurrent Neural Network where the attention layer helps to calculate the contribution of each part of the text towards targeted hateful messages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. In questo articolo descriviamo il
sistema che abbiamo sviluppato per il task
di Hate Speech Detection (HaSpeeDe),
presso EVALITA 2018, la sesta campagna
di valutazione dellelaborazione del
linguaggio naturale. Il task consiste
nellannotare automaticamente testi italiani
da due popolari piattaforme di
microblogging, Twitter e Facebook, con un
valore booleano indicando la presenza o
meno di incitamento allodio. Il nostro
approccio usa una rete neurale ricorrente
LSTM attention-based, in cui il layer di
attenzione aiuta a calcolare il contributo
di ciascuna porzione del testo verso
messaggi di odio mirati.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>In recent years, Hate Speech (HS) has become a
major issue as a hot topic in the domain of social
media. Some key aspects (such as virality, or
presumed anonymity) that characterize it, distinguish
it from offline communication and make it
potentially more dangerous and hurtful. Therefore, the
identification of HS is an important step for
dealing with the urgent need for effective counter
measures to this issue.</p>
      <p>
        The evaluation campaign EVALITA 20181
launched this year the HaSpeeDe (Hate Speech
Detection) task2
        <xref ref-type="bibr" rid="ref2">(Bosco et al., 2018)</xref>
        . It consists in
automatically annotating messages from two
popular micro-blogging platforms, Twitter and
Facebook, with a boolean value indicating the presence
(or not) of HS.
      </p>
      <p>
        Deep neural network are greatly studied due
to their flexibility in capturing nonlinear
relationships. Long Short-Term Memory units (LSTM)
        <xref ref-type="bibr" rid="ref3">(Hochreiter and Schmidhuber, 1997)</xref>
        are one of the
most used in Natural Language Processing (NLP).
They are able to learn the dependencies in lengths
of considerably large chains. Moreover, attention
models have become an effective mechanism to
obtain better results
        <xref ref-type="bibr" rid="ref10 ref11 ref4 ref7 ref8">(Yang et al., 2017; Zhang et
al., 2017; Wang et al., 2016; Lin et al., 2017;
Rush et al., 2015)</xref>
        . In
        <xref ref-type="bibr" rid="ref8 ref9">(Yang et al., 2016)</xref>
        , the
authors use a hierarchical attention network for
document classification. The model has two levels
of attention mechanisms applied at the word and
sentence-level, enabling it to attend differentially
to more and less important content when
constructing the document representation. The
experiments show that the architecture outperforms
previous methods by a substantial margin. In this
paper, we propose a similar Attention-based LSTM
for HaSpeeDe. The attention layer is applied on
the top of a Bidirectional LSTM to generate a
context vector for each word embedding which is then
fed to another LSTM network to detect the
presence or not of hate in the text. The paper is
organized as follows. Section 2 describes our system.
1http://www.evalita.it/2018
2http://www.di.unito.it/tutreeb/haspeedeevalita18/index.html
Experimental results are then discussed in Section
3. Finally, we present our conclusions with a
summary of our findings in Section 4.
In the preprocessing step, the text is cleaned.
Firstly, the emoticons are recognized and replaced
by corresponding words that express the sentiment
they convey. Also, all links and urls are removed.
Afterwards, text is morphologically analyzed by
FreeLing
        <xref ref-type="bibr" rid="ref6">(Padro´ and Stanilovsky, 2012)</xref>
        . In this
way, for each resulting token, its lemma is
assigned. Then, the texts are represented as
vectors with a word embedding model. We used
pretrained word vectors in Italian from fastText
        <xref ref-type="bibr" rid="ref1 ref8">(Bojanowski et al., 2016)</xref>
        .
2.2
      </p>
      <sec id="sec-2-1">
        <title>Method</title>
        <p>
          We propose a model that consists in a
Bidirectional LSTM neural network (Bi-LSTM) at the
word level as Figure 1 shows. At each time step
t the Bi-LSTM gets as input a word vector xt
with syntactic and semantic information, known
as word embedding
          <xref ref-type="bibr" rid="ref5">(Mikolov et al., 2013)</xref>
          .
Afterward, an attention layer is applied over each
hidden state h^t. The attention weights are learned
using the concatenation of the current hidden state
ht of the Bi-LSTM and the past hidden state st 1
of the Post-Attention LSTM (Pos-Att-LSTM).
Finally, the presence of hate (or not) in a text is
predicted by this final Pos-Att-LSTM network.
2.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Bidirectional LSTM</title>
        <p>In NLP problems, standard LSTM receives
sequentially (left to right order) at each time step a
word embedding xt and produces a hidden state
ht. Each hidden state ht is calculated as follow:
input gatet = (W (i)xt + U (i)ht 1 + b(i))
f orget gatet = (W (f)xt + U (f)ht 1 + b(f))
output gatet = (W (o)xt + U (i)ht 1 + b(o))
new memt = (W (u)xt + U (u)ht 1 + b(u))
f inal memt = it
ht = ot
ut + ft
tanh(ct)
ct 1</p>
        <p>Where all W ; U and b are parameters to be
learned during training. The function is the
sigmoid function and stands for element-wise
multiplication.</p>
        <p>The bidirectional LSTM, on the other hand,
makes the same operations as standard LSTM but,
processes the incoming text in a left-to-right and a
right-to-left order in parallel. Thus, the output is a
two hidden state at each time step !ht and ht .</p>
        <p>The proposed method uses a Bidirectional
LSTM network which considers each new
hidden state as the concatenation of these two h^t =
[!ht ,ht ]. The idea of this Bi-LSTM is to capture
long-range and backwards dependencies.
With an attention mechanism we allow the
BiLSTM to decide which part of the sentence should
“attend”. Importantly, we let the model learn what
to attend on the basis of the input sentence and
what it has produced so far. Figure 2 shows the
general attention mechanism.</p>
        <p>Let H 2 R2 Nh Tx the matrix of hidden states
[h^1; h^2; :::; h^Tx ] produced by the Bi-LSTM, where
Nh is the size of the hidden state and Tx is the
length of the given sentence. The goal is then to
derive a context vector ct that captures relevant
information and feeds it as an input to the next level
(Pos-Att-LSTM). Each ct is calculate as follow:</p>
        <p>Tx
ct = X
t;t0 = tanh(Wa [h^t; st 1] + ba)</p>
        <p>Where Wa and ba are the trainable attention
weights, st 1 is the past hidden state of the
PosAtt-LSTM and h^t is the current hidden state. The
idea of the concatenation layer is to take into
account not only the input sentence but also the past
hidden state to produce the attention weights.
2.5</p>
      </sec>
      <sec id="sec-2-3">
        <title>Post-Attention LSTM</title>
        <p>The goal of the Post-Att-LSTM is to predict
whether the text is hateful or not. This network at
each time step receives the context vector ct which
is propagated until the final hidden state sTx . This
vector is a high level representation of the text and
is used in the final softmax layer as follow:
y^ = sof tmax(Wg sTx + bg)</p>
        <p>Where Wg and bg are the parameters for the
softmax layer. Finally, cross entropy is used as
the loss function, which is defined as:</p>
        <p>L =</p>
        <p>X yi log(y^i)</p>
        <p>i
yi is the true classification of the i-th text.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>As run1 in M1 and M3, we first evaluated
the model described before which is compound
for the Bi-LSTM, the Attention layer and the
LSTM (Bi-LSTM+Att+LSTM). Also, a variation
in this model originated a new model for
analizing the contribution of the Bi-LSTM layer.
Therefore, we substituted the Bi-LSTM for a LSTM
(LSTM+Att+LSTM).</p>
      <p>Then, we processed the training sets to generate
resources that we called the hate words
dictionaries. For each train set we generated a dictionary
of the most common words in the texts labeled as
hateful. Taking into account this dictionaries, we
added a linguistic characteristic to texts which
defines if it contains a word into the correspondent
dictionary. Thus, run 2 of the model is obtained
considering this linguistic characteristic.</p>
      <p>We used a SVM as baseline to compare the
results of the different variants of the model and all
variants achieved better results than this baseline.</p>
      <p>The results show that the original model
outperforms the results of the variant where the
BiLSTM is not used. It is important to note that this
occurs for run2 where the linguistic
characteristic is taken into account. In fact, when this
feature is not used the results decrease and the
original model obtains the worst results in most cases.
Therefore, taking into account the run2 of each
variant, the results suggest that the best option is to
use the Bi-LSTM with the linguistic characteristic.</p>
      <p>The HaSpeeDe task was three sub-tasks, based
on the dataset used. First, only the Facebook
dataset could be used to classify the Facebook
test set (HaSpeeDe-FB), where our system takes
macro-average F1-score of 0.7147 and 0.7144,
reaching the 11th and 10th positions for run1 and
run2 of the model respectively. Another subtask
was HaSpeeDe-TW, here only the Twitter dataset
can be used to classify the Twitter test set, where
our system takes scores of 0.6638 and 0.6567,
reaching the 12th and 13th positions for run1 and
run2 of the model respectively. Finally, two other
tasks consisted of using one of the datasets to train
and the other to classify (Cross-HaSpeeDe). Here
our system takes scores of 0.4544 and 0.5436,
reaching places 10th and 7th in
Cross-HaSpeeDeFB and scores of 0.4451 and 0.318, for places 10th
and 12th in Cross-HaSpeeDe-TW.</p>
      <p>We think that these results can be improved with
a more careful tunning of the model parameters. In
addition, it may be necessary to enrich the system
with linguistic resources for the treatment of the
Italian language.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We propose an Attention-based Long Short-Term
Memory Network Recurrent Neural Network for
the EVALITA 2018 task on Hate Speech
Detection (HaSpeeDe). The model consists of a
bidirectional LSTM neural network with an attention
mechanism that allows to estimate the importance
of each word and then, this context vector is used
with another LSTM model to estimate whether a
text is hateful or not. The results showed that the
use of a linguistic characteristic based on the
occurrence of hateful words in the texts allows to
improve the performance of the model. In addition,
experiments performed on the training sets with
5-fold cross-validation suggest that the use of the
Bi-LSTM layer is important when this linguistic
characteristic is taken into account.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The work of the fourth author was partially
supported by the SomEMBED
TIN2015-71147-C21-P research project (MINECO/FEDER).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607</source>
          .
          <fpage>04606</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Felice Dell'Orletta, Fabio Poletto, Manuela Sanguinetti, and
          <string-name>
            <given-names>Maurizio</given-names>
            <surname>Tesconi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the Evalita 2018 Hate Speech Detection Task</article-title>
          . In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech tools for Italian (EVALITA</source>
          <year>2018</year>
          ), Turin, Italy. CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and Ju¨rgen Schmidhuber.
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Kai</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Dazhen</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Donglin</given-names>
            <surname>Cao</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Sentiment analysis model based on structure attention mechanism</article-title>
          .
          <source>In UK Workshop on Computational Intelligence</source>
          , pages
          <fpage>17</fpage>
          -
          <lpage>27</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>Llu´ıs Padro´</article-title>
          and
          <string-name>
            <given-names>Evgeny</given-names>
            <surname>Stanilovsky</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Freeling 3.0: Towards wider multilinguality</article-title>
          .
          <source>In LREC2012.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Alexander M Rush</surname>
            ,
            <given-names>Sumit</given-names>
          </string-name>
          <string-name>
            <surname>Chopra</surname>
            , and
            <given-names>Jason</given-names>
          </string-name>
          <string-name>
            <surname>Weston</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A neural attention model for abstractive sentence summarization</article-title>
          .
          <source>arXiv preprint arXiv:1509</source>
          .
          <fpage>00685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Yequan</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Minlie Huang</surname>
            ,
            <given-names>Li</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Attention-based lstm for aspect-level aentiment classification</article-title>
          .
          <source>In Proceedings of the 2016 conference on empirical methods in natural language processing</source>
          , pages
          <fpage>606</fpage>
          -
          <lpage>615</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Zichao</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Diyi</given-names>
            <surname>Yang</surname>
          </string-name>
          , Chris Dyer, Xiaodong He,
          <string-name>
            <surname>Alex Smola</surname>
            , and
            <given-names>Eduard</given-names>
          </string-name>
          <string-name>
            <surname>Hovy</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Hierarchical attention networks for document classification</article-title>
          .
          <source>In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pages
          <fpage>1480</fpage>
          -
          <lpage>1489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Min</given-names>
            <surname>Yang</surname>
          </string-name>
          , Wenting Tu, Jingxuan Wang,
          <string-name>
            <surname>Fei Xu</surname>
            ,
            <given-names>and Xiaojun</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention based lstm for target dependent sentiment classification</article-title>
          .
          <source>In AAAI</source>
          , pages
          <fpage>5013</fpage>
          -
          <lpage>5014</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Yu</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Pengyuan Zhang, and
          <string-name>
            <given-names>Yonghong</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention-based lstm with multi-task learning for distant speech recognition</article-title>
          .
          <source>Proc. Interspeech</source>
          <year>2017</year>
          , pages
          <fpage>3857</fpage>
          -
          <lpage>3861</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>