UO @ HaSpeeDe2: Ensemble Model for Italian Hate Speech Detection

          Mariano Jason Rodriguez Cisnero                             Reynier Ortega Bueno
               Universidad de Oriente                                  Universidad de Oriente
               Santiago de Cuba, Cuba                                 Santiago de Cuba, Cuba
            mjasoncuba@gmail.com                                      reynier@uo.edu.cu


                       Abstract                                tigations about this topic, such as (Cimino et al.,
                                                               2018), involving LSTM (Liu and Guo, 2019) and
    English. This document describes our                       transformers (Vaswani et al., 2017) that gain atten-
    participation in the Hate Speech Detection                 tion in NLP community due to their results.
    task at Evalita 2020. Our system is based                     We propose a model based on multiple repre-
    on deep learning techniques, specifically                  sentations learned by means of deep learning tech-
    RNNs and attention mechanism, mixed                        niques and linguistic knowledge. Particularly a
    with transformer representations and lin-                  Long Short Term Memory architecture mixed with
    guistic features. In the training process                  linguistic features and language model representa-
    a multi task learning was used to in-                      tions given by a special kind of transformer model,
    crease the system effectiveness. The re-                   BERT.
    sults show how some of the selected fea-                      The paper is organized as follows. The Sec-
    tures were not a good combination within                   tion 2 introduces a brief description of HaSpeeDe
    the model. Nevertheless, the generaliza-                   Task. Our hate detection system is presented
    tion level achieved yield encourage re-                    in Section 3.       The experiments and results
    sults.                                                     are discussed in Section 4. Finally, in Sec-
                                                               tion 5 the conclusions and future directions
                                                               are given. The code of this work is avail-
1    Introduction                                              able on GitHub: https://github.com/
                                                               mjason98/evalita20_hate
Modern societies found easy and interesting ways
for sharing information via Social Media. Users                2   HaSpeeDe2 Task
discover freedom to express themselves through
                                                               Hate speech and stereotypes recognition on so-
online communication. Even if the ability to freely
                                                               cial media have become an attractive research area
express oneself is a human right, some users take
                                                               from the computational point of view. In the sec-
this opportunity to spread hateful content. A dan-
                                                               ond edition of HaSpeeDe (Sanguinetti et al., 2020)
gerous and hurtful potential arises with this kind
                                                               at Evalita 2020 (Basile et al., 2020), the organiz-
of information. Recognizing automatically such
                                                               ers proposed to address three subtasks. The main
content is an interesting topic for researchers.
                                                               subtask is the subtask A, which aims at determin-
   Creative methods have been proposed to tackle
                                                               ing the presence or absence of hateful content in a
the fascinating task of recognizing hate in texts
                                                               text. The dataset is composed by 6839 short texts,
(De la Pena Sarracén et al., 2018; Gambäck and
                                                               2766 labeled as hate speech and 4076 as not hate
Sikdar, 2017). Some of those works face the
                                                               speech. In this work we focused only on subtask
problem using feature extraction (Schmidt and
                                                               A. The subtask B consists of a binary classification
Wiegand, 2017) and classification algorithms like
                                                               problem oriented to stereotypes’ detection. The
SVM (Santucci et al., 2018). In the last years,
                                                               last subtask C is a sequence labeling task aims at
Deep Learning approaches have become one of
                                                               recognizing Nominal Utterances in hateful tweets.
the most successful research areas in Natural Lan-
guage Processing (NLP). There are exciting inves-              3   Our Proposal
     Copyright© 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In-      We dealt with hate detection task as a text classi-
ternational (CC BY 4.0).                                       fication problem to classify “hateful” or “no hate-
ful” categories. We train a deep learning model                To create the information gain feature (IgF), we
based on attention mechanism and Recurrent Neu-             calculated the IG for every word and the highest
ral Networks, specifically a Bidirectional Long             ones are chosen3 . Then, the occurrence of those
Short Term Memory (Bi-LSTM) (Hochreiter and                 selected words in the text are counted.
Schmidhuber, 1997) mixed with linguistic fea-
                                                            3.2    Italian BERT
tures and transformers representations by means
of an interpretable multi-source fusion component           Finally, we use a pre-trained BERT4 to accom-
(Karimi et al., 2018).                                      plish the calculation of a deep representation of
   In Section 3.1 and Section 3.2 we describe the           the text. One of the most widely used auto-
linguistic features and the transformer representa-         encoding pre-trained Language Models (PLMs) is
tion used in this work. The Section 3.3 presents            BERT (Devlin et al., 2018). BERT is trained us-
the preprocessing phase. Finally, the neural net-           ing the masked language modeling task that ran-
work model and the feature ensemble are de-                 domly masks some tokens in a text sequence, and
scribed in Section 3.4.                                     then independently recovers the masked tokens by
                                                            conditioning on the encoding vectors obtained by
3.1      Linguistic Feature                                 a bidirectional Transformer.
To build the hate detection model, we start by ex-             Inside BERT, the information is passed forward
tracting several sets of linguistic features:               crosswise transformer layers. In this work, we
   WordNet Features: We count the number                    used a specific output from one of those layers,
of verbs, adverbs, nouns and adjectives. Also,              this operation can be expressed by:
for every word, we calculated the average of its                                h0 = Hl0 (texttok )
similarity with respect to the others using the                                 hi = Hli (hi−1 )
similarity path function provided by the word-
                                                                               hn = Hln (hn−1 )
net2 corpus. Furthermore, we consider the degree
of lexical ambiguity by counting the number of              where texttok is the text after its tokenization5 ,
synsets of each word within the text.                       hi is the output of the ith transformer layer(Hli )
   Hurt and Sentiment content:                HurtLex       called hidden state and n is the total transformer
(Bassignana et al., 2018) is a lexicon of offen-            layers in BERT. Then, for an specific i, from the
sive, aggressive, and hateful words in over 50 lan-         tensor of order 2 hi it is computed the vector fbert ,
guages. The words according to the 17 categories            as a deep representation of the initial text who will
offered by the lexicon are counted and added as             act as PLM feature.
linguistic features jointly with polarity and seman-
                                                                        X                               v
                                                                    v=       hi [k, :]     fbert =
tic values obtained from SenticNet (Cambria et al.,                                                   ||v||
                                                                          k=0
2018) corpus.                                               3.3    Preprocessing
   Information Gain: Information gain (Lewis,
                                                            In the preprocessing step, firstly stopwords were
1992) had been a good feature selection measure
                                                            removed . Then, the hashtags composed of many
for text categorization. It takes into account the
                                                            words are split (e.g: #NessunDorma becomes #
presence of the term in a category as well as its
                                                            nessun dorma). We use a regular expression6 al-
absence and can be defined by:
                                                            gorithm to archive this step.
                     XX                          p(t, C)       Secondly, using the FreeLing7 tool we obtain
 IG(tk , Ci ) =                p(t, C) · log2               for each word it lemma, and non alphanumeric
                           t
                                                p(t)·p(C)
                      C                                     characters are removed. Finally, the remaining
                                                            words are represented as vectors using a pre-
where C ∈ {Ci , C̄i } and t ∈ {tk , t¯k }. In this
                                                            trained word embedding generated by Word2Vec
formula, probabilities are interpreted on an event
                                                            model (Mikolov et al., 2013).
space of documents, where p(t¯k , Ci ) is the proba-
                                                               3
bility that, for a random document d, term tk does               We selected the top 50 words with highest IG value.
                                                               4
                                                                 https://huggingface.co/dbmdz/bert-base-italian-cased
not occur in d who belongs to category Ci . In our             5
                                                                 The text is represented as a vector of integers using the
case, categories were two: hateful and no hateful,          tokenizer function in BERT Model
                                                               6
and the term is the word’s lemma.                                The automaton was created using the re library from
                                                            python and the words from an italian corpus.
   2                                                           7
       The wordnet came from the python library nltk             http://nlp.lsi.upc.edu/freeling/index.php
3.4   The Deep Ensemble Model                               features from different sources. A naive way of
The standard LSTM receives sequentially at each             doing this is concatenating the vector representa-
time step a vector xt and produces a hidden state           tions into a single vector. This scheme considers
ht . Each hidden state ht is calculated as follow:          all sources equally, but one source may yield a bet-
                                                            ter result than others. With IMF we propose to
        it = σ(W (i) xt + U (i) ht−1 + b(i) )               consider the contribution of every source of fea-
       ft = σ(W (f ) xt + U (f ) ht−1 + b(f ) )             ture via an attention mechanism. The IMF can be
                                                            expressed by:
        ot = σ(W (o) xt + U (o) ht−1 + b(o) )
       ut = σ(W (u) xt + U (u) ht−1 + b(u) )                             ri = tanh(Wpi fi + bpi )
        ct = i + t ⊕ +ft ⊕ ct−1
       ht = ot ⊕ tanh(ct )                            (1)   where ri represents a projection of fi , the ith fea-
                                                            ture vector passed to IMF ensuring that every ri
   Where all W (∗) , U (∗) and b(∗) are parameters          have the same size. In this step, all the Wpi , bpi ,
to be learned during training. Function σ is the            Wa and ba are parameters to be learned during
sigmoid function and ⊗ stands for element-wise              training, then:
multiplication.
   Bidirectional LSTM, on the other hand, makes                    ai = Wa ri + ba              αi = π(ai )
the same operations as standard LSTM but,                                                            X
                                                                   βi = αi ri                    z=     βk    (3)
processes the incoming text in a left-to-right and
                                                                                                       k=0
a right-to-left order in parallel. Thus, it output
                →
                − ← −
become ĥt = [ ht , ht ] for the two directions.            where αi represents the importance of ri to the
                                                            final calculation of z, the IMF outcome.
  By adding an attention mechanism, we allow
the model to decide which part of the sequence                 To increase the learning power of our system,
“attends to”. First, lets define the softmax function       we used a multitask learning (Caruana, 1997) in
π(v) for a vector v = [v0 , · · · , vn−1 ] as:              which we predict the polarity of tweets in parallel
                                  ev                        with the classes of the hate speech detection sub-
                  π(v) = P             vi                   task. This approach have been developed before
                                  i=0 e
                                                            (Cimino et al., 2018) in HaSpeede at Evalita 2018
   Then, let I ∈ RN ×L be the matrix of input vec-          (Bosco et al., 2018). The tweets used to accom-
tors, where L the size of then and N the length of          plish the multitask learning are extracted from the
the given sequence. We define the attention layer           Sentipolc-2016 (Barbieri et al., 2016) challenge.
(AttLSTM), as a regular LSTM layer like (1) with               Finally we present the composition of the previ-
extra operations described as follow:                       ous layers and features to create our deep ensem-
                                                            ble model:
ak,t = π(Wk · hTt−1 + bk ) αk,t = aTk,t · I
  βt = [α0,t , · · · , αS−1,t ]        xt = Wa · βi + ba                 E = [w0 , w1 , · · · , wN −1 ]
                                                      (2)
                                                                        ob1 = BiLST M (E)                     (4)
Here k ∈ [0, S − 1] represents the number of
attention’s heads, Wk ∈ RN ×M where M is the                where E represents the vector representation of
size of the hidden state vector ht , Wa ∈ RM ×SM ,          the text, see Section 3.3. Equation (4) is the first
ba and bk are learnable parameters. The (∗)T is             block of our model, and the second block can be
the transpose operation and the output of the layer         described as follow:
is O = [h0 , ..., ht , ..., hN ], a concatenation of the
hidden states produced by the AttLSTM at each                              A = AttLST M (ob1 )
time step.                                                               mi =       max         Aj,i
                                                                                j=0,··· ,N −1

   As mentioned before, we propose a feature en-                         ob2 = [m0 , · · · , mM −1 ]          (5)
semble by using an interpretable multi-source fu-
sion component (IMF). The IMF aims to combine               The vector ob2 is the return of a MaxPool layer
over the A vector sequence, then:                             to the column were not used in the corresponding
                                                              run. We used a 10% of the training dataset for vali-
            F = [ob2 , fbert , fwn , fhs , fig ]              dation. We report the accuracy measure computed
           ob3 = IM F (F )                                    on this validation data.
             ŷ = σ(Wh ob3 + bh )                                Both Tables show that the presence of BERT in-
                                                              crease the performance, also almost all the runs
           yˆf = σ(Wf ob3 + bf )                     (6)
                                                              have higher values with IMF in contrast to not us-
    The third block is described in (6) where Wh ,            ing it. Increasing the number of attention heads
Wf , bf and bh are learnable parameters and                   without IMF increase the results, but the opposite
ŷ, yˆf ∈ R. The vectors fbert , fwn , fhs and fig cor-       occurs in the presence of the IMF.
respond to the BERT, WordNet, Hurt-Sentiment
and Information Gain features respectively. The                Name     heads     bert   ig   wn-hs        acc
prediction of the tweets polarity is determined by             run1         2                           0.764386
the yˆf value and the hate value trough ŷ.                    run2         -            ×       ×      0.742690
    The overall weighted loss of the model is cal-             run3         3                           0.767544
culated by cross-entropy, with higher importance               run4         2      ×                    0.713450
value for the hate speech predictions that polarity            run5         2                    ×      0.763158
predictions. The overall loss is calculated accord-            run6         -                           0.757310
ing to the following formula.                                  run7         -      ×                    0.724152
                                                               run8         -                    ×      0.755848
           X                             X
  L1 = −       yi log(yˆi )   L2 = −         yfi log(yˆfi )
                                                                  Table 1: Experiment results without IMF.
loss = λL1 + (1 − λ)L2              (0 ≤ λ ≤ 1)      (7)

   Here L1 and L2 are the cross-entropy loss of                Name     heads     bert   ig   wn-hs        acc
hate predictions and sentiment polarity predictions            run1         2                           0.795848
respectively. The value λ is the main task impor-              run2         -            ×       ×      0.779101
tance weight. The values yi and yfi represents the             run3         3                           0.764620
ground true hate classification and polarity clas-             run4         2      ×                    0.720760
sification respectively. Then, the final loss is ob-           run5         2                    ×      0.774854
tained as a convex sum of L1 and L2.                           run6         -                           0.767544
                                                               run7         -      ×                    0.719298
4   Experiments and Results
                                                               run8         -                    ×      0.777778
In this section we show the results of our proposed
method in subtask A and discuss about them. The                    Table 2: Experiment results with IMF.
organizers allow a maximum of two submissions
for every subtask in the challenge. We named our                 The pretrained embedding have a size of 300,
team UO.                                                      the number of neurons in the Bi-LSTM and in the
   Experiments where conducted in two main di-                AttLSTM was 128. The λ value was equal to 0.75
rections: Firstly, to investigate the impact of the           and the dropout (Srivastava et al., 2014) after the
IMF fusion strategy and secondly, to evaluate the             embedding layer was 0.3. The optimizer algorithm
impact of each proposed single-modal representa-              to train the whole model was Adam (Kingma and
tion into our proposal. The results of our experi-            Ba, 2015), with a learning rate of 0.01.
ments are presented in Table 1 and Table 2.                      The bold models in Table 2 were chosen as final
   In those tables, the column named heads is                 submission for the subtask. The run1 uses the at-
the number of attention headers in the Att-LSTM               tention layer proposed in Section 3.2 and consider
layer. If this space is empty, this layer was not             all proposed representations. The run2 does not
used. Columns bert and ig correspond to the                   use attention mechanism and handcraft features,
presence or not of BERT and IG representations.               using only the BERT text representation and the
The column wn-hs express the presence of Hurt-                rest of the architecture.
Sentiment and WordNet based representations. If                  The Table 3 shows the official results of our sys-
a cell has a cross, the representation associated             tem. The evaluation was performed on two distinct
corpora: one conformed by tweets and the other by       Elisa Bassignana, Valerio Basile, and Viviana Patti.
news headlines.                                            2018. Hurtlex: A multilingual lexicon of words to
                                                           hurt. In 5th Italian Conference on Computational
        Runs                      macro-F                  Linguistics, CLiC-it 2018, volume 2253, pages 1–6.
        UO:tweets run1             0.6878                  CEUR-WS.
        UO:tweets run2             0.7214               Cristina Bosco, Dell’Orletta Felice, Fabio Poletto,
        BEST RATED:tweets          0.8088                 Manuela Sanguinetti, and Tesconi Maurizio. 2018.
        UO:news run1               0.6657                 Overview of the evalita 2018 hate speech detection
                                                          task. In EVALITA 2018-Sixth Evaluation Campaign
        UO:news run2               0.7314                 of Natural Language Processing and Speech Tools
        BEST RATED:news            0.7744                 for Italian, volume 2263, pages 1–9. CEUR.
                                                        Erik Cambria, Soujanya Poria, Devamanyu Hazarika,
             Table 3: Official results.                    and Kenneth Kwok. 2018. Senticnet 5: Discov-
                                                           ering conceptual primitives for sentiment analysis
   These results show that between our two mod-            by means of context embeddings. In Thirty-Second
els, the simple one get better results. The simplic-       AAAI Conference on Artificial Intelligence.
ity is not a condition for a better performance us-     Rich Caruana. 1997. Multitask learning. Machine
ing deep learning. These results also express that        learning, 28(1):41–75.
some linguistic features decrease the effectiveness
                                                        Andrea Cimino, Lorenzo De Mattei, and Felice
of the model, but the similarity between the results      Dell’Orletta. 2018. Multi-task learning in deep neu-
in the tweets and news evaluation sets suggest that       ral networks at evalita 2018. Proceedings of the
the system is able to generalize with a good per-         6th evaluation campaign of Natural Language Pro-
formance.                                                 cessing and Speech tools for Italian (EVALITA’18),
                                                          pages 86–95.
5   Conclusions and Future Work                         Gretel Liz De la Pena Sarracén, Reynaldo Gil Pons,
                                                          Carlos Enrique Muniz Cuza, and Paolo Rosso.
In this paper we presented an Ensemble Model for          2018. Hate speech detection using attention-based
the task Hate Speech Detection (HaSpeeDe2) sub-           lstm. EVALITA Evaluation of NLP and Speech Tools
task A at Evalita 2020. Our proposal combines lin-        for Italian, 12:235.
guistic features and RNNs with transformers rep-        Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
resentations using an IMF. In the training phase,          Kristina Toutanova. 2018. Bert: Pre-training of
we used a multi-task learning approaches to rec-           deep bidirectional transformers for language under-
ognize hate speech and polarity simultaneously.            standing. arXiv preprint arXiv:1810.04805.
   The achieved results show that the ability of this   Björn Gambäck and Utpal Kumar Sikdar. 2017. Us-
ensemble to generalize the detection of hate con-          ing convolutional neural networks to classify hate-
tent in different text genres. Nevertheless, some          speech. In Proceedings of the first workshop on abu-
                                                           sive language online, pages 85–90.
handcraft features decrements its results. Moti-
vated by this, we plan to explore better features se-   Sepp Hochreiter and Jürgen Schmidhuber. 1997.
lection, other attention mechanisms and multitask         Long short-term memory. Neural Computation,
                                                          9(8):1735–1780.
learning techniques to improve the performance.
                                                        Hamid Karimi, Proteek Roy, Sari Saba-Sadiya, and Jil-
                                                          iang Tang. 2018. Multi-source multi-class fake
References                                                news detection. In Proceedings of the 27th Inter-
                                                          national Conference on Computational Linguistics,
Francesco Barbieri, Valerio Basile, Danilo Croce,         pages 1546–1557.
  Malvina Nissim, Nicole Novielli, and Viviana Patti.
  2016. Overview of the evalita 2016 sentiment polar-   Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
  ity classification task.                                method for stochastic optimization. In Yoshua Ben-
                                                          gio and Yann LeCun, editors, 3rd International Con-
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-      ference on Learning Representations, ICLR 2015,
  cia C. Passaro. 2020. Evalita 2020: Overview            San Diego, CA, USA, May 7-9, 2015, Conference
  of the 7th evaluation campaign of natural language      Track Proceedings.
  processing and speech tools for italian. In Valerio
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.     David D Lewis. 1992. An evaluation of phrasal
  Passaro, editors, Proceedings of Seventh Evalua-        and clustered representations on a text categoriza-
  tion Campaign of Natural Language Processing and        tion task. In Proceedings of the 15th annual inter-
  Speech Tools for Italian. Final Workshop (EVALITA       national ACM SIGIR conference on Research and
  2020), Online. CEUR.org.                                development in information retrieval, pages 37–50.
Gang Liu and Jiabao Guo. 2019. Bidirectional lstm
  with attention mechanism and convolutional layer
  for text classification. Neurocomputing, 337:325–
  338.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
  Corrado, and Jeffrey Dean. 2013. Distributed rep-
  resentations of words and phrases and their compo-
  sitionality. In Christopher J. C. Burges, Léon Bot-
  tou, Zoubin Ghahramani, and Kilian Q. Weinberger,
  editors, Advances in Neural Information Processing
  Systems 26: 27th Annual Conference on Neural In-
  formation Processing Systems 2013. Proceedings of
  a meeting held December 5-8, 2013, Lake Tahoe,
  Nevada, United States, pages 3111–3119.

Manuela Sanguinetti, Gloria Comandini, Elisa
 Di Nuovo, Simona Frenda, Marco Stranisci,
 Cristina Bosco, Tommaso Caselli, Viviana Patti,
 and Irene Russo. 2020. Overview of the evalita
 2020 second hate speech detection task (haspeede
 2). In Valerio Basile, Danilo Croce, Maria Di Maro,
 and Lucia C. Passaro, editors, Proceedings of the
 7th evaluation campaign of Natural Language
 Processing and Speech tools for Italian (EVALITA
 2020), Online. CEUR.org.
Valentino Santucci, Stefania Spina, Alfredo Milani,
  Giulio Biondi, and Gabriele Di Bari. 2018. Detect-
  ing hate speech for italian language in social media.
  In EVALITA 2018, co-located with the Fifth Italian
  Conference on Computational Linguistics (CLiC-it
  2018), volume 2263.
Anna Schmidt and Michael Wiegand. 2017. A survey
  on hate speech detection using natural language pro-
  cessing. In Proceedings of the Fifth International
  workshop on natural language processing for social
  media, pages 1–10.
Nitish Srivastava, Geoffrey E. Hinton, Alex
  Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-
  nov. 2014. Dropout: a simple way to prevent neural
  networks from overfitting. J. Mach. Learn. Res.,
  15(1):1929–1958.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in neural information pro-
  cessing systems, pages 5998–6008.