=Paper= {{Paper |id=Vol-2263/paper039 |storemode=property |title=Comparing Different Supervised Approaches to Hate Speech Detection |pdfUrl=https://ceur-ws.org/Vol-2263/paper039.pdf |volume=Vol-2263 |authors=Michele Corazza,Stefano Menini,Pinar Arslan,Rachele Sprugnoli,Elena Cabrio,Sara Tonelli,Serena Villata |dblpUrl=https://dblp.org/rec/conf/evalita/CorazzaMASCTV18 }} ==Comparing Different Supervised Approaches to Hate Speech Detection== https://ceur-ws.org/Vol-2263/paper039.pdf
    Comparing Different Supervised Approaches to Hate Speech Detection
            Michele Corazza† , Stefano Menini‡ , Pinar Arslan† , Rachele Sprugnoli‡
                         Elena Cabrio† , Sara Tonelli‡ , Serena Villata†
                       †
                         Université Côte d’Azur, CNRS, Inria, I3S, France
                            ‡
                              Fondazione Bruno Kessler, Trento, Italy
                   {michele.corazza,pinar.arslan}@inria.fr
                      {menini,sprugnoli,satonelli}@fbk.eu
                    {elena.cabrio,serena.villata}@unice.fr

                     Abstract                       detection from Facebook to Twitter posts (Task
                                                    3.1: Cross-HaSpeeDe FB) and Cross-domain task
     English. This paper reports on the sys-        hate speech detection from Twitter to Facebook
     tems the InriaFBK Team submitted to the        posts (Task 3.2: Cross-HaSpeeDe TW). We build
     EVALITA 2018 - Shared Task on Hate             our models for these binary classification sub-
     Speech Detection in Italian Twitter and        tasks testing recurrent neural networks, ngram-
     Facebook posts (HaSpeeDe). Our submis-         based neural networks1 and a LinearSVC (Support
     sions were based on three separate classes     Vector Machine) approach2 . In HaSpeeDe-TW,
     of models: a model using a recurrent layer,    which has comparatively short sequences with re-
     an ngram-based neural network and a Lin-       spect to HaSpeeDe-FB, an ngram-based neural
     earSVC. For the Facebook task and the          network and a LinearSVC model were used, while
     two cross-domain tasks we used the recur-      for HaSpeeDe-FB and the two cross-domain tasks
     rent model and obtained promising results,     recurrent models were used.
     especially in the cross-domain setting. For
     Twitter, we used an ngram-based neural         2       System Description
     network and the LinearSVC-based model.
                                                    We adopt a supervised approach and, to select the
     Italiano. Questo articolo descrive i mo-       best model for each task, we perform grid search
     delli del team InriaFBK per lo Shared Ta-      over different machine learning classifiers such
     sk on Hate Speech Detection in Italian         as Neural Networks (NN), Support Vector Ma-
     Twitter and Facebook posts (HaSpeeDe)          chines (SVM) and Logistic Regression (LR). Both
     di EVALITA 2018. Tre classi di modelli         ngram-based (unigram and bigram) and recurrent
     differenti sono state utilizzate: un model-    models using embeddings were tested, but only
     lo che usa un livello ricorrente, una rete     the ones that were submitted for the tasks will be
     neurale basata su ngrammi e un model-          described. A LinearSVC model from scikit-learn
     lo basato su LinearSVC. Per Facebook e         (Pedregosa et al., 2011a) was also tested, and it
     i due task cross-domain, si è scelto un mo-   showed good performance on the Twitter dataset.
     dello ricorrente che ha ottenuto buoni ri-     In order to perform a grid search over the param-
     sultati, specialmente per quanto riguarda      eters and models, the training set released by the
     i task cross-domain. Per Twitter, sono stati   task organisers was partitioned in three: 60% of it
     utilizzati la rete neurale basata su ngram-    was used for training, 20% for validation and 20%
     mi e il modello basato su LinearSVC.           for testing.3

                                                    2.1      Preprocessing
1    Introduction                                   Since misspellings, neologisms, acronyms and jar-
In this paper, we describe the submitted systems    gon are common in social media interactions, it
for each of the four subtasks organized within      was necessary to carefully preprocess the data, in
the HaSpeeDe evaluation exercise at EVALITA             1
                                                        https://gitlab.com/ashmikuz/
2018 (Bosco et al., 2018): Hate speech detec-       creep-cyberbullying-classifier
                                                      2
tion on Facebook comments (Task 1: HaSpeeDe-            https://github.com/0707pinar/
                                                    Hate-Speech-Detection/
FB), Hate speech detection on tweets (Task 2:         3
                                                        To split the data we use the scikit-learn
HaSpeeDe-TW), Cross-domain task hate speech         train test split function, using 42 as seed value.
order to normalize it without losing information.        asymmetric topology is used for the neural net-
For this reason, we first replace URLs with the          work: the sequences of word embeddings are fed
word “url” and “@” user mentions with “user-             to a recurrent layer, whose output is then concate-
name” by using regular expressions.                      nated with the social features. The concatenated
   Since hashtags often provide important seman-         vector is then fed to one or two feed forward fully
tic content, they are normalized by splitting them       connected layers that use the Rectified Linear Unit
into the words composing them. To this end,              (ReLU) as their activation function. The output
we adapted to Italian the Ekphrasis tool (Bazio-         layer is a single neuron with a sigmoid activation,
tis et al., 2017), using as ngram model the Italian      while binary cross-entropy is used as the loss func-
Google ngrams starting from year 2000. In addi-          tion for the model.
tion to the aforementioned normalizations, for the          Batch normalization and various kinds of
LinearSVC model we also stemmed Italian words            dropout have been tested to reduce the variance of
via the Snowball Stemmer (Bird and Loper, 2004)          the models. Experimental results suggested that
and we removed stopwords.                                applying the former to the output of the recur-
                                                         rent layer had a negative effect on performance.
2.2   Feature Description                                For this reason, batch normalization was applied
We used the following text-derived features:             only to the output of the hidden layers. As for
                                                         dropout, we tried three different mechanisms. A
  • Word Embeddings: Italian fastText embed-             simple dropout layer (Srivastava et al., 2014) is
    dings (Bojanowski et al., 2016)4 employed in         applied to the output of the hidden layers, as ap-
    the recurrent models (Section 2.3);                  plying dropout to the output of the recurrent layer
  • Ngrams: unigrams and bigrams, used for               introduces too much noise and does not improve
    the ngram-based neural network and the lin-          performance. We also tested a dropout on the em-
    earSVC (Sections 2.4, 2.5);                          beddings (Gal and Ghahramani, 2016) that effec-
                                                         tively skips some of the word embeddings in the
  • Social-network specific features: the num-           sequence, as dropping part of the embedding vec-
    ber of hashtags and mentions, the number of          tor causes a loss of information, while dropping
    exclamation and question marks, the number           entire words can help reduce overfitting. In ad-
    of emojis, the number of words that are writ-        dition, a recurrent dropout (Gal and Ghahramani,
    ten in uppercase.                                    2016) was also tested.        While evaluating the
                                                         models, we tested both a Long Short Term Mem-
  • Sentiment and Emotion features: the word-            ory (LSTM) (Gers et al., 1999) and a Gated Re-
    level emotion and sentiment tags for Italian         current Unit (GRU) (Cho et al., 2014) as recurrent
    words extracted from the EmoLex (Moham-              layers. The latter is functionally very similar to
    mad and Turney, 2013; Mohammad and Tur-              an LSTM but by using less weights it can some-
    ney, 2010) resource.                                 times reduce the variance of the model, improving
                                                         its performance.
2.3   Recurrent Neural Network Model
In order to classify hate speech in social media
                                                         2.4   Ngram-based Neural Networks
interactions, we believe that recurrent neural net-
works are a useful tool, given their ability to re-      Ngram-based neural networks are structurally
member the sequence of inputs while considering          similar to the recurrent models. We first com-
their order, differently from the feed-forward mod-      pute the unigrams and bigrams over the lemma-
els. In the context of our classifier, this allows the   tized social media posts. The resulting vector is
model to remember the whole sequence of words            then normalized by using tf-idf from scikit-learn
in the order they appear in.                             and concatenated to the social-specific features.
   More specifically, our recurrent models, imple-       One or two hidden feed-forward layers are then
mented using Keras (Chollet and others, 2015),           used, and the same output layer as in the recurrent
combine both sequences of word embeddings and            models is used. The same dropout and batch nor-
social media features. In order to achieve that, an      malization techniques used in the recurrent models
  4
    https://github.com/facebookresearch/                 have been tested for the ngram-based neural net-
fastText                                                 works as well. For the first submitted run of Task
2: HaSpeeDe-TW, we used unigrams and bigrams             scribed in subsection 2.5. This run was ranked
along with the required preprocessing steps based        sixth out of 19 submissions. As our second run on
on tf-idf model.                                         the Task 2: HaSpeeDe-TW, an ngram-based neu-
                                                         ral network was used having a single fully con-
2.5    Linear SVC System                                 nected hidden layer with size 200. Simple dropout
We implemented a Linear Support Vector Clas-             was applied to the hidden layer with value 0.5.
sification system (i.e., LinearSVC) (Fan et al.,         This run ranked fourth. Both runs show better
2008) based on bag-of-words (i.e., unigrams), us-        performance when classifying the non hate speech
ing scikit-learn (Pedregosa et al., 2011b) for the       class as displayed in Table 2.
first submitted run in Task 2: HaSpeeDe-TW. We
                                                                                First Run
chose this system as it scales well for large-scale             Category     P       R    F1    Instances
samples, and it is efficient to solve text classifica-          Non Hate   0.873 0.827 0.850       676
tion problems. To deal with imbalanced labels, we                 Hate     0.675 0.750 0.711       324
                                                               Macro AVG   0.774 0.788 0.780      1000
set the class weight parameter as “balanced”.
                                                                               Second Run
To mitigate overfitting, penalty parameter C was
                                                                Non Hate   0.842 0.899 0.870      676
scaled as 0.7.                                                    Hate     0.755 0.648 0.698      324
                                                               Macro AVG   0.799 0.774 0.784      1000
3     Submitted Runs and Results
                                                                  Table 2: Results on HaSpeeDe-TW
In this Section we describe the single runs submit-
ted for each task and we present the results. The
                                                         3.3    Task 3.1: Cross-HaSpeeDe FB
official ranking reported for each run is given in
terms of macro-average F-score.                          For Task 3.1: Cross-HaSpeeDe FB two recurrent
                                                         models were used. In the first submitted run, two
3.1    Task 1: HaSpeeDe-FB                               hidden layers of size 500 were used. An LSTM of
For Task 1: HaSpeeDe-FB, two recurrent models            size 200 was adopted as the recurrent layer. Em-
were used. The first submitted run used a single         beddings dropout was applied with value 0.5 and
fully connected layer of size 200 and a GRU of           a simple dropout was applied to the output of the
size 100 as the recurrent layer. Recurrent dropout       feed-forward layers with value 0.5. The recurrent
was applied to the GRU with value 0.2. The sec-          model for the second run had one hidden layer of
ond submitted run used two fully connected layers        size 500. A GRU of size 200 was used as the re-
of size 500 and a GRU of size 300 as the recurrent       current layer and no dropout was applied. The first
layer. Simple dropout was applied to the output of       run ranked second out of 17 submissions while the
the feed-forward layers with value 0.5. The first        second run registered the best score in the Task
run ranked third and the second ranked fourth out        3.1: Cross-HaSpeeDe FB. In both runs, the mod-
of 18 submissions (Table 1). As shown in Table           els showed good performance over the non hate
1, both runs yield a better performance on the hate      speech class, whereas the precision on the hate
speech class.                                            speech class does not exceed 0.5 (see Table 3).
                                                                                First Run
                       First Run
                                                                Category     P       R    F1    Instances
       Category     P       R    F1      Instances
                                                                Non Hate   0.810 0.675 0.736       676
       Non Hate   0.763 0.687 0.723         323
                                                                  Hate     0.497 0.670 0.570       324
         Hate     0.858 0.898 0.877         677
                                                               Macro AVG   0.653 0.672 0.653      1000
      Macro AVG   0.810 0.793 0.800        1000
                                                                               Second Run
                      Second Run
                                                                Non Hate   0.818 0.660 0.731      676
       Non Hate   0.716 0.703 0.709         323
                                                                  Hate     0.494 0.694 0.580      324
         Hate     0.859 0.867 0.863         677
                                                               Macro AVG   0.656 0.677 0.654      1000
      Macro AVG   0.788 0.785 0.786        1000

         Table 1: Results on HaSpeeDe-FB                       Table 3: Results on Cross-HaSpeeDe FB


3.2    Task 2: HaSpeeDe-TW                               3.4    Task 3.2: Cross-HaSpeeDe TW
In the first submitted run for Task 2: HaSpeeDe-         For Task 3.2: Cross-HaSpeeDe TW two recurrent
TW, we used the LinearSVC-based model de-                models were used. In the first submitted run, two
hidden layers of size 500 were used together with        messages reporting the title of a news, eg. “Il Gi-
a GRU of size 200 as the recurrent layer. Sim-           appone senza immigrati a corto di forza lavoro”.
ple dropout was applied to the output of the feed-          In Task 2: HaSpeeDe-TW, when the classifier
forward layers with value 0.2, whereas the recur-        relies on sentiment and emotion features, we reg-
rent dropout has value 0.2. In the second submit-        istered several misclassified instances containing
ted run, one hidden layer of size 200 was used           relevant content words not covered by EmoLex.
adopting an LSTM of size 200 as the recurrent            This is due to the fact that for every English word,
layer. Embeddings dropout was applied with value         EmoLex provides only one translated entry, thus
0.5. The first run ranked fourth out of 17 submis-       limiting the overall coverage. For instance, “to
sions, while the other run ranked second. Table 4        kill” is translated in Italian with “uccidere” not
shows that in both cases the system showed good          considering synonyms such as “ammazzare” often
performance over the hate speech class, while de-        used in the dataset.
tecting negative instances proved difficult, in par-        Finally, we noticed some inconsistencies in the
ticular in terms of precision over the non hate          gold standard. For example, the message “Al solo
speech class.                                            vederle danno il voltastomaco!” is annotated as
                                                         hate speech while, the almost equivalent, “Appena
                      First Run                          le ho viste ho vomitato” is considered a non hate
     Category      P       R    F1       Instances
     Non Hate    0.493 0.703 0.580          323          speech instance while our models identify it as
       Hate      0.822 0.656 0.730          677          hate speech. Similarly, an insult like “ridicoli”
    Macro AVG    0.658 0.679 0.655         1000          is annotated as non hate speech in “CERTO CHE
                     Second Run
                                                         GLI ONOREVOLI DEL PD SI RICONOSCONO
     Non Hate    0.537 0.653 0.589          323
       Hate      0.815 0.731 0.771          677          A KILOMETRI ... RIDICOLI” but as hate speech
    Macro AVG    0.676 0.692 0.680         1000          in “Ci vorrebbe anche qua Putin, invece di quei
                                                         RIDICOLI...PAROLACCE PAROLACCE”.
    Table 4: Results on Cross-HaSpeeDe TW
                                                         5   Conclusions
                                                         In this paper we presented an overview of the
4   Error Analysis and Discussion                        runs submitted for the four subtasks of HaSpeeDe
                                                         evaluation exercise. We implemented a number
Although all our runs obtained satisfactory re-
                                                         of different models, comparing recurrent neural
sults in each task, there is still room for improve-
                                                         networks, ngram-based neural networks and lin-
ment. In particular, we noticed that our models
                                                         ear SVC. While RNNs perform better in three of
have problems in classifying social media mes-
                                                         four tasks, classification on Twitter data achieves
sages containing the following specific phenom-
                                                         a better ranking using the ngram based neural net-
ena: (i) dialects (e.g. “un se ponno sentı̀...ma come
                                                         work. Our system was ranked first among all
se fà...”) or bad orthography (e.g. “Io no nesdune
                                                         the teams in one of the cross-domain task, i.e.
delle due.....momti pesanti”); (ii) sarcasm, “Dopo
                                                         Cross-HaSpeeDe FB. This is probably due to the
i campi rom via pure i centri sociali. L’unico
                                                         fact that considering the whole sequence of inputs
problema sarà distinguere gli uni dagli altri”; (iii)
                                                         with a recurrent neural networks and using a pre-
references to world knowledge, typically used for
                                                         learned representation by using word embeddings
an indirect attack not containing an explicit insult
                                                         help the model to learn some common traits of
(e.g. “un certo Adolf sarebbe utile ancora oggi
                                                         hate speech across different social media.
con certi soggetti”); (iv) metaphorical expressions,
usually referring to ways to physically eliminate
                                                         Acknowledgments
the targets of hate speech messages (e.g. “Rus-
pali”).                                                  Part of this work was funded by the CREEP
   As for false positives, some errors come from         project (http://creep-project.eu/), a
the misclassification of messages containing the         Digital Wellbeing Activity supported by EIT
lemmas “terrorista”, “terrorismo”, “immigrato”           Digital in 2018. This research was also sup-
that are extremely frequent in particular in the         ported by the HATEMETER project (http://
Twitter dataset. These lemmas are associated to          hatemeter.eu/) within the EU Rights, Equal-
the hate speech class even when they appear in           ity and Citizenship Programme 2014-2020.
References                                                  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
                                                               B. Thirion, O. Grisel, M. Blondel, P. Pretten-
Christos Baziotis, Nikos Pelekis, and Christos Doulk-          hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
  eridis. 2017. DataStories at SemEval-2017 Task               sos, D. Cournapeau, M. Brucher, M. Perrot, and
  4: Deep LSTM with Attention for Message-level                E. Duchesnay. 2011a. Scikit-learn: Machine learn-
  and Topic-based Sentiment Analysis. In Proceed-              ing in Python. Journal of Machine Learning Re-
  ings of the 11th International Workshop on Semantic          search, 12:2825–2830.
  Evaluation (SemEval-2017), pages 747–754, Van-
  couver, Canada, August. Association for Computa-          Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-
  tional Linguistics.                                         fort, Vincent Michel, Bertrand Thirion, Olivier
                                                              Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Steven Bird and Edward Loper. 2004. Nltk: the nat-
                                                              Weiss, Vincent Dubourg, et al. 2011b. Scikit-learn:
   ural language toolkit. In Proceedings of the ACL
                                                              Machine learning in python. Journal of machine
   2004 on Interactive poster and demonstration ses-
                                                              learning research, 12(Oct):2825–2830.
   sions, page 31. Association for Computational Lin-
   guistics.                                                Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Piotr Bojanowski, Edouard Grave, Armand Joulin, and           Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
   Tomas Mikolov. 2016. Enriching Word Vec-                   Dropout: a simple way to prevent neural networks
   tors with Subword Information. arXiv preprint              from overfitting. The Journal of Machine Learning
   arXiv:1607.04606.                                          Research, 15(1):1929–1958.

Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.
  Overview of the EVALITA 2018 HaSpeeDe Hate
  Speech Detection (HaSpeeDe) Task.          In Tom-
  maso Caselli, Nicole Novielli, Viviana Patti, and
  Paolo Rosso, editors, Proceedings of the 6th evalua-
  tion campaign of Natural Language Processing and
  Speech tools for Italian (EVALITA18), Turin, Italy,
  December. CEUR.org.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
  Schwenk, and Yoshua Bengio. 2014. Learning
  phrase representations using RNN encoder-decoder
  for statistical machine translation. arXiv preprint
  arXiv:1406.1078.
François Chollet et al. 2015. Keras. https://
  keras.io.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-
  Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A
  library for large linear classification. Journal of ma-
  chine learning research, 9(Aug):1871–1874.
Yarin Gal and Zoubin Ghahramani. 2016. A theoret-
  ically grounded application of dropout in recurrent
  neural networks. In Advances in neural information
  processing systems, pages 1019–1027.
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins.
  1999. Learning to forget: Continual prediction with
  LSTM.
Saif M Mohammad and Peter D Turney. 2010. Emo-
  tions evoked by common words and phrases: Us-
  ing mechanical turk to create an emotion lexicon. In
  Proceedings of the NAACL HLT 2010 workshop on
  computational approaches to analysis and genera-
  tion of emotion in text, pages 26–34. Association for
  Computational Linguistics.
Saif M Mohammad and Peter D Turney. 2013. Crowd-
  sourcing a word–emotion association lexicon. Com-
  putational Intelligence, 29(3):436–465.