RuG @ EVALITA 2018: Hate Speech Detection In Italian Social Media

   Xiaoyu Bai∗ , Flavio Merenda∗∓ , Claudia Zaghi∗ , Tommaso Caselli∗ , Malvina Nissim∗
                   ∗
                     Rikjuniversiteit Groningen, Groningen, The Netherlands
                       ∓
                         Università degli Studi di Salerno, Salerno, Italy
f.merenda|t.caselli|m.nissim@rug.nl x.bai.5|c.zaghi@student.rug.nl


                       Abstract                        have been urged to deal with and remove offen-
                                                       sive and/or abusive content but the phenomenon is
      English. We describe the systems the RuG         so pervasive that developing systems that automat-
      Team developed in the context of the Hate        ically detect and classify offensive on-line content
      Speech Detection Task in Italian Social          has become a pressing need (Bleich, 2014; Nobata
      Media at EVALITA 2018. We submitted a            et al., 2016; Kennedy et al., 2017).
      total of eight runs, participating in all four      The Natural Language Processing and Compu-
      subtasks. The best macro-F1 score in all         tational Social Science communities have been re-
      subtasks was obtained by a Linear SVM,           ceptive to such urgency, and the automatic detec-
      using hate-rich embeddings. Our best sys-        tion of abusive and/or offensive language, trolling,
      tem obtains competitive results, by rank-        and cyberbulling (Waseem et al., 2017; Schmidt
      ing 6th (out of 14) in HaSpeeDe-FB, 3rd          and Wiegand, 2017) has seen a growing interest.
      (out of 15) in HaSpeeDe-TW, 8th (out of          This has taken various forms: datasets in multi-
      13) in Cross-HaSpeeDe FB, and 6th (out           ple languages1 , thematic workshops2 , and shared
      of 13) in Cross-HaSpeeDe TW.                     evaluation exercises, such as the GermEval 2018
                                                       Shared Task (Wiegand et al., 2018), and the Se-
      Italiano. Illustriamo i dettagli dei due         mEval 2019 Task 5: HateEval3 and Task 6: Of-
      sistemi che il Team RuG ha sviluppato            fensEval4 . The EVALITA 2018 Hate Speech De-
      nell’ambito dell’esercizio di valutazione        tection task (haspeede)5 (Bosco et al., 2018)
      su riconoscimento di messagi d’odio in           also falls in the latter category, and focuses on
      testi da Social Media per l’italiano. Ab-        the automatic identification of hate messages from
      biamo partecipato a tutti e quattro i sotto-     Facebook comments and tweets in Italian. We
      task, inviando un totale di otto predi-          participated in this shared task with two different
      zioni. La migliore macro-F1, è ottenuta         models, exploiting the concept of polarised em-
      da un SVM che usa embedding polariz-             beddings (Merenda et al., 2018). The details of
      zati, costruiti sfruttando contenuto ricco       our participation are the core of this paper. Code
      di odio. Il nostro miglior sistema ha            and outputs are available at https://github.
      ottenuto dei risultati competitivi, classi-      com/tommasoc80/evalita2018-rug.
      ficandosi 6◦ (su 14) in HaSpeeDe-FB,
      3◦ (su 15) in HaSpeeDe-TW, 8◦ (su 13)            2       Task
      nel Cross-HaSpeeDe FB, e 6◦ (su 13) in
      Cross-HaSpeeDe TW.                               The haspeede task derives from the harmoniza-
                                                       tion process of originally separate annotation ef-
                                                       forts from two research groups, converging onto a
 1    Introduction                                     uniform label granularity (Del Vigna et al., 2017;
                                                       Poletto et al., 2017; Sanguinetti et al., 2018). For
 The use of “bad” words and “bad” language has         details on the data see Section 3.1, and the task
 been the battleground for freedom of speech for
                                                           1
 centuries. The spread of Social Media platforms,            http://bit.ly/2RZUlKH
                                                           2
                                                             https://sites.google.com/view/alw2018
 and especially of micro-blog platforms (e.g. Face-        3
                                                             http://bit.ly/2EEC7Me
 book and Twitter), has favoured the growth of on-         4
                                                             http://bit.ly/2P7pTQ9
                                                           5
 line hate speech. Social media sites and platforms          http://di.unito.it/haspeedeevalita18
overview paper (Bosco et al., 2018).                     general topics that may contain hateful messages
   The hate detection task is articulated in four bi-    such as immigration, religion, politics, gender is-
nary (hate vs non-hate) sub-tasks, two in-domain,        sues, while the Twitter dataset is focused on spe-
two cross-domain. The in-domain sub-tasks re-            cific targets, i.e., categories or groups of individ-
quire training and test data to belong to the same       uals who are likely to become victims of hate
text type, either Facebook (HaSpeeDe-FB) or              speech (migrants, Muslims, and Roma6 ). It is also
Twitter (HaSpeeDe-TW), while the cross-domain            interesting to note that the label distribution in the
sub-tasks require training on one text type and          Facebook test data is flipped compared to training,
testing on the other: Facebook-Twitter (Cross-           with a strong majority of hateful comments.
HaSpeeDe FB) and Twitter-Facebook (Cross-
HaSpeeDe TW).                                            3.2    Additional Resources: Source-Driven
                                                                Embeddings
3     Data and Resources                                 We addressed the task by adopting a closed-task
All of our runs for all subtasks are based on super-     setting. However, as a strategy to potentially in-
vised approaches, where data (and features) play         crease the generalization capabilities of our sys-
a major role for the final results of a system. Fur-     tems and tune them towards better recognition
thermore, our contribution adopted a closed-task         of hate content, we developed hate- and offense-
setting, i.e. we did not include any training data       sensitive word embeddings.
beyond what was provided within the task. We                To do so, we scraped comments from a list of
did however build enhanced distributed represen-         selected Facebook pages likely to contain offen-
tations of words exploiting additional data (see         sive and/or hateful content in the form of com-
Section 3.2). This section illustrates the datasets      ments to posts, extracting over 1M comments. We
and language resources used in our submissions.          built word embeddings over the acquired data with
                                                         the word2vec tool skip-gram model (Mikolov et
3.1    Resources Provided by the Organisers              al., 2013), using 300 dimensions, a context win-
The organizers provided a total of 6,000 labeled         dow of 5, and minimum frequency 1. In the re-
Italian messages for training, split as follows:         mainder of this paper we refer to these representa-
3,000 comments from Facebook, and 3,000 mes-             tions as “hate-rich embeddings”. More details on
sages from Twitter. For test, they subsequently          the creation process, including the complete list
made available 1000 instances for each text type.        of Facebook pages used, and a preliminary eval-
Table 1 illustrates the distribution of the classes      uation of these specialised representations can be
in the different text types both in training and test    found in (Merenda et al., 2018).
data. Note that the distribution of labels in the test
data is unknown at developing time.                      4     Systems and Runs
                                                         We detail in this section our final submissions.
Table 1: Distribution of the labeled samples in the      The models have been developed in parallel to
training and test data per text type.                    our participating systems at the GermEval 2018
        Text type   Class      Training   Test           Shared Task (Bai et al., 2018), sharing with them
                    non-hate      1,618   323            some core aspects.
        Facebook
                    hate          1,382   677
                    non-hate      2,028   676
                                                         4.1    Run 1: Binary SVM
         Twitter
                    hate            972   324            Our first model is a Linear Support Vector Ma-
                                                         chine (SVM), built using the LinearSVC scikit
Although the task organisers have balanced the           learn implementation (Pedregosa et al., 2011).
datasets with respect to size, and have adopted the         We performed minimal pre-processing by re-
same annotation granularity (hate vs. non-hate),         moving stop words using the Python module
the two datasets are very different both in terms        stop-words7 , and lowercasing the tokens.
of class distribution (i.e. 46.06% of messages la-
                                                             6
belled as hateful in Facebook vs. 32.40% in Twit-              The Romani, Romany, or Roma are an ethnic group of
                                                         traditionally itinerant people who originated in northern India
ter in training) and with regard to their contents.      and are nowadays subject to ethnic discrimination.
                                                             7
For instance, the Facebook data is concerned with              https://pypi.org/project/stop-words/
    We used two groups of surface features,
namely: i.) word n-grams in the range 1–3; and
ii.) character n-grams in the range 2–4. The sparse
vector representation of each (training) instance is
then concatenated with its dense vector representa-
tion, as follows: for every word w in an instance i,    Figure 1: Feature representation of each sample
we derived a 300 dimension representation, w,   ~ by    fed to the ensemble model. On top, the represen-
means of a look-up in the hate-rich embeddings.         tation of a training sample, on bottom, the repre-
We performed max pooling over these word em-            sentation of a test sample.
beddings, w,~ to obtain a 300 dimension represen-
tation of the full instance, ~i. Words not covered in   sigmoid activation function computes the distribu-
the hate-oriented embeddings are ignored. Finally,      tion of the two labels. (Other network hyperpa-
class weights are balanced and SVM parameters           rameters: Number of filters: 6; Filter
use default values (C = 1).                             sizes: 3, 5, 8; Strides: 1). We used binary
                                                        cross-entropy as loss function and Adam as opti-
4.2     Run 2: Binary Ensemble Model                    miser. In training, we set a batch size of 64 and
Our second submission uses a binary ensemble            ran it for 10 epochs. We also applied two dropouts:
model, which combines a Convolutional Neural            0.6 between the embeddings and the convolutional
Network (CNN) system and the linear SVM (Sec-           layer, and 0.8 between the max-pooling and the
tion 4.1), with a logistic regression meta-classifier   fully-connected layer.
on top. Predictions on training data are obtained
via ten-fold cross-validation.                          5    Results and Ranking
   In the ensemble model, each input instance to        Table 2 reports the results and ranking for our runs
the meta-classifier is represented by the concate-      for all four subtasks. We also include the scores
nation of four features: a) the class predictions       of the CNN (not submitted to the official competi-
for that instance made by the SVM, b) the predic-       tion), marked with a ∗.9
tions of the CNN, and c) two additional surface-
level features: the instance’s length in terms of       Table 2: System results and ranking, including the
characters and the percentage of offensive terms        out-of-competition runs for CNN alone.
in the instance. This latter feature is obtained via
                                                         Subtask                    Model10       Rank      Macro F1
a look-up in a list of offensive terms in Italian ob-
                                                                                    SVM             6/14        0.7751
tained from the article Le Parole per ferire by Tul-     HaSpeeDe-FB                Ensemble        9/14        0.7428
lio De Mauro8 and the “bad words” category in                                       CNN∗             n/a        0.7138
the Italian Wiktionary. The feature is expressed                                    SVM             3/15        0.7934
by the ratio between the frequency of any of the         HaSpeeDe-TW                Ensemble        9/15        0.7530
instance’s tokens comprised in the list and the in-                                 CNN∗             n/a        0.7363
stance’s length in terms of tokens. Figure 1 shows                                  SVM             8/13        0.5409
                                                         Cross-HaSpeeDe FB          Ensemble        9/13        0.4845
the features fed to the ensemble meta-classifier.                                   CNN∗             n/a        0.4692
   The CNN is an adaptation of available archi-                                     SVM             6/13        0.6021
tectures for sentence classification (Kim, 2014;         Cross-HaSpeeDe TW          Ensemble        7/13        0.5545
Zhang and Wallace, 2015), using Keras (Chollet                                      CNN∗             n/a        0.6093
and others, 2015), and is composed of: i.) a word
embeddings input layer using the hate-rich em-          The SVM models obtain, by far, better results than
beddings; ii.) a single convolutional layer; iii.)      the Ensemble models. It is likely that the Ensem-
a single max-pooling layer; iv.) a single fully-        ble systems suffer from the lower performances of
connected layer; and v.) a sigmoid output layer.            9
                                                              Being allowed to submit a maximum of two runs per sub-
   The max-pooling layer output is flattened, con-      task, we based our choice of models on the results of a 10-fold
catenated, and fed to the fully-connected layer         cross validation of the three architectures on the training data.
                                                           10
                                                              The SVM correposnds to run id 1 and the Ensemble
composed of 50 hidden-units with the ReLU ac-           model to run id 3 in the official submitted runs - see
tivation function. The final output layer with the      Submissions-Haspeede in the GitHub repository https:
                                                        //github.com/tommasoc80/evalita2018-rug/
   8
       https://bit.ly/2J4TPag                           tree/master/Submissions-Haspeede
the CNN. We also observe differences in perfor-            sibly some jargon and topics are shared. While
mance on the two datasets across the subtasks.             this has a positive effect when training and test-
                                                           ing on Facebook (HaSpeeDe-FB), it has instead a
        Table 3: SVM’s performance per class               detrimental effect when testing on Twittter (Cross-
                          non-hate             hate        HaSpeeDe FB), since this dataset has a large ma-
    Subtask
                         P        R        P          R    jority of non-hate instances, and we tend to over-
    HaSpeeDe-FB        0.6990   0.6904   0.8531   0.8581   predict the hate class (see Table 3).
    HaSpeeDe-TW        0.8577   0.8831   0.7401   0.6944
    CrossHaSpeeDe FB   0.8318   0.4023   0.3997   0.8302
                                                              In HaSpeeDe-TW and Cross-HaSpeeDe TW
    CrossHaSpeeDe TW   0.4375   0.6934   0.7971   0.5745   (training on Twitter) the impact of the hate-rich
                                                           embeddings is a lot less clear. Indeed, recall for
   In-domain, in absolute terms, we do better on           the hate class is always lower than non-hate, with
Twitter (.7934) than on Facebook (.7751), and this         the large majority of errors (more than 50% in
is even truer in relative terms, as performance            all runs) being hate messages wrongly classified
overall in the competition is better on Facebook           as non-hateful, thus seemingly just following the
(best: 0.8288) than on Twitter (best: 0.7993).             class imbalance of the Twitter trainset.
Our high score on HaSpeeDe-TW comes from                      In both datasets, hate content is expressed either
high precision and recall on non-hate, while for           in a direct way, by means of “bad words” or direct
HaSpeeDe-FB, we do well on the hate class. This            insults to the target(s), or more implicitly and sub-
can be due to label distribution (hate is always mi-       tly. This latter type of hate messages is definitely
nority class, but more balanced in Facebook), but          the main source of errors for our systems in all
also to the fact that we use Facebook-based hate-          subtasks. Finally, we observe that in some cases
rich embeddings, which might push towards better           the annotation of messages as hateful is subject to
hate detection.                                            disagreement and debate. For instance, all mes-
   Cross-domain, results are globally lower, as ex-        sages containing the word rivoluzione [revolution]
pected, with best scores on Cross-HaSpeeDe FB              are marked as hateful, even though there is a lack
and Cross-HaSpeeDe TW of 0.6541 and 0.6985,                of linguistic evidence.
respectively (Bosco et al., 2018). Our models
experience a more substantial loss when trained            7   Conclusion and Future Work
on Facebook and tested on Twitter (in Cross-               Developing our systems for the Hate Speech
HaSpeeDe FB we lose over 25 percentage points              Detection in Italian Social Media task at
compared to HaSpeeDe-TW, where the Twitter                 EVALITA 2018, we focused on the generation of
test set is the same), than viceversa (we lose ca. 17      distributed representations of text that could not
percentage points on the Facebook test set).               only enhance the generalisation power of the mod-
                                                           els, but also better capture the meaning of words
6     Discussion
                                                           in hate-rich contexts of use. We did so exploiting
The drop in performance in the cross-domain set-           Facebook on-line communities to generate hate-
tings is likely due to topics, and data collection         rich embeddings (Merenda et al., 2018).
strategies (general topics on Facebook, specific              A Linear SVM system outperformed a meta-
targets on Twitter). In other words, despite the use       classifer that used predictions from the SVM it-
of hate-rich embeddings as a strategy to make the          self, and a CNN, due to the low performance of
systems generalize better, our models remain too           the CNN component. Major errors of the systems
sensitive to training data, which is strongly repre-       are due to implicit hate messages, where even the
sented as word and character n-grams.                      hate-rich embeddings fail. A further aspect to con-
   The impact of the hate-rich embeddings is               sider in this task is the difference in text type and
most strongly seen in HaSpeeDe-FB and Cross-               class balance of the two datasets. Both of these as-
HaSpeeDe FB, with recall for the hate class being          pects have a major impact on system performance
substantially higher than for the non-hate class.          in the cross-genre settings.
This could be due to the fact that the hate-rich              Finally, to better generalize to unseen data and
embeddings have been generated from comments               genres, future work will focus on developing sys-
in Facebook pages, that is, the same text type as          tems able to further abstract from the actual lexi-
the training data in the two tasks, so that pos-           cal content of the messages by capturing general
writing patterns of haters. One avenue to explore          Proceedings of the 25th International Conference
in this respect is “bleaching” text (van der Goot          on World Wide Web, pages 145–153. International
                                                           World Wide Web Conferences Steering Committee.
et al., 2018), a newly suggested technique used to
fade the actual strings into more abstract, signal-      F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
preserving representations of tokens.                       B. Thirion, O. Grisel, M. Blondel, P. Pretten-
                                                            hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
                                                            sos, D. Cournapeau, M. Brucher, M. Perrot, and
References                                                  E. Duchesnay. 2011. Scikit-learn: Machine learn-
                                                            ing in Python. Journal of Machine Learning Re-
Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tommaso          search, 12:2825–2830.
  Caselli, and Malvina Nissim. 2018. RuG at Ger-
  mEval: Detecting Offensive Speech in German So-        Fabio Poletto, Marco Stranisci, Manuela Sanguinetti,
  cial Media. In Josef Ruppenhofer, Melanie Siegel,        Viviana Patti, and Cristina Bosco. 2017. Hate
  and Michael Wiegand, editors, Proceedings of the         speech annotation: Analysis of an italian twitter cor-
  GermEval 2018 Workshop.                                  pus. In CEUR WORKSHOP PROCEEDINGS, vol-
                                                           ume 2006, pages 1–6. CEUR-WS.
Erik Bleich. 2014. Freedom of expression versus racist
   hate speech: Explaining differences between high      Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
   court regulations in the usa and europe. Journal of    Viviana Patti, and Marco Stranisci. 2018. An Ital-
   Ethnic and Migration Studies, 40(2):283–300.           ian Twitter Corpus of Hate Speech against Immi-
                                                          grants. In Nicoletta Calzolari (Conference chair),
Cristina Bosco, Fabio Poletto Dell’Orletta, Felice,       Khalid Choukri, Christopher Cieri, Thierry De-
  Manuela Sanuguinetti, and Maurizio Tesconi. 2018.       clerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara,
  Overview of the EVALITA Hate Speech Detection           Bente Maegaard, Joseph Mariani, Hlne Mazo, Asun-
  Task. In Tommaso Caselli, Nicole Novielli, Viviana      cion Moreno, Jan Odijk, Stelios Piperidis, and
  Patti, and Paolo Rosso, editors, Proceedings of the     Takenobu Tokunaga, editors, Proceedings of the
  6th evaluation campaign of Natural Language Pro-        Eleventh International Conference on Language Re-
  cessing and Speech tools for Italian (EVALITA’18),      sources and Evaluation (LREC 2018), Miyazaki,
  Turin, Italy. CEUR.org.                                 Japan, May 7-12, 2018. European Language Re-
                                                          sources Association (ELRA).
François Chollet et al. 2015. Keras. https://
  keras.io.                                              Anna Schmidt and Michael Wiegand. 2017. A survey
                                                           on hate speech detection using natural language pro-
Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta,       cessing. In Proceedings of the Fifth International
  Marinella Petrocchi, and Maurizio Tesconi. 2017.         Workshop on Natural Language Processing for So-
  Hate me, hate me not: Hate speech detection on           cial Media. Association for Computational Linguis-
  facebook. In Proceedings of the First Italian Con-       tics, Valencia, Spain, pages 1–10.
  ference on Cybersecurity (ITASEC17), Venice, Italy,
  January 17-20, 2017, pages 86–95.                      Rob van der Goot, Nikola Ljubešić, Ian Matroos, Malv-
                                                           ina Nissim, and Barbara Plank. 2018. Bleaching
George Kennedy, Andrew McCollough, Edward                  text: Abstract features for cross-lingual gender pre-
  Dixon, Alexei Bastidas, John Ryan, Chris Loo, and        diction. In Proceedings of the 56th Annual Meet-
  Saurav Sahay. 2017. Technology solutions to com-         ing of the Association for Computational Linguistics
  bat online harassment. In Proceedings of the First       (Volume 2: Short Papers), volume 2, pages 383–389.
  Workshop on Abusive Language Online, pages 73–
  77.                                                    Zeerak Waseem, Thomas Davidson, Dana Warms-
                                                           ley, and Ingmar Weber. 2017. Understanding
Yoon Kim.      2014.    Convolutional neural net-          abuse: A typology of abusive language detection
  works for sentence classification. arXiv preprint        subtasks. In Proceedings of the First Workshop on
  arXiv:1408.5882.                                         Abusive Language Online, pages 78–84, Vancouver,
Flavio Merenda, Claudia Zaghi, Tommaso Caselli, and        BC, Canada, August. Association for Computational
   Malvina Nissim. 2018. Source-driven Representa-         Linguistics.
   tions for Hate Speech Detection, proceedings of the   Michael Wiegand, Melanie Siegel, and Josef Ruppen-
   5th italian conference on computational linguistics     hofer. 2018. Overview. In Josef Ruppenhofer,
   (clic-it 2018). Turin, Italy.                           Melanie Siegel, and Michael Wiegand, editors, Pro-
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-            ceedings of the GermEval 2018 Workshop.
  frey Dean. 2013. Efficient estimation of word          Ye Zhang and Byron Wallace. 2015. A sensitiv-
  representations in vector space. arXiv preprint          ity analysis of (and practitioners’ guide to) convo-
  arXiv:1301.3781.                                         lutional neural networks for sentence classification.
Chikashi Nobata, Joel Tetreault, Achint Thomas,            arXiv preprint arXiv:1510.03820.
  Yashar Mehdad, and Yi Chang. 2016. Abu-
  sive language detection in online user content. In