Detecting Hate Speech for Italian Language in Social Media
                        Valentino Santucci, Stefania Spina
                        University for Foreigners of Perugia
            {valentino.santucci, stefania.spina}@unistrapg.it

                                         Alfredo Milani
                                       University of Perugia
                                   alfredo.milani@unipg.it

                            Giulio Biondi, Gabriele Di Bari
                                 University of Florence
                   {giulio.biondi, gabriele.dibari}@unifi.it


                     Abstract                              2016; Del Vigna et al., 2017; Davidson et al.,
                                                           2017; Badjatiya et al., 2017; Gitari et al., 2015).
    English. In this report we describe the
                                                              In this paper, we provide the description of
    hate speech detection system for the Ital-
                                                           our hate speech detection system for the Ital-
    ian language developed by a joint team
                                                           ian language. The system, namely HSD4I PG,
    of researchers from the two universi-
                                                           has been developed by a joint team of re-
    ties of Perugia (University for Foreign-
                                                           searchers from the University for Foreigners of
    ers of Perugia and University of Perugia).
                                                           Perugia and the University of Perugia. The
    The experimental results obtained in the
                                                           code of HSD4I PG is provided online at the url
    HaSpeeDe task of the Evalita 2018 eval-
                                                           https://github.com/Gabriele91/HSD4I PG.
    uation campaign are analyzed. Finally, a
    suggestion for future research directions is              The rest of the paper is organized as follows.
    provided in the conclusion.                            The main system architecture is provided in Sec-
                                                           tion 2, while the single software components are
    Italiano. In questo documento descri-                  described in Sections 3-6. Experimental results
    viamo il sistema di hate speech detection              are provided in Section 7, while conclusion and
    per la lingua Italiana sviluppato da una               future lines of research are depicted in Section 8.
    squadra di ricercatori dell’Università per
    Stranieri di Perugia e dell’Università degli          2    Architecture of the Hate Speech
    Studi di Perugia. I risultati sperimentali                  Detector
    ottenuti nel task HaSpeeDe, organizzato
    nell’ambito di Evalita 2018, sono ripor-               The hate speech detector we have developed,
    tati e analizzati. Infine, una possibile di-           namely HSD4I PG, is composed by several soft-
    rezione di ricerca è fornita nelle conclu-            ware components:
    sioni.
                                                               • a tokenizer for Italian posts from social me-
                                                                 dia,
1   Introduction
In the recent years there was an exponential                   • the popular FastText tool (Bojanowski et al.,
growth of social media that has revolutionized                   2016) used to generate a word embedding
communication and content publishing. However,                   model,
social media are also increasingly exploited for the
propagation of hate speech. This issue motivates               • a features generator that generates a vector
the recent research on hate speech detection sys-                of numeric features for each post to be clas-
tems (Zhang and Luo, 2018; Waseem and Hovy,                      sified,

                                                       1
    • a (trainable) classifier that, for each post, pre-       2. alternative spellings of some bad words have
      dicts its class label.                                      been normalized (e.g., ”vaffa” is replaced
                                                                  with its most popular form),
  Moreover, the following resources have been
adopted:                                                       3. some common mispellings and abbreviations
    • the Ita Twitter corpus (Spina, 2016)                        have been corrected (e.g., ”cmq” is replaced
      that includes 1,234,865 tweets extracted                    with ”comunque”),
      from the Italian timeline in a time                      4. hashtags have been split into multiple tokens
      span of seven months (November 2012                         using the Python library ”compound-word-
      - May 2013).        The tweets were ex-                     splitter”,
      tracted randomly, 2,000 per day, using
      the R package TwitteR (https://cran.r-                   5. apostrophes have been considered as token
      project.org/web/packages/twitteR/);                         separators,

    • the Italian Lexicon of Hate Speech                       6. tokens composed by digits characters have
      that was collected based on an Italian                      been replaced with the token NUM,
      monolingual dictionary, Il Nuovo De                      7. tokens corresponding to Twitter mentions
      Mauro, which is also available online                       have been replaced with the token MEN,
      (https://dizionario.internazionale.it);
                                                               8. tokens corresponding to web links have been
    • the Sentix italian lexicon for sentiment analy-             replaced with the token URL,
      sis (Basile and Nissim, 2013);
                                                               9. emojis have been kept as tokens on their own,
    • the training sets of 3,000 Facebook posts and               while other punctuation characters have been
      3,000 tweets available for the ”Haspeede”                   removed,
      task of Evalita 2018.
                                                           10. all the textual tokens have been replaced with
   As any other supervised classifier system,                  their stemmed form by using the NLTK im-
HSD4I PG requires a training stage, that is de-                plementation of the Snowball stemming algo-
picted in Figure 1. The word embedding model                   rithm for the Italian language (Porter, 1980).
is trained by FastText using the Ita twitter corpus.
                                                              Moreover, in order to provide additional exper-
Numeric features are obtained by aggregating the
                                                           imental results, we have also tried a lighter variant
FastText features and by generating some ad-hoc
                                                           of the tokenizer that only perform the tasks num-
extra-features. These numeric features are finally
                                                           bered from 5 to 10.
fed to a Support Vector Machine (SVM) (Cortes
and Vapnik, 1995) in order to generate a classifier        4     The Word Embedding Model
model.
   After the SVM classifier has been trained, the          A word embedding model is generated by Fast-
prediction of (unlabeled) posts is performed fol-          Text (Bojanowski et al., 2016) using the skipgram
lowing the scheme depicted in Figure 2.                    technique.
                                                              Fed with the Ita Twitter corpus, FastText pro-
3     The Tokenizer                                        duces a numeric vector representation for every n-
                                                           gram contained in the corpus’ posts in such a way
A tokenizer for the Italian language adopted in so-
                                                           that the n-grams belonging to tokens appearing in
cial media has been designed by modifying the
                                                           similar contexts are close to each other in the con-
output produced by the ”TweetTokenizer” class
                                                           tinuous numerical space.
of the popular Python library NLTK (Bird et al.,
                                                              After the model has been generated, a numeric
2009).
                                                           representation for a given token w can be simply
   A variety of corrections have been introduced.
                                                           computed by summing up the numeric representa-
The most important ones are:
                                                           tions of the n-grams that compose w.
    1. two or more consecutive occurrences of the             Since out-of-vocabulary words are quite com-
       same vowel have been replaced by a single           mon in social media texts, we think that the sub-
       occurrence (e.g., ”ciaooo” is replaced with         words information contained in the n-grams is
       ”ciao”),                                            particularly useful in our scenario.
5    The Features Generator                                   • post length in number of characters,
The word embedding model allows to generate a                 • post length in number of tokens.
numeric representation for every token. Therefore,
in order to produce a (constant length) numeric              As an illustrative example, let consider that:
representation of the whole post, we need to ag-          FastText has generated numeric vectors of size 300
gregate the vectors corresponding to the tokens of        for every single token w of a post p, and that
the post. Six different aggregation functions have        the combination of the three aggregators sum,
been considered: average (avg), standard devi-            min, max has been chosen. Then, the numeric
ation (std), minimum (min), maximum (max),                vector representing p has 300 × 3 + 20 = 920
median (med), and sum (sum). Any combina-                 dimensions and it is formed by concatenating the
tion of these aggregators can be adopted, thus the        three vectors, each one of size 300, given by ev-
features generator requires an experimental tuning        ery chosen aggregator together with the 20 extra-
(see Section 7).                                          features.
   Moreover, 20 additional extra-features have               Finally, in the case the number of features is too
been introduced:                                          large for the classifier, during the training phase
                                                          we are able to reduce the dimensionality to a
    • number of hateful tokens, computed using            given number k by selecting the features having
      the Italian Lexicon of Hate Speech (Spina,          the largest mutual information with respect to the
      2016),                                              class labels.
    • average sentiment polarity and intensity,
      computed using the Sentix lexicon (Basile
                                                          6     The Classifier
      and Nissim, 2013),                                  After some preliminary experiments, we have de-
                                                          cided to adopt a Support Vector Machine (SVM)
    • number of web links,
                                                          classifier (Cortes and Vapnik, 1995). SVM is a su-
    • number of mentions,                                 pervised technique for training a classifier model
                                                          by efficiently computing a separation hyperplane
    • a boolean flag to indicate if it is a reply tweet   (between the two classes to be predicted) in a (im-
      or not,                                             plicitly) higher dimensional space (with respect
    • number of hashtags,                                 to the features dimensionality). The SVM im-
                                                          plementation of the Python’s library Scikit-Learn
    • maximum length of an hashtag (in charac-            (Pedregosa et al, 2011) has been used.
      ters),                                                 Compared to the popular neural network model,
                                                          the SVM technique has less parameters to be
    • a boolean flag to indicate if it is a retweet or    tuned, it is computationally more efficient, and it
      not,                                                generally obtains comparable performances.
    • the percentage of capital letters,                     Finally, it is important to note that, before the
                                                          training phase, all the training features have been
    • the percentage of tokens whose letters are all      standardized in such a way that their means and
      in capital case,                                    variances, across all the training instance, are, re-
                                                          spectively, 0 and 1.
    • number of exclamation marks,

    • number of tokens composed by three or more          7     Experiments
      dots,                                               7.1    Experimental Setting
    • number of punctutation characters,                  The parameters of the different software compo-
                                                          nents of HSD4I PG have been tuned using a grid
    • number of emojis,                                   search approach and a 10-folds cross-validation
    • number of repeated consecutive vowels,              scheme.
                                                             FastText parameters have been chosen in the
    • percentage of tokens representing a correct         following ranges: number of epochs epoch ∈
      Italian word,                                       {5, 20, 50, 100}, the initial learning rate lr ∈
{0.05, 0.1}, the negative sampling neg ∈             7.2    Experimental Results
{5, 20, 50}, the window size ws ∈ {5, 10}.           Table 2 provides the results obtained by
Moreover, the skipgram model has been consid-        HSD4I PG in the four proposed tasks.           In
ered, while other FastText parameters that have      particular, the Macro-Average F1 score for each
been set to constant values are: dim = 300,          subtask is shown, along with the difference from
minCount = 1, minn = 3, and maxn = 6.                the best competitor in the subtask.
   Regarding the features generator (see Section
5), a combination of the six aggregators has to be             SubTask         HSD4I PG
                                                                                            Distance
chosen. Importantly, for combinations resulting in                                         from best
more than 1,000 features, the filtering procedure             HaSpeeDe-FB        0.7841      0.0447
                                                              HaSpeeDe-TW        0.7744      0.0249
described at the end of Section 5 is performed.            Cross-HaSpeeDe-FB     0.6279      0.0262
   After some preliminary experiments, we have             Cross-HaSpeeDe-TW     0.5545      0.1440
decided to use the following ranges in order
to tune the SVM parameters: kernel ∈                        Table 2: Subtask results of HSD4I PG
{rbf, linear}, C ∈ {1.8, 2, 2.2, 2.4}. More-
over, the gamma and class weight param-                 Table 2 shows that HSD4I PG achieved results
eters have been set to, respectively, auto and       comparable to the best competitors, except in the
balanced.                                            task Cross-HaSpeeDe-TW. The complete results
   The best parameter setting resulting from the     for all the tasks are available in (Bosco et al.,
experimental tuning is provided in Table 1.          2018). Besides, in Tables 3 and 4, three additional
                                                     rows corresponding to the new executions A,B,C
                          Parameter     Value        previously discussed (and performed after the of-
                          epoch           50         ficial HaSpeeDe evaluation) are provided.
                          lr             0.05           Interestingly, the results in Table 4 show that
    FastText
                          ns              50         HSD4I PG, tuned with different parameter set-
                          ws              5          tings, would have ranked 3rd in the HaSpeeDe-
                                        sum          TW subtask (see (Bosco et al., 2018)).
    Features Generator    aggregators   min
                                        max          8     Conclusion and Future Work
                          kernel        rbf
    SVM                                              In this paper we have introduced a system for the
                          C              2.2
                                                     hate speech detection of social media texts in Ital-
        Table 1: Tuned parameter setting             ian language. The results we have obtained for the
                                                     HaSpeeDe task of the Evalita 2018 campaign are
  This setting has been used to generate the re-     provided.
sults submitted as ”run 2” at the Haspeede task         It is worth to point out that the results of most
of Evalita 2018 by the team ”Perugia1”. For a        participants are very similar and quite far from be-
mistake, we have submitted a wrong file as ”run      ing fully accurate. The question is whether hate
1”. Anyway, in the following section we also pro-    annotation is objective or subjective. Few of the
vide the results of three additional executions of   posts in the datasets looks to be difficult to anno-
HSD4I PG:                                            tate even for a human being. Indeed, we think that
                                                     different people can produce different annotations.
   Execution A) It uses the same setting of Table    Therefore, it can be interesting to model the sub-
    1 except that C = 2,                             jective perception of hatefulness and exploit such
                                                     information in the detection task, perhaps, taking
   Execution B) It uses the same setting of Table
                                                     inspiration by recommender system techniques.
    1 except that the lighter variant of the tok-
    enizer (see Section 3) has been adopted,
                                                     References
   Execution C) It uses the same setting of Ta-
                                                     Pinkesh Badjatiya, Shashank Gupta, Manish Gupta,
    ble 1 except that C = 2 and the lighter vari-       and Vasudeva Varma. 2017. Deep Learning for Hate
    ant of the tokenizer (see Section 3) has been       Speech Detection in Tweets. In Proceedings of the
    adopted.                                            26th International Conference on World Wide Web
                          Figure 1: Training in HSD4I PG


                     Figure 2: Classification in HSD4I PG


                Not HS                            HS
                                                                    Macro-Avg F-score
    Precision    Recall    F-score   Precision   Recall   F-score
A    0.7261      0.6811    0.7029     0.8522     0.8774   0.8646         0.7838
B    0.7219      0.6749    0.6976     0.8496     0.8759   0.8625         0.7801
C    0.7166      0.6811    0.6984     0.8514     0.8715   0.8715         0.7799

         Table 3: Additional results in the subtask HaSpeeDe-FB


                Not HS                            HS
                                                                    Macro-Avg F-score
    Precision    Recall    F-score   Precision   Recall   F-score
A    0.8489      0.8728    0.8607     0.7180     0.6759   0.6963         0.7785
B    0.8545      0.8950    0.8743     0.7568     0.6821   0.7175         0.7959
C    0.8575      0.8905    0.8737     0.7517     0.6914   0.7203         0.7970

         Table 4: Additional results in the subtask HaSpeeDe-TW
  Companion - WWW ’17 Companion, pages 759–
  760, New York, New York, USA. ACM Press.
Valerio Basile and Malvina Nissim. 2013. Sentiment
  Analysis on Italian Tweets. In In Proceedings of the
  4th Workshop on Computational Approaches to Sub-
  jectivity, Sentiment and Social Media Analysis, At-
  lanta, Georgia, 14 June 2013.
Steven Bird, Ewan Klein, and Edward Loper.
   2009. Natural Language Processing with Python.
   O’Reilly Media, Inc., 1st edition.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
   Tomas Mikolov. 2016. Enriching Word Vectors
   with Subword Information. 7.
Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.
  Overview of the Evalita 2018 Hate Speech Detection
  Task. In Tommaso Caselli, Nicole Novielli, Viviana
  Patti, and Paolo Rosso, editors, Proceedings of the
  6th evaluation campaign of Natural Language Pro-
  cessing and Speech tools for Italian (EVALITA’18),
  Turin, Italy. CEUR.org.
Corinna Cortes and Vladimir Vapnik. 1995. Support-
  vector networks. Machine Learning, 20(3):273–
  297.
Thomas Davidson, Dana Warmsley, Michael Macy,
  and Ingmar Weber. 2017. Automated Hate Speech
  Detection and the Problem of Offensive Language.
  3.
Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta,
  Marinella Petrocchi, and Maurizio Tesconi. 2017.
  Hate me, hate me not: Hate speech detection on
  Facebook. In CEUR Workshop Proceedings.
Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura
  Damien, and Jun Long. 2015. A lexicon-based
  approach for hate speech detection. International
  Journal of Multimedia and Ubiquitous Engineering.
Fabian Pedregosa et al. 2011. Scikit-learn: Machine
  Learning in Python. J. Mach. Learn. Res., 12:2825–
  2830.
M.F. Porter. 1980. An algorithm for suffix stripping.
  Program, 14(3):130–137, 3.
Stefania Spina. 2016. Fiumi di parole. Discorso e
   grammatica delle conversazioni scritte in Twitter.
   StreetLib, Loreto, Italy.
Zeerak Waseem and Dirk Hovy. 2016. Hateful Sym-
  bols or Hateful People? Predictive Features for Hate
  Speech Detection on Twitter. In Proceedings of the
  NAACL Student Research Workshop.
Ziqi Zhang and Lei Luo. 2018. Hate Speech Detec-
  tion: A Solved Problem? The Challenging Case of
  Long Tail on Twitter. 2.