Text analysis for hate speech detection
                     in Italian messages on Twitter and Facebook
                   Giulio Bianchini                              Lorenzo Ferri
                  University of Perugia                        University of Perugia
                          Italy                                        Italy
        giulio.bianchini@studenti.unipg.it             lorenzo.ferri@studenti.unipg.it

                                          Tommaso Giorni
                                         University of Perugia
                                                 Italy
                                 tommaso.giorni@studenti.unipg.it

                    Abstract                          and the publication of contents. However, if on
    English. In this paper, we present a sys-         one hand social networks represent an instrument
    tem able to classify hate speeches in Ital-       of freedom of expression and connection, on the
    ian messages from Facebook and Twit-              other hand they are used for propagation and in-
    ter platforms. The system combines sev-           citement to hatred. For this reason, recently, many
    eral typical techniques from Natural Lan-         softwares and technologies have been developed
    guage Processing with a classifier based          to reduce this phenomenon (Zhang and Luo, 2018)
    on Artificial Neural Networks. It has been        (Waseem and Hovy, 2016) (Del Vigna et al., 2017)
    trained and tested on a corpus of 3000            (Davidson et al., 2017) (Badjatiya et al., 2017)
    messages from the Twitter platform and            (Gitari et al., 2015).
    3000 messages from the Facebook plat-             Specifically, approaches based on machine learn-
    form. The system has been submitted to            ing and deep learning are used by large companies
    the HaSpeeDe task within the EVALITA              to stem and stop this widespread fact. Despite the
    2018 competition and the experimental             efforts spent to produce systems for the English
    results obtained in the evaluation phase          language, there are very few resources for Italian
    of the competition are presented and dis-         (Del Vigna et al., 2017). In order to bridge this
    cussed.                                           gap, a specific task (Bosco et al., 2018) for the
                                                      detection of hateful contents has been proposed
    Italiano. In questo documento presenti-           within the context of EVALITA 2018, the 6th eval-
    amo un sistema in grado di classificare           uation campaign of Natural Language Processing
    messaggi di incitamento all’odio in lin-          and Speech tools for Italian. The EVALITA team
    gua italiana presi dalle piattaforme Face-        provided the participants with the initial starting
    book e Twitter. Il sistema combina di-            data sets, each consisting of 3000 classified com-
    verse tecniche tipiche del Natural Lan-           ments taken respectively from Facebook and Twit-
    guage Processing con un classificatore            ter pages. The objective of the competition is
    basato su una Rete Neurale Artificiale.           to produce systems able to automatically annotate
    Quest’ultimo stato allenato e testato con         messages with boolean values (1 for message con-
    un corpus di 3000 messaggi presi dalla pi-        taining Hate Speech, 0 otherwise).
    attaforma Twitter e 3000 messaggi presi           In this paper we describe the system submitted
    dalla piattaforma Facebook. Il sistema            by the Vulpecula team. The system works in
    stato sottomesso al task HaSpeeDe rela-           four phases: preprocessing of the initial dataset;
    tivo alla competizione EVALITA 2018, e            encoding of the preprocessed dataset; training
    sono presentati e discussi i risultati sper-      of the Machine Learning model; testing of the
    imentali ottenuti nella fase di valutazione       trained model. In the first phase, the comments
    della competizione.                               were cleaned by applying text analysis techniques
                                                      and some features have been extrapolated from
1   Introduction                                      these; then in the second phase, using a trained
In the last years, social networks have revolution-   Word2Vec model (Mikolov et al., 2013), the com-
ized in a radical way the world of communication      ments were coded in a vector of 256 real num-
bers. In the third phase, an artificial neural net-            encoded comments and the respective fea-
work model was trained using the encoded com-                  tures we extracted in the first phase. The de-
ments as input along with their respective extra               scription of the ANN is in the section 6.
features. In order to have better and reliable re-
sults, to train and evaluate the model a cross-              • In the last phase the test set comments pro-
validation was used. In addition, together with ac-            vided by EVALITA were classified and we
curacy and evaluation of the error, the F-measure              joined the competition. This phase is ex-
were used to evaluate the quality of the model. Fi-            plained in details in the section 7.
nally, in the fourth and last phase, the test set com-
                                                            The source code of the project is provided on-
ments provided by EVALITA were classified. The
                                                         line1 .
rest of the paper is organized as follows. A system
overview is provided in Section 2, while some de-        3    Tools Used
tails about the system components and the external
tools used are provided in Section 3. Experimen-         The entire project was developed using the Python
tal results are shown and discussed in Section 7,        programming language, for which several libraries
while some conclusions and ideas for future works        are available and usable for the purpose of the
are depicted in Section 8.                               project. Specifically, the following libraries were
                                                         used for the preprocessing phase of the dataset:
2    System overview
                                                             • nltk: toolkit for natural language processing;
The system has a structure similar to (Castellini
et al., 2017); it has been organized into four main          • unicode emoji: library for the recognition
phases: preprocessing, encondig, training, testing,            and translation of emoticons;
as also shown in Figure 1.
                                                             • treetaggerwrapper: library for lemming and
                                                               word tagging;

                                                             • textblob: another library for natural language
                                                               processing;

                                                             • gensim: library that contains word2vec;

                                                             • sequence matcher: library for calculating the
           Figure 1: System architecture.                      spelling distance between words;

                                                           For the training phase of the ML model the fol-
    • In the first phase the corpus of 3000 Facebook     lowing libraries were used:
      comments, and the corpus of 3000 Twitter
      comments are cleaned and prepared to be en-            • keras (Chollet and others, 2015): High-level
      coded. In parallel within the cleaning we                neural network API;
      have extrapolated some interesting features
      for each comment. The entire phase is ex-              • sklearn (Pedregosa et al., 2011): Simple and
      plained in details in section 4 and 5.                   efficient tools for data mining and data anal-
                                                               ysis ;
    • In the second phase we trained a Word2Vec
      model, starting from 200k comments we              Finally, some corpora have been used:
      download from some Facebook pages known                • SentiWordNet (Baccianella et al., 2010);
      to contain hate messages. Each of the ini-
      tial data set comment has been encoded in a            • dataset of badwords, provided by Prof. Spina
      vector of real values by submitting it to the            and research group of the University for For-
      Word2Vec model. This phase is explained in               eigners of Perugia;
      details in the section 5.
                                                             • dataset of italian words;
    • In the third phase we trained a multi-layer          1
                                                             https://github.com/VulpeculaTeam/
      feed-forward neural network using the 3000         Hate-Speech-Detection
    • dataset of 220k comments downloaded from               this middle sub-word is deleted from the cen-
      Facebook pages (Italia agli italiani stop              sored bad-word, taking the top and end part
      ai clandestini, matteo renzi official, matteo          of this formed by letters. At the end we scan
      salvini official, noiconsalvini, politici cor-         the list of bad-words and we control if this
      rotti)                                                 top and end part matching with one of this
                                                             scanned bad-words. If yes, this is replaced
4    Preprocessing                                           by the real word.
In the Knowledge Discovery in Databases (KDD)              • Hashtag splitting. One of the most diffi-
one of the crucial phases is data preparation. In            cult cleaning phases is the Hashtag Splitting.
this project this phase was tackled and the com-             For this we used a large dictionary of italian
ments given to us were processed and prepared.               words in .csv format. First, we scan every
Specific text analysis techniques have been ap-              word in this file and we control if these word
plied in order to prepare the data in the best pos-          is in the hashtag and then (for convenience
sible way in order to extract the most important             we avoid the words of lenght 2) saving it in
information from them. All the operations per-               a list. In this phase will be taken also use-
formed for the data cleaning and for the extra-              less words not contextualized to the hashtag,
feature extraction are listed below. Each operation          so we will need to filter them. For this, first
is iterated for all the 3000 comments of the data            we sort all found words in decreasing length
set.                                                         and we scan the list. So, starting to the first
                                                             word on, we delete it from the hashtag. In this
    • Extraction of the first feature: length of the         way the useless words in the list contained in
      comment.                                               larger words are found, saved in another list,
                                                             and deleted from the beginning list contain-
    • Extraction of the second feature: percent-
                                                             ing all the words (both useful and useless) in
      age of words written in CAPS-LOCK inside
                                                             the hashtag. In the final phase for each word
      the comment. Calculated by the number of
                                                             in the resulting list we find its position within
      words written in CAPS-LOCK divided by the
                                                             the hashtag and with this we create the real
      number of words in the comment.
                                                             sentence, separating every word with a space.
    • Replace the characters ’&’, ’@’ respectively
                                                           • Removal of all the links from the comment.
      in the letters ’e’, ’a’.
                                                           • Editing of each word in the comment by this
    • Conversion of disguised bad words. An in-              way: removal of nearby equal vowels, re-
      teresting function added to the preprocessing          moval of nearby equal consonants if they are
      is the recognition of censored bad-words, i.e.         more than 2. Examples: from ”caaaaane” to
      bad-words where some of their middle let-              ”cane”, from ”gallllina” to ”gallina”.
      ters are replaced by special character (sym-
      bol, punctuation...) to make it recognizable         • Extraction of the third feature: number of
      by an human but not by a computer. At this             sentences inside the comment. By sentence
      scope we don’t use a large vocabulary but              we mean a list of words that ends with ’.’ or
      it’s better a simple list of most common bad-          ’?’ or ’!’.
      words censored (because only a small group
      of bad words is commonly censored). At this          • Extraction of the fourth feature: number of
      python function we pass an entire sentence             ’?’ or ’!’ inside the comment.
      creating a list splitting this by space. We scan
                                                           • Extraction of the fifth feature: number of ’.’
      the list of sentence words and we control if
                                                             or ’,’ inside the comment.
      the first and last characters are letters and not
      number or symbols. Then we take this word            • Punctuation removal.
      without first and last letters and control if this
      middle sub-word is formed by special sym-            • Translation of emoticons (for Twitter mes-
      bols/punctuation or by letter x (because ”x”           sages). Given the large presence of emoti-
      is often used for hiding bad-words). If yes,           cons in Twitter messages, it was decided to
  translate the emoticons with the respective             • Extraction of the ninth feature: percentage of
  English translations. To do this, each sen-               bad words. Calculated by the number of bad
  tence is scanned and if there are emoticons,              words divided by the number of words in the
  these are translated into their corresponding             comment.
  meaning in English. Using the library uni-
  code emoji,                                             • Extraction of the tenth feature : Polarity
                                                            TextBlob. This value is compute using a
• Emoticon removal.                                         TextBlob function that allows to calculate the
                                                            polarity. Also in this case the message is
• Replacement of the abbreviations with the re-
                                                            translated into English.
  spective words, using a list of abbreviations
  created by ourselves.                                   • Extraction of the final feature : Subjectivity
                                                            TextBlob. Another value computed with a
• Removal of articles, pronouns, prepositions,
                                                            function in TextBlob.
  conjunctions and numbers.
• Removal of the laughs.                              5    Word Embeddings with Word2Vec

• Replacement of accented characters with             Very briefly, Word Embedding turns text into num-
  their unaccented characters.                        bers. This transformation is necessary because
                                                      many Machine Learning algorithms don’t work
• Lemmatization of each comment with the              with plain text but they require vectors of con-
  treetaggerwrapper library.                          tinuous values. Word Embedding has fundamen-
                                                      tal advantages in particular, it is a more efficient
• Extraction of the sixth feature : polarity of the
                                                      representation (dimensionality reduction) and also
  message. This feature is compute using the
                                                      it is a more expressive representation (contex-
  SentiWordNet corpora and his APIs. Since
                                                      tual similarity). So we have created a Word2Vec
  SentiWordNet was created to find the polar-
                                                      model for word embedding. For the training of
  ity of sentences in English, each message is
                                                      the model, 200k messages were downloaded from
  translated using TextBlob in English and the
                                                      several Facebook pages. These messages were
  polarity is then calculated.
                                                      preprocessed as explained in the previous sec-
• Extraction of the seventh feature : Percentage      tion 4 and (in addiction with the messages pro-
  of spelling errors in the comment. To calcu-        vided by EVALITA’s team) were used to train the
  late a spelling error a word is compared with       Word2Vec model. The trained model encode each
  all the words of the Italian Vocabulary cor-        word in a vector of 128 real numbers. Each sen-
  pora; if the word is not present in the corpora     tence is instead encoded with a vector of 256 real
  there is a spelling error. Calculated by the        numbers divided into two components of 128 el-
  number of spelling error divided by the num-        ements: the first component is the vector sum of
  ber of words in the comment.                        the coding of each word in the sentence, while the
                                                      second component is the arithmetic mean. At this
• Replacement of spelling error: In parallel          point each of the 3000 comments of the starting
  with the previous step every spelling error is      training set is a vector of 265 reals: 256 for the
  replaced with the most similar word in the          coding of the sentence and 9 for the previously cal-
  Italian Vocabulary corpora. The similarity          culated features.
  between the wrong word and all the other
  is calculated using a function of Sequence-         6    Model training
  Matcher library. The wrong word is replaced
                                                      The vectors obtained by the process described in
  with the most similar word in Italian Vocabu-
                                                      section 5 were used as input for training an Artifi-
  lary corpora.
                                                      cial Neural Network - ANN. (Russell and Norvig,
• Extraction of the eighth feature: number of         2016) The Articial Neural Network mathematical
  bad words in the comment. Every word in             model composed of artificial ”neurons”, vaguely
  the comment is compared with all the word           inspired by the simplification of a biological neu-
  in the Bad Words corpora; if the word is in         ral network. There are different types of ANN, the
  the corpora it’s a bad word.                        one used in this research is a feed-forward: this
means that the connections between nodes do not             • False Positive: 163;
form cycles as opposed to recurrent neural net-
works. In this neural network, the information              • False Negative: 325;
moves only in one direction, ahead, with respect to         • Precision: 0.899%;
input nodes, through hidden nodes (if existing) up
to the exit nodes. In the class of feed-forward net-        • Recall: 0.817%;
works there is the multilayer perceptron one. The
network we have built is made up of two hidden              • F1-Score: 0.856%;
layers in which the first layer consists of 128 nodes       • F1-Score Macro: 0.856%;
and the second one is 56. The last layer is the out-
put one and is formed by 2 nodes. The activation        7    Experimental Results
functions for the respective levels are sigmoid, relu
and softmax and the chosen optimizer is Adagrad,        After the release of the unlabelled test set, the new
each layer has a dropout of 0.45. The reason why        2000 messages (1000 of them from Facebook and
these parameters have been chosen is because after      1000 of them from Twitter) were cleaned as ex-
having tried countless configurations, the best re-     plained in section 4 and the respective features
sults during the training phase have been obtained      were extrapolated. Then, these new comments
with these parameters. In particular, have been         were added to the comment’s pool used to cre-
tried all the possible combinations of these param-     ate the Word2Vec model, and a new Word2Vec
eters:                                                  model was created with the new pool. Finally, the
                                                        2000 comments were encoded as previously ex-
  • Number of nodes of the hidden layers: 56,           plained in section 5 in the 265 component vectors
    128, 256, 512;                                      and these were the input of the neural network that
                                                        classified them. From the training phase, two neu-
  • Activation function of the hidden layers:           ral network models were built: one trained with
    sigmoid, relu, tanh, softplus;                      the dataset of 3000 Facebook messages and the
                                                        other trained with the dataset of 3000 Twitter mes-
  • Optimizer: Adagrad, RMSProp, Adam.
                                                        sages. We call the first model VTfb and the sec-
Furthermore, the dropout was essential to prevent       ond one VTtw. EVALITA’s task consisted in four
over-fitting. In fact, dropout consists to not con-     sub-tasks that were:
sider neurons during the training phase of cer-             • HaSpeeDe-FB: test VTfb with the 1000 mes-
tain set of neurons which is chosen randomly.                 sages taken from Facebook;
The dropout rate is set to 45%, meaning that the
45% of the inputs will be randomly excluded from            • HaSpeeDe-TW: test VTtw with the 1000
each update cycle. As methods of estimation,                  messages taken from Twitter;
cross-validation was used, partitioning the data
into 10 disjoint subsets. As metrics for perfor-            • Cross-HaSpeeDe-FB: test VTfb with the
mance evaluation, the goodness of the model was               1000 messages taken from Facebook;
analyzed by calculating True Positive, True Neg-            • Cross-HaSpeeDe-TW: test VTtw with the
ative, False Positive and False Negative. From                1000 messages taken from Twitter;
these the cost-sensitive measures precision, recall
and f-score were calculated. These are the best
results achieved with the training dataset of Face-      Sub-task                    Model   F1        Distance
book comments obtained during the cross valida-          HaSpeeDe-FB                 VTfb    0.7554    0.0734
tion:                                                    HaSpeeDe-TW                 VTtw    0.7783    0.021
                                                         Cross-HaSpeeDe-FB           VTfb    0.6189    0.0089
  • Accuracy: 83.73%;                                    Cross-HaSpeeDe-TW           VTtw    0.6547    0.0438
  • Standard deviation: 1.09;
                                                        Table 1: Team results in the HaSpeeDe sub-tasks.
  • True Positive: 1455;
                                                        In Table 1 we report the Macro-Average F1 score
  • True Negative: 1057;                                for each sub-task together with the differences
with the best result obtained in the competition         Jacopo Castellini, Valentina Poggioni, and Giulia
(column ”Distance” in the table). Compared with             Sorbi. 2017. Fake Twitter Followers Detection by
                                                            Denoising Autoencoder. In Proceedings of the In-
the results we had in the training phases (section
                                                            ternational Conference on Web Intelligence, WI ’17,
6), we would have expected better results in the            pages 195–202.
HaSpeede-FB task. However, our system appears
to be more general and not specifically targeted to      François Chollet et al. 2015. Keras. https://
                                                           keras.io.
a platform, in fact the differences in the other tasks
are minimal.                                             Thomas Davidson, Dana Warmsley, Michael Macy,
                                                           and Ingmar Weber. 2017. Automated hate speech
8   Conclusion and Future Work                             detection and the problem of offensive language.
                                                           arXiv preprint arXiv:1703.04009.
In this paper we presented a system based on neu-
                                                         Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta,
ral networks for the hate speech detection in social       Marinella Petrocchi, and Maurizio Tesconi. 2017.
media messages in Italian language. Recognizing            Hate Me, Hate Me Not: Hate Speech Detection on
negative comments is not easy, as the concept of           Facebook. In ITASEC, volume 1816 of CEUR Work-
negativity is often subjective. However, good re-          shop Proceedings, pages 86–95. CEUR-WS.org.
sults have been achieved that are not so far from        Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura
the results obtained by the best within the compe-         Damien, and Jun Long. 2015. A lexicon-based
tition. The proposed system can certainly be im-           approach for hate speech detection. International
proved, an idea can be to use clustering techniques        Journal of Multimedia and Ubiquitous Engineering,
                                                           10(4):215–230.
to categorize the messages (cleaned and with the
related features) in two subgroups (positive and         Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
negative) and then, for each comment, calculate            frey Dean. 2013. Efficient Estimation of Word
                                                           Representations in Vector Space. arXiv preprint
how much this is more similar to negative com-             arXiv:1301.3781.
ments or positive comments and add it as a feature.
                                                         F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Acknowledgments                                             B. Thirion, O. Grisel, M. Blondel, P. Pretten-
                                                            hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
The authors would like to thank prof. Valentina             sos, D. Cournapeau, M. Brucher, M. Perrot, and
Poggioni who has helped and supported us in                 E. Duchesnay. 2011. Scikit-learn: Machine Learn-
                                                            ing in Python. Journal of Machine Learning Re-
the development of the whole project. A special             search, 12:2825–2830.
thanks to Manuela Sanguinetti, our shepherd in
EVALITA competition for all the support she has          Stuart J. Russell and Peter Norvig. 2016. Artificial In-
given to us.                                                telligence: A Modern Approach. Malaysia; Pearson
                                                            Education Limited.
                                                         Zeerak Waseem and Dirk Hovy. 2016. Hateful sym-
References                                                 bols or hateful people? predictive features for hate
                                                           speech detection on twitter. In Proceedings of the
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-     NAACL student research workshop, pages 88–93.
   tiani. 2010. Sentiwordnet 3.0: an enhanced lexical
   resource for sentiment analysis and opinion mining.   Ziqi Zhang and Lei Luo. 2018. Hate speech detection:
   In Proceedings of Lrec 2010, pages 2200–2204.           A solved problem? the challenging case of long tail
                                                           on twitter. arXiv preprint arXiv:1803.03662.
Pinkesh Badjatiya, Shashank Gupta, Manish Gupta,
   and Vasudeva Varma. 2017. Deep learning for hate
   speech detection in tweets. In Proceedings of the
   26th International Conference on World Wide Web
   Companion, pages 759–760. International World
   Wide Web Conferences Steering Committee.

Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.
  Overview of the EVALITA 2018 Hate Speech De-
  tection Task. In Tommaso Caselli, Nicole Novielli,
  Viviana Patti, and Paolo Rosso, editors, Proceed-
  ings of the 6th Evaluation Campaign of Natural
  Language Processing and Speech Tools for Italian
  (EVALITA 2018), Turin, Italy. CEUR.org.