=Paper= {{Paper |id=Vol-2263/paper038 |storemode=property |title=HanSEL: Italian Hate Speech Detection Through Ensemble Learning and Deep Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2263/paper038.pdf |volume=Vol-2263 |authors=Marco Polignano,Pierpaolo Basile |dblpUrl=https://dblp.org/rec/conf/evalita/PolignanoB18 }} ==HanSEL: Italian Hate Speech Detection Through Ensemble Learning and Deep Neural Networks== https://ceur-ws.org/Vol-2263/paper038.pdf
               HanSEL: Italian Hate Speech detection through
               Ensemble Learning and Deep Neural Networks
             Marco Polignano                              Pierpaolo Basile
        University of Bari Aldo Moro                University of Bari Aldo Moro
          Dept. Computer Science                      Dept. Computer Science
     via E. Orabona 4, 70125 Bari, Italy         via E. Orabona 4, 70125 Bari, Italy
     marco.polignano@uniba.it                   pierpaolo.basile@uniba.it


                 Abstract                           maggioranza: Support Vector Machine
                                                    (Hearst et al., 1998) (SVM con kernel
English. The detection of hate speeches,            RBF), Random Forest (Breiman, 2001),
over social media and online forums, is             Deep Multilayer Perceptron (Kolmogorov,
a relevant task for the research area of            1992) (MLP). Ogni classificatore è stato
natural language processing. This interest          configurato utilizzando una strategia
is motivated by the complexity of the task          greedy di ottimizzazione degli iper-
and the social impact of its use in real            parametri considerando il valore di
scenarios. The task solution proposed in            ”F1” calcolato su una suddivisione
this work is based on an ensemble of three          casuale in 5-fold del set di training.
classification strategies, mediated by a            Ogni frase è stata pre-elaborata affinchè
majority vote algorithm: Support Vector             fosse trasformarta in formato word em-
Machine (Hearst et al., 1998) (SVM with             beddings e TF-IDF. I risultati ottenuti
RBF kernel), Random Forest (Breiman,                tramite cross-validation sul training set
2001), Deep Multilayer Perceptron (Kol-             hanno mostrato un valore F1 pari a
mogorov, 1992) (MLP). Each classifier               0.8034 per le frasi estratte da Facebook
has been tuned using a greedy strategy of           e 0.7102 per quelle di Twitter. Il codice
hyper-parameters optimization over the              sorgente del sistema proposto può essere
”F1” score calculated on a 5-fold random            scaricato tramite GitHub:        https:
subdivision of the training set. Each sen-          //github.com/marcopoli/
tence has been pre-processed to transform           haspeede_hate_detect
it into word embeddings and TF-IDF bag
of words. The results obtained on the
cross-validation over the training sets have    1   Introduction and background
shown an F1 value of 0.8034 for Facebook
sentences and 0.7102 for Twitter. The           In the current digital era, characterized by the large
code of the system proposed can be              use of the Internet, it is common to interact with
downloaded from GitHub:            https:       others through chats, forums, and social networks.
//github.com/marcopoli/                         Common is also to express opinions on public
haspeede_hate_detect                            pages and online squares. These places of discus-
                                                sion are frequently transformed into ”fight clubs”
Italiano. L’individuazione di discorsi          where people use insults and strong words in or-
di incitamento all’odio sui social media        der to support their ideas. The unknown identity
e sui forum on-line è una sfida rile-          of the writer is used as an excuse to fell free of
vante per l’area di ricerca riguardante         consequences derived by attacking people only for
l’elaborazione del linguaggio naturale.         their gender, race or sexual inclinations. A gen-
Tale interesse è motivato della complessità   eral absence of automatic moderation of contents
del processo e dell’impatto sociale del         can cause the diffusion of this phenomenon. In
suo utilizzo in scenari reali. La soluzione     particular, consequences on the final user could be
proposta in questo lavoro si basa su un         psychological problems such as depression, rela-
insieme di tre strategie di classificazione     tional disorders and in the most critical situations
mediate da un algoritmo di votazione per        also suicidal tendencies.
   A recent survey of state of the art approaches       that indicates the presence and absence of hate
for hate speech detection is provided by (Schmidt       speeches. The task is organized into three sub-
and Wiegand, 2017). The most common systems             tasks, based on the dataset used for training and
of speech detection are based on algorithms of text     testing the participants’ systems:
classification that use a representation of contents
based on ”surface features” such as them available          • Task 1: HaSpeeDe-FB, where only the
in a bag of words (BOW) (Chen et al., 2012; Xu et             Facebook dataset can be used to classify the
al., 2012; Warner and Hirschberg, 2012; Sood et               Facebook test set
al., 2012). A solution based on BOW is efficient
                                                            • Task 2: HaSpeeDe-TW, where only the
and accurate especially when n-grams have been
                                                              Twitter dataset can be used to classify the
extended with semantic aspects derived by the
                                                              Twitter test set
analysis of the text. (Chen et al., 2012) describe an
increase of the classification performances when            • Task 3: Cross-HaSpeeDe, which can be fur-
features such as the number of URLs, punctua-                 ther subdivided into two sub-tasks:
tions and not English words are added to the vec-
torial representation of the sentence. (Van Hee et                1. Task 3.1: Cross-HaSpeeDe FB, where
al., 2015) proposed, instead, to add as a feature                    only the Facebook dataset can be used
the number of positive, negative and neutral words                   to classify the Twitter test set
found in the sentence. This idea demonstrated                     2. Task 3.2:        Cross-HaSpeeDe TW,
that the polarity of sentences positively supports                   where only the Twitter dataset can be
the classification task. These approaches suffer                     used to classify the Facebook test set
from the lack of generalization of words contained
into the bag of words, especially when it is cre-       The Facebook and Twitter datasets released for
ated through a limited training set. In particular,     the task consist of a total amount of 4,000 com-
terms found in the test sentences are often missing     ments/tweets, randomly split into development
in the bag. More recent works have proposed word        and test set, of 3,000 and 1,000 messages respec-
embeddings (Le and Mikolov, 2014) as a possi-           tively. Data are encoded in a UTF-8 with three
ble distributional representation able to overcome      tab-separated columns, each one representing the
this problem. This representation has the advan-        sentence id, the text and the class (Fig. 1).
tage to transform semantically similar words into
                                                             id     text                              hs
a similar numerical vector. Word embeddings are
                                                             8      Io votero NO NO E NO              0
consequently used by classification strategies such
                                                             36     Matteo serve un colpo di stato.   1
as Support Vector Machine and recently by deep
learning approaches such as deep recurrent neural            Table 1: Examples of annotated sentences.
networks (Mehdad and Tetreault, 2016). The so-
lution proposed in this work reuse the findings of
(Chen et al., 2012; Mehdad and Tetreault, 2016)         3    Description of the system
for creating an ensemble of classifiers, including
a deep neural network, which works with a com-          The system proposed in this work is HanSEL:
bined representation of word embeddings and a           a system of Hate Speech detection through En-
bag of words.                                           semble Learning and Deep Neural Networks. We
                                                        decided to approach the problem using a classic
2   Task and datasets description                       natural language processing pipeline with a final
                                                        task of sentences classification into two exclusive
The hate speech detection strategy proposed in          classes: hate speech and not hate speech. The data
HAnSEL has been developed for HaSpeeDe (Hate            provided by task organizers are obtained crawling
Speech Detection) task organized within Evalita         social network, in particular, Facebook and Twit-
2018 (Caselli et al., 2018), which is going to          ter. The analysis of the two data sources showed
be held in Turin, Italy, on December 12th-13th,         many possible difficulties to face in the case of us-
2018 (Bosco et al., 2018). HaSpeeDe consists in         ing approaches based on Italian lexicons of hate
the annotation of messages from social networks         speeches. In particular, we identified the follow-
(Twitter and Facebook) with a boolean label (0;1)       ing issues:
  • Repeated characters: many words in-                   We follow the same step of pre-processing ap-
    cludes characters repeated many times for          plied by Tripodi (Tripodi and Li Pira, 2017) to
    emphasizing the semantic meaning of the            transform the sentence of the task datasets into
    word. As an example, the words ”nooooo”,           word embeddings. In particular, we applied the
    ”Grandeeeee”,     ”ccanaleeeeeeeeeeeeeeee”         following Natural Language Processing pipeline:
    are found in the training of Facebook
                                                         • Reduction of repeated characters: we scan
    messages.
                                                           each sentence of the datasets (both training
  • Emoji: sentences are often characterized by            and test). For each sentence, we obtain words
    emoji such as hearts and smiley faces that are         merely splitting it by space. Each word is an-
    often missing in external lexicons.                    alyzed, and characters repeated three times or
                                                           more are reduced to only two symbols, try-
  • Presence of links, hashtags and mentions:
                                                           ing to keep intact word that naturally includes
    this particular elements are typical of the
                                                           doubles.
    social network language and can introduce
    noise in the data processing task.                   • Data cleansing: we transformed the words is
                                                           lowercase and following we removed from
  • Length of the sentences: many sentences are
                                                           each sentences links, hashtags, entities, and
    composed by only one word or in general,
                                                           emoji
    they are very short. Consequently, they are
    not expressive of any semantic meaning.               The normalized sentences are consequently to-
The complexity of the writing style used in hate       kenized using the TweetTokenizer of the NLTK
speech sentences guided us through the idea            library 1 . For each sentence we averaged the
to do not use an approach based on standard            word2vec vectors correspondent of each token, re-
lexicons and to prefer supervised learning strate-     moving during each sum the centroid of the whole
gies on the dataset provided by the task organizers.   distributional space. This technique is used for
                                                       mitigating the problems of loss of information due
Sentence processing.                                   to the operation of averaging the semantic vectors.
We decide to represent each sentence as a con-            The two bags of words (Facebook and Twitter)
catenation of a 500 features word embedding            are, instead, created directly on the sentences
vector and a 7,349 size bag of words for Facebook      without any pre-processing, also if during the
messages and 24,866 size bag of words for Twitter      tuning of the architecture we had tried some con-
messages. In particular, the word-embedding pro-       figurations that include bag of words without stop
cedure used is word2vec introduced by Mikolov          words, with lowercase letters and processed by
(Mikolov et al., 2013). This model learns a            Snowball stemmer algorithm 2 without obtaining
vector representation for each word using a neural     breaking results. The n-gram size considered for
network language model and can be trained              the construction of the bag is in the range of 1 to
efficiently on billions of words. Word2vec allows      3. The final representation of each sentence of the
being a very efficient data representation in text     dataset is consequently obtained concatenating
classification due to its capability to create very    the word2vec vector and the correspondent bag
similar vectors for words strongly semantically        of words. Sentences too shorts that cannot be
related. The Italian word embeddings used in           transformed into word2vec as a consequence
this work are provided by Tripodi (Tripodi and         of the absence of all the tokens of the sentence
Li Pira, 2017). The author trained the model on a      have been classified using only the bag of words
dump of the Italian Wikipedia (dated 2017.05.01),      representation.
from which only the body text of each article is
used. The corpus consists of 994,949 sentences         Classification strategy.
that result in 470,400,914 tokens. The strategy of     HAnSEL is based on a classification process that
the creation of word embeddings is CBOW with           uses three different classification strategies medi-
the size of the vectors equal to 500, the window       ated by a hard majority vote algorithm. A stack-
size of the words contexts set to 5, the minimum       ing of classifiers with a Random Forest blender
number word occurrences equal to 5 and the                1
                                                              https://www.nltk.org/data.html
                                                          2
number of negative samples set to 10.                         http://snowball.tartarus.org/texts/quickintro.html
has also been considered during the design of                according to their feature values. In order to
HAnSEL architecture but, the internal evaluation             classify an unseen item, the three is navigated
runs on 5-fold cross-validation of the training set          until reaching the leaf and then the ration of
showed us low performances of the approach.                  training items of class k in that leaf is used as
This analysis is not detailed more in this work due          a class probability.
to the page limitations of it. In order to design the
ensemble, we analyzed the performances of some            • Random forest classifier (RF). It is an en-
of the most popular classification algorithms for           semble of Decision Trees trained on different
the text categorization task. In particular, we con-        batches of the dataset that uses averaging to
sidered:                                                    improve the predictive accuracy and control
                                                            over-fitting. A typical parameter is the num-
  • Logistic regression with stochastic gradient            ber of threes to use in order to balance the
    descendent training (SGD). It has the advan-            precision of the algorithm and the random-
    tage to be very efficient with large datasets           ness to obtain a good level of generalization
    considering, during the training, one instance          of the model.
    per time independent by others. It uses the
                                                          • Multi-layer Perceptron classifier (MLP). This
    gradient descendent as optimization function
                                                            model is a classical architecture of a deep
    for learning the optimal weight of the sepa-
                                                            neural network. It is composed by one layer
    ration function of the distributional space of
                                                            of inputs, one layer of linear threshold units
    items. In literature, it has been successfully
                                                            (LTU) as output and many hidden layers of an
    used for tasks of text classification, especially
                                                            arbitrary number of LTU plus one bias neu-
    with binary classification problems.
                                                            ron fully connected each other. The weights
  • C-Support Vector Classification (SVC). It is            learned by each neuron (perceptron) are up-
    the standard Support Vector Machine algo-               dated through back-propagation using a gra-
    rithm applied for the classification task. It is        dient descendent strategy. Important parame-
    a powerful approach that supports linear and            ters to configuring are the number of hidden
    non-linear classification function. Moreover,           layers, the number of training epochs and the
    through the C parameter, it is possible to de-          L2 penalty (regularization term) parameter.
    cide how much the margin of classification             We evaluated the performance of the algorithms
    could be significant and consequently sensi-        just described using a default configuration and a
    tive to outliers. The implementation is based       5-fold cross validation over the Facebook training
    on libsvm, and we evaluated different config-       set. Moreover, we set the random seed equal to 42
    urations of the algorithm: polynomial func-         for obtaining at each run always the same folder
    tion with 2 and 3 degree, RBF kernel and dif-       subdivision.
    ferent values of the C parameters.                     Tab. 3 shows the results obtained by the dif-
  • K-nearest neighbors vote (KNN). This clas-          ferent classification algorithms during their pre-
    sic and versatile algorithm is based on the         liminary analysis considering the macro F1 score
    concept of similarity among items accord-           as in the task specifications. The values obtained
    ing to a distance metric. In particular, for        do not point out significant statistical differences
    an unseen item, the k most similar items of         among the approaches, but we decided to investi-
    the training set are retrieved, and the class,      gate more the top three scored algorithms: SVM
    provided as output, is obtained by the major-       with an RBF kernel, Random Forest with 300
    ity vote of the neighborhoods. Despite the          trees, MLP with 2,000 hidden layers. In general,
    simplicity of the algorithm it is often used in     we observed that linear algorithms obtain a high
    tasks of text classification.                       score for the task supporting our idea that linguis-
                                                        tic features are enough for defining a clear sepa-
  • A decision tree classifier (DT). This approach      ration among the sentences of hate and not hate
    is another popular strategy of classification       speeches. In order to identify an optimal config-
    used especially when it is required to visu-        uration of the algorithms, we trained our models
    alize the model. The DT algorithm works             using a greedy search approach. For each algo-
    splitting items into a different path of the tree   rithm, we performed 100 training runs with pa-
                         Not HS                                    HS
             Precision   Recall     F1-score    Precision         Recall       F1-score           Macro F1   Pos.
  Task 1      0,6981     0,6873      0,6926      0,8519           0,8581        0,855              0,7738     7
  Task 2      0,7541     0,8801      0,8122      0,6161           0,4012        0,4859             0,6491     14
 Task 3.1     0,7835     0,2677      0,3991      0,3563           0,8456        0,5013             0,4502     11
 Task 3.2     0,3674     0,8235      0,5081      0,7934           0,3234        0,4596             0,4838     8

                     Table 2: Final scores obtained during the HASpeeDe challenge

       Algorithm           Macro F1 score                        max iter=184,solver=’adam’,
       LR                  0.780444109                           warm start=False)
       SVC-rbf - C= 1      0.789384136
       SVC-poly 2 C=1      0.758599844                    The models are consequently used in a vot-
       SVC-poly 3 C=1      0.667374386                 ing classifier configured for using a hard majority
       KNN - 3             0.705064332                 vote algorithm. The ensemble obtains an F1 value
       KNN - 5             0.703990867                 of 0.8034 for Facebook sentences and 0.7102 for
       KNN - 10            0.687719117                 Twitter using the 5-fold subdivision of the training
       KNN - 20            0.663451598                 sets. The implementation of the system has been
       DT                  0.68099986                  realized into Python language and using the scikit-
       RF-50               0.75219596                  learn 0.20 machine learning library 3 .
       RF-100              0.764247578
       RF-300              0.787778421                 4        Results and discussion
       RF-500              0.768494151                 HanSEL has been used for classifying the data
       MLP-1000            0.766835616                 provided as a test set for each of the three special-
       MLP-2000            0.791230474                 ized tasks of HaSpeeDe competition. Tab. 2 shows
       MLP-3000            0.76952709                  the final results obtained by our system in the chal-
Table 3: Classification algorithms on Facebook         lenge. It is possible to observe that the system
training set using 5-fold cross validation.            well performed for Task 1 and Task 3.2 which in-
                                                       volve the classification of Facebook messages. In
                                                       particular, it emerges that HanSEL performs bet-
rameters randomly selected from a range of values      ter fot hate speeches sentences than for not hate
preliminary defined. Each run has been evaluated,      speeches probably a consequence of the presence
considering the macro F1 score, on the training        of many clear hate words used in this type of mes-
set using the same strategy of cross-validation al-    sages such as ”sfigati” and ”bugiardo” in that cate-
ready described before. At the end of the 100 runs     gory of textual sentences. A symmetrical situation
the model that achieve the best results has been       is obtained for Task 2 and Task 3.2 that involves
stored and later used in our final ensemble of clas-   Twitter messages. In this scenario, the significant
sifiers. The final configurations obtained for the     use of specific hashtags, irony, and entities instead
three strategies are the following:                    of clear hate words has made difficult the identi-
                                                       fication of hate speeches. The cross-classification
  • SVC(C=1.76800710431488,                            task has, moreover, stressed the generalization of
    gamma=0.1949764030136127, kernel=’rbf’)            the system. It has been observed that the writ-
                                                       ing style of the two social networks strongly in-
  • RandomForestClassifier(bootstrap=False,            fluences the classification performance, especially
    max depth=30,max features=’sqrt’,                  when the models are trained on a small training
    min samples leaf=2,min samples split=2,            set, as in our case. Finally, the optimization of
    n estimators=200, warm start=False)                the models inside the ensemble has been stressed
                                                       more on the Facebook dataset consequently over-
  • MLPClassifier(alpha=0.5521952082781035,
                                                       fitting on the characteristics of that type of mes-
    early stopping=False,
                                                       sages. The outcomes achieved for the challenge
    hidden layer sizes=2220,
                                                            3
    learning rate init=0.001,                                   http://scikit-learn.org/stable/
allow us to deduce important consideration for fur-        Security, Risk and Trust (PASSAT), 2012 Interna-
ther developments of the system. In particular, we         tional Conference on and 2012 International Con-
                                                           fernece on Social Computing (SocialCom), pages
consider essential to mix the two datasets in or-
                                                           71–80. IEEE.
der to allow the models to generalize better con-
sidering the two different sources of data. More-        Marti A. Hearst, Susan T Dumais, Edgar Osuna, John
over, extra features regarding hashtags, entities,        Platt, and Bernhard Scholkopf. 1998. Support vec-
                                                          tor machines. IEEE Intelligent Systems and their ap-
and links can be helpful for obtaining better results     plications, 13(4):18–28.
with Twitter messages.
                                                         Věra Kolmogorov. 1992. Kolmogorov’s theorem
5   Conclusion                                              and multilayer neural networks. Neural networks,
                                                            5(3):501–506.
The HaSpeeDe competition has been a perfect sce-
                                                         Quoc Le and Tomas Mikolov. 2014. Distributed rep-
nario for developing and testing solutions for the
                                                           resentations of sentences and documents. In Inter-
social problem of hate speeches on social media            national Conference on Machine Learning, pages
and, in particular, for them in the Italian language.      1188–1196.
In our work, we presented HAnSEL a system based
                                                         Yashar Mehdad and Joel Tetreault. 2016. Do charac-
on an ensemble of classifiers that includes the Sup-       ters abuse more than words? In Proceedings of the
port Vector Machine algorithm, Random Forests,             17th Annual Meeting of the Special Interest Group
and a Multilayers Perceptron Deep Neural Net-              on Discourse and Dialogue, pages 299–303.
work. We formalize messages as a concatenation
                                                         Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-
of word2vec sentence vectors and a TF-IDF bag of           frey Dean. 2013. Efficient estimation of word
words. Results showed the efficacy of the solution         representations in vector space. arXiv preprint
in a scenario that uses clear offensive words such         arXiv:1301.3781.
as Facebook messages. On the contrary, there is          Anna Schmidt and Michael Wiegand. 2017. A survey
a large margin of improvements for the classifica-         on hate speech detection using natural language pro-
tion of Tweets. The future direction of the work           cessing. In Proceedings of the Fifth International
will surely investigate the use of more data and se-       Workshop on Natural Language Processing for So-
                                                           cial Media, pages 1–10.
mantic features for allowing classification meth-
ods to create a more general model.                      Sara Sood, Judd Antin, and Elizabeth Churchill. 2012.
                                                           Profanity use in online communities. In Proceed-
                                                           ings of the SIGCHI Conference on Human Factors
References                                                 in Computing Systems, pages 1481–1490. ACM.

Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,      Rocco Tripodi and Stefano Li Pira. 2017. Anal-
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.         ysis of italian word embeddings. arXiv preprint
  Overview of the EVALITA 2018 HaSpeeDe Hate               arXiv:1707.08783.
  Speech Detection (HaSpeeDe) Task.          In Tom-
  maso Caselli, Nicole Novielli, Viviana Patti, and      Cynthia Van Hee, Els Lefever, Ben Verhoeven, Julie
  Paolo Rosso, editors, Proceedings of the 6th evalua-     Mennes, Bart Desmet, Guy De Pauw, Walter Daele-
  tion campaign of Natural Language Processing and         mans, and Véronique Hoste. 2015. Detection and
  Speech tools for Italian (EVALITA’18), Turin, Italy.     fine-grained classification of cyberbullying events.
  CEUR.org.                                                In International Conference Recent Advances in
                                                           Natural Language Processing (RANLP), pages 672–
Leo Breiman. 2001. Random forests. Machine learn-          680.
  ing, 45(1):5–32.
                                                         William Warner and Julia Hirschberg. 2012. Detect-
Tommaso Caselli, Nicole Novielli, Viviana Patti, and       ing hate speech on the world wide web. In Proceed-
  Paolo Rosso. 2018. EVALITA 2018: Overview of             ings of the Second Workshop on Language in Social
  the 6th Evaluation Campaign of Natural Language          Media, pages 19–26. Association for Computational
  Processing and Speech Tools for Italian. In Tom-         Linguistics.
  maso Caselli, Nicole Novielli, Viviana Patti, and
  Paolo Rosso, editors, Proceedings of Sixth Evalua-     Jun-Ming Xu, Kwang-Sung Jun, Xiaojin Zhu, and
  tion Campaign of Natural Language Processing and         Amy Bellmore. 2012. Learning from bullying
  Speech Tools for Italian. Final Workshop (EVALITA        traces in social media. In Proceedings of the 2012
  2018), Turin, Italy. CEUR.org.                           conference of the North American chapter of the as-
                                                           sociation for computational linguistics: Human lan-
Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu.             guage technologies, pages 656–666. Association for
  2012. Detecting offensive language in social me-         Computational Linguistics.
  dia to protect adolescent online safety. In Privacy,