No Place For Hate Speech @ HaSpeeDe 2: Ensemble to Identify Hate
                            Speech in Italian
            Adriano dos S.R. da Silva                    Norton T. Roman
           Schoool of Arts, Sciences and      Schoool of Arts, Sciences and Humanities
         Humanities – University of Sao Paulo         University of Sao Paulo
                 Sao Paulo - Brazil                      Sao Paulo - Brazil
        adriano.santos.silva@usp.br                      norton@usp.br


                        Abstract                                real harm to other people, some of it bears dis-
                                                                criminating discourse, not rarely filled with hate
    English. In this article, we present the                    for minorities or people with different viewpoints.
    results of applying a Stacking Ensemble                         Defined as “language which attacks or demeans
    method to the problem of hate speech                        a group based on race, ethnic origin, religion, gen-
    classification proposed in the main task                    der, age, disability, or sexual orientation/gender
    of HaSpeeDe 2 at EVALITA 2020. The                          identity” (Nobata et al., 2016), hate speech rep-
    model was then compared to a Logistic                       resents a problem that cannot be allowed to grow,
    Regression classifier, along with two other                 under the risk of having it lead to more concrete
    benchmarks defined by the competition’s                     actions, by some people, with truly undesired re-
    organising committee (an SVM with a lin-                    sults.
    ear kernel and a majority class classifier).                    This is so much of an issue, that some compa-
    Results showed our Ensemble to outper-                      nies have already decided to stop advertising on
    form the benchmarks to various degrees,                     Facebook1 , for example, as a way to try to pressure
    both when testing in the same domain as                     the company into facing this problem. Some ini-
    training and in a different domain.                         tiatives have also emerged in order to monitor and
    Italiano. In questo articolo, ci presen-                    combat this type of content, such as the code of
    tiamo i risultati dell’applicazione di un                   conduct that has been signed by some companies
    modello di Stacking Ensemble al problema                    (YouTube, Facebook, Twitter) so that this type of
    della classificazione dei discorsi di incita-               publication can be monitored and removed within
    mento all’odio nel compito A di EVALITA                     24 hours2 .
    (HaSpeeDe 2). Il modello è stato quindi                        Due to the large volume of data, machine learn-
    confrontato con un modello di regressione                   ing techniques, along with natural language pro-
    logistica, insieme ad altri due benchmark                   cessing, are being used to automate this activity
    definiti dal comitato organizzatore della                   and identify this type of speech more accurately.
    competizione (un SVM con un kernel lin-                     Other initiatives include the setting up of compe-
    eare e un classificatore di classe maggior-                 titions, aimed at developing and testing different
    itaria). I risultati hanno mostrato che il                  ways to tackle the problem.
    nostro Ensemble supera i benchmark a                            One such competitions is the evaluation cam-
    vari livelli, sia durante i test nello stesso               paign of Natural Language Processing and Speech
    dominio di sviluppo che in un dominio di-                   Tools for Italian (EVALITA), which started in
    verso.                                                      2007 aiming at promoting the development and
                                                                dissemination of language resources for Italian. In
                                                                its 2018 edition, a task (HaSpeeDe) was proposed
1    Introduction                                               to identify hate speech on Facebook and Twit-
                                                                ter (Bosco et al., 2018). HaSpeeDe had the par-
Social networks are already part of people’s lives,
generating thousands of publications on a daily ba-                1
                                                                     https://www.nytimes.com/2020/08/01/
sis. Even though most of this material presents no              business/media/facebook-boycott.html
                                                                   2
                                                                     https://ec.europa.eu/info/policies/justice-and-
     Copyright © 2020 for this paper by its authors. Use per-   fundamental-rights/combatting-discrimination/racism-
mitted under Creative Commons License Attribution 4.0 In-       and-xenophobia/eu-code-conduct-countering-illegal-hate-
ternational (CC BY 4.0).                                        speech-online en
ticipation of several teams and promising results      glish, in this case focusing in hate speech towards
were presented that stimulated the development of      women, with a reported accuracy of 0.70 (Saha et
the second edition of the event (HaSpeeDe2) at         al., 2018). Delivering an accuracy value of 79.8,
EVALITA 2020 (Sanguinetti et al., 2020; Basile         an ensemble associated with a meta-classifier was
et al., 2020). In this work, we describe our attempt   also found to perform well in the task (Malmasi
to deal with the hate speech identification problem    and Zampieri, 2018).
HaSpeeDe 2, by developing a stack ensemble of             With an overall performance of F 1 = 0.749,
three machine learning models to this task. Weak       our ensemble method looks competitive, when
classifiers used in the ensemble were an SVM with      compared to these models. Even though one can-
RBF kernel, a Bernoulli Naı̈ve Bayes (NB), and a       not really make a true comparison between them,
Random Forest model (RF), with a Linear Regres-        we believe this to be an alternative to be consid-
sion (LR) model serving as meta-classifier.            ered.
   For the sake of comparison, and as a way to
define some benchmarks to our model, we also           3       Task
developed and tested a Linear Regression classi-       HaSpeeDe 2 Task A consists of a binary classifi-
fier, with L2 regularisation, along with both mod-     cation to identify the presence or absence of hate
els suggested by HaSpeeDe 2 organising commit-         speech in tweets written in Italian. The competi-
tee, to wit, an SVM model with a linear kernel         tion’s organising committee provides participants
and a majority class classifier. As it will be made    with a data set for training and testing compet-
clearer in the forthcoming sections, with a Macro      ing models. This data set is slightly imbalanced,
F1-score of 0.749, our ensemble outperforms all        with approximately 40% of tweets presenting hate
benchmarks, for both in and out-of-domain test         speech language, as shown in Table 1.
sets, even though sometimes differences were not
high.
                                                                  Table 1: Data set class distribution
   The rest of this article is organized as follows.           Hate Speech Not Hate Speech Total
Section 2 presents some related work, aiming at                   2766               4073            6839
identifying hate speech. Section 3, in turn, gives
an overview of HaSpeeDe 2 task. Next, in sec-
                                                          This data set is supposed to be used by the com-
tions 4 and 5 we explain the preprocessing we
                                                       petition participants to train and test their models.
made, along with the classifiers we built for this
                                                       Competing models will then be evaluated in a sep-
task. Section 6, in turn, presents our results,
                                                       arate data set, which consists of in-domain and
which are further discussed in Section 7. Finally,
                                                       out-of-domain data, defined by the competition’s
Section 8 presents our final considerations to this
                                                       organisation.
work.
                                                       4       Preprocessing
2   Related Work
                                                       As a preprocessing step, we removed stopwords
Several strategies have been used to identify          using the NLTK (Natural Language Toolkit 3 ) li-
hate speech. Some classic algorithms, like Sup-        brary. For each tweet in the corpus, we also added
port Vector Machine (SVM), Naı̈ve Bayes (NB),          the following new features:
Logistic Regression (LR) and ensemble with
these techniques have also shown good results              • The number of words in the tweet;
(e.g. (Basile et al., 2019; Saha et al., 2018; Mal-
                                                           • The number of exclamation points (‘!’)
masi and Zampieri, 2018)).
                                                             present in the tweet; and
   An SVM with RBF kernel, for example, was
used to identify hate speech against immigrants            • The presence or not of a question mark (‘?’)
and women in tweets written in English. Achiev-              in the tweet.
ing a macro-averaged F 1 score of 0.65 this model
was the winner at SemEval 2019 (Basile et al.,            As a final measure, all features related to the
2019).                                                 tweet’s text were normalised in the range between
   Logistic Regression was another classic model       0 and 1.
                                                           3
to be applied to hate speech identification in En-             https://www.nltk.org/
                        Table 2: Results of the classifiers in the training stage in terms of F1
                                                 Without Preprocessing With Preprocessing
                    Classifier Lang. Model No Norm. TF-IDF                     No Norm. TF-IDF
                       RF         3-Gram         0.662           0.657         06687         0.667
                       RF         4-Gram         0.683           0.694         0.690         0.689
                       RF         5-Gram         0.701           0.701         0.687         0.686
                       LR         3-Gram         0.681           0.703         0.676         0.696
                       LR         4-Gram         0.711           0.701         0.706         0.697
                       LR         5-Gram         0.711           0.673         0.708         0.673
                       NB         3-Gram         0.679           0.679         0.681         0.681
                       NB         4-Gram         0.689           0.689         0.694         0.694
                       NB         5-Gram         0.654           0.654         0.668         0.668


                           Table 3: Results of the classifiers in the test stage in terms of F1
                                                  Without Preprocessing With Preprocessing
                    Classifier Lang. Model No Norm. TF-IDF                      No Norm. TF-IDF
                       RF           3-Gram        0.650          0.668          0.650         0.674
                       RF           4-Gram        0.693          0.694          0.710         0.696
                       RF           5-Gram        0.707          0.709          0.703         0.700
                       LR           3-Gram        0.675          0.701          0.675         0.709
                       LR           4-Gram        0.684          0.696          0.685         0.710
                       LR           5-Gram        0.669          0.665          0.707         0.680
                       NB           3-Gram        0.696          0.696          0.707         0.707
                       NB           4-Gram        0.718          0.718          0.740         0.740
                       NB           5-Gram        0.658          0.658          0.687         0.687


5       Classifiers                                          trained in the training/validation set using 10-fold
                                                             cross-validation. (Han et al., 2011).
In the sequence, three individual classifiers were
developed using the Python Sklearn4 library.                 6   Results
These were a Naı̈ve Bayes (NB) with Bernoulli
distribution, Logistic Regression (LR) with L2               Tables 2 and 3 show the performance and set-
regularization, and Random Forest (RF) with                  tings of each classifier in the training/validation
150 trees. Each classifier was tested with N-                and test sets, respectively. During training, best re-
Gram representations (N ranging from 3 to 5),                sults were observed without preprocessing, for RF
with and without term frequency-inverse docu-                and LR, whereas NB showed better results with
ment frequency (TF-IDF) (Rajaraman and Ull-                  preprocessing. These results, however, were very
man, 2011) normalisation, and with and without               close to each other, ranging from F 1 = 0.69 to
pre-processing the training and test sets.                   F 1 = 0.71. Regarding language model, best re-
   We then chose the two best models to compose              sults were observed with 5-grams, for RF and LR,
the ensemble to be used at the competition. As it            and 4-grams, for LR and NB.
will be shown in the next section, these were Ran-              At the test set, best results, for all methods, were
dom Forests and Naı̈ve Bayes. In the sequence, we            observed with preprocessing the data. Normalis-
also added an SVM classifier, to RBF kernel and              ing the vectors does not seem, however, to have
C = 2 penalty to the ensemble, making Logistic               influenced results when preprocessing is used. All
Regression our meta-classifier.                              best values were obtained with 4-grams. Over-
   The training set was divided into 90% for train-          all, the best result was achieved with Naı̈ve Bayes
ing/validation and 10% for test set. Models were             (F = 0.74), with preprocessing, using a 4-gram
                                                             language model, and both with and without TF-
    4
        https://scikit-learn.org/stable/                     IDF normalisation.
   The ensemble model was tested with only one
                                                         Table 4: Result of baselines and final performance
configuration: 4-Gram, with normalization, and
                                                         of classifiers in task A in terms of F1
without preprocessing. This configuration resulted        Classifier           Out-of-domain In-domain
in an F 1 = 0.729 in the training set (a 2.5%              Baseline-MC             0.3894          0.3366
increase over the best model in this set) and an          Baseline-SVM              0.621          0.7212
F 1 = 0.751 in the test set, corresponding to a              Ensemble               0.632           0.749
1.5% improvement over the best model in this
                                                                 LR                 0.621           0.705
set. As it turns out, especially in the test set, dif-
ferences between the ensemble and its best con-
stituent method do not seem so high.                     our Logistic Regression model presented the same
                                                         result as the baseline SVM, outscoring the major-
7   Discussion                                           ity class baseline by 59.5%. Interestingly, both
                                                         Ensemble and Logistic Regression models scored
The competition rules allow only two models to           similarly in this set.
be sent by each team. Although our Naı̈ve Bayes
model has shown good performance in the test             8   Conclusion
set we had at hand, we chose not to send it to
                                                         In this article we reported on the results ob-
HaSpeeDe 2 due to the fact that it would also be
                                                         tained by two models submitted to EVALITA’s
tested in an out-of-domain data set.
                                                         HaSpeeDe2 task. Even though our Ensemble
   Since this classifier can be very sensitive to do-
                                                         model outscored both benchmarks, we believe it
main changes, specially regarding null frequency
                                                         could do better, should other choices regarding the
words, which might bring the whole model down
                                                         language model be made.
to multiplying smoothing values, we thought we
                                                            Since the best results were obtained with longer
would be better off not sending it. Still, it re-
                                                         word sequences (in our case, 4-grams), it might be
mained as one of the weak classifiers in the En-
                                                         the case that other language models, such as Glove
semble we sent, so it was not completely put aside.
                                                         or CBOW, for example, which make use of context
   The organization of the competition presented
                                                         words at both sides of the target word, could come
F1 results corresponding to two classifiers, run in
                                                         up as better alternatives for the 4-gram model we
the same data set distributed to all participants in
                                                         used. BERT could also be a possibility to test.
the competition. These were supposed to be taken
                                                            Our best results were also obtained, at least dur-
as baselines by all competing teams. The first
                                                         ing test, with preprocessing the data. We thus be-
consisted of a majority class classifiers (Baseline-
                                                         lieve this is something to be kept. Regarding the
MC), which always chooses the majority class to
                                                         normalisation of feature vectors, we could not ob-
label new examples. The second classifier, in turn,
                                                         serve great differences between using it or not, at
consisted of an SVM with linear kernel, running
                                                         least when it comes to TF-IDF normalisation.
with TF-IDF normalisation (Baseline-SVM).
                                                            Another direction to be followed might be to
   Table 4 shows the result of these two baseline        test other models as weak classifiers in the Ensem-
classifiers, along with the classifiers we submit-       ble, or even ensemble strategies other than stack-
ted to the competition (i.e. our Ensemble model          ing. This is something we leave for future work.
and its constituent Logistic Regression classifier).
As it turns out, for the within-domain task, only
our Ensemble was superior to the baselines (3.9%         References
over the baseline SVM and almost 123% over the           Valerio Basile, Cristina Bosco, Elisabetta Fersini,
majority class baseline). When moving to the out-          Debora Nozza, Viviana Patti, Francisco Manuel
of-domain test set, this difference dropped to only        Rangel Pardo, Paolo Rosso, and Manuela San-
1.8% over the SVM model and 62.3% over the ma-             guinetti. 2019. SemEval-2019 task 5: Multilin-
                                                           gual detection of hate speech against immigrants and
jority class, still outscoring both baselines.             women in twitter. In Proceedings of the 13th In-
   Regarding our Logistic Regression model,                ternational Workshop on Semantic Evaluation, Min-
when run in the within-domain test set, it                 neapolis, USA, June.
outscored only the majority class baseline (109%         Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
better), being however outscored by the baseline           cia C. Passaro. 2020. Evalita 2020: Overview
SVM by 2.3%. As for the out-of-domain test set,            of the 7th evaluation campaign of natural language
  processing and speech tools for italian. In Valerio
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.
  Passaro, editors, Proceedings of Seventh Evalua-
  tion Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA
  2020), Online. CEUR.org.
Cristina Bosco, Felice Dell’Orletta, Fabio Poletto,
  Manuela Sanguinetti, and Maurizio Tesconi. 2018.
  Overview of the EVALITA 2018 Hate Speech De-
  tection Task. In Proceedings of the 6th Evalua-
  tion Campaign of Natural Language Processing and
  Speech Tools for Italian (EVALITA’18).
Jiawei Han, Jian Pei, and Micheline Kamber. 2011.
   Data mining: concepts and techniques. Elsevier.
Shervin Malmasi and Marcos Zampieri. 2018. Chal-
  lenges in discriminating profanity from hate speech.
  Journal of Experimental & Theoretical Artificial In-
  telligence, 30(2):187–202.
Chikashi Nobata, Joel Tetreault, Achint Thomas,
  Yashar Mehdad, and Yi Chang. 2016. Abusive lan-
  guage detection in online user content. In Proceed-
  ings of the 25th international conference on world
  wide web.
Anand Rajaraman and Jeffrey David Ullman. 2011.
  Mining of massive datasets. Cambridge.
Punyajoy Saha, Binny Mathew, Pawan Goyal, and Ani-
  mesh Mukherjee. 2018. Hateminers : Detecting
  hate speech against women. CoRR, abs/1812.06700.
Manuela Sanguinetti, Gloria Comandini, Elisa
 Di Nuovo, Simona Frenda, Marco Stranisci,
 Cristina Bosco, Tommaso Caselli, Viviana Patti, and
 Irene Russo. 2020. HaSpeeDe 2@EVALITA2020:
 Overview of the EVALITA 2020 Hate Speech
 Detection Task. In Valerio Basile, Danilo Croce,
 Maria Di Maro, and Lucia C. Passaro, editors,
 Proceedings of Seventh Evaluation Campaign of
 Natural Language Processing and Speech Tools for
 Italian. Final Workshop (EVALITA 2020), Online.
 CEUR.org.