<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>No Place For Hate Speech @ HaSpeeDe 2: Ensemble to Identify Hate Speech in Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adriano dos S.R. da Silva</string-name>
          <email>adriano.santos.silva@usp.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norton T. Roman</string-name>
          <email>norton@usp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Schoool of Arts, Sciences and Humanities, University of Sao Paulo</institution>
          ,
          <addr-line>Sao Paulo -</addr-line>
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Schoool of Arts, Sciences and, Humanities - University of Sao Paulo</institution>
          ,
          <addr-line>Sao Paulo -</addr-line>
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>English. In this article, we present the results of applying a Stacking Ensemble method to the problem of hate speech classification proposed in the main task of HaSpeeDe 2 at EVALITA 2020. The model was then compared to a Logistic Regression classifier, along with two other benchmarks defined by the competition's organising committee (an SVM with a linear kernel and a majority class classifier). Results showed our Ensemble to outperform the benchmarks to various degrees, both when testing in the same domain as training and in a different domain.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. In questo articolo, ci
presentiamo i risultati dell’applicazione di un
modello di Stacking Ensemble al problema
della classificazione dei discorsi di
incitamento all’odio nel compito A di EVALITA
(HaSpeeDe 2). Il modello e` stato quindi
confrontato con un modello di regressione
logistica, insieme ad altri due benchmark
definiti dal comitato organizzatore della
competizione (un SVM con un kernel
lineare e un classificatore di classe
maggioritaria). I risultati hanno mostrato che il
nostro Ensemble supera i benchmark a
vari livelli, sia durante i test nello stesso
dominio di sviluppo che in un dominio
diverso.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>Social networks are already part of people’s lives,
generating thousands of publications on a daily
basis. Even though most of this material presents no</p>
      <p>Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
real harm to other people, some of it bears
discriminating discourse, not rarely filled with hate
for minorities or people with different viewpoints.</p>
      <p>
        Defined as “language which attacks or demeans
a group based on race, ethnic origin, religion,
gender, age, disability, or sexual orientation/gender
identity”
        <xref ref-type="bibr" rid="ref5">(Nobata et al., 2016)</xref>
        , hate speech
represents a problem that cannot be allowed to grow,
under the risk of having it lead to more concrete
actions, by some people, with truly undesired
results.
      </p>
      <p>This is so much of an issue, that some
companies have already decided to stop advertising on
Facebook1, for example, as a way to try to pressure
the company into facing this problem. Some
initiatives have also emerged in order to monitor and
combat this type of content, such as the code of
conduct that has been signed by some companies
(YouTube, Facebook, Twitter) so that this type of
publication can be monitored and removed within
24 hours2.</p>
      <p>Due to the large volume of data, machine
learning techniques, along with natural language
processing, are being used to automate this activity
and identify this type of speech more accurately.
Other initiatives include the setting up of
competitions, aimed at developing and testing different
ways to tackle the problem.</p>
      <p>
        One such competitions is the evaluation
campaign of Natural Language Processing and Speech
Tools for Italian (EVALITA), which started in
2007 aiming at promoting the development and
dissemination of language resources for Italian. In
its 2018 edition, a task (HaSpeeDe) was proposed
to identify hate speech on Facebook and
Twitter (Bosco et al., 2018). HaSpeeDe had the
par1https://www.nytimes.com/2020/08/01/
business/media/facebook-boycott.html
2https://ec.europa.eu/info/policies/justice-andfundamental-rights/combatting-discrimination/racismand-xenophobia/eu-code-conduct-countering-illegal-hatespeech-online en
ticipation of several teams and promising results
were presented that stimulated the development of
the second edition of the event (HaSpeeDe2) at
EVALITA 2020
        <xref ref-type="bibr" rid="ref2 ref8">(Sanguinetti et al., 2020; Basile
et al., 2020)</xref>
        . In this work, we describe our attempt
to deal with the hate speech identification problem
HaSpeeDe 2, by developing a stack ensemble of
three machine learning models to this task. Weak
classifiers used in the ensemble were an SVM with
RBF kernel, a Bernoulli Na¨ıve Bayes (NB), and a
Random Forest model (RF), with a Linear
Regression (LR) model serving as meta-classifier.
      </p>
      <p>For the sake of comparison, and as a way to
define some benchmarks to our model, we also
developed and tested a Linear Regression
classifier, with L2 regularisation, along with both
models suggested by HaSpeeDe 2 organising
committee, to wit, an SVM model with a linear kernel
and a majority class classifier. As it will be made
clearer in the forthcoming sections, with a Macro
F1-score of 0.749, our ensemble outperforms all
benchmarks, for both in and out-of-domain test
sets, even though sometimes differences were not
high.</p>
      <p>The rest of this article is organized as follows.
Section 2 presents some related work, aiming at
identifying hate speech. Section 3, in turn, gives
an overview of HaSpeeDe 2 task. Next, in
sections 4 and 5 we explain the preprocessing we
made, along with the classifiers we built for this
task. Section 6, in turn, presents our results,
which are further discussed in Section 7. Finally,
Section 8 presents our final considerations to this
work.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        Several strategies have been used to identify
hate speech. Some classic algorithms, like
Support Vector Machine (SVM), Na¨ıve Bayes (NB),
Logistic Regression (LR) and ensemble with
these techniques have also shown good results
(e.g.
        <xref ref-type="bibr" rid="ref1 ref4 ref7 ref7">(Basile et al., 2019; Saha et al., 2018;
Malmasi and Zampieri, 2018)</xref>
        ).
      </p>
      <p>
        An SVM with RBF kernel, for example, was
used to identify hate speech against immigrants
and women in tweets written in English.
Achieving a macro-averaged F 1 score of 0.65 this model
was the winner at SemEval 2019
        <xref ref-type="bibr" rid="ref1">(Basile et al.,
2019)</xref>
        .
      </p>
      <p>
        Logistic Regression was another classic model
to be applied to hate speech identification in
English, in this case focusing in hate speech towards
women, with a reported accuracy of 0.70
        <xref ref-type="bibr" rid="ref7">(Saha et
al., 2018)</xref>
        . Delivering an accuracy value of 79.8,
an ensemble associated with a meta-classifier was
also found to perform well in the task
        <xref ref-type="bibr" rid="ref4 ref7">(Malmasi
and Zampieri, 2018)</xref>
        .
      </p>
      <p>With an overall performance of F 1 = 0:749,
our ensemble method looks competitive, when
compared to these models. Even though one
cannot really make a true comparison between them,
we believe this to be an alternative to be
considered.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Task</title>
      <p>HaSpeeDe 2 Task A consists of a binary
classification to identify the presence or absence of hate
speech in tweets written in Italian. The
competition’s organising committee provides participants
with a data set for training and testing
competing models. This data set is slightly imbalanced,
with approximately 40% of tweets presenting hate
speech language, as shown in Table 1.</p>
      <p>This data set is supposed to be used by the
competition participants to train and test their models.
Competing models will then be evaluated in a
separate data set, which consists of in-domain and
out-of-domain data, defined by the competition’s
organisation.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Preprocessing</title>
      <p>As a preprocessing step, we removed stopwords
using the NLTK (Natural Language Toolkit 3)
library. For each tweet in the corpus, we also added
the following new features:
• The number of words in the tweet;
• The number of exclamation points (‘!’)
present in the tweet; and
• The presence or not of a question mark (‘?’)
in the tweet.</p>
      <p>As a final measure, all features related to the
tweet’s text were normalised in the range between
0 and 1.</p>
      <p>3https://www.nltk.org/</p>
      <sec id="sec-5-1">
        <title>Classifier RF RF RF</title>
        <p>LR
LR
LR
NB
NB
NB</p>
      </sec>
      <sec id="sec-5-2">
        <title>Classifier RF RF RF</title>
        <p>
          LR
LR
LR
NB
NB
NB
In the sequence, three individual classifiers were
developed using the Python Sklearn4 library.
These were a Na¨ıve Bayes (NB) with Bernoulli
distribution, Logistic Regression (LR) with L2
regularization, and Random Forest (RF) with
150 trees. Each classifier was tested with
NGram representations (N ranging from 3 to 5),
with and without term frequency-inverse
document frequency (TF-IDF)
          <xref ref-type="bibr" rid="ref3 ref6">(Rajaraman and
Ullman, 2011)</xref>
          normalisation, and with and without
pre-processing the training and test sets.
        </p>
        <p>We then chose the two best models to compose
the ensemble to be used at the competition. As it
will be shown in the next section, these were
Random Forests and Na¨ıve Bayes. In the sequence, we
also added an SVM classifier, to RBF kernel and
C = 2 penalty to the ensemble, making Logistic
Regression our meta-classifier.</p>
        <p>
          The training set was divided into 90% for
training/validation and 10% for test set. Models were
4https://scikit-learn.org/stable/
trained in the training/validation set using 10-fold
cross-validation.
          <xref ref-type="bibr" rid="ref3">(Han et al., 2011)</xref>
          .
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>Tables 2 and 3 show the performance and
settings of each classifier in the training/validation
and test sets, respectively. During training, best
results were observed without preprocessing, for RF
and LR, whereas NB showed better results with
preprocessing. These results, however, were very
close to each other, ranging from F 1 = 0:69 to
F 1 = 0:71. Regarding language model, best
results were observed with 5-grams, for RF and LR,
and 4-grams, for LR and NB.</p>
      <p>At the test set, best results, for all methods, were
observed with preprocessing the data.
Normalising the vectors does not seem, however, to have
influenced results when preprocessing is used. All
best values were obtained with 4-grams.
Overall, the best result was achieved with Na¨ıve Bayes
(F = 0:74), with preprocessing, using a 4-gram
language model, and both with and without
TFIDF normalisation.</p>
      <p>The ensemble model was tested with only one
configuration: 4-Gram, with normalization, and
without preprocessing. This configuration resulted
in an F 1 = 0:729 in the training set (a 2.5%
increase over the best model in this set) and an
F 1 = 0:751 in the test set, corresponding to a
1.5% improvement over the best model in this
set. As it turns out, especially in the test set,
differences between the ensemble and its best
constituent method do not seem so high.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>The competition rules allow only two models to
be sent by each team. Although our Na¨ıve Bayes
model has shown good performance in the test
set we had at hand, we chose not to send it to
HaSpeeDe 2 due to the fact that it would also be
tested in an out-of-domain data set.</p>
      <p>Since this classifier can be very sensitive to
domain changes, specially regarding null frequency
words, which might bring the whole model down
to multiplying smoothing values, we thought we
would be better off not sending it. Still, it
remained as one of the weak classifiers in the
Ensemble we sent, so it was not completely put aside.</p>
      <p>The organization of the competition presented
F1 results corresponding to two classifiers, run in
the same data set distributed to all participants in
the competition. These were supposed to be taken
as baselines by all competing teams. The first
consisted of a majority class classifiers
(BaselineMC), which always chooses the majority class to
label new examples. The second classifier, in turn,
consisted of an SVM with linear kernel, running
with TF-IDF normalisation (Baseline-SVM).</p>
      <p>Table 4 shows the result of these two baseline
classifiers, along with the classifiers we
submitted to the competition (i.e. our Ensemble model
and its constituent Logistic Regression classifier).
As it turns out, for the within-domain task, only
our Ensemble was superior to the baselines (3.9%
over the baseline SVM and almost 123% over the
majority class baseline). When moving to the
outof-domain test set, this difference dropped to only
1.8% over the SVM model and 62.3% over the
majority class, still outscoring both baselines.</p>
      <p>Regarding our Logistic Regression model,
when run in the within-domain test set, it
outscored only the majority class baseline (109%
better), being however outscored by the baseline
SVM by 2.3%. As for the out-of-domain test set,
our Logistic Regression model presented the same
result as the baseline SVM, outscoring the
majority class baseline by 59.5%. Interestingly, both
Ensemble and Logistic Regression models scored
similarly in this set.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>In this article we reported on the results
obtained by two models submitted to EVALITA’s
HaSpeeDe2 task. Even though our Ensemble
model outscored both benchmarks, we believe it
could do better, should other choices regarding the
language model be made.</p>
      <p>Since the best results were obtained with longer
word sequences (in our case, 4-grams), it might be
the case that other language models, such as Glove
or CBOW, for example, which make use of context
words at both sides of the target word, could come
up as better alternatives for the 4-gram model we
used. BERT could also be a possibility to test.</p>
      <p>Our best results were also obtained, at least
during test, with preprocessing the data. We thus
believe this is something to be kept. Regarding the
normalisation of feature vectors, we could not
observe great differences between using it or not, at
least when it comes to TF-IDF normalisation.</p>
      <p>Another direction to be followed might be to
test other models as weak classifiers in the
Ensemble, or even ensemble strategies other than
stacking. This is something we leave for future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          .
          <year>2019</year>
          . SemEval
          <article-title>-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter</article-title>
          .
          <source>In Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          , Minneapolis, USA, June.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Evalita 2020: Overview of the 7th evaluation campaign of natural language Cristina Bosco, Felice Dell'Orletta, Fabio Poletto</article-title>
          , Manuela Sanguinetti, and
          <string-name>
            <given-names>Maurizio</given-names>
            <surname>Tesconi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 Hate Speech Detection Task</article-title>
          .
          <source>In Proceedings of the 6th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA'18).</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Jiawei</surname>
            <given-names>Han</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jian</given-names>
            <surname>Pei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Micheline</given-names>
            <surname>Kamber</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Data mining: concepts and techniques</article-title>
          . Elsevier.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Shervin</given-names>
            <surname>Malmasi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marcos</given-names>
            <surname>Zampieri</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Challenges in discriminating profanity from hate speech</article-title>
          .
          <source>Journal of Experimental &amp; Theoretical Artificial Intelligence</source>
          ,
          <volume>30</volume>
          (
          <issue>2</issue>
          ):
          <fpage>187</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Chikashi</given-names>
            <surname>Nobata</surname>
          </string-name>
          , Joel Tetreault, Achint Thomas,
          <string-name>
            <given-names>Yashar</given-names>
            <surname>Mehdad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yi</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Abusive language detection in online user content</article-title>
          .
          <source>In Proceedings of the 25th international conference on world wide web.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Anand</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          and Jeffrey David Ullman.
          <year>2011</year>
          .
          <article-title>Mining of massive datasets</article-title>
          . Cambridge.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Punyajoy</given-names>
            <surname>Saha</surname>
          </string-name>
          , Binny Mathew, Pawan Goyal, and
          <string-name>
            <given-names>Animesh</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Hateminers : Detecting hate speech against women</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1812</year>
          .06700.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Marco Stranisci, Cristina Bosco, Tommaso Caselli, Viviana Patti, and
          <string-name>
            <given-names>Irene</given-names>
            <surname>Russo</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>