<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Montanti @ HaSpeeDe2 EVALITA 2020: Hate Speech Detection in Online Contents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elia Bisconti</string-name>
          <email>eliabisconti@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Montagnani</string-name>
          <email>matteo.montagnani8@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Pisa</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. This report describes an approach to face a task regarding the identification of hate content and stereotypes within tweets. Two models will be shown, both presented to the HaSpeeDe competition proposed by EVALITA 2020. They are based on a Logistic Regression model that takes different types of embedding as input. The best system shows interesting results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The use of bad words and bad language has
always been a subject of debate. The spread of
social media platforms, such as Twitter and
Facebook, has fostered the growth of hate speech
online. These sites have been urged to treat and
remove offensive content, but the phenomenon is
so pervasive that the manual way of filtering out
hateful tweets is not enough. For that reason, the
development of automatic recognition systems is
increasingly important. To date, the use of
Natural Language Processing
        <xref ref-type="bibr" rid="ref3">(Bird et al., 2009)</xref>
        is
fundamental in this field. Most of the systems
      </p>
      <p>
        Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
proposed so far are based on manual feature
extraction
        <xref ref-type="bibr" rid="ref5">(Joulin et al., 2016)</xref>
        , even if in recent
years some approaches based on Deep Learning
techniques
        <xref ref-type="bibr" rid="ref1">(Badjatiya et al., 2017)</xref>
        have been
proposed. EVALITA organized the second edition
of an NLP competition for Hate Speech
Detection
        <xref ref-type="bibr" rid="ref2">(Basile et al., 2020)</xref>
        , intending to analyze
various techniques for automatic recognition
systems.The main goal was to classify a sentence as
hate speech or even as stereotyping. The
organizers provided us an in-domain dataset for training
and testing and another out-domain. In this
report, we will show a classical supervised approach
with the aim of obtaining good results regarding
the out-of-domain test.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Tasks Description</title>
      <p>
        The task proposed in the competition
        <xref ref-type="bibr" rid="ref8">(Sanguinetti
et al., 2020)</xref>
        consists of three parts, but only the
first two ones will be examined in this article; they
correspond to the following sub-tasks:
• Subtask A - Hate Speech Detection: it
consists of a binary classification task aimed at
determining the presence or the absence of
hateful content in the text towards a given
target.
• Subtask B - Stereotype Detection: it
consists of a binary classification task aimed at
determining the presence or the absence of a
stereotype, therefore an oversimplified
opinion, prejudiced attitude, or uncritical
judgment, toward a given target. This aims to
boost the investigation of its occurrences,
especially in a hateful context.
      </p>
      <p>The performances of the participating systems are
evaluated on a corpus of Italian tweets as in the
previous edition and also on a set of mixed text
genres, such as newspapers, comments and
headlines.</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>The dataset used is the one provided by the
competition organizers. In particular, the entire dataset
is split into one Training Set composed of tweets
and two test sets: an in-domain (based on tweets)
and a smaller out-of-domain (based on newspaper
phrases) test set. Overall, the Training Set includes
6,839 Italian tweets distributed as in Tables 1 and
2.</p>
      <p>TR Set</p>
      <sec id="sec-3-1">
        <title>Hate Speech 2766 Not Hate Speech 4073</title>
        <p>As we can see, the data are not well distributed.
Regarding the Hate Speech Training Set, we have
that sixty percent of the data are classified as hate
speech. The Stereotype Training Set is also a
little unbalanced, with fifty-five percent of the data
classified as non-stereotype.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Proposed Approach</title>
      <p>In this section, the proposed approaches will be
described, focusing on what has been developed
for the preprocessing of data, the used
embeddings and models. Some decisions regarding the
choice of models and the extraction of features
were made based on the results obtained in other
related works.
4.1</p>
      <sec id="sec-4-1">
        <title>Preprocessing</title>
        <p>A Tweet is a text message with a maximum length
of 280 characters. It may contain elements such
as hashtags, mentions, links and emoticons.
An example of a tweet extracted from the dataset
is shown below:</p>
        <p>”@user La societa` multirazziale... #migranti
#profughi #rom URL”
As we can see in the example, the dataset
provided has already been preprocessed,
censuring names and URLs, probably for privacy.
The preprocessing phase that we faced
implements a series of functions aimed at modifying a
tweet to eliminate useless elements and to
standardize it. Punctuation, emoji and any symbols
are also eliminated. The tweet is also transformed
into a lower case representation as shown:
”la societa` multirazziale migranti profughi
rom”</p>
        <p>Regarding this phase, the transformation of the
single words from an inflected form to root or
canonical form was also carried out, respectively,
through stemming and lemmatization. We tried
to consider these characteristics during the feature
selection phase. However, these attempts will not
be mentioned further, as they did not produce
relevant results.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Feature vectors</title>
        <p>The preprocessed tweets were used to generate the
feature useful for classification purposes. Both
tasks were addressed with the same types of
representation and the same models.</p>
        <p>
          • TF-IDF Vector:
          <xref ref-type="bibr" rid="ref7">(Qaiser and Ali, 2018)</xref>
          the
idea for the use of this function was to give
more importance to the less frequent, but
relevant, words. The vectors were generated
using the TfidfVectorizer class present in the
scikit-learn library.
• DistilBert:
          <xref ref-type="bibr" rid="ref9">(Wolf, 2019)</xref>
          this is a pre-trained
model. A single output vector with a size of
768 is considered, corresponding to the
result of the first position of what the model
received in input, that is the special token
[CLS], used for the sentence-level
classification.
• GloVe:
          <xref ref-type="bibr" rid="ref6">(Pennington et al., 2014)</xref>
          we used a
pre-trained model that returns a vector
representation of words. The database, extracted
from Twitter, includes more than 2 billion
phrases, which generated about 27 billion
tokens.
        </p>
        <p>These three types of features were used both
individually and in combination with each other by
concatenation. To decrease the size of these
vectors and to speed up the training phase, a features
Selection phase is also performed using a Random
Forest Classifier.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Systems and Results</title>
      <p>For both tasks, we tried the use of an SVM
Classifier with kernel RBF, a Logistic Regression
and a Random Forest. As already mentioned, each
of these models has taken various concatenations
of the previous feature vectors as input.</p>
      <p>We tested each model using 3-fold
crossvalidation and performed a grid-search to iterate
over the models and all the parameters.</p>
      <p>
        As a result of this search, the best final model
was undoubtedly the Logistic Regression that has
performed well also in previous papers
        <xref ref-type="bibr" rid="ref4">(Davidson
et al., 2017)</xref>
        . As for the input features, we
expected that the concatenation of features extracted
with the different techniques described above
would lead to the best results. Unexpectedly,
instead, the best results were obtained in the
validation phase with the use of TFIDF only. The
second best one was obtained with the TFIDF
concatenated with the DistilBert vectors. These
two systems represent the two runs submitted to
the competition. Overall, the difference in the
results between the first and the second model
is considerable; therefore, we will show in the
following table the F1 values obtained with the
best run, for tasks A and B, respectively.
      </p>
      <sec id="sec-5-1">
        <title>TaskA</title>
      </sec>
      <sec id="sec-5-2">
        <title>F-score M-F1</title>
      </sec>
      <sec id="sec-5-3">
        <title>Tweets TS NoHS HS 0.750 0.735 0.7432</title>
        <p>News TS
NoHS HS
0.835 0.615
0.7256</p>
        <p>Beyond the macro-F1 values obtained, it is
interesting to note the behavior of the model with
regard to the out-domain Test Set in both tasks.
In particular, the F-scores show worse values in
the classification of sentences that actually contain
hate speech or stereotyping. This is actually due
to low Recall values (about 0.51 for both tasks)
which is probably due to the fact that the model is
trained on a different type of data.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>Observing the results on the in-domain Test Set,
our best models obtained a ranking of 15/27 and
6/12 respectively for tasks A and B. Regarding the
out-domain Test Set, they obtained the third-best
score in both tasks. The result obtained with the
first Test Set confirms that the proposed approach
turned out to be too simplistic. However, it’s
interesting to notice how such a simple system
achieved a good placement in the out-of-domain
test-set. An explanation of that could be the
way the Training Set was preprocessed. In fact,
each tweet has been transformed into a plain
text, without taking into consideration any
characteristic of a ’social’ language. This may have
positively influenced the model in predicting the
out-of-domain classification.</p>
      <p>
        A further observation to be made about the dataset
concerns a lack of correlation between the use of
bad words and the presence of hateful contents
in a phrase. This fact shows how Offensive
Language Detection and Hate Speech Detection
are related topics, but they remain two distinct
tasks
        <xref ref-type="bibr" rid="ref4">(Davidson et al., 2017)</xref>
        . Also, many times
these kinds of bad words are probably used in an
ironic way or to emphasize a sentence, especially
in the Italian language.
7
      </p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>The participation in the Hate Speech Detection
2020 competition proposed by Evalita is derived
from purely academic purposes.</p>
      <p>We focused on using different types and
combinations of embeddings. Surprisingly, the best results
were obtained with the use of Tfidf only instead
of the use of a combination of more sophisticated
embeddings such as GloVe and DistilBert. After a
feature selection phase carried out through a
Random Forest, the results obtained through a Linear
SVM and a Logistic Regression were compared.
The latter was the best.</p>
      <p>We are aware that the presented system does not
introduce new elements with respect to the state
of the art of current technologies. Despite this, it
was interesting to observe the different results
obtained in relation to the composition of the Test
Set.</p>
      <p>The project was completely developed in python,
and the code is publicly available at the following
link:</p>
      <p>https://github.com/eliabisconti/haspeede</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Pinkesh</given-names>
            <surname>Badjatiya</surname>
          </string-name>
          , Shashank Gupta, Manish Gupta, and
          <string-name>
            <given-names>Vasudeva</given-names>
            <surname>Varma</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Deep learning for hate speech detection in tweets</article-title>
          .
          <source>In Proceedings of the 26th International Conference on World Wide Web Companion</source>
          , pages
          <fpage>759</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In Valerio Basile</source>
          , Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Ewan Klein, and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Natural language processing with Python: analyzing text with the natural language toolkit</article-title>
          . ”
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.”.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Davidson</surname>
          </string-name>
          , Dana Warmsley,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Macy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ingmar</given-names>
            <surname>Weber</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Automated hate speech detection and the problem of offensive language</article-title>
          .
          <source>arXiv preprint arXiv:1703</source>
          .
          <fpage>04009</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Armand</given-names>
            <surname>Joulin</surname>
          </string-name>
          , Edouard Grave, Piotr Bojanowski, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Bag of tricks for efficient text classification</article-title>
          .
          <source>CoRR, abs/1607</source>
          .01759.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <volume>14</volume>
          :
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Shahzad</given-names>
            <surname>Qaiser</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ramsha</given-names>
            <surname>Ali</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Text mining: Use of tf-idf to examine the relevance of words to documents</article-title>
          .
          <source>International Journal of Computer Applications</source>
          ,
          <volume>181</volume>
          ,
          <fpage>07</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Marco Stranisci, Cristina Bosco, Tommaso Caselli, Viviana Patti, and
          <string-name>
            <given-names>Irene</given-names>
            <surname>Russo</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Victor</given-names>
            <surname>Sanh Lysandre Debut Julien Chaumond Thomas Wolf</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>