<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DH-FBK @ HaSpeeDe2: Italian Hate Speech Detection via Self-Training and Oversampling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elisa Leonardelli</string-name>
          <email>eleonardelli@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Menini</string-name>
          <email>menini@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Tonelli</string-name>
          <email>satonelli@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe in this paper the system submitted by the DH-FBK team to the HaSpeeDe evaluation task, and dealing with Italian hate speech detection (Task A). While we adopt a standard approach for fine-tuning AlBERTo, the Italian BERT model trained on tweets, we propose to improve the final classification performance by two additional steps, i.e. self-training and oversampling. Indeed, we extend the initial training data with additional silver data, carefully sampled from domain-specific tweets and obtained after first training our system only with the task training data. Then, we retrain the classifier by merging silver and task training data but oversampling the latter, so that the obtained model is more robust to possible inconsistencies in the silver data. With this configuration, we obtain a macro-averaged F1 of 0.753 on tweets, and 0.702 on news headlines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Although hate speech detection may seem a solved
task on English, with more than 60 systems
participating in the last Offenseval edition reaching an
F1 &gt; 0.90
        <xref ref-type="bibr" rid="ref22">(Zampieri et al., 2020)</xref>
        , this goal has
not been reached when moving to other languages
and settings. For example, at the last HaSpeeDe
shared task on Italian
        <xref ref-type="bibr" rid="ref5">(Bosco et al., 2018)</xref>
        the best
systems reached 0.83 F1 on Facebook data and
0.80 on Twitter data
        <xref ref-type="bibr" rid="ref6">(Cimino et al., 2018)</xref>
        , but the
performance dropped below 0.70 F1 when dealing
with a cross-domain setting, i.e. training on
Facebook and testing on Twitter
        <xref ref-type="bibr" rid="ref6">(Cimino et al., 2018)</xref>
        ,
      </p>
      <p>
        Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
and vice-versa
        <xref ref-type="bibr" rid="ref7">(Corazza et al., 2018)</xref>
        . Other
recent studies confirmed that detecting hate speech
on different social media platforms would require
a platform-specific setting, and that just merging
all training data coming from different sources
does not always improve performance, in
particular when testing on Twitter
        <xref ref-type="bibr" rid="ref8">(Corazza et al., 2019)</xref>
        .
      </p>
      <p>The problem of developing hate speech
detection systems that are robust when analysing
different sources or data that vary over time is however
an understudied problem. Therefore, the task of
out-of-domain classification introduced this year
at HaSpeeDe is particularly important and will
hopefully foster the development and evaluation of
classifiers with good generalisation capabilities.</p>
      <p>
        Concerning our classification approach, we
build a standard pipeline based on AlBERTo
        <xref ref-type="bibr" rid="ref16 ref17">(Polignano et al., 2019b)</xref>
        , the Italian
transformerbased model trained on Twitter data, since
BERTlike models represent the state of the art for hate
speech detection
        <xref ref-type="bibr" rid="ref22">(Zampieri et al., 2020)</xref>
        . We
extend it in two ways: first, we use self-training to
build a first classifier with the task training data
and annotate a large set of tweets collected via
Islam- and immigrant-specific hashtags. The
silver data and the task training set are then merged
to train a second, possibly more robust classifier,
which we use to classify the test set. When
retraining, we introduce over-sampling in one of the
two runs submitted by our team, i.e. we repeat
five times the task training data so that they are
balanced with respect to the silver data. This,
together with self-training, proved to be effective
when evaluated in a five-fold fashion on the
training set, outperforming a standard approach based
only on fine-tuning with AlBERTo.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        While most approaches to hate speech detection
have been proposed for English, other systems
have been recently developed to deal with a
number of other languages, including Turkish, Arabic,
Danish
        <xref ref-type="bibr" rid="ref22">(Zampieri et al., 2020)</xref>
        , German
        <xref ref-type="bibr" rid="ref21">(Wiegand
et al., 2018)</xref>
        and Spanish
        <xref ref-type="bibr" rid="ref16 ref17 ref3">(Basile et al., 2019)</xref>
        .
Concerning Italian, the first Hate Speech
Detection task (HaSpeeDe) for Italian was organized at
EVALITA-2018
        <xref ref-type="bibr" rid="ref5">(Bosco et al., 2018)</xref>
        . The task
consisted in automatically annotating messages
from Twitter and Facebook, with a boolean value
indicating the presence (or not) of hate speech.
The participating systems adopt a wide range of
approaches, including bi-LSTM
        <xref ref-type="bibr" rid="ref14">(la Pen˜a Sarrace´n
et al., 2018)</xref>
        , SVM
        <xref ref-type="bibr" rid="ref19">(Santucci et al., 2018)</xref>
        ,
ensemble classifiers
        <xref ref-type="bibr" rid="ref1 ref12 ref15 ref21 ref6">(Polignano and Basile, 2018; Bai et
al., 2018)</xref>
        , RNN
        <xref ref-type="bibr" rid="ref12">(Fortuna et al., 2018)</xref>
        , CNN and
GRU
        <xref ref-type="bibr" rid="ref20">(von Grunigen et al., 2018)</xref>
        . The authors of
the best-performing system, ItaliaNLP
        <xref ref-type="bibr" rid="ref6">(Cimino et
al., 2018)</xref>
        , experiment with three different
classification models: one based on linear SVM, another
one based on a 1-layer BiLSTM and a
newlyintroduced one based on a 2-layer BiLSTM which
exploits multi-task learning with additional data
from the 2016 SENTIPOLC task
        <xref ref-type="bibr" rid="ref2">(Barbieri et al.,
2016)</xref>
        . The same training and test set released for
HaSpeeDe have been recently used also for other
types of evaluation, for example to compare
classifier performance and settings across different
languages
        <xref ref-type="bibr" rid="ref9">(Corazza et al., 2020)</xref>
        , confirming the
importance of domain-specific language models and
the effectiveness of deep learning approaches (in
this case, LSTM + fasttext embeddings). Since
the development of BERT-like transformer-based
models, however, they have become
state-of-theart approaches in several NLP tasks. This includes
also hate speech detection for Italian, with the
BERT model AlBERTo
        <xref ref-type="bibr" rid="ref16 ref17">(Polignano et al., 2019b)</xref>
        ,
which has recently achieved top-scores in two out
of three HaSpeeDe 2018 tasks
        <xref ref-type="bibr" rid="ref16 ref17">(Polignano et al.,
2019a)</xref>
        . For this reason, we decided to develop a
classifier using the same model and the same
approach.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Task Description</title>
      <p>
        For the 2020 edition of EVALITA
        <xref ref-type="bibr" rid="ref4">(Basile et al.,
2020)</xref>
        , the HaSpeeDe task
        <xref ref-type="bibr" rid="ref18">(Sanguinetti et al.,
2020)</xref>
        has focused on three main phenomena
relevant to online hate speech detection by proposing
three different tasks:
      </p>
      <p>Task A (main task): binary classification task
aimed at determining whether a message
contains hate speech or not
Task B: binary classification task aimed at
determining whether a message contains
stereotypes or not
Task C: sequence labeling task aimed at
recognizing nominal utterances in hateful tweets
We participate in Task A, which in 2020 has
the goal also to investigate variation in language
and time concerning hate speech detection. To this
purpose, the training set contains Twitter data,
accompanied by a test set including both in-domain
and out-of-domain data (tweets + news headlines),
as well as from different time periods.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Data</title>
      <p>In our experiments we use two types of data, the
HaSpeeDe2 dataset provided by the task
organisers, and domain-specific data collected from
Twitter, that we include as silver data. The two datasets
are described below.
4.1</p>
      <sec id="sec-4-1">
        <title>HaSpeeDe2 Dataset</title>
        <p>This dataset contains the training data provided
by the organizers. These data specifically focus
on the presence or the absence of hateful
content towards immigrants, muslims or roma people.
It consists of 6,839 annotated tweets, with 2,766
messages annotated as hateful and 4,073 as
nonhateful.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Silver data description</title>
        <p>
          Since the task is focused on hate speech against
immigrants and minorities, we decided to exploit
a set of tweets in Italian that covers similar topics
and that was collected within the European project
Hatemeter1
          <xref ref-type="bibr" rid="ref11">(Ferret et al., 2019)</xref>
          . For this project,
conducted between February 2018 and January
2020, we downloaded tweets using hashtags of
hate towards the Islam community, for example
#nomoschee, #stopIslam, etc. Even if the dataset
mainly covers Islam, references to other minorities
like Roma or generic Immigrants are also present.
To ensure that also other minorities are well
represented, we randomly select from this dataset
tweets that contain the most common words as
chosen from the training data provided by task
organizers, i.e. Rom, nomade, migrante, straniero,
profugo, islam, mussulmano (musulmano),
terrorista. Overall, around 20,400 additional tweets
were selected. We then perform a first round of
1http://hatemeter.eu/
classification of the “new” tweets using the
available data provided by organizers as training. This
results in a new silver dataset composed of 11,129
hate and 9,254 non-hate tweets. This additional
dataset is then merged with the task gold data and
used to re-train the classifier. Details are reported
in the following Section.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>System Description</title>
      <p>
        The classifier developed for both runs submitted
by our team is based on the Italian BERT model
trained on tweets, called AlBERTo
        <xref ref-type="bibr" rid="ref16 ref17">(Polignano et
al., 2019b)</xref>
        . After fine-tuning it on the task
training data, we use the obtained classifier to
automatically annotate the additional dataset described in
Section 4.2. These silver data are then merged
with the task training data and used to fine-tune
AlBERTo a second time. For one of the two
submitted runs, we also experiment with
oversampling as follows:
      </p>
      <p>Run1: we add the silver data to the tweets
provided by the organizers for the training,
keeping 500 of the released tweets for
validation. In this setting, the training set size is
27,000 tweets, including 20,400 silver
instances.</p>
      <p>Run2: we add the silver data to the tweets
provided by the organizers as in Run1, but
the tweet from organizers are oversampled by
repeating them five times (and shuffling) in
the training set, while tweets from the silver
dataset occur only once. In this setting, the
training set includes 52,000 tweets, with
39% of them being silver data.</p>
      <p>We tested also the option to automatically
assign a tag to each tweet, stressing the presence of
a certain topic (immigrants/roma people/islam)
using a keyword-based approach. However, with this
additional information the classifier performed
worse than without any topic indicator, so we
removed it from the final runs. Below we report
a detailed description of the process to select the
best classification model, and of the preprocessing
steps.
5.1</p>
      <sec id="sec-5-1">
        <title>Model selection</title>
        <p>
          The best performance in a wide variety of NLP
tasks is currently obtained with approaches based
on BERT
          <xref ref-type="bibr" rid="ref10">(Devlin et al., 2019)</xref>
          , a pre-trained
transformed-based language model that can be
fine-tuned and adapted to specific tasks by adding
just one additional output layer to the neural
network. As different BERT models exist, we first
evaluated whether to use a multilingual version
of BERT or the Italian version trained on Twitter
data, called AlBERTo
          <xref ref-type="bibr" rid="ref16 ref17">(Polignano et al., 2019b)</xref>
          .
        </p>
        <p>The comparison and evaluation of the
different models and approaches is done with a 6-fold
cross-validation using the task training set. Each
fold consists of about 1,000 tweets as test while
the others are used as train and validation (500
tweets). The performance score is obtained as the
average of the six folds, so that the final evaluation
is unbiased and independent as much as possible
from the specific splits into train, validation and
test.</p>
        <p>In our setup we tested two models, first
Multilingual BERT, covering 104 languages including
Italian 2 and then AlBERTo, which was trained
using the official BERT source code on 200M tweets
in the Italian language. For the fine-tuning of
AlBERTo we run it for 15 epochs, using a learning
rate of 2e-5 with 1000 steps per loops on batches
of 64 examples. Since AlBERTo performed
better than multilingual BERT on each fold, it was
included in the final system configuration for the
task. The cross-validation over 6 folds using only
the task training set with AlBERTo resulted in an
average Macro-F1 of 83.12 for Run1 and 82.15 for
Run2.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Data Preprocessing</title>
        <p>
          The data, both from the dataset provided by the
organisers and the silver one, are preprocessed as
follows. First we split hashtags by adapting to
Italian the Ekphrasis tool
          <xref ref-type="bibr" rid="ref13">(Gimpel et al., 2010)</xref>
          ,
which recognises the tokens in a hashtag based
on Google n-grams. With the same tool we also
normalise the text to replace all mentions to users
and urls with &lt;user&gt; and &lt;url&gt; respectively. We
also replace with a dedicated tag all the instances
of “money”, “time”, “date” and in general any
“number“. The emojis are replaced with their
descriptions3 in order to have a textual representation
to be used with AlBERTo.
        </p>
        <p>2with 12-layer, 768-hidden, 12-heads, 110M parameters
3manually translated to Italian from the English
description at rhttps://unicode.org/emoji/charts/
full-emoji-list.html.</p>
        <sec id="sec-5-2-1">
          <title>DocType. System Precision</title>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>Hate class</title>
        <p>Recall</p>
      </sec>
      <sec id="sec-5-4">
        <title>Non-hate class</title>
        <p>Precision Recall
We submitted two runs each for the in-domain
(tweets) and out-of-domain (news headlines) text
types in Task A. The results obtained on the test
set are reported in Table 1 and compared with two
baselines provided by the task organisers, one
obtained by always assigning the most frequent
label (i.e. non-hateful), and the other by training
an SVM classifier with unigrams, char-grams and
TF-IDF representation as features. We also
compare our results with the top-ranked system in each
subtask (additional details on such systems have
not been disclosed at the moment of writing).</p>
        <p>As expected, on out-of-domain data (news
headlines) we obtain lower results than on tweets,
since the training set is retrieved exclusively from
Twitter. Furthermore, our approach does not
include any specific tuning aimed at treating news
headlines differently from tweets. On the
contrary, the additional data used for self-training are
all gathered from Twitter, which may negatively
affect performance on out-of-domain data.</p>
        <p>On both document types, Run2 performs better
than Run1, showing that our oversampling
strategy to reduce the weight of silver data is
effective. However, results obtained with 6-fold
crossvalidation only on the training set were
significantly higher, both with macro F1 &gt; 0.80. This
may be explained by the fact that, as pointed out
by the task organisers, tweets from the test set
were collected in a different time period than those
in the training set. This will likely make the two
sets different in terms of topics.</p>
        <sec id="sec-5-4-1">
          <title>Predicted</title>
        </sec>
        <sec id="sec-5-4-2">
          <title>Predicted</title>
          <p>Run 1
Run 2
We report in Table 2 and 3 the confusion
matrix showing the number of true positives and
negatives, and false positives and negatives obtained
with the two runs on tweets and news headlines.
While on tweets the performance on the hate class
is overall better, in particular concerning recall,
this does not apply to news headlines, with a low
recall for the hate class. The reason for this low
score lies in the different linguistic expressions
connected with hate between tweets and
headlines: while in tweets they are more direct, and
more frequently connected with profanities that a
classifier can easily recognise, hateful content in
news headlines is usually expressed in more subtle
ways. As an example, we report below two
headlines misclassified by our system. The first one
(i) was classified as non-hateful, even if it conveys
hateful content. The second one (ii) was instead
classified as hateful, although it is not:</p>
          <p>Run 2
ii) Matera, Salvini contestato durante il
comizio. E lui risponde: “Bravi, avete vinto
dieci immigrati da mantenere” (EN: Matera,
Salvini challenged at a rally, and he replies:
“Congratulations, you won ten migrants to
pay for”)</p>
          <p>Both examples have a similar structure, are
written in standard Italian and mention migrants.
Furthermore, the second example reports a hateful
direct speech, but since it is only reported it does
not mean that the journalist agrees with what was
said by the politician Matteo Salvini.
7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper we described the system
developed by the DH-FBK team to participate in the
HaSpeeDe shared Task A. We submitted two runs,
both based on AlBERTo and using in-domain
silver data as additional training data in a
selflearning framework. The only difference between
the two configurations is that, for Run2, the task
training data were repeated five times, to balance
the weight of silver data.</p>
      <p>Our evaluation shows that, both in a
crossvalidation setting and on the task test set,
oversampling has a positive effect on the classification
results. As expected, performance on in-domain
data (i.e. training and testing on tweets) is better
than on out-of-domain data (i.e. training on tweets
and testing on news headlines). In the future, we
may try to address this issue by including as silver
data also news headlines, so that also the
specificity of this kind of text is taken into account. For
a better data quality, it may be useful to select only
the silver instances that have been automatically
classified with high confidence.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Xiaoyu</given-names>
            <surname>Bai</surname>
          </string-name>
          , Flavio Merenda, Claudia Zaghi, Tommaso Caselli, and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <year>2018</year>
          . Rug @ EVALITA 2018:
          <article-title>Hate speech detection in italian social media</article-title>
          .
          <source>In Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Barbieri</surname>
          </string-name>
          , Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli, and
          <string-name>
            <given-names>Viviana</given-names>
            <surname>Patti</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the evalita 2016 sentiment polarity classification task</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter</article-title>
          .
          <source>In Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          , pages
          <fpage>54</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In Valerio Basile</source>
          , Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dell'Orletta Felice</surname>
            , Fabio Poletto, Manuela Sanguinetti, and
            <given-names>Tesconi</given-names>
          </string-name>
          <string-name>
            <surname>Maurizio</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the evalita 2018 hate speech detection task</article-title>
          .
          <source>In EVALITA 2018-Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</source>
          , volume
          <volume>2263</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . CEUR.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Cimino</surname>
          </string-name>
          , Lorenzo De Mattei, and Felice Dell'Orletta.
          <year>2018</year>
          .
          <article-title>Multi-task learning in deep neural networks at EVALITA 2018</article-title>
          .
          <source>In Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Michele</given-names>
            <surname>Corazza</surname>
          </string-name>
          , Stefano Menini, Pinar Arslan, Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and
          <string-name>
            <given-names>Serena</given-names>
            <surname>Villata</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Comparing different supervised approaches to hate speech detection</article-title>
          .
          <source>In Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Michele</given-names>
            <surname>Corazza</surname>
          </string-name>
          , Stefano Menini, Elena Cabrio, Sara Tonelli, and
          <string-name>
            <given-names>Serena</given-names>
            <surname>Villata</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Cross-platform evaluation for italian hate speech detection</article-title>
          .
          <source>In Proceedings of the Sixth Italian Conference on Computational Linguistics</source>
          , Bari, Italy,
          <source>November 13-15</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Michele</given-names>
            <surname>Corazza</surname>
          </string-name>
          , Stefano Menini, Elena Cabrio, Sara Tonelli, and
          <string-name>
            <given-names>Serena</given-names>
            <surname>Villata</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A multilingual evaluation for online hate speech detection</article-title>
          .
          <source>ACM Transactions on Internet Technology</source>
          ,
          <volume>20</volume>
          (
          <issue>2</issue>
          ):
          <volume>10</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          :
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Minneapolis, Minnesota, June.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Je´roˆme Ferret</surname>
            ,
            <given-names>M</given-names>
            ario Laurent, Daniela Andreatta, Andrea Di Nicola, Elisa Martini, M
          </string-name>
          <string-name>
            <surname>Guerini</surname>
            ,
            <given-names>S Tonelli</given-names>
          </string-name>
          , Georgios Antonopoulos, and
          <string-name>
            <given-names>Parisa</given-names>
            <surname>Diba</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Hatemeter d18: Training module a for academics and research organisations</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Paula</given-names>
            <surname>Fortuna</surname>
          </string-name>
          , Ilaria Bonavita, and Se´rgio Nunes.
          <year>2018</year>
          .
          <article-title>Merging datasets for hate speech classification in italian</article-title>
          .
          <source>In Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy,
          <source>December 12-13</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Gimpel</surname>
          </string-name>
          , Nathan Schneider,
          <string-name>
            <surname>Brendan O'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Daniel Mills</surname>
          </string-name>
          , Jacob Eisenstein, Michael Heilman, Dani Yogatama,
          <source>Jeffrey Flanigan, and Noah A Smith</source>
          .
          <year>2010</year>
          .
          <article-title>Part-of-speech tagging for twitter: Annotation, features, and experiments</article-title>
          .
          <source>Technical report</source>
          , Carnegie-Mellon University, School of Computer Science.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Gretel Liz De la Pen</surname>
          </string-name>
          <article-title>˜a Sarrace´n, Reynaldo Gil Pons, Carlos Enrique Mun˜iz-</article-title>
          <string-name>
            <surname>Cuza</surname>
            , and
            <given-names>Paolo</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Hate speech detection using attention-based LSTM</article-title>
          .
          <source>In Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pierpaolo</given-names>
            <surname>Basile</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Hansel: Italian hate speech detection through ensemble learning and deep neural networks</article-title>
          .
          <source>In Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          , Pierpaolo Basile, Marco de Gemmis, and
          <string-name>
            <given-names>Giovanni</given-names>
            <surname>Semeraro</surname>
          </string-name>
          . 2019a.
          <article-title>Hate speech detection through alberto italian language understanding model</article-title>
          . In Mehwish Alam, Valerio Basile, Felice Dell'Orletta,
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          , and Nicole Novielli, editors,
          <source>Proceedings of the 3rd Workshop on Natural Language for Artificial Intelligence colocated with the 18th International Conference of the Italian Association for Artificial Intelligence (AIIA</source>
          <year>2019</year>
          ), Rende, Italy,
          <source>November 19th-22nd</source>
          ,
          <year>2019</year>
          , volume
          <volume>2521</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          , Pierpaolo Basile, Marco de Gemmis, Giovanni Semeraro, and
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          . 2019b.
          <article-title>Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets</article-title>
          .
          <source>In Raffaella Bernardi</source>
          , Roberto Navigli, and Giovanni Semeraro, editors,
          <source>Proceedings of the Sixth Italian Conference on Computational Linguistics</source>
          , Bari, Italy,
          <source>November 13-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2481</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Marco Stranisci, Cristina Bosco, Tommaso Caselli, Viviana Patti, and
          <string-name>
            <given-names>Irene</given-names>
            <surname>Russo</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Valentino</given-names>
            <surname>Santucci</surname>
          </string-name>
          , Stefania Spina, Alfredo Milani, Giulio Biondi, and Gabriele Di Bari.
          <year>2018</year>
          .
          <article-title>Detecting hate speech for italian language in social media</article-title>
          .
          <source>In Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Dirk von Grunigen</surname>
            , Ralf Grubenmann, Fernando Benites, Pius Von Daniken, and
            <given-names>Mark</given-names>
          </string-name>
          <string-name>
            <surname>Cieliebak</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>spmmmp at germeval 2018 shared task: Classification of offensive content in tweets using convolutional neural networks and gated recurrent units</article-title>
          .
          <source>In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Wiegand</surname>
          </string-name>
          , Melanie Siegel, and
          <string-name>
            <given-names>Josef</given-names>
            <surname>Ruppenhofer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the germeval 2018 shared task on the identification of offensive language</article-title>
          .
          <source>In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS</source>
          <year>2018</year>
          ), pages
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          , Vienna, Austria.
          <source>Austrian Academy of Sciences.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Marcos</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and C¸ ag˘rı C¸ o¨ltekin.
          <year>2020</year>
          . SemEval-2020
          <source>Task</source>
          <volume>12</volume>
          :
          <article-title>Multilingual Offensive Language Identification in Social Media (OffensEval 2020)</article-title>
          .
          <source>In Proceedings of the 14th International Workshop on Semantic Evaluation</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>