<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fontana-Unipi @ HaSpeeDe2: Ensemble of Transformers for the Hate Speech Task at Evalita</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michele Fontana</string-name>
          <email>m.fontana12@studenti.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Attardi</string-name>
          <email>attardi@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica, Universita` di Pisa</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe our approach and experiments to tackle Task A of the second edition of HaSpeeDe, within the Evalita 2020 evaluation campaign. The proposed model consists in an ensemble of classifiers built from three variants of a common neural architecture. Each classifier uses contextual representations from transformers trained on Italian texts, fine tuned on the training set of the challenge. We tested the proposed model on the two official test sets, the in-domain test set containing just tweets and the out-of-domain one including also news headlines. Our submissions ranked 4th on the tweets test set and 17th on the second test set.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The spreading of hateful messages on social
media has become a serious issue, therefore
techniques of hate speech detection have become quite
relevant. The goal of the Hate Speech
Detection task
        <xref ref-type="bibr" rid="ref12">(Sanguinetti et al., 2020)</xref>
        at Evalita
2020
        <xref ref-type="bibr" rid="ref2">(Basile et al., 2020)</xref>
        is to improve the
automatic detection of hate messages in Italian tweets.
The organizers provided to the participants the
dataset HaSpeeDe2, which consists of 6,837
Italian tweets, containing, besides the raw text, also
hashtags and emojis. The Task A can be cast into
a binary classification task: the model has to
predict whether a given message contains hate speech
or not.
      </p>
      <p>Approaches based on transformer models have
become quite popular recently and have proved
effective in reaching state-of-the-art scores on major
NLP tasks such as those of the GLUE benchmark</p>
      <p>
        Copyright © 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
        <xref ref-type="bibr" rid="ref14">(Wang et al., 2018)</xref>
        . With our experiments we try
to assess the effectiveness of transformers trained
on Italian documents in a task involving Italian
texts from different sources. We experiments with
both a transformer model trained specifically on
Italian tweets and one trained on generic web
documents.
      </p>
      <p>We combine several instances of classifiers
based on these transformers, in order to address
the problem of over-fitting due to the small size of
the training set.</p>
      <p>For this edition of the Evalita HaSpeeDe task,
the organizers released two test sets, an in-domain
one consisting of tweets and an out-of-domain one
containing also news headlines.</p>
      <p>The ensemble model of our official submission
achieved a competitive score of 78.03 Macro-F1
on the in-domain test set but did not perform as
well on the second test set.</p>
      <p>We make available the source code for our
experiments as Open Source at https://
github.com/mikelefonty/Haspeede2.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The first edition of HaSpeeDe was held in 2018.
The results produced during this contest were the
starting point of our research. As described in
        <xref ref-type="bibr" rid="ref3">(Bosco et al., 2018)</xref>
        , most of the systems were
based on neural networks and used word
embeddings, such as FastText
        <xref ref-type="bibr" rid="ref6">(Grave et al., 2018)</xref>
        or
word2vec
        <xref ref-type="bibr" rid="ref1 ref9">(Polignano and Basile, 2018)</xref>
        in the first
layer of their architecture. The embeddings layer
was usually followed by a Recurrent Network or
a Convolutional Neural Network to get an internal
representation of the input text. This hidden
representation was provided as input to a series of dense
layers to obtain the final classification result.
      </p>
      <p>
        Over the last couple of years, the trend in
approaches to language analysis has changed
considerably, as can be seen by examining the models
used in competitions like SemEval 2020
OffensEval 2
        <xref ref-type="bibr" rid="ref15">(Zampieri et al., 2020)</xref>
        . In these new models,
to get a better text representation, the embedding
layer is often replaced by a Transformer
        <xref ref-type="bibr" rid="ref13">(Vaswani
et al., 2017)</xref>
        such as BERT
        <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
        ,
RoBERTa
        <xref ref-type="bibr" rid="ref7">(Liu et al., 2019)</xref>
        , or Multilingual BERT
        <xref ref-type="bibr" rid="ref5">(Devlin et al., 2019)</xref>
        .
      </p>
      <p>We followed this trend but we also focused our
attention on the problem raised by the small size
of the dataset. As Risch and Krestel (2020)
mention, transformer models tend to have a high
variance with respect to the input dataset, that often
leads to overfitting. The authors therefore suggest
to implement an ensemble of classifiers to reduce
the variance and consequently improve the
generalization capabilities of the trained model.</p>
      <p>
        In the following, we describe a similar approach
based on the Bagging technique
        <xref ref-type="bibr" rid="ref4">(Breiman, 1996)</xref>
        ,
where we apply three different transformer-based
classifiers to populate the ensemble and to get the
final prediction.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>System Architecture</title>
      <p>
        During the design phase of our classifier, we
looked for a transformer trained directly on a
significantly large collection of Italian texts and
particularly on Italian tweets, in order to compensate
for the small size of the training data. We found
two possible models based on BERT: AlBERTo
        <xref ref-type="bibr" rid="ref10">(Polignano et al., 2019)</xref>
        1 and DBMDZ 2. The
former is trained on TWITA
        <xref ref-type="bibr" rid="ref1 ref9">(Basile et al., 2018)</xref>
        , a
191 GB collection of Italian tweets gathered by
the authors, and tested on the SENTIPOLC task
during the EVALITA 2016 campaign, where it
achieved state-of-the-art accuracy in subjectivity,
polarity, and irony detection on Italian tweets. We
considered this model suitable for hate speech
detection, since its source are Italian tweets and the
SENTIPOLC task is a classification task similar
to ours. DBMDZ instead is trained on a more
general domain, from a 13 GB dataset, which includes
a dump of Italian Wikipedia and texts from web
pages selected from the Opus Corpora. 3 We
decided to test both transformer models, assessing
their performance through a validation phase on a
development set.
      </p>
      <p>These transformers were used in the input stage
of all our architectures, providing contextual
embeddings for sentences that were fine tuned during
1https://github.com/marcopoli/AlBERTo-it
2https://huggingface.co/dbmdz/bert-base-italian-uncase
3http://opus.nlpl.eu/
training. We designed three architecture variants,
which were employed as the basic building blocks
to construct the ensembles:
• ALB-SINGLE: It consists of a first layer
provided by the AlBERTo transformer,
followed by a single neuron with a sigmoid
activation function.
• DB-SINGLE: It follows the same structure
of ALB-SINGLE; it just replaces AlBERTo
with DBMDZ in the first layer.
• DB-MLP: Compared to DB-SINGLE, it
adds a new dense layer, using a ReLU
activation function, between the transformer and
the output neuron.</p>
      <p>The final model is an ensemble consisting of a
number of instances of each of the above
architectures. For each architecture, e.g. ALB-SINGLE,
we construct instances in the following way. After
initializing the weights randomly within a given
interval and generating the training data by
applying the bootstrap technique to the original dataset,
we start training the model. When that phase is
over, we insert the resulting model in the
ensemble. We repeat this process several times with
different random weights initialization. Note that,
due to the random initialization, no two classifiers
in the ensemble are identical to each other. More
formally, the model consists of N elements,</p>
      <p>N = NAL + NDB + NMLP
where NAL, NDB, NMLP represent, respectively,
the number of instances of ALB-SINGLE,
DBSINGLE and DB-MLP classifiers.</p>
      <p>In retrospect, it might have been worth while
to consider instances of the architecture obtained
varying them more thoroughly than just in the
initial weights, for example, by changing in the
hyper-parameters or number of layers.</p>
      <p>Our classification algorithm is a slight
generalization of the most classical one, which collects
results from each member of the ensemble and
outputs the class which gets the majority of
predictions over all iterations. The process, described
by Algorithm 1, performs nrun iterations.
During the ith iteration, the algorithm starts sampling
randomly from the ensemble a given number of
instances for each type of classifier (line 3-5) and
initializing to 0 the variable class1, which
contains the total number of votes that the hate class</p>
      <sec id="sec-3-1">
        <title>Algorithm 1 Classification Algorithm</title>
        <p>Input: t: the tweet to classify.</p>
        <p>Input: (nAL; nDB; nMLP ): number of classifiers
of each type to be sampled.</p>
        <p>Input: (NAL; NDB; NMLP ): number of
classifiers of each type in the ensemble.</p>
        <p>Input: nrun: number of desired iterations.
Output: cfinal: predicted class
1: preds = []
2: for run = 1 to nrun do
3: albs = sample al(nAL; NAL)
4: dbs = sample db(nDB; NDB)
5: mlps = sample ml(nMLP ; NMLP )
6: sampled classif = albs [ dbs [ mlps
7: class1 = 0 // votes for class 1
8: for cl in sampled classif do
9: class1 += cl(t) // cl’s classification
10: end for
11: preds[run] =</p>
        <p>class1 d nAL+nD2B+nMLP e
12: end for
13: cfinal =
14: return cfinal
nrun
P pred[i]
i
d nr2un e
receives during the iteration (line 7). It then
collects the predictions of the selected models on the
tweet t (line 8-10). cl(t) 2 f0; 1g represents the
prediction of classifier cl for the tweet t; in
particular cl(t) = 1 if and only if cl classifies t as hateful.
The output of iteration i is the most predicted class
(line 11). The final result of the algorithm is then
the class cfinal 2 f0; 1g, which obtained the most
votes over all the nrun iterations (line 13-14). If
cfinal = 1, it means that the tweet t has been
classified as hateful.</p>
        <p>A simpler variant of the algorithm would be to
just add the counts of each class by all classifiers in
all iterations and return the class with the highest
count. We plan to compare these two approaches
in a future work.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section we describe the experiments we
performed to tune the hyper-parameters of our
model. We will focus on the search to choose
the best values for nDB, nAL, nMLP , that is how
many instances to select at each iteration in the
classification algorithm.</p>
      <p>Before starting the experiments, we divided the</p>
      <sec id="sec-4-1">
        <title>Classifier</title>
      </sec>
      <sec id="sec-4-2">
        <title>ALB-SINGLE</title>
        <p>
          DB-SINGLE
DB-MLP
dataset into two disjoint subsets, a development
and an internal test set, in the proportion of 80%
and 20%, respectively. The split was done by
means of Stratified Sampling, according to the
distribution of the target variable hs. We applied
the Stratified 3-fold-CV technique to validate our
model. Given that we are solving a binary
classification problem, we picked the Binary Cross
Entropy as our loss. We chose AdamW as our
optimizer; we set the first 10% of the total steps as
warmup steps. We conducted the experiences on a
GPU offered by Google Colab 4. Our models are
implemented in PyTorch
          <xref ref-type="bibr" rid="ref8">(Paszke et al., 2019)</xref>
          . To
extract as much information as possible from input
texts, we preprocessed them through hashtag
segmentation by means of Tweet Preprocessor.5 We
also converted emojis into their Italian description
by using the emoji 6 and Google Translate 7
libraries.
        </p>
        <p>We analyzed the behaviour of the three baseline
architectures we planned to include in the
ensemble.</p>
        <p>We trained each model for a maximum of 4
epochs, using a batch of size 16 and setting the
maximum text length to 100. A grid search
revealed that the optimal learning rate for DB-MLP
is 5 10 5, and 6 10 5 for the remaining
models. The optimal number of neurons in the hidden
layer of DB-MLP is 50.</p>
        <p>Table 1 highlights the following aspect:
DBSINGLE achieves better performance than
ALBSINGLE, even though the dataset used to train
AlBERTo was composed by a large collection of
tweets. The obtained values of the macro-F1 are
the baselines of our work.</p>
        <p>We then describe the results obtained through</p>
      </sec>
      <sec id="sec-4-3">
        <title>4https://colab.research.google.com/ 5https://pypi.org/project/tweet-preprocessor/ 6https://pypi.org/project/emoji 7https://pypi.org/project/googletrans/</title>
        <p>the ensemble model. To build the classifier, we
trained 30 instances of each architecture, keeping
the same hyper-parameters obtained from the
previous grid search. We thus set:</p>
        <p>NAL = NDB = NMLP = 30</p>
        <p>We noted that the generalization capability
of the ensemble is strictly related to the triple
(nDB; nMLP ; nAL), so we performed another grid
search, looking for the optimal combination of the
three parameters. Table 2 shows the five best
configurations found by this search. The optimal
values for the triple, (20; 25; 30), allow the ensemble
to achieve an F1-score of 80:0%, with a gain of
about 2 points with respect to the score by a single
DB-MLP (see Table 1).</p>
        <p>We analyzed the contribution of each
architecture individually to the ensemble combination. As
shown in Table 3, the best results are obtained with
instances of all three architectures. Nevertheless,
the results presented in Table 2, show that a more
balanced combination achieves better accuracy.
Std
78.592</p>
        <p>We picked the first configuration from Table 2
for our final model and tested it on the internal test
set, obtaining the results shown in Table 4.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>The results of our final model applied to the data
of the two official test sets of the competition are
shown in Table 5. The model performs pretty well
on the in-domain dataset, reaching the 4th
position in the rankings. However, it did not rank as
well in detecting hate speech on the out-of-domain
dataset, obtaining an F1-score of just 65:46. The
low recall for the hate class highlights that the
model fails too often to identify news headlines
containing some form of hate speech. In
comparison with the official top rankings, listed in Table
6, our model achieved about 12 points below the
top score of 77:44% F1.</p>
      <p>Surprised by this fact, we investigated more
deeply, looking for an explanation for such poor
result on the out-of-domain dataset.</p>
      <p>We randomly sampled from the test set some
hateful headlines missed by the model, some of
which are shown in Table 7.</p>
      <p>In these headlines, the qualification as hate is
implicit and harder to recognize, since it seems
due more to the presence of stereotypes (nomads,
asylum seekers, Muslims, foreigners), than to the
presence of explicit hate expressions.</p>
      <p>Broadly speaking, we identified some possible
reasons for the difference in performance across
the two test sets:
• Linguistic register: Tweets often exhibit a
more informal and colloquial language, while
headlines employ a more formal lexicon and
a more objective tone. This is a crucial
difference in identifying hateful messages: while
in tweets the feeling of hatred transpires
clearly and directly, in headlines this message
is conveyed in a more subtle way, often
alluding to concepts from political propaganda or
common stereotypes. Prior knowledge about
the subject and inference might be necessary</p>
      <p>NOT HATE
Precision Recall
81.93 72.85
71.88 99.37
anziana rapinata sull’autobus, i due
nomadi in fuga si rifugiano al campo di via
Candoni
(elderly woman robbed on the bus, the two
fleeing nomads take refuge at the camp on
via Candoni)
Expo: Bordonali, richiedenti asilo in
campo base simbolo fallimento governo.
(Expo: Bordonali, asylum seekers in base
camp government failure symbol.)
Il cardinale Mu¨ ller: ”non possiamo
pregare come o con i musulmani”
(”we cannot pray like nor with Muslims”)
Salvini: ”Il calcio? Rimpiango i tre
stranieri in campo”
(Salvini: ”Soccer? I regret the three
foreigners on the field”)</p>
      <p>to decipher the presence of hate. Examining
the entire body of the article might have been
helpful.
• Length of text: Tweets are usually longer
than news headlines. Thus, the model has
fewer elements to exploit to correctly classify
a piece of news.</p>
      <p>These difficulties seem to be shared with other
submissions which all got lower scores on the
outof-domain dataset. We expected that pretrained
contextual embedding would be more effective in
addressing the domain adaptation issue. Further
experiments would be needed to improve the
resilience of our model.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>We described an ensemble of neural classifiers,
relying on contextual embeddings from
transformers, for automated detection of hateful content in
Italian texts. We presented the general
architecture of our base classification models and how
they were combined into an ensemble through a
bagging technique. We performed extensive
experiments to tune our models and the ensemble
on a validation test set. The results achieved by
our ensemble model on the in-domain test set
confirm its ability in detecting hateful tweets; however
the same model performed poorly on the
out-ofdomain dataset, showing particularly an inability
to adapt to handling news headlines. We plan to
investigate this issue in future research.</p>
      <p>Machine</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Mirko Lai, and
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Long-term social media data collection at the university of turin</article-title>
          . In Elena Cabrio, Alessandro Mazzei, and Fabio Tamburini, editors,
          <source>Proceedings of the Fifth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2018</year>
          ), Torino, Italy,
          <source>December 10-12</source>
          ,
          <year>2018</year>
          , volume
          <volume>2253</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In Valerio Basile</source>
          , Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Felice Dell'Orletta, Fabio Poletto, Manuela Sanguinetti, and
          <string-name>
            <given-names>Maurizio</given-names>
            <surname>Tesconi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 hate speech detection task</article-title>
          .
          <source>In Tommaso Caselli</source>
          , Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy,
          <source>December 12-13</source>
          ,
          <year>2018</year>
          , volume
          <volume>2263</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>Bagging predictors</article-title>
          .
          <source>Learning</source>
          ,
          <volume>24</volume>
          :
          <fpage>123</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Jill Burstein</source>
          , Christy Doran, and Thamar Solorio, editors,
          <source>Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Edouard</given-names>
            <surname>Grave</surname>
          </string-name>
          , Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Learning word vectors for 157 languages</article-title>
          .
          <source>In Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Yinhan</given-names>
            <surname>Liu</surname>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>
          . CoRR, abs/
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Adam</given-names>
            <surname>Paszke</surname>
          </string-name>
          , Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang,
          <string-name>
            <surname>Zachary</surname>
            <given-names>DeVito</given-names>
          </string-name>
          , Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and
          <string-name>
            <given-names>Soumith</given-names>
            <surname>Chintala</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Pytorch: An imperative style, high-performance deep learning library</article-title>
          . In H. Wallach,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beygelzimer</surname>
          </string-name>
          , F. d'Alche´-Buc, E. Fox, and R. Garnett, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          , pages
          <fpage>8024</fpage>
          -
          <lpage>8035</lpage>
          . Curran Associates, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pierpaolo</given-names>
            <surname>Basile</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Hansel: Italian hate speech detection through ensemble learning and deep neural networks</article-title>
          .
          <source>In Tommaso Caselli</source>
          , Nicole Novielli, Viviana Patti, and Paolo Rosso, editors,
          <source>Proceedings of the Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          )
          <article-title>co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2018</year>
          ), Turin, Italy,
          <source>December 12-13</source>
          ,
          <year>2018</year>
          , volume
          <volume>2263</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          , Pierpaolo Basile, Marco de Gemmis, Giovanni Semeraro, and
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets</article-title>
          .
          <source>In Raffaella Bernardi</source>
          , Roberto Navigli, and Giovanni Semeraro, editors,
          <source>Proceedings of the Sixth Italian Conference on Computational Linguistics</source>
          , Bari, Italy,
          <source>November 13-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2481</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Julian</given-names>
            <surname>Risch</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ralf</given-names>
            <surname>Krestel</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Bagging BERT models for robust aggression identification</article-title>
          . In Ritesh Kumar,
          <string-name>
            <given-names>Atul</given-names>
            <surname>Kr</surname>
          </string-name>
          . Ojha, Bornini Lahiri, Marcos Zampieri, Shervin Malmasi, Vanessa Murdock, and Daniel Kadar, editors,
          <source>Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying</source>
          ,
          <source>TRAC@LREC</source>
          <year>2020</year>
          , Marseille, France, May
          <year>2020</year>
          , pages
          <fpage>55</fpage>
          -
          <lpage>61</lpage>
          .
          <string-name>
            <given-names>European</given-names>
            <surname>Language Resources Association (ELRA).</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Gloria Comandini, Elisa Di Nuovo, Simona Frenda, Marco Stranisci, Cristina Bosco, Tommaso Caselli, Viviana Patti, and
          <string-name>
            <given-names>Irene</given-names>
            <surname>Russo</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>HaSpeeDe 2@EVALITA2020: Overview of the EVALITA 2020 Hate Speech Detection Task</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Isabelle Guyon, Ulrike von Luxburg</source>
          , Samy Bengio,
          <string-name>
            <surname>Hanna M. Wallach</surname>
            , Rob Fergus,
            <given-names>S. V. N.</given-names>
          </string-name>
          <string-name>
            <surname>Vishwanathan</surname>
          </string-name>
          , and Roman Garnett, editors,
          <source>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems</source>
          <year>2017</year>
          ,
          <fpage>4</fpage>
          -9
          <source>December</source>
          <year>2017</year>
          , Long Beach, CA, USA, pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Amanpreet</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Julian Michael</surname>
          </string-name>
          , Felix Hill,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Bowman</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>GLUE: A multi-task benchmark and analysis platform for natural language understanding</article-title>
          .
          <source>In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          , pages
          <fpage>353</fpage>
          -
          <lpage>355</lpage>
          , Brussels, Belgium, November. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Marcos</given-names>
            <surname>Zampieri</surname>
          </string-name>
          , Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and C¸ agri C¸ o¨ltekin.
          <year>2020</year>
          . Semeval-2020 task 12:
          <article-title>Multilingual offensive language identification in social media</article-title>
          (offenseval
          <year>2020</year>
          ). CoRR, abs/
          <year>2006</year>
          .07235.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>