Overview of the EVALITA 2018 Hate Speech Detection Task
    Cristina Bosco                    Felice Dell’Orletta                  Fabio Poletto
  University of Torino                  ILC-CNR, Pisa                      Acmos, Torino
         Italy                               Italy                             Italy
   bosco@di.unito.it           felice.dellorletta@ilc.cnr.it fabio.poletto@edu.unito.it


             Manuela Sanguinetti                            Maurizio Tesconi
             University of Torino                            IIT-CNR, Pisa
                    Italy                                         Italy
             msanguin@di.unito.it                    maurizio.tesconi@iit.cnr.it


                Abstract                             di dati oppure addrestrati su una tipolo-
                                                     gia e testati sull’altra: HaSpeeDe-FB,
English. The Hate Speech Detection                   HaSpeeDe-TW e Cross-HaSpeeDe (a sua
(HaSpeeDe) task is a shared task on Ital-            volta suddiviso in Cross-HaSpeeDe FB e
ian social media (Facebook and Twit-                 Cross-HaSpeeDe TW). Nel complesso, 9
ter) for the detection of hateful con-               gruppi hanno partecipato alla campagna,
tent, and it has been proposed for the               e il miglior sistema ha ottenuto un pun-
first time at EVALITA 2018. Provid-                  teggio di macro F1 pari a 0,8288 in
ing two datasets from two different on-              HaSpeeDe-FB, 0,7993 in HaSpeeDe-TW,
line social platforms differently featured           0,6541 in Cross-HaSpeeDe FB e 0.6985
from the linguistic and communicative                in Cross-HaSpeeDe TW. L’articolo de-
point of view, we organized the task                 scrive i dataset rilasciati e le modalità di
in three tasks where systems must be                 valutazione, e discute i risultati ottenuti.
trained and tested on the same resource
or using one in training and the other
in testing: HaSpeeDe-FB, HaSpeeDe-               1   Introduction and Motivations
TW and Cross-HaSpeeDe (further sub-
divided into Cross-HaSpeeDe FB and               Online hateful content, or Hate Speech (HS), is
Cross-HaSpeeDe TW sub-tasks). Over-              characterized by some key aspects (such as viral-
all, 9 teams participated in the task, and       ity, or presumed anonymity) which distinguish it
the best system achieved a macro F1-             from offline communication and make it poten-
score of 0.8288 for HaSpeeDe-FB, 0.7993          tially more dangerous and hurtful. Therefore, its
for HaSpeeDe-TW, 0.6541 for Cross-               identification becomes a crucial mission in many
HaSpeeDe FB and 0.6985 for Cross-                fields.
HaSpeeDe TW. In this report, we describe            The task that we have proposed for this edi-
the datasets released and the evaluation         tion of EVALITA namely consists in automatically
measures, and we discuss results.                annotating messages from two popular micro-
                                                 blogging platforms, Twitter and Facebook, with a
Italiano. HaSpeeDe è la prima cam-              boolean value indicating the presence (or not) of
pagna di valutazione di sistemi per              HS.
l’identificazione automatica di discorsi            HS can be defined as any expression “that is
di incitamento all’odio su social media          abusive, insulting, intimidating, harassing, and/or
(Facebook e Twitter) in lingua italiana,         incites to violence, hatred, or discrimination. It is
proposta nell’ambito di EVALITA 2018.            directed against people on the basis of their race,
Fornendo ai partecipanti due insiemi di          ethnic origin, religion, gender, age, physical con-
dati estratti da due piattaforme differenti      dition, disability, sexual orientation, political con-
dal punto di vista linguistico e della comu-     viction, and so forth” (Erjavec and Kovačič, 2012).
nicazione, abbiamo articolato HaSpeeDe              Although definitions and approaches to HS vary
in tre compiti in cui i sistemi sono ad-         a lot and depend on the juridical tradition of the
destrati e testati sulla stessa tipologia        country, many agree that what is identified as
such can not fall under the protection granted             horship and aggressiveness analysis (MEX-A3T)
by the right to freedom of expression, and must            (Carmona et al., 2018) proposed at the 2018 edi-
be prohibited. Also for transposing in practical           tion of IberEval, the GermEval Shared Task on the
initiatives the Code of Conduct of the European            Identification of Offensive Language (Wiegand et
Union1 , online platforms like Twitter, Facebook           al., 2018), the Automatic Misogyny Identification
or YouTube discourage hateful content, but its re-         task at EVALITA 2018 (Fersini et al., 2018a), and
moval mainly relies on users and trusted flaggers          finally the SemEval shared task on hate speech de-
reports, and lacks a systematic control.                   tection against immigrants and women (HatEval),
   Although HS analysis and identification re-             that is still ongoing at the time of writing3 .
quires a multidisciplinary approach that includes             On the other hand, such contributions and
knowledge from different fields (psychology, law,          events are mainly based on other languages (En-
social sciences, among others), NLP plays a fun-           glish, for most part), while very few of them deal
damental role in this respect. Therefore, the de-          with Italian (Del Vigna et al., 2017; Musto et
velopment of high-accuracy automatic tools able            al., 2016; Pelosi et al., 2017). Precisely for this
to identify HS assumes the utmost relevance not            reason, the Hate Speech Detection (HaSpeeDe)4
only for NLP – and Italian NLP in particular –             task has been conceived and proposed within the
but also for all the practical applications a simi-        EVALITA context (Caselli et al., 2018); its pur-
lar task lends itself to. Furthermore, as also sug-        pose is namely to encourage and promote the par-
gested in Schmidt and Wiegand (2017), the com-             ticipation of several research groups, both from
munity would considerably benefit from a bench-            academia and industry, making a shared dataset
mark dataset for HS detection underlying a com-            available, in order to allow an advancement in the
monly accepted definition of the task.                     state of the art in this field for Italian as well.
   As regards the state of the art, a large number
of contributions have been proposed on this topic,         2    Task Organization
that adopt from lexicon-based (Gitari et al., 2015)        Considering the linguistic, as well as meta-
to various machine learning approaches, and with           linguistic, features that distinguish Twitter and
different learning techniques, ranging from naı̈ve         Facebook posts, namely due to the differences in
Bayes classifiers (Kwok and Wang, 2013), Logis-            use between the two platforms and the character
tic Regression and Support Vector Machines (Bur-           limitations posed for their messages (especially on
nap and Williams, 2015; Davidson et al., 2017),            Twitter), the task has been further organized into
to the more recent Recurrent and Convolutional             three sub-tasks, based on the dataset used (see Sec-
Neural Networks (Mehdad and Tetreault, 2016;               tion 3):
Gambäck and Sikdar, 2017). However, there exist
no comparative studies which would allow making                • Task 1: HaSpeeDe-FB, where only the
judgement on the most effective learning method                  Facebook dataset could be used to classify
(Schmidt and Wiegand, 2017).                                     the Facebook test set
Furthermore, a large number of academic events
and shared tasks took place in the recent past,                • Task 2: HaSpeeDe-TW, where only the
thus reflecting the interest in HS and HS-related                Twitter dataset could be used to classify the
topics by the NLP community; to name a few,                      Twitter test set
the first and second edition of the Workshop on                • Task 3: Cross-HaSpeeDe, which has been
Abusive Language2 (Waseem et al., 2017), the                     further subdivided into two sub-tasks:
First Workshop on Trolling, Aggression and Cy-
berbullying (Kumar et al., 2018), that also in-                    – Task 3.1: Cross-HaSpeeDe FB, where
cluded a shared task on aggression identifica-                       only the Facebook dataset could be used
tion, the tracks on Automatic Misogyny Identifi-                     to classify the Twitter test set
cation (AMI) (Fersini et al., 2018b) and on auto-                  – Task 3.2:        Cross-HaSpeeDe TW,
                                                                     where, conversely, only the Twitter
   1
     On May 31, 2016, the EU Commission presented with
                                                             3
Facebook, Microsoft, Twitter and YouTube a “Code of con-       https://competitions.codalab.org/
duct on countering illegal hate speech online”.            competitions/19935
   2                                                         4
     https://sites.google.com/view/                            http://www.di.unito.it/˜tutreeb/
alw2018/                                                   haspeede-evalita18/
          dataset could be used to classify the         3.2    Twitter Dataset
          Facebook test set
                                                        The Twitter dataset released for the competition
                                                        is a subset of a larger hate speech corpus devel-
   Cross-HaSpeeDe, in particular, has been pro-
                                                        oped at the Turin University. The corpus forms
posed as an out-of-domain task that specifically
                                                        indeed part of the Hate Speech Monitoring pro-
aimed on one hand at highlighting the challeng-
                                                        gram5 , coordinated by the Computer Science De-
ing aspects of using social media data for classi-
                                                        partment with the aim at detecting, analyzing and
fication purposes, and on the other at enhancing
                                                        countering HS with an inter-disciplinary approach
the systems’ ability to generalize their predictions
                                                        (Bosco et al., 2017). Its preliminary stage of devel-
with different datasets.
                                                        opment has been described in Poletto et al. (2017),
3     Datasets and Format                               while the fully developed corpus is described in
                                                        Sanguinetti et al. (2018).
The datasets proposed for this task are the result of   The collection includes Twitter posts gathered
a joint effort of two research groups on harmoniz-      with a classical keyword-based approach, more
ing the annotation previously applied to two dif-       specifically by filtering the corpus using neutral
ferent datasets, in order to allow their exploitation   keywords related to three social groups deemed as
in the task.                                            potential HS targets in the Italian context: immi-
   The first dataset is a collection of Facebook        grants, Muslims and Roma.
posts developed by the group from Pisa and cre-         After a first annotation step that resulted in a col-
ated in 2016 (Del Vigna et al., 2017), while the        lection of around 1,800 tweets, the corpus has
other one is a Twitter corpus developed in 2017-        been further expanded by adding new annotated
2018 by the Turin group (Sanguinetti et al., 2018).     data. The newly introduced tweets were annotated
Section 3.1 and 3.2 briefly introduce the original      partly by experts and partly by CrowdFlower (now
datasets, while Section 3.3 describes the unified       Figure Eight) contributors. The final version of the
annotation scheme adopted in both corpora for the       corpus consists of 6,928 tweets.
purposes of this task.                                  The main feature of this corpus is its annotation
                                                        scheme, specifically designed to properly encode
3.1    Facebook Dataset                                 the multiplicity of factors that can contribute to
This is a corpus of comments retrieved from             the definition of a hate speech notion, and to of-
the Facebook public pages of Italian newspapers,        fer a broader tagset capable of better representing
politicians, artists, and groups. Those pages were      all those factors which may increase, or rather mit-
selected because typically they host discussions        igate, the impact of the message. This resulted in
spanning across a variety of topics.                    a scheme that includes, besides HS tags (no-yes),
The comments collected were related to a series         also its intensity degree (from 1 through 4 if HS is
of web pages and groups, chosen as being sus-           present, and 0 otherwise), the presence of aggres-
pected to possibly contain hateful content: salvin-     siveness (no-weak-strong) and offensiveness (no-
iofficial, matteorenziufficiale, lazanzarar24, jenus-   weak-strong), as well as irony and stereotype (no-
dinazareth, sinistracazzateliberta2, ilfattoquotidi-    yes).
ano, emosocazzi, noiconsalviniufficiale.                   In addition, given that irony has been included
Overall, 17,567 Facebook comments were col-             as annotation category in the scheme, part of
lected from 99 posts crawled from the selected          this hate speech corpus (i.e. the tweets an-
pages. Five bachelor students were asked to an-         notated as ironic) has also been used in an-
notate comments, in particular 3,685 received at        other task proposed in this edition of EVALITA,
least 3 annotations. The annotators were asked to       namely the one on irony detection in Italian tweets
assign one class to each post, where classes span       (IronITA)6 (Cignarella et al., 2018). More pre-
over the following levels of hate: No hate, Weak        cisely, the overlapping tweets in the IronITA
hate, Strong hate.                                      datasets are 781 in the training set and just 96 in
Hateful messages were then divided into distinct        the test set.
categories: Religion, Physical and/or mental hand-         5
                                                           http://hatespeech.di.unito.it/
icap, Socio-economical status, Politics, Race, Sex         6
                                                           http://www.di.unito.it/˜tutreeb/
and Gender issues, and Other.                           ironita-evalita18/
3.3       Format and Data in HaSpeeDe                                                     0         1
The annotation format provided for the task is                                Train     1,618     1,382
the same for both datasets described above, and                               Test       323       677
it consists of a simplified version of the schemes                            total     1,941     2,059
adopted in the two corpora introduced in Section
3.1 and 3.2.                                                  Table 3:      Label distribution in the Facebook
   The data have been encoded in UTF-8 plain-text             dataset.
files with three tab-separated columns, each one                                          0         1
representing the following information:                                       Train     2,028      972
    1. the ID of the Facebook comment or tweet7 ,                             Test       676       324
                                                                              total     2,704     1,296
    2. the text,
    3. the class: 1 if the text contains HS, and 0            Table 4: Label distribution in the Twitter dataset.
       otherwise (see Table 1 and 2 for a few exam-
       ples).                                                 been provided.
                                                              The evaluation has been performed according to
                                                              the standard metrics known in literature, i.e Pre-
    id    text                                         hs
                                                              cision, Recall and F1-score. However, given the
    8     Io voterò NO NO E NO                        0
                                                              imbalanced distribution of hateful vs not hateful
    36    Matteo serve un colpo di stato.              1
                                                              messages, and in order to get more useful insights
          Qua tra poco dovremo andare in giro
                                                              on the system’s performance on a given class,
          tutti armati come in America.
                                                              the scores have been computed for each class
Table 1: Annotation examples from the Facebook                separately; finally the F1-score has been macro-
dataset.                                                      averaged, so as to get the overall results.
                                                                 For all tasks, the baseline score has been com-
                                                              puted as the performance of a classifier based on
    id       text                                      hs
                                                              the most frequent class.
    1,783    Corriere: Mafia Capitale,                 0
             4 patteggiamenti                                 5     Overview of the Task: Participation
             Gli appalti truccati dei campi rom                     and Results
    3,290    altro che profughi? sono zavorre          1
             e tutti uomini                                   5.1    Task Participants and Submissions
                                                              A total amount of 9 teams8 participated in at least
Table 2: Annotation examples from the Twitter                 one of the three HaSpeeDe main tasks. Table 5
dataset.                                                      provides an overview of the teams and their affili-
                                                              ation.
   Both Facebook and Twitter datasets consist of a
                                                                 Except for one case, where one run was sent for
total amount of 4,000 comments/tweets retrieved
                                                              HaSpeeDe-TW only, all teams submitted at least
from the main corpora introduced in Section 3.1
                                                              one run for all the tasks.
and 3.2. The data were randomly split into devel-
opment and test set, of 3,000 and 1,000 messages              5.2    Systems
respectively.
                                                              As participants were allowed to submit up to 2
The distribution in both datasets of the labels ex-
                                                              runs for each task, several training options were
pressing the presence or not of HS is summarized
                                                              adopted in order to properly classify the texts.
in Table 3 and 4.
                                                              Furthermore, unlike other tasks, we have cho-
4        Evaluation                                           sen to not establish any distinction between con-
                                                              strained and unconstrained runs, and to allow par-
Participants were allowed to submit up to 2 runs              ticipants to use all the additional resources that
for each task, and a separate official ranking has               8
                                                                   In fact, 11 teams submitted their results, but one team
     7
    In order to meet the GDPR requirements, texts have been   withdrew its submissions, and another one’s submissions
pseudonymized replacing all original IDs in both datasets     have been removed from the official rankings by the task or-
with newly-generated ones.                                    ganizers.
 Team              Affiliation                            layer BiLSTM and a newly-introduced one based
 GRCP              Univ. Politècnica de València +      on a 2-layer BiLSTM which exploits multi-task
                   CERPAMID, Cuba                         learning with additional data from the 2016 SEN-
 InriaFBK          Univ. Côte d’Azur, CNRS, Inria +      TIPOLC task (Barbieri et al., 2016).
                   FBK, Trento
 ItaliaNLP         ILC-CNR, Pisa + Univ. of Pisa
                                                          Perugia (Santucci et al., 2018) The participants’
 Perugia           Univ. for Foreigners of Perugia +
                                                          system uses a document classifier based on a SVM
                   Univ. of Perugia + Univ. of Florence
                                                          algorithm. The features used by the system are
 RuG               University of Groningen +
                                                          a combination of features extracted using mathe-
                   Univ. degli Studi di Salerno
                                                          matical operations on FastText word embeddings
 sbMMP             Zurich Univ. of Applied Sciences
                                                          and other 20 features extracted from the raw text.
 StopPropagHate    INESC TEC + Univ. of Porto +           RuG (Bai et al., 2018) The authors proposed
                   Eurecat, Centre Tecn. de Catalunya     two different classifiers: a SVM based on linear
 HanSEL            University of Bari Aldo Moro           kernel algorithm and an ensemble system com-
 VulpeculaTeam     University of Perugia                  posed of a SVM classifier and a Convolutional
                                                          Neural Network combined by a logistic regres-
           Table 5: Participants overview.                sion meta-classifier. The features of each classi-
                                                          fier is algorithm dependent and exploits word em-
they deemed useful for the task (other annotated          beddings, raw text features and lexical resources
resources, lexicons, pre-trained word embeddings,         features.
etc.), on the sole condition that these were explic-      sbMMMP The authors tested two different sys-
itly mentioned in their final report.                     tems, in a similar fashion to what described in von
   Table 6 summarizes the external resources (if          Grüningen et al. (2018). The first one is based
any) used by participants to enhance their systems’       on an ensemble of Convolutional Neural Networks
performance, while the remainder of this section          (CNN), whose outputs are then used as features
offers a brief overview of the teams’ systems and         by a meta-classifier for the final prediction. The
core methods adopted to participate in the task .         second system uses a combination of a CNN and
GRCP (De la Peña Sarracén et al., 2018) The             a Gated Recurrent Unit (GRU) together with a
authors proposed a bidirectional Long Short-              transfer-learning approach based on pre-training
Term Memory Recurrent Neural Network with an              with a large, automatically-translated dataset.
Attention-based mechanism that allows to esti-            StopPropagHate (Fortuna et al., 2018) The au-
mate the importance of each word; this context            thors use a classifier based on Recurrent Neural
vector is then used with another LSTM model to            Networks with a binary cross-entropy as loss func-
estimate whether a text is hateful or not.                tion. In their system, each input word is repre-
HanSEL (Polignano and Basile, 2018) The sys-              sented by a 10000-dimensional vector which is a
tem proposed is based on an ensemble of three             one-hot encoding vector.
classification strategies, mediated by a majority         VulpeculaTeam (Bianchini et al., 2018) Ac-
vote algorithm: Support Vector Machine with               cording to the description provided by partici-
RBF kernel, Random Forest and Deep Multilayer             pants, a neural network with three hidden layers
Perceptron. The input social media text is repre-         was used, with word embeddings trained on a set
sented as a concatenation of word2vec sentence            of previously extracted Facebook comments.
vectors and a TF-IDF bag of words.
                                                          5.3    Results and Discussion
InriaFBK (Corazza et al., 2018) The authors
implemented three different classifier models,            In Table 7, 8, 9 and 10, we report the final results
based on recurrent neural networks, n-gram based          of HaSpeeDe, separated according to the respec-
models and linear SVC.                                    tive sub-task and ranked by the macro F1-score (as
                                                          described in Section 4)9 .
ItaliaNLP (Cimino et al., 2018) Participants                 9
                                                               Due to space constraints, the complete evaluation for all
tested three different classification models: one         classes has been made available here: https://goo.gl/
based on linear SVM, another one based on a 1-            xPyPRW
        Team                External Resources
        GRCP                pre-trained word embeddings
        InriaFBK            emotion lexicon
        ItaliaNLP Lab       polarity and subjectivity lexicons + 2 word-embedding lexicons
        Perugia             Twitter corpus + hate speech lexicon + polarity lexicon
        RuG                 pre-trained word embeddings + bad/offensive word lists
        sbMMP               pre-trained word embeddings
        StopPropagHate      –
        HanSEL              pre-trained word embeddings
        VulpeculaTeam       polarity lexicon + lists of bad words + pre-trained word embeddings

Table 6: Overview of the additional resources used by participants, besides the datasets provided by the
task organizers.


In case of multiple runs, the suffixes ” 1” and ” 2”        Team                     Macro F1-score
have been appended to each team name, in order              baseline                 0.4033
to distinguish the run number of the submitted file.        ItaliaNLP 2              0.7993
Furthermore, some of the runs in the tables have            ItaliaNLP 1              0.7982
been marked with *: this means that they were re-           RuG 1                    0.7934
submitted because of file incompatibility with the          InriaFBK 2               0.7837
evaluation script or other minor issues that did not        sbMMMP                   0.7809
affect the evaluation process.                              InriaFBK 1               0.78
                                                            VulpeculaTeam*           0.7783
     Team                   Macro F1-score                  Perugia 2                0.7744
     baseline               0.2441                          RuG 2                    0.753
     ItaliaNLP 2            0.8288                          StopPropagHate 2*        0.7426
     ItaliaNLP 1            0.8106                          StopPropagHate 1*        0.7203
     InriaFBK 1             0.8002                          GRCP 1                   0.6638
     InriaFBK 2             0.7863                          GRCP 2                   0.6567
     Perugia 2              0.7841                          HanSEL                   0.6491
     RuG 1                  0.7751                          Perugia 1                0.4033
     HanSEL                 0.7738
     VulpeculaTeam*         0.7554                       Table 8: Results of the HaSpeeDe-TW task.
     RuG 2                  0.7428
     GRCP 2                 0.7147
     GRCP 1                 0.7144                     HaSpeeDe-FB and HaSpeeDe-TW, i.e. ItaliaNLP,
     StopPropagHate 2*      0.6532                     also achieved valuable results in the cross-domain
     StopPropagHate 1*      0.6419                     sub-tasks, ranking at fifth and first position in
     Perugia 1              0.2424                     Cross-HaSpeeDe FB and Cross-HaSpeeDe TW,
                                                       respectively. But these results can also depend on
   Table 7: Results of the HaSpeeDe-FB task.           the association of the polarity and subjectivity lex-
                                                       icon with word embeddings, which alone did not
  In absolute terms, i.e. based on the score           allow the achievement of particularly high results.
of the first-ranked team, the best results have           Furthermore, it is not surprising that the best re-
been achieved in the HaSpeeDe-FB task, with            sults have been obtained on HaSpeeDe-FB, pro-
a macro F1 of 0.8288, followed by HaSpeeDe-            vided the fact that messages posted on this plat-
TW (0.7993), Cross-HaSpeeDe TW (0.6985) and            form are longer and more correct than those in
Cross-HaSpeeDe FB (0.6541).                            Twitter, allowing systems (and humans too) to find
  The robustness of an approach benefiting from        more and more clear indications of the presence of
a polarity and subjectivity lexicon is confirmed       HS.
by the fact that the best ranking team in both         The coarse granularity of the annotation scheme,
       Team               Macro F1-score                    Team                 Macro F1-score
       baseline           0.4033                            baseline             0.2441
       InriaFBK 2         0.6541                            ItaliaNLP 2          0.6985
       InriaFBK 1         0.6531                            InriaFBK 2           0.6802
       VulpeculaTeam      0.6542                            ItaliaNLP 1          0.6693
       Perugia 2          0.6279                            InriaFBK 1           0.6547
       ItaliaNLP 1        0.6068                            VulpeculaTeam*       0.6189
       ItaliaNLP 2        0.5848                            RuG 1                0.6021
       GRCP 2             0.5436                            RuG 2                0.5545
       RuG 1              0.5409                            HanSEL               0.4838
       RuG 2              0.4845                            Perugia 2            0.4594
       GRCP 1             0.4544                            GRCP 1               0.4451
       HanSEL             0.4502                            StopPropagHate*      0.4378
       StopPropagHate     0.443                             GRCP 2               0.318
       Perugia 1          0.4033                            Perugia 1            0.2441

Table 9: Results of the Cross-HaSpeeDe FB sub-        Table 10: Results of the Cross-HaSpeeDe TW
task.                                                 sub-task.


which is a simplification of the schemes originally      Overall, the heterogeneous nature of the
proposed for the datasets, and merged specifically    datasets provided for the task - both in terms of
for the purpose of this task, probably influenced     class distribution and data composition - together
the scores which are indeed very promising and        with their quite small size, made the whole task
high with respect to other tasks of the sentiment     even more challenging; nonetheless, this did not
analysis area.                                        prevent participants from finding the appropriate
                                                      solutions, thus improving the state of the art for
   As regards the Cross-HaSpeeDe FB and Cross-
                                                      HS identification in Italian language as well.
HaSpeeDe TW sub-tasks, the lower results with
respect to the in-domain tasks can be attributed      6   Closing Remarks
to several factors, among which - and as expected
- the different distribution in Facebook and Twit-    The paper describes the HaSpeeDe task for the de-
ter datasets of HS and not HS classes. As a mat-      tection of HS in Italian texts from Facebook and
ter of fact, the percentage of HS in the Facebook     Twitter. The novelty of the task mainly consists
train and test set is around 46% and 68%, respec-     in allowing the comparison between the results
tively, while in the Twitter test set is around 32%   obtained on the two platforms and experiments
in both sets. Such imbalanced distribution is re-     on training on one typology of texts and testing
flected in the overall system outputs in the two      on the other. The results confirmed the difficulty
sub-tasks: in Cross-HaSpeeDe FB, where systems        of cross-platform HS detection but also produced
have been evaluated against the Twitter test set,     very promising scores in the tasks where the data
most of the labels predicted as HS were not clas-     from the same social network were exploited both
sified as such in the gold standard; conversely, in   for training and testing.
Cross-HaSpeeDe TW, the majority of labels pre-        Future work can be devoted to an in-depth analy-
dicted as not HS were actually considered as HS       sis of errors and to the observation of the contri-
in the gold corpus.                                   bution that different resources can give to systems
Another feature that distinguishes Facebook from      performing this task.
Twitter dataset is the wider range of hate cat-
                                                      Acknowledgments
egories in the former, compared to the latter
(see Section 3.1 and 3.2). Especially in Cross-       The work of Cristina Bosco and Manuela San-
HaSpeeDe TW, the identification of hateful mess-      guinetti is partially funded by Progetto di Ate-
sages may have been made even more difficult due      neo/CSP 2016 (Immigrants, Hate and Prejudice
to the reduced number of potential hate targets in    in Social Media, S1618 L2 BOSC 01).
the training set, with respect to the test set.
References                                                 Neural Networks at EVALITA 2018. In Proceed-
                                                           ings of the 6th evaluation campaign of Natural
Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tom-            Language Processing and Speech tools for Italian
  maso Caselli, and Malvina Nissim. 2018. RuG              (EVALITA’18), Turin, Italy. CEUR.org.
  @ EVALITA 2018: Hate Speech Detection In Ital-
  ian Social Media. In Proceedings of Sixth Evalua-      Michele Corazza, Stefano Menini, Pinar Arslan,
  tion Campaign of Natural Language Processing and         Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and
  Speech Tools for Italian. Final Workshop (EVALITA        Serena Villata. 2018. Comparing Different Super-
  2018). CEUR.org.                                         vised Approaches to Hate Speech Detection. In Pro-
                                                           ceedings of Sixth Evaluation Campaign of Natural
Francesco Barbieri, Valerio Basile, Danilo Croce,          Language Processing and Speech Tools for Italian.
  Malvina Nissim, Nicole Novielli, and Viviana Patti.      Final Workshop (EVALITA 2018). CEUR.org.
  2016. Overview of the Evalita 2016 SENTIment
  POLarity Classification Task. In Proceedings of        Thomas Davidson, Dana Warmsley, Michael W. Macy,
  the Fifth Evaluation Campaign of Natural Language        and Ingmar Weber. 2017. Automated Hate Speech
  Processing and Speech Tools for Italian. Final Work-     Detection and the Problem of Offensive Language.
  shop (EVALITA 2016).                                     CoRR, abs/1703.04009.

Giulio Bianchini, Lorenzo Ferri, and Tommaso Giorni.     Gretel Liz De la Peña Sarracén, Reynaldo Gil Pons,
  2018. Text Analysis for Hate Speech Detection in         Carlos Enrique Muñiz Cuza, and Paolo Rosso.
  Italian Messages on Twitter and Facebook. In Pro-        2018. Hate Speech Detection Using Attention-
  ceedings of Sixth Evaluation Campaign of Natural         based LSTM. In Proceedings of Sixth Evalua-
  Language Processing and Speech Tools for Italian.        tion Campaign of Natural Language Processing and
  Final Workshop (EVALITA 2018). CEUR.org.                 Speech Tools for Italian. Final Workshop (EVALITA
                                                           2018). CEUR.org.
Cristina Bosco, Patti Viviana, Marcello Bogetti,
  Michelangelo Conoscenti,        Giancarlo Ruffo,       Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta,
  Rossano Schifanella, and Marco Stranisci. 2017.          Marinella Petrocchi, and Maurizio Tesconi. 2017.
  Tools and Resources for Detecting Hate and Prej-         Hate Me, Hate Me Not: Hate Speech Detection on
  udice Against Immigrants in Social Media. In             Facebook. In Proceedings of the First Italian Con-
  Proceedings of First Symposium on Social Interac-        ference on Cybersecurity (ITASEC17).
  tions in Complex Intelligent Systems (SICIS), AISB
  Convention 2017, AI and Society.                       Karmen Erjavec and Melita Poler Kovačič. 2012.
                                                           ”You Don’t Understand, This is a New War!” Anal-
Pete Burnap and Matthew L. Williams. 2015. Cyber           ysis of Hate Speech in News Web Sites’ Comments.
  Hate Speech on Twitter: An Application of Machine        Mass Communication and Society, 15(6).
  Classification and Statistical Modeling for Policy
                                                         Elisabetta Fersini, Debora Nozza, and Paolo Rosso.
  and Decision Making. Policy & Internet, 7(2).
                                                            2018a. Overview of the EVALITA 2018 Task on
                                                            Automatic Misogyny Identification (AMI). In Pro-
Miguel Ángel Álvarez Carmona, Estefanı́a Guzmán-
                                                            ceedings of Sixth Evaluation Campaign of Natural
  Falcón, Manuel Montes-y-Gómez, Hugo Jair Es-
                                                            Language Processing and Speech Tools for Italian.
  calante, Luis Villaseñor Pineda, Verónica Reyes-
                                                            Final Workshop (EVALITA 2018). CEUR.org.
  Meza, and Antonio Rico Sulayes. 2018. Overview
  of MEX-A3T at IberEval 2018: Authorship and Ag-        Elisabetta Fersini, Paolo Rosso, and Maria Anzovino.
  gressiveness Analysis in Mexican Spanish Tweets.          2018b.     Overview of the Task on Automatic
  In IberEval@SEPLN. CEUR-WS.org.                           Misogyny Identification at IberEval 2018.      In
                                                            IberEval@SEPLN. CEUR-WS.org.
Tommaso Caselli, Nicole Novielli, Viviana Patti, and
  Paolo Rosso. 2018. EVALITA 2018: Overview of           Paula Fortuna, Ilaria Bonavita, and Sérgio Nunes.
  the 6th Evaluation Campaign of Natural Language          2018. Merging datasets for hate speech classifi-
  Processing and Speech Tools for Italian. In Pro-         cation in Italian. In Proceedings of Sixth Evalua-
  ceedings of Sixth Evaluation Campaign of Natural         tion Campaign of Natural Language Processing and
  Language Processing and Speech Tools for Italian.        Speech Tools for Italian. Final Workshop (EVALITA
  Final Workshop (EVALITA 2018). CEUR.org.                 2018). CEUR.org.
Alessandra Teresa Cignarella, Simona Frenda, Vale-       Björn Gambäck and Utpal Kumar Sikdar. 2017. Using
  rio Basile, Cristina Bosco, Viviana Patti, and Paolo      Convolutional Neural Networks to Classify Hate-
  Rosso. 2018. Overview of the Evalita 2018 Task on         Speech. In Proceedings of the First Workshop on
  Irony Detection in Italian Tweets (IronITA). In Pro-      Abusive Language.
  ceedings of Sixth Evaluation Campaign of Natural
  Language Processing and Speech Tools for Italian.      Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura
  Final Workshop (EVALITA 2018). CEUR.org.                 Damien, and Jun Long. 2015. A lexicon-based
                                                           approach for hate speech detection. International
Andrea Cimino, Lorenzo De Mattei, and Felice               Journal of Multimedia and Ubiquitous Engineering,
  Dell’Orletta. 2018. Multi-task Learning in Deep          10(4).
Ritesh Kumar, Atul Kr. Ojha, Marcos Zampieri, and         Dirk von Grünigen, Ralf Grubenmann, Fernando Ben-
   Shervin Malmasi, editors. 2018. Proceedings of           ites, Pius Von Däniken, and Mark Cieliebak. 2018.
   the First Workshop on Trolling, Aggression and Cy-       spMMMP at GermEval 2018 Shared Task: Classifi-
   berbullying (TRAC-2018). Association for Compu-          cation of Offensive Content in Tweets using Con-
   tational Linguistics.                                    volutional Neural Networks and Gated Recurrent
                                                            Units. In Proceedings of GermEval 2018, 14th
Irene Kwok and Yuzhou Wang. 2013. Locate the Hate:          Conference on Natural Language Processing (KON-
   Detecting Tweets Against Blacks. In Proceedings          VENS 2018).
   of the Twenty-Seventh AAAI Conference on Artificial
   Intelligence. AAAI Press.                              Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy,
                                                            and Joel Tetreault, editors. 2017. Proceedings of the
Yashar Mehdad and Joel Tetreault. 2016. Do Char-            First Workshop on Abusive Language Online. Asso-
  acters Abuse More Than Words? In 17th Annual              ciation for Computational Linguistics.
  Meeting of the Special Interest Group on Discourse      Michael Wiegand, Melanie Siegel, and Josef Ruppen-
  and Dialogue.                                             hofer. 2018. Overview of the GermEval 2018
                                                            Shared Task on the Identification of Offensive Lan-
Cataldo Musto, Giovanni Semeraro, Marco de Gem-             guage. In Proceedings of GermEval 2018, 14th
  mis, and Pasquale Lops. 2016. Modeling Commu-             Conference on Natural Language Processing (KON-
  nity Behavior through Semantic Analysis of Social         VENS 2018).
  Data: The Italian Hate Map Experience. In Pro-
  ceedings of the 2016 Conference on User Modeling
  Adaptation and Personalization, UMAP 2016.

Serena Pelosi, Alessandro Maisto, Pierluigi Vitale, and
  Simonetta Vietri. 2017. Mining Offensive Lan-
  guage on Social Media. In Proceedings of the
  Fourth Italian Conference on Computational Lin-
  guistics (CLiC-it 2017).

Fabio Poletto, Marco Stranisci, Manuela Sanguinetti,
  Viviana Patti, and Cristina Bosco. 2017. Hate
  Speech Annotation: Analysis of an Italian Twit-
  ter Corpus. In Proceedings of the Fourth Italian
  Conference on Computational Linguistics (CLiC-it
  2017). CEUR.

Marco Polignano and Pierpaolo Basile.        2018.
 HanSEL: Italian Hate Speech Detection through En-
 semble Learning and Deep Neural Networks. In
 Proceedings of Sixth Evaluation Campaign of Natu-
 ral Language Processing and Speech Tools for Ital-
 ian. Final Workshop (EVALITA 2018). CEUR.org.

Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
 Viviana Patti, and Marco Stranisci. 2018. An Italian
 Twitter Corpus of Hate Speech against Immigrants.
 In Proceedings of the 11th Language Resources and
 Evaluation Conference 2018.

Valentino Santucci, Stefania Spina, Alfredo Milani,
  Giulio Biondi, and Gabriele Di Bari. 2018. De-
  tecting Hate Speech for Italian Language in Social
  Media. In Proceedings of Sixth Evaluation Cam-
  paign of Natural Language Processing and Speech
  Tools for Italian. Final Workshop (EVALITA 2018).
  CEUR.org.

Anna Schmidt and Michael Wiegand. 2017. A Sur-
  vey on Hate Speech Detection using Natural Lan-
  guage Processing. In Proceedings of the Fifth Inter-
  national Workshop on Natural Language Process-
  ing for Social Media. Association for Computational
  Linguistics.