Error Analysis in a Hate Speech Detection Task:
                     the Case of HaSpeeDe-TW at EVALITA 2018
              Chiara Francesconi                      Cristina Bosco
       Dipartimento di Lingue e Letterature            Fabio Poletto
          Straniere e Culture Moderne             Manuela Sanguinetti
               University of Turin              Dipartimento di Informatica
    chiara.francesconi@edu.unito.it                 University of Turin
                                       {bosco,poletto,msanguin}@di.unito.it

                      Abstract                             a shared task may result in also more interesting
                                                           hints about the directions to be followed in the im-
    Taking as a case study the Hate Speech                 provement of both data and systems.
    Detection task at EVALITA 2018, the pa-                   As a case study to carry out error analysis, data
    per discusses the distribution and typol-              from a shared task have been used in this paper.
    ogy of the errors made by the five best-               Shared tasks offer clean, high-quality annotated
    scoring systems. The focus is on the sub-              datasets on which different systems are trained and
    task where Twitter data was used both for              tested. Although often researchers omit to reflect
    training and testing (HaSpeeDe-TW). In                 on what caused to system to collect some failures
    order to highlight the complexity of hate              (Nissim et al., 2017), they are an ideal ground
    speech and the reasons beyond the failures             for sharing negative results and encourage reflec-
    in its automatic detection, the annotation             tions on ”what did not work”, an excellent oppor-
    provided for the task is enriched with or-             tunity to carry out a comparative error analysis and
    thogonal categories annotated in the orig-             search for patterns that may, in turn, suggest im-
    inal reference corpus, such as aggressive-             provements in both the dataset and the systems.
    ness, offensiveness, irony and the presence               Here we analyze the case of the Hate Speech
    of stereotypes.                                        Detection (HaSpeeDe) task (Bosco et al., 2018)
                                                           presented at EVALITA 2018, the Evaluation Cam-
1   Introduction                                           paign for NLP and Speech Tools for Italian
                                                           (Caselli et al., 2018). HS detection is a really com-
The field of Natural Language Processing wit-              plex task, starting from the definition of the notion
nesses an ever-growing number of automated sys-            on which it is centered. Considering the growing
tems trained on annotated data and built to solve,         attention it is gaining, see e.g. the variety of re-
with remarkable results, the most diverse tasks.           sources and tasks for HS developed in the last few
As performances increase, resources, settings and          years, we believe that error analysis could be espe-
features that contributed to the improvement are           cially interesting and useful for this case, as well
(understandably) emphasized, but sometimes little          as in other tasks where the outcome of systems
or no room is given to an analysis of the factors          meaningfully depends on resources exploited for
that caused the system to misclassify some items.          training and testing.
   This paper wants to draw attention to the impor-
                                                              The paper outlines the background and motiva-
tance of a thorough error analysis on the perfor-
                                                           tions behind this research (Section 2), describes
mance of supervised systems, as a means to pro-
                                                           the sub-task on which the study is based (Section
duce advancement in the field. Errors made by a
                                                           3), reports on the error analysis process (Section 4)
system may entail not only the poorness of the sys-
                                                           and discusses its results (Section 5), and presents
tem itself but also the sparseness of the data used
                                                           some conclusive remarks (Section 6).
in training, the failure of the annotation scheme in
describing the observed phenomena or a cue of the          2   Background and Motivations
data inherent ambiguity. The presence of the same
errors in the results of several systems involved in       There are several issues connected to the identifi-
                                                           cation of HS: its juridical definition, the subjectiv-
     Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0   ity of its perception, the need to remove potentially
International (CC BY 4.0).                                 illegal content from the web without unjustly re-
moving legal content, and a list of linguistic phe-    used within the HaSpeeDe shared task, namely
nomena that partly overlap to HS but need to be        the HaSpeeDe-TW sub-task dataset (described in
kept apart.                                            Section 3). Characteristics of this dataset make
    Many works have recently contributed to the        it ideal for our purpose: each tweet is connected
field by releasing novel annotated resources or        to a target and is annotated not only for the pres-
presenting automated classifiers. Two reviews on       ence of HS but for four other parameters. If
HS detection were recently published by Schmidt        a comparative analysis of two corpora present-
and Wiegand (2017) and Fortuna and Nunes               ing different textual genres (HaSpeeDe-TW and
(2018). Since 2016, shared tasks on the detection      HaSpeeDe-FB) might have offered interesting per-
of HS or related phenomena (such as abusive lan-       spectives, the lack of such characteristic in the FB
guage or misogyny) have been organized, effec-         dataset prevents a thorough comparison. Further-
tively enhancing advancements in resource build-       more, among the in-domain HaSpeeDe sub-tasks,
ing and system development. These include Hat-         HaSpeeDe-TW is the one where systems achieved
Eval at SemEval 2019 (Basile et al., 2019), AMI        the lower F1 -scores, providing thus more material
at IberEval 2018 (Fersini et al., 2018), HaSpeeDe      for our analysis.
at EVALITA 2018 (Bosco et al., 2018) and more.
Nevertheless, the growing interest in HS detection     3    HaSpeeDe-TW at EVALITA 2018: A
suggests that the task is far from being solved: to         Brief Overview
improve quality and interoperability of resources,     While a description of the HaSpeeDe task as
to design suitable annotation schemes and to re-       a whole has been provided in the organizers’
duce biases in the annotation is still as needed as    overview (Bosco et al., 2018), here we focus on
it is to work on system engineering. Establishing      HaSpeeDe-TW, one of the three sub-tasks into
standards and good practices in error analysis can     which the competition was structured2 . The sub-
enhance these processes and push towards the de-       task consisted in a binary classification of hateful
velopment of effective classifiers for HS.             vs non-hateful tweets. Training set and test set
    While academic literature is rich with works on    contain 3,000 and 1,000 tweets respectively, la-
human annotation and evaluation metrics, it is not     beled with 1 or 0 for the presence of HS, and with
as easy to find works dedicated to error analysis      a distribution, in both sets, of around 1/3 hateful
of automated classification systems. This is rather    against 2/3 non-hateful tweets. Data are drawn
more often found as a section of papers describ-       from an already existing HS corpus (Poletto et al.,
ing a system (see, e.g., (Mohammad et al., 2018)).     2017), whose original annotation scheme was sim-
This section, however, is not always present. To       plified for the purposes of the task (see Section 4).
examine the errors made by a system, classify             Nine teams participated in the task, submitting
them and search for linguistic patterns appear to      fifteen runs. The five best scores, submitted by
be a somewhat undervalued job, especially when         the teams ItaliaNLP (whose runs ranked 1st and
the system had an overall good performance.Yet, it     2nd) (Cimino and De Mattei, 2018), RuG (Bai et
is crucial to understand why a system proved to be     al., 2018), InriaFBK (Corazza et al., 2018) and sb-
a weak solution to certain instances of a problem,     MMP (von Grünigen et al., 2018), ranged from
even while being excellent for other instances.        0.7993 to 0.7809 in terms of macro-averaged F1 -
    In the context of COLING 2018, error analysis      score3 . They applied both classical machine learn-
emerged as one of the most relevant features to        ing approaches, Linear Support Vector Machine in
be addressed in NLP research1 . This attention to      particular (ItaliaNLP, RuG) and more recent deep
error analysis encouraged authors to submit papers     learning algorithms, such as Convolutional Neu-
with a dedicated section, with Yang et al. (2018)      ral Networks (sbMMP) or Bi-LSTMs (ItaliaNLP,
winning the award for the best error analysis, and     who adopted a multi-task learning approach ex-
is a step towards establishing good practices in the      2
                                                            The other two being HaSpeeDe-FB, where Facebook
NLP community.                                         data were used both for training and testing the systems, and
    In the wake of this awareness, we apply lin-       Cross-HaSpeeDe, further subdivided into Cross-HaSpeeDe-
                                                       FB and Cross-HaSpeeDe-TW, where systems were trained
guistic insights to one of the annotated corpora       using Facebook data and tested against Twitter data in the
                                                       former, and the opposite in the latter.
  1                                                       3
    https://coling2018.org/                                 All official ranks are available here: https://goo.
error-analysis-in-research-and-writing/.               gl/xPyPRW.
ploiting the SENTIPOLC 2016 (Barbieri et al.,               Even though only the annotation concerning the
2016) dataset as well). Learning architectures re-       presence of HS was distributed to the teams, the
sorted to both surface features such as word and         corpus from which the training and test set of
character n-grams (RuG) and linguistic informa-          HaSpeeDe-TW were extracted was provided with
tion such as Part of Speech (ItaliaNLP).                 additional labels (Poletto et al., 2017; Sanguinetti
   In the next section, we provide a description of      et al., 2018). These labels (see Table 1) were
the errors collected from these best five runs as        meant to mark the user’s intention to be aggres-
put in relation with the specific factors we chose       sive (aggressiveness), the potentially hurtful effect
to analyze in this study, encompassing and merg-         of a tweet (offensiveness), the use of ironic devices
ing qualitative and quantitative observations. Our       to possibly mitigate a hateful message (irony), and
analysis is strictly based on the results provided       whether the tweet contains any implicit or explicit
by those systems. An analysis focused on the fea-        reference to negative beliefs about the targeted
tures of the systems that determined the errors is       group (stereotype).
unfortunately beyond the scope of this work, as
in HaSpeeDe participants were only requested to                  label              values
provide the results after training their systems.                aggressiveness     no, weak, strong
                                                                 offensiveness      no, weak, strong
4   Error Analysis                                               irony              yes, no
                                                                 stereotype         yes, no
Error analysis can be used in between runs to im-
prove results or test different feature settings. With   Table 1: The original annotation scheme of the HS
the aim of weaving a broader reflection on the es-       corpus that was (partially) used in HaSpeeDe-TW.
pecially hard linguistic patterns within a HS de-
tection task, here it is performed a posteriori and         These labels were conceived with the aim of
on the aggregated results of five systems on the         identifying some particular aspects that may in-
HaSpeeDe-TW test set (1,000 tweets). We fo-              tersect HS but occur independently. As a mat-
cus on the answers given by the majority of the          ter of fact, hateful contents towards a given target
five best systems because we believe they provide        might be expressed using aggressive tones or of-
a faithful representation of the errors without the      fensive/stereotypical slurs, but also in much sub-
noise due to the presence of the worst runs.             tler forms. At the same time, aggressive or offen-
   The test set was composed of 32.4% of hateful         sive content, though addressed to a potential HS
tweets and 67.6% non-hateful tweets. As the first        target, does not necessarily imply the presence of
step of our analysis, we compared the gold label         HS. Our assumption while carrying out this study
assigned to each tweet in the test set with the one      was that such close, but at times misleading, rela-
attributed by the majority of the five runs consid-      tion between HS on one side and these phenomena
ered for the task. An error was considered to occur      on the other could be considered a source of error
when the label assigned by the majority of the sys-      for the automatic systems.
tems was different from the gold label. If we ex-           In addition, other aspects of both linguistic and
tend our analysis to all the fifteen submitted runs,     extra-linguistic nature were taken into account, so
156 out of 1,000 tweets have been misclassified          as to complement the analysis. We thus consid-
by the majority of them. However, this number in-        ered the tweets targets, i.e. Roma, immigrants and
creases to 172 if only the five best runs are taken      Muslims (also an information available from the
into account.                                            original HS corpus). Finally, we selected three
   Regardless of the correct label, agreement            features that are typical of computer-mediated
among the five best runs is higher than that             communication and social platforms such as Twit-
among all runs and among any other set of runs:          ter, in particular, the presence of links, multi-word
those systems which have best modeled the phe-           hashtags, and the use of capitalized words.
nomenon on the data provided appear to have                 As for the method adopted, the percentage of
made similar mistakes. This supports our hypoth-         errors for the gold positives and the gold negatives
esis that errors mostly depend on data-dependent         in the whole test set was calculated. First, the rates
features rather than on systems, which are all dif-      were calculated considering the two labels - hate-
ferent in approach and feature setting.                  ful and non-hateful - separately, in order to bal-
ance their different distribution in the test set; then   FNs are more than 30%. Results for the target Im-
the results were halved to represent the whole cor-       migrants are similar to the overall performance,
pus in percentage and to maintain the proportion          only with a slightly higher number of FPs. The
between the results of the tags. All the percent-         target Muslims caused a low number of FNs but
ages correlating two different tags were calculated       almost twice as many FPs as in the general perfor-
this way, so that the results could be easily com-        mance.
pared. The percentages of mistakes for each la-              The systems seem to struggle to recognize hate-
bel of the categories were determined and com-            ful content against Roma: this may be caused by
pared to the general result to understand whether         an imbalance in the test set (only 6.3% of tweets
they influenced it positively or negatively. Table        with the target Roma are labelled as HS, while the
2 summarizes the results for each label showing           targets Immigrants and Muslims have 12.6% and
the distribution of the false negatives (FN), false       13.4% of hateful tweets respectively) or by biases
positives (FP), true positives (TP) and true nega-        in the annotation.
tives (TN). The error percentages higher than the            The poor results achieved in classifying mes-
general result are in bold font.                          sages with target Roma can also be explained by
                                                          the subtler ways of expressing HS when this tar-
5   Results and Discussion                                get is involved, more heavily based on stereotypes
                                                          than it happens with the other targets. The hate
In order to find some answers to our research ques-
                                                          against the other two targets, in particular Mus-
tions and evidence of the influence of the anno-
                                                          lims, was instead very explicit. See the following
tated features on the systems’ results, we provide
                                                          examples extracted from the test set.
in this section an analysis driven by the categories
we described in the previous section.                           2235.     Roma, colpisce una pecora
                                                                con il pallone: bambino rom accecato
Aggressiveness and Offensiveness. The differ-                   da un pastore https://t.co/KsSAS3fUx9
ent degrees of aggressiveness did not affect the                @ilmessaggeroit HA DIFESO I SUOI
systems recall, but we measured more FPs when                   AVERI!4 [FN, strong aggressiveness,
weak or strong aggressiveness is involved (more                 target: Roma]
than thrice as many as in the overall results when
strong aggressiveness is present).                              4749. @Corriere Uccidere gli islamici,
Offensiveness seems to hold a similar but heavier               prima di tutto.5 [TP, strong aggressive-
influence on performance, causing better recall but             ness, target: religion]
worse precision: FPs are more than doubled when           Other features. Some other features were con-
strong offensiveness is present.                          sidered in our analysis. The presence of stereo-
   The presence of offensiveness is often associ-         type was more frequent in hateful tweets, which
ated to slurs or vulgar terms: these are not a con-       caused a slight increase in FPs; conversely, cases
sistent presence in the dataset (the most vulgar          of HS without stereotype posed no issues to the
tweets are probably quickly removed by the plat-          systems. Moreover, as expected, the presence of
form), and mostly appear in tweets classified as          irony slightly increased the errors rate both in hate-
HS. However, about half of the non-hateful tweets         ful and non-hateful tweets.
containing offensive words were wrongly classi-              The presence of Twitter’s linguistic devices
fied as hateful, proving that offensiveness can be        also negatively influenced the results, probably
misleading for systems. In these cases, a lexicon-        because of the difficulty encountered by sys-
based approach can fail, while attention to the con-      tems when some semantic content assumes non-
text could be crucial: in the most common in-             standard forms, e.g. links, multi-word hashtags
stances of false positives, in fact, offensive words      and capitalized words.
did not refer to the targets.                                URLs frequently occur in the data, but mostly
                                                          in non-hateful tweets (although this may be a pe-
HS Targets. Analyzing the three targets of HS
                                                          culiarity of this dataset). Systems appear to have
allowed us understanding how the systems reacted
                                                             4
to different ways of expressing hate.                          ”Rome, Roma child hits a sheep with a ball: blinded by a
                                                          shepherd https://t.co/KsSAS3fUx9 @ilmessaggeroit HE DE-
   Most of the errors were caused by the target           FENDED HIS PROPERTY!”
                                                             5
Roma: few hateful tweets were recognized, and                  ”@Corriere Kill the Muslims, first of all.”
                                          FN       FP       TP     TN      Gold HS     Gold Not-HS
                       general           15%       6%      35%    44%       32.3%         67.7%
                  no aggressiveness      15%       4%      35%    46%       13.5%         56.8%
                weak aggressiveness      15%      10%      35%    40%       11.2%        10.1%
                strong aggressiveness    15%      19%      35%    31%       7.6%          0.8%
                   no offensiveness      20%       5%      30%    45%       10.9%          60%
                 weak offensiveness      13%      11%      37%    39%       14.6%         4.9%
                strong offensiveness     12%      16%      38%    34%       6.8%          2.8%
                      no irony           15%       5%      35%    45%       27.8%         59%
                      yes irony          18%      9%       32%    41%       4.5%          8.7%
                    no stereotype        15%       5%      35%    45%       11.6%        49.7%
                   yes stereotype        15%      8%       35%    42%       20.7%         18%
                     Immigrants          15%      9%       35%    41%       12.6%        22.4%
                      Muslims             8%      11%      42%    39%       13.4%        12.2%
                        Roma             31%       1%      19%    49%       6.3%          33.1%
                       no link           11%      13%      37%    39%       25.4%         24.4%
                       yes link          29%       1%      21%    49%        7%           43.2%
                   multi hashtags        23%      8%       27%    42%        3%           1.9%
                no capitalized words     15%       5%      35%    45%       29.1%        64.1%
                yes capitalized words    14%      9%       36%    41%       3.3%          3.5%

Table 2: Percentage of correct (TPs and TNs) and erroneous (FPs and FNs) results in relation to the
features considered in the analysis, along with the actual distribution of these features in the test set.


troubles recognizing hateful tweets that contain               performances of the systems. The tweets with a
URLs (errors increased by 14%). Conversely, the                multi-word hashtag clarifying the text would have
absence of URLs caused an increase in FPs. This                a better chance of being correctly identified.
feature is unlikely to be directly connected to hate-
ful language: we rather believe that it could some-               Finally, some capitalized words have been
how affect predictions regardless of the actual con-           found in the data set, mostly in hateful tweets,
tent.                                                          which again caused an increase in FPs. Despite
   Also multi-word hashtags influenced results, es-            their small number, we noticed that, in non-hateful
pecially for hateful content: their presence in-               tweets, a higher percentage of capitalized words
creased FNs by 8%. The reason for this kind of                 are named entities (nouns of places, people, news-
error might lie in the fact that our dataset contains          papers, etc.), while in hateful tweets capitalized
some cases where the crucial element in a hateful              words are more often used to intensify opinions
tweet is precisely the hashtag, as in the example              or feelings.
below:
                                                                  Among all the features taken into account, of-
     2149. Quando vedremo lo stessa tema                       fensiveness seems to have affected the perfor-
     portato in piazza con la stessa forza e                   mance in various ways: its absence led systems to
     determinazione? Mai credo. #stopislam                     classify as non-hateful tweets that are indeed hate-
     6 https://t.co/dDYLZB1BlJ [multi-word                     ful, while its presence caused the inverse error. A
     hashtag, FN]                                              possible explanation for this is that, as shown in
                                                               Sanguinetti et al. (2018), offensiveness does not
   The text in this tweet is not hateful, but an               correlate with HS even though it can be one of its
element of hatred is conveyed by the hashtag                   features. The systems might have taken offensive
”#stopislam”.                                                  terms as indicators for HS, as also humans tend to
The ability to separate the multi-word hashtags                do (see for example Bohra et al. (2018)), but this is
into the words composing them would improve the                a false assumption that systems should be trained
   6
     ”When will we see people fighting for the same issue      to avoid. Aggressiveness also caused a certain de-
with the same strength and determination? Never, I believe.”   gree of errors, but only affecting precision.
6   Lessons Learned and Conclusion                     the way for future work on, e.g., the development
                                                       of tools that perform a more careful analysis of the
This paper presents a detailed error analysis of       text.
the results obtained within the context of a shared
task for HS detection. In our study, we took into      Acknowledgments
account two types of data: content information,
provided by gold standard labels assigned to each      The work of C. Bosco and M. Sanguinetti is par-
tweet; and metadata information, namely the pres-      tially funded by Progetto di Ateneo/CSP 2016 (Im-
ence of URLs, hashtags and capitalized words.          migrants, Hate and Prejudice in Social Media,
Results prove the importance of considering other      S1618 L2 BOSC 01), while that of F. Poletto is
categories related to that on which the task was       funded by Fondazione Giovanni Goria and Fon-
centered.                                              dazione CRT (Talenti della Società Civile 2018).
   The analysis of performances in relation to
URLs poses a controversial result. There are two       References
reasons why tweets collected via Twitter’s API
                                                       Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tom-
may contain a URL: the tweet may have been cut
                                                         maso Caselli, and Malvina Nissim. 2018. RuG
off and a URL automatically generated as a link          @ EVALITA 2018: Hate Speech Detection In Ital-
to the complete tweet, or the URL may be part of         ian Social Media. In Proceedings of Sixth Evalua-
the original tweet and lead to an external page. In      tion Campaign of Natural Language Processing and
both cases, unless the URL is followed, the tweet        Speech Tools for Italian. Final Workshop (EVALITA
                                                         2018). CEUR.org.
is likely to be harder to understand compared to a
tweet that contains no URL. This may cause lower       Francesco Barbieri, Valerio Basile, Danilo Croce,
agreement among human judges, and it is a very           Malvina Nissim, Nicole Novielli, and Viviana Patti.
                                                         2016. Overview of the Evalita 2016 SENTIment
complicated issue for automated systems to deal          POLarity Classification Task. In Proceedings of
with, especially when the meaning of the tweet           the Fifth Evaluation Campaign of Natural Language
is unintelligible without first opening the URL.         Processing and Speech Tools for Italian. Final Work-
Tweets containing URLs are, for the time being,          shop (EVALITA 2016). CEUR.org.
less reliable as training data and pose a tougher      Valerio Basile, Cristina Bosco, Elisabetta Fersini, Deb-
challenge for Sentiment Analysis tasks at large;         ora Nozza, Viviana Patti, Francisco Manuel Rangel
we encourage an effort towards solving this issue.       Pardo, Paolo Rosso, and Manuela Sanguinetti.
   As for capitalized words, future work may in-         2019. Semeval-2019 task 5: Multilingual detec-
                                                         tion of hate speech against immigrants and women
clude investigating how they affect human anno-          in Twitter. In Proceedings of the 13th International
tation, as some judges may show a bias towards           Workshop on Semantic Evaluation, pages 54–63.
associating capitalized words to HS or other cat-
                                                       Aditya Bohra, Deepanshu Vijay, Vinay Singh,
egories. Furthermore, improvements may come              Syed Sarfaraz Akhtar, and Manish Shrivastava.
from considering the PoS tags of such words, or          2018. A dataset of Hindi-English code-mixed social
the number of consecutive capitalized words.             media text for hate speech detection. In Proceedings
   Multi-word hashtags as well need to be treated        of the Second Workshop on Computational Model-
                                                         ing of Peoples Opinions, Personality, and Emotions
with care, as they may affect and even overturn          in Social Media, pages 36–41.
the meaning of the whole tweet. Yet, it happens
that a hashtag might require syntactic, semantic       Cristina Bosco, Dell’Orletta Felice, Fabio Poletto,
and world-knowledge processing in order to be            Manuela Sanguinetti, and Tesconi Maurizio. 2018.
                                                         Overview of the EVALITA 2018 hate speech detec-
fully understood: for example, by comparing the          tion task. In Proceedings of Sixth Evaluation Cam-
phrase ”stop Islam” with, e.g., ”stop harassment”,       paign of Natural Language Processing and Speech
we can see that the word ”stop” is not necessarily       Tools for Italian. Final Workshop (EVALITA 2018).
negative, and it becomes so only because it is fol-      CEUR.org.
lowed by the name of a religion whose members          Tommaso Caselli, Nicole Novielli, Viviana Patti, and
are, nowadays and in Western society, particularly       Paolo Rosso. 2018. EVALITA 2018: Overview of
subject to discrimination.                               the 6th Evaluation Campaign of Natural Language
                                                         Processing and Speech Tools for Italian. In Pro-
   Overall, our analysis suggests that systems fail-     ceedings of Sixth Evaluation Campaign of Natural
ures are motivated by the difficulty in dealing with     Language Processing and Speech Tools for Italian.
cases where HS is less directly expressed and pave       Final Workshop (EVALITA 2018). CEUR.org.
Andrea Cimino and Lorenzo De Mattei. 2018. Multi-          Units. In Proceedings of GermEval 2018, 14th
  task Learning in Deep Neural Networks for Hate           Conference on Natural Language Processing (KON-
  Speech Detection in Facebook and Twitter. In Pro-        VENS 2018).
  ceedings of Sixth Evaluation Campaign of Natural
  Language Processing and Speech Tools for Italian.      Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei
  Final Workshop (EVALITA 2018). CEUR.org.                 Wu, and Houfeng Wang. 2018. Sgm: sequence
                                                           generation model for multi-label classification. In
Michele Corazza, Stefano Menini, Pinar Arslan,             Proceedings of the 27th International Conference on
  Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and       Computational Linguistics, pages 3915–3926.
  Serena Villata. 2018. Comparing Different Super-
  vised Approaches to Hate Speech Detection. In Pro-
  ceedings of Sixth Evaluation Campaign of Natural
  Language Processing and Speech Tools for Italian.
  Final Workshop (EVALITA 2018). CEUR.org.

Elisabetta Fersini, Paolo Rosso, and Maria Anzovino.
   2018. Overview of the Task on Automatic Misog-
   yny Identification at IberEval 2018. In Proceed-
   ings of the Third Workshop on Evaluation of Hu-
   man Language Technologies for Iberian Languages
   (IberEval 2018), co-located with 34th Conference of
   the Spanish Society for Natural Language Process-
   ing (SEPLN 2018), pages 214–228. CEUR-WS.org.

Paula Fortuna and Sérgio Nunes. 2018. A survey on
  automatic detection of hate speech in text. ACM
  Computing Surveys (CSUR), 51(4):85.

Saif Mohammad, Felipe Bravo-Marquez, Moham-
  mad Salameh, and Svetlana Kiritchenko. 2018.
  Semeval-2018 task 1: Affect in tweets. In Proceed-
  ings of The 12th International Workshop on Seman-
  tic Evaluation, pages 1–17.

Malvina Nissim, Lasha Abzianidze, Kilian Evang, Rob
 van der Goot, Hessel Haagsma, Barbara Plank, and
 Martijn Wieling. 2017. Sharing is caring: The
 future of shared tasks. Computational Linguistics,
 43(4):897–904.

Fabio Poletto, Marco Stranisci, Manuela Sanguinetti,
  Viviana Patti, and Cristina Bosco. 2017. Hate
  Speech Annotation: Analysis of an Italian Twit-
  ter Corpus. In Proceedings of the Fourth Italian
  Conference on Computational Linguistics (CLiC-it
  2017). CEUR.org.

Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
 Viviana Patti, and Marco Stranisci. 2018. An Italian
 Twitter Corpus of Hate Speech against Immigrants.
 In Proceedings of the 11th Language Resources and
 Evaluation Conference (LREC 2018).

Anna Schmidt and Michael Wiegand. 2017. A Sur-
  vey on Hate Speech Detection using Natural Lan-
  guage Processing. In Proceedings of the Fifth Inter-
  national Workshop on Natural Language Process-
  ing for Social Media. Association for Computational
  Linguistics.

Dirk von Grünigen, Ralf Grubenmann, Fernando Ben-
  ites, Pius Von Däniken, and Mark Cieliebak. 2018.
  spMMMP at GermEval 2018 Shared Task: Classifi-
  cation of Offensive Content in Tweets using Con-
  volutional Neural Networks and Gated Recurrent