<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Error Analysis in a Hate Speech Detection Task: the Case of HaSpeeDe-TW at EVALITA 2018</article-title>
      </title-group>
      <abstract>
        <p>Taking as a case study the Hate Speech Detection task at EVALITA 2018, the paper discusses the distribution and typology of the errors made by the five bestscoring systems. The focus is on the subtask where Twitter data was used both for training and testing (HaSpeeDe-TW). In order to highlight the complexity of hate speech and the reasons beyond the failures in its automatic detection, the annotation provided for the task is enriched with orthogonal categories annotated in the original reference corpus, such as aggressiveness, offensiveness, irony and the presence of stereotypes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The field of Natural Language Processing
witnesses an ever-growing number of automated
systems trained on annotated data and built to solve,
with remarkable results, the most diverse tasks.
As performances increase, resources, settings and
features that contributed to the improvement are
(understandably) emphasized, but sometimes little
or no room is given to an analysis of the factors
that caused the system to misclassify some items.</p>
      <p>This paper wants to draw attention to the
importance of a thorough error analysis on the
performance of supervised systems, as a means to
produce advancement in the field. Errors made by a
system may entail not only the poorness of the
system itself but also the sparseness of the data used
in training, the failure of the annotation scheme in
describing the observed phenomena or a cue of the
data inherent ambiguity. The presence of the same
errors in the results of several systems involved in</p>
      <p>Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
a shared task may result in also more interesting
hints about the directions to be followed in the
improvement of both data and systems.</p>
      <p>
        As a case study to carry out error analysis, data
from a shared task have been used in this paper.
Shared tasks offer clean, high-quality annotated
datasets on which different systems are trained and
tested. Although often researchers omit to reflect
on what caused to system to collect some failures
        <xref ref-type="bibr" rid="ref12">(Nissim et al., 2017)</xref>
        , they are an ideal ground
for sharing negative results and encourage
reflections on ”what did not work”, an excellent
opportunity to carry out a comparative error analysis and
search for patterns that may, in turn, suggest
improvements in both the dataset and the systems.
      </p>
      <p>
        Here we analyze the case of the Hate Speech
Detection (HaSpeeDe) task
        <xref ref-type="bibr" rid="ref14 ref5">(Bosco et al., 2018)</xref>
        presented at EVALITA 2018, the Evaluation
Campaign for NLP and Speech Tools for Italian
        <xref ref-type="bibr" rid="ref6">(Caselli et al., 2018)</xref>
        . HS detection is a really
complex task, starting from the definition of the notion
on which it is centered. Considering the growing
attention it is gaining, see e.g. the variety of
resources and tasks for HS developed in the last few
years, we believe that error analysis could be
especially interesting and useful for this case, as well
as in other tasks where the outcome of systems
meaningfully depends on resources exploited for
training and testing.
      </p>
      <p>The paper outlines the background and
motivations behind this research (Section 2), describes
the sub-task on which the study is based (Section
3), reports on the error analysis process (Section 4)
and discusses its results (Section 5), and presents
some conclusive remarks (Section 6).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Motivations</title>
      <p>There are several issues connected to the
identification of HS: its juridical definition, the
subjectivity of its perception, the need to remove potentially
illegal content from the web without unjustly
removing legal content, and a list of linguistic
phenomena that partly overlap to HS but need to be
kept apart.</p>
      <p>
        Many works have recently contributed to the
field by releasing novel annotated resources or
presenting automated classifiers. Two reviews on
HS detection were recently published by Schmidt
and Wiegand (2017) and Fortuna and Nunes
(2018). Since 2016, shared tasks on the detection
of HS or related phenomena (such as abusive
language or misogyny) have been organized,
effectively enhancing advancements in resource
building and system development. These include
HatEval at SemEval 2019
        <xref ref-type="bibr" rid="ref3">(Basile et al., 2019)</xref>
        , AMI
at IberEval 2018
        <xref ref-type="bibr" rid="ref9">(Fersini et al., 2018)</xref>
        , HaSpeeDe
at EVALITA 2018
        <xref ref-type="bibr" rid="ref14 ref5">(Bosco et al., 2018)</xref>
        and more.
Nevertheless, the growing interest in HS detection
suggests that the task is far from being solved: to
improve quality and interoperability of resources,
to design suitable annotation schemes and to
reduce biases in the annotation is still as needed as
it is to work on system engineering. Establishing
standards and good practices in error analysis can
enhance these processes and push towards the
development of effective classifiers for HS.
      </p>
      <p>
        While academic literature is rich with works on
human annotation and evaluation metrics, it is not
as easy to find works dedicated to error analysis
of automated classification systems. This is rather
more often found as a section of papers
describing a system (see, e.g.,
        <xref ref-type="bibr" rid="ref11">(Mohammad et al., 2018)</xref>
        ).
This section, however, is not always present. To
examine the errors made by a system, classify
them and search for linguistic patterns appear to
be a somewhat undervalued job, especially when
the system had an overall good performance.Yet, it
is crucial to understand why a system proved to be
a weak solution to certain instances of a problem,
even while being excellent for other instances.
      </p>
      <p>In the context of COLING 2018, error analysis
emerged as one of the most relevant features to
be addressed in NLP research1. This attention to
error analysis encouraged authors to submit papers
with a dedicated section, with Yang et al. (2018)
winning the award for the best error analysis, and
is a step towards establishing good practices in the
NLP community.</p>
      <p>In the wake of this awareness, we apply
linguistic insights to one of the annotated corpora
1https://coling2018.org/
error-analysis-in-research-and-writing/.
used within the HaSpeeDe shared task, namely
the HaSpeeDe-TW sub-task dataset (described in
Section 3). Characteristics of this dataset make
it ideal for our purpose: each tweet is connected
to a target and is annotated not only for the
presence of HS but for four other parameters. If
a comparative analysis of two corpora
presenting different textual genres (HaSpeeDe-TW and
HaSpeeDe-FB) might have offered interesting
perspectives, the lack of such characteristic in the FB
dataset prevents a thorough comparison.
Furthermore, among the in-domain HaSpeeDe sub-tasks,
HaSpeeDe-TW is the one where systems achieved
the lower F1-scores, providing thus more material
for our analysis.
3</p>
    </sec>
    <sec id="sec-3">
      <title>HaSpeeDe-TW at EVALITA 2018: A</title>
    </sec>
    <sec id="sec-4">
      <title>Brief Overview</title>
      <p>
        While a description of the HaSpeeDe task as
a whole has been provided in the organizers’
overview
        <xref ref-type="bibr" rid="ref14 ref5">(Bosco et al., 2018)</xref>
        , here we focus on
HaSpeeDe-TW, one of the three sub-tasks into
which the competition was structured2. The
subtask consisted in a binary classification of hateful
vs non-hateful tweets. Training set and test set
contain 3,000 and 1,000 tweets respectively,
labeled with 1 or 0 for the presence of HS, and with
a distribution, in both sets, of around 1=3 hateful
against 2=3 non-hateful tweets. Data are drawn
from an already existing HS corpus
        <xref ref-type="bibr" rid="ref13">(Poletto et al.,
2017)</xref>
        , whose original annotation scheme was
simplified for the purposes of the task (see Section 4).
      </p>
      <p>
        Nine teams participated in the task, submitting
fifteen runs. The five best scores, submitted by
the teams ItaliaNLP (whose runs ranked 1st and
2nd)
        <xref ref-type="bibr" rid="ref10 ref6 ref7 ref9">(Cimino and De Mattei, 2018)</xref>
        , RuG
        <xref ref-type="bibr" rid="ref1">(Bai et
al., 2018)</xref>
        , InriaFBK
        <xref ref-type="bibr" rid="ref8">(Corazza et al., 2018)</xref>
        and
sbMMP
        <xref ref-type="bibr" rid="ref16">(von Gru¨nigen et al., 2018)</xref>
        , ranged from
0.7993 to 0.7809 in terms of macro-averaged
F1score3. They applied both classical machine
learning approaches, Linear Support Vector Machine in
particular (ItaliaNLP, RuG) and more recent deep
learning algorithms, such as Convolutional
Neural Networks (sbMMP) or Bi-LSTMs (ItaliaNLP,
who adopted a multi-task learning approach
ex2The other two being HaSpeeDe-FB, where Facebook
data were used both for training and testing the systems, and
Cross-HaSpeeDe, further subdivided into
Cross-HaSpeeDeFB and Cross-HaSpeeDe-TW, where systems were trained
using Facebook data and tested against Twitter data in the
former, and the opposite in the latter.
      </p>
      <p>
        3All official ranks are available here: https://goo.
gl/xPyPRW.
ploiting the SENTIPOLC 2016
        <xref ref-type="bibr" rid="ref2">(Barbieri et al.,
2016)</xref>
        dataset as well). Learning architectures
resorted to both surface features such as word and
character n-grams (RuG) and linguistic
information such as Part of Speech (ItaliaNLP).
      </p>
      <p>In the next section, we provide a description of
the errors collected from these best five runs as
put in relation with the specific factors we chose
to analyze in this study, encompassing and
merging qualitative and quantitative observations. Our
analysis is strictly based on the results provided
by those systems. An analysis focused on the
features of the systems that determined the errors is
unfortunately beyond the scope of this work, as
in HaSpeeDe participants were only requested to
provide the results after training their systems.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Error Analysis</title>
      <p>Error analysis can be used in between runs to
improve results or test different feature settings. With
the aim of weaving a broader reflection on the
especially hard linguistic patterns within a HS
detection task, here it is performed a posteriori and
on the aggregated results of five systems on the
HaSpeeDe-TW test set (1,000 tweets). We
focus on the answers given by the majority of the
five best systems because we believe they provide
a faithful representation of the errors without the
noise due to the presence of the worst runs.</p>
      <p>The test set was composed of 32.4% of hateful
tweets and 67.6% non-hateful tweets. As the first
step of our analysis, we compared the gold label
assigned to each tweet in the test set with the one
attributed by the majority of the five runs
considered for the task. An error was considered to occur
when the label assigned by the majority of the
systems was different from the gold label. If we
extend our analysis to all the fifteen submitted runs,
156 out of 1,000 tweets have been misclassified
by the majority of them. However, this number
increases to 172 if only the five best runs are taken
into account.</p>
      <p>Regardless of the correct label, agreement
among the five best runs is higher than that
among all runs and among any other set of runs:
those systems which have best modeled the
phenomenon on the data provided appear to have
made similar mistakes. This supports our
hypothesis that errors mostly depend on data-dependent
features rather than on systems, which are all
different in approach and feature setting.</p>
      <p>
        Even though only the annotation concerning the
presence of HS was distributed to the teams, the
corpus from which the training and test set of
HaSpeeDe-TW were extracted was provided with
additional labels
        <xref ref-type="bibr" rid="ref13 ref14">(Poletto et al., 2017; Sanguinetti
et al., 2018)</xref>
        . These labels (see Table 1) were
meant to mark the user’s intention to be
aggressive (aggressiveness), the potentially hurtful effect
of a tweet (offensiveness), the use of ironic devices
to possibly mitigate a hateful message (irony), and
whether the tweet contains any implicit or explicit
reference to negative beliefs about the targeted
group (stereotype).
      </p>
      <p>label
aggressiveness
offensiveness
irony
stereotype
values
no, weak, strong
no, weak, strong
yes, no
yes, no</p>
      <p>These labels were conceived with the aim of
identifying some particular aspects that may
intersect HS but occur independently. As a
matter of fact, hateful contents towards a given target
might be expressed using aggressive tones or
offensive/stereotypical slurs, but also in much
subtler forms. At the same time, aggressive or
offensive content, though addressed to a potential HS
target, does not necessarily imply the presence of
HS. Our assumption while carrying out this study
was that such close, but at times misleading,
relation between HS on one side and these phenomena
on the other could be considered a source of error
for the automatic systems.</p>
      <p>In addition, other aspects of both linguistic and
extra-linguistic nature were taken into account, so
as to complement the analysis. We thus
considered the tweets targets, i.e. Roma, immigrants and
Muslims (also an information available from the
original HS corpus). Finally, we selected three
features that are typical of computer-mediated
communication and social platforms such as
Twitter, in particular, the presence of links, multi-word
hashtags, and the use of capitalized words.</p>
      <p>As for the method adopted, the percentage of
errors for the gold positives and the gold negatives
in the whole test set was calculated. First, the rates
were calculated considering the two labels -
hateful and non-hateful - separately, in order to
balance their different distribution in the test set; then
the results were halved to represent the whole
corpus in percentage and to maintain the proportion
between the results of the tags. All the
percentages correlating two different tags were calculated
this way, so that the results could be easily
compared. The percentages of mistakes for each
label of the categories were determined and
compared to the general result to understand whether
they influenced it positively or negatively. Table
2 summarizes the results for each label showing
the distribution of the false negatives (FN), false
positives (FP), true positives (TP) and true
negatives (TN). The error percentages higher than the
general result are in bold font.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Results and Discussion</title>
      <p>In order to find some answers to our research
questions and evidence of the influence of the
annotated features on the systems’ results, we provide
in this section an analysis driven by the categories
we described in the previous section.</p>
    </sec>
    <sec id="sec-7">
      <title>Aggressiveness and Offensiveness. The differ</title>
      <p>ent degrees of aggressiveness did not affect the
systems recall, but we measured more FPs when
weak or strong aggressiveness is involved (more
than thrice as many as in the overall results when
strong aggressiveness is present).</p>
      <p>Offensiveness seems to hold a similar but heavier
influence on performance, causing better recall but
worse precision: FPs are more than doubled when
strong offensiveness is present.</p>
      <p>The presence of offensiveness is often
associated to slurs or vulgar terms: these are not a
consistent presence in the dataset (the most vulgar
tweets are probably quickly removed by the
platform), and mostly appear in tweets classified as
HS. However, about half of the non-hateful tweets
containing offensive words were wrongly
classified as hateful, proving that offensiveness can be
misleading for systems. In these cases, a
lexiconbased approach can fail, while attention to the
context could be crucial: in the most common
instances of false positives, in fact, offensive words
did not refer to the targets.</p>
      <p>HS Targets. Analyzing the three targets of HS
allowed us understanding how the systems reacted
to different ways of expressing hate.</p>
      <p>Most of the errors were caused by the target
Roma: few hateful tweets were recognized, and
FNs are more than 30%. Results for the target
Immigrants are similar to the overall performance,
only with a slightly higher number of FPs. The
target Muslims caused a low number of FNs but
almost twice as many FPs as in the general
performance.</p>
      <p>The systems seem to struggle to recognize
hateful content against Roma: this may be caused by
an imbalance in the test set (only 6.3% of tweets
with the target Roma are labelled as HS, while the
targets Immigrants and Muslims have 12.6% and
13.4% of hateful tweets respectively) or by biases
in the annotation.</p>
      <p>The poor results achieved in classifying
messages with target Roma can also be explained by
the subtler ways of expressing HS when this
target is involved, more heavily based on stereotypes
than it happens with the other targets. The hate
against the other two targets, in particular
Muslims, was instead very explicit. See the following
examples extracted from the test set.</p>
      <p>2235. Roma, colpisce una pecora
con il pallone: bambino rom accecato
da un pastore https://t.co/KsSAS3fUx9
@ilmessaggeroit HA DIFESO I SUOI
AVERI!4 [FN, strong aggressiveness,
target: Roma]
4749. @Corriere Uccidere gli islamici,
prima di tutto.5 [TP, strong
aggressiveness, target: religion]
Other features. Some other features were
considered in our analysis. The presence of
stereotype was more frequent in hateful tweets, which
caused a slight increase in FPs; conversely, cases
of HS without stereotype posed no issues to the
systems. Moreover, as expected, the presence of
irony slightly increased the errors rate both in
hateful and non-hateful tweets.</p>
      <p>The presence of Twitter’s linguistic devices
also negatively influenced the results, probably
because of the difficulty encountered by
systems when some semantic content assumes
nonstandard forms, e.g. links, multi-word hashtags
and capitalized words.</p>
      <p>URLs frequently occur in the data, but mostly
in non-hateful tweets (although this may be a
peculiarity of this dataset). Systems appear to have
4”Rome, Roma child hits a sheep with a ball: blinded by a
shepherd https://t.co/KsSAS3fUx9 @ilmessaggeroit HE
DEFENDED HIS PROPERTY!”
5”@Corriere Kill the Muslims, first of all.”
general
no aggressiveness
weak aggressiveness
strong aggressiveness</p>
      <p>no offensiveness
weak offensiveness
strong offensiveness
no irony
yes irony
no stereotype
yes stereotype
Immigrants
Muslims
Roma
no link
yes link
multi hashtags
no capitalized words
yes capitalized words</p>
      <p>FN
15%
15%
15%
15%
20%
13%
12%
15%
18%
15%
15%
15%
8%
31%
11%
29%
23%
15%
14%
troubles recognizing hateful tweets that contain
URLs (errors increased by 14%). Conversely, the
absence of URLs caused an increase in FPs. This
feature is unlikely to be directly connected to
hateful language: we rather believe that it could
somehow affect predictions regardless of the actual
content.</p>
      <p>Also multi-word hashtags influenced results,
especially for hateful content: their presence
increased FNs by 8%. The reason for this kind of
error might lie in the fact that our dataset contains
some cases where the crucial element in a hateful
tweet is precisely the hashtag, as in the example
below:
2149. Quando vedremo lo stessa tema
portato in piazza con la stessa forza e
determinazione? Mai credo. #stopislam
6 https://t.co/dDYLZB1BlJ [multi-word
hashtag, FN]</p>
      <p>The text in this tweet is not hateful, but an
element of hatred is conveyed by the hashtag
”#stopislam”.</p>
      <p>The ability to separate the multi-word hashtags
into the words composing them would improve the
6”When will we see people fighting for the same issue
with the same strength and determination? Never, I believe.”
performances of the systems. The tweets with a
multi-word hashtag clarifying the text would have
a better chance of being correctly identified.</p>
      <p>Finally, some capitalized words have been
found in the data set, mostly in hateful tweets,
which again caused an increase in FPs. Despite
their small number, we noticed that, in non-hateful
tweets, a higher percentage of capitalized words
are named entities (nouns of places, people,
newspapers, etc.), while in hateful tweets capitalized
words are more often used to intensify opinions
or feelings.</p>
      <p>Among all the features taken into account,
offensiveness seems to have affected the
performance in various ways: its absence led systems to
classify as non-hateful tweets that are indeed
hateful, while its presence caused the inverse error. A
possible explanation for this is that, as shown in
Sanguinetti et al. (2018), offensiveness does not
correlate with HS even though it can be one of its
features. The systems might have taken offensive
terms as indicators for HS, as also humans tend to
do (see for example Bohra et al. (2018)), but this is
a false assumption that systems should be trained
to avoid. Aggressiveness also caused a certain
degree of errors, but only affecting precision.</p>
    </sec>
    <sec id="sec-8">
      <title>Lessons Learned and Conclusion</title>
      <p>This paper presents a detailed error analysis of
the results obtained within the context of a shared
task for HS detection. In our study, we took into
account two types of data: content information,
provided by gold standard labels assigned to each
tweet; and metadata information, namely the
presence of URLs, hashtags and capitalized words.
Results prove the importance of considering other
categories related to that on which the task was
centered.</p>
      <p>The analysis of performances in relation to
URLs poses a controversial result. There are two
reasons why tweets collected via Twitter’s API
may contain a URL: the tweet may have been cut
off and a URL automatically generated as a link
to the complete tweet, or the URL may be part of
the original tweet and lead to an external page. In
both cases, unless the URL is followed, the tweet
is likely to be harder to understand compared to a
tweet that contains no URL. This may cause lower
agreement among human judges, and it is a very
complicated issue for automated systems to deal
with, especially when the meaning of the tweet
is unintelligible without first opening the URL.
Tweets containing URLs are, for the time being,
less reliable as training data and pose a tougher
challenge for Sentiment Analysis tasks at large;
we encourage an effort towards solving this issue.</p>
      <p>As for capitalized words, future work may
include investigating how they affect human
annotation, as some judges may show a bias towards
associating capitalized words to HS or other
categories. Furthermore, improvements may come
from considering the PoS tags of such words, or
the number of consecutive capitalized words.</p>
      <p>Multi-word hashtags as well need to be treated
with care, as they may affect and even overturn
the meaning of the whole tweet. Yet, it happens
that a hashtag might require syntactic, semantic
and world-knowledge processing in order to be
fully understood: for example, by comparing the
phrase ”stop Islam” with, e.g., ”stop harassment”,
we can see that the word ”stop” is not necessarily
negative, and it becomes so only because it is
followed by the name of a religion whose members
are, nowadays and in Western society, particularly
subject to discrimination.</p>
      <p>Overall, our analysis suggests that systems
failures are motivated by the difficulty in dealing with
cases where HS is less directly expressed and pave
the way for future work on, e.g., the development
of tools that perform a more careful analysis of the
text.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>The work of C. Bosco and M. Sanguinetti is
partially funded by Progetto di Ateneo/CSP 2016
(Immigrants, Hate and Prejudice in Social Media,
S1618 L2 BOSC 01), while that of F. Poletto is
funded by Fondazione Giovanni Goria and
Fondazione CRT (Talenti della Societ a` Civile 2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Xiaoyu</given-names>
            <surname>Bai</surname>
          </string-name>
          , Flavio Merenda, Claudia Zaghi, Tommaso Caselli, and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <year>2018</year>
          . RuG @ EVALITA 2018:
          <article-title>Hate Speech Detection In Italian Social Media</article-title>
          .
          <source>In Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          ). CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Barbieri</surname>
          </string-name>
          , Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli, and
          <string-name>
            <given-names>Viviana</given-names>
            <surname>Patti</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the Evalita 2016 SENTIment POLarity Classification Task</article-title>
          .
          <source>In Proceedings of the Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2016</year>
          ). CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, and
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter</article-title>
          .
          <source>In Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          , pages
          <fpage>54</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Bohra</surname>
          </string-name>
          , Deepanshu Vijay, Vinay Singh, Syed Sarfaraz Akhtar, and
          <string-name>
            <given-names>Manish</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A dataset of Hindi-English code-mixed social media text for hate speech detection</article-title>
          .
          <source>In Proceedings of the Second Workshop on Computational Modeling of Peoples Opinions</source>
          , Personality, and Emotions in Social Media, pages
          <fpage>36</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dell'Orletta Felice</surname>
            , Fabio Poletto, Manuela Sanguinetti, and
            <given-names>Tesconi</given-names>
          </string-name>
          <string-name>
            <surname>Maurizio</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 hate speech detection task</article-title>
          .
          <source>In Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          ). CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Caselli</surname>
          </string-name>
          , Nicole Novielli, Viviana Patti, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>EVALITA 2018: Overview of the 6th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>In Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          ). CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Cimino and Lorenzo De Mattei</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Multitask Learning in Deep Neural Networks for Hate Speech Detection in Facebook and Twitter</article-title>
          .
          <source>In Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          ). CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Michele</given-names>
            <surname>Corazza</surname>
          </string-name>
          , Stefano Menini, Pinar Arslan, Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and
          <string-name>
            <given-names>Serena</given-names>
            <surname>Villata</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Comparing Different Supervised Approaches to Hate Speech Detection</article-title>
          .
          <source>In Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2018</year>
          ). CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Elisabetta</given-names>
            <surname>Fersini</surname>
          </string-name>
          , Paolo Rosso, and
          <string-name>
            <given-names>Maria</given-names>
            <surname>Anzovino</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the Task on Automatic Misogyny Identification at IberEval 2018</article-title>
          .
          <source>In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval</source>
          <year>2018</year>
          ),
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2018</year>
          ), pages
          <fpage>214</fpage>
          -
          <lpage>228</lpage>
          . CEUR-WS.org.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Paula</given-names>
            <surname>Fortuna</surname>
          </string-name>
          and Se´rgio Nunes.
          <year>2018</year>
          .
          <article-title>A survey on automatic detection of hate speech in text</article-title>
          .
          <source>ACM Computing Surveys (CSUR)</source>
          ,
          <volume>51</volume>
          (
          <issue>4</issue>
          ):
          <fpage>85</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Saif</given-names>
            <surname>Mohammad</surname>
          </string-name>
          , Felipe Bravo-Marquez,
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Salameh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Svetlana</given-names>
            <surname>Kiritchenko</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Semeval-2018 task 1: Affect in tweets</article-title>
          .
          <source>In Proceedings of The 12th International Workshop on Semantic Evaluation</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          , Lasha Abzianidze, Kilian Evang, Rob van der Goot, Hessel Haagsma, Barbara Plank, and
          <string-name>
            <given-names>Martijn</given-names>
            <surname>Wieling</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Sharing is caring: The future of shared tasks</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>43</volume>
          (
          <issue>4</issue>
          ):
          <fpage>897</fpage>
          -
          <lpage>904</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Poletto</surname>
          </string-name>
          , Marco Stranisci, Manuela Sanguinetti, Viviana Patti, and
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hate Speech Annotation: Analysis of an Italian Twitter Corpus</article-title>
          .
          <source>In Proceedings of the Fourth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2017</year>
          ). CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Fabio Poletto, Cristina Bosco, Viviana Patti, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Stranisci</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>An Italian Twitter Corpus of Hate Speech against Immigrants</article-title>
          .
          <source>In Proceedings of the 11th Language Resources and Evaluation Conference (LREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Anna</given-names>
            <surname>Schmidt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Wiegand</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Survey on Hate Speech Detection using Natural Language Processing</article-title>
          .
          <source>In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>Dirk von Gru¨nigen, Ralf Grubenmann</article-title>
          , Fernando Benites,
          <source>Pius Von Da¨niken, and Mark Cieliebak</source>
          .
          <year>2018</year>
          . spMMMP at GermEval 2018 Shared Task:
          <article-title>Classification of Offensive Content in Tweets using Convolutional Neural Networks and Gated Recurrent Units</article-title>
          .
          <source>In Proceedings of GermEval 2018, 14th Conference on Natural Language Processing (KONVENS</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Pengcheng</given-names>
            <surname>Yang</surname>
          </string-name>
          , Xu Sun,
          <string-name>
            <given-names>Wei</given-names>
            <surname>Li</surname>
          </string-name>
          , Shuming Ma, Wei Wu, and
          <string-name>
            <given-names>Houfeng</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Sgm: sequence generation model for multi-label classification</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          , pages
          <fpage>3915</fpage>
          -
          <lpage>3926</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>