=Paper=
{{Paper
|id=Vol-2481/paper32
|storemode=property
|title=Error Analysis in a Hate Speech Detection Task: The Case of HaSpeeDe-TW at EVALITA 2018
|pdfUrl=https://ceur-ws.org/Vol-2481/paper32.pdf
|volume=Vol-2481
|authors=Chiara Francesconi,Cristina Bosco,Fabio Poletto,Manuela Sanguinetti
|dblpUrl=https://dblp.org/rec/conf/clic-it/FrancesconiBPS19
}}
==Error Analysis in a Hate Speech Detection Task: The Case of HaSpeeDe-TW at EVALITA 2018==
Error Analysis in a Hate Speech Detection Task:
the Case of HaSpeeDe-TW at EVALITA 2018
Chiara Francesconi Cristina Bosco
Dipartimento di Lingue e Letterature Fabio Poletto
Straniere e Culture Moderne Manuela Sanguinetti
University of Turin Dipartimento di Informatica
chiara.francesconi@edu.unito.it University of Turin
{bosco,poletto,msanguin}@di.unito.it
Abstract a shared task may result in also more interesting
hints about the directions to be followed in the im-
Taking as a case study the Hate Speech provement of both data and systems.
Detection task at EVALITA 2018, the pa- As a case study to carry out error analysis, data
per discusses the distribution and typol- from a shared task have been used in this paper.
ogy of the errors made by the five best- Shared tasks offer clean, high-quality annotated
scoring systems. The focus is on the sub- datasets on which different systems are trained and
task where Twitter data was used both for tested. Although often researchers omit to reflect
training and testing (HaSpeeDe-TW). In on what caused to system to collect some failures
order to highlight the complexity of hate (Nissim et al., 2017), they are an ideal ground
speech and the reasons beyond the failures for sharing negative results and encourage reflec-
in its automatic detection, the annotation tions on ”what did not work”, an excellent oppor-
provided for the task is enriched with or- tunity to carry out a comparative error analysis and
thogonal categories annotated in the orig- search for patterns that may, in turn, suggest im-
inal reference corpus, such as aggressive- provements in both the dataset and the systems.
ness, offensiveness, irony and the presence Here we analyze the case of the Hate Speech
of stereotypes. Detection (HaSpeeDe) task (Bosco et al., 2018)
presented at EVALITA 2018, the Evaluation Cam-
1 Introduction paign for NLP and Speech Tools for Italian
(Caselli et al., 2018). HS detection is a really com-
The field of Natural Language Processing wit- plex task, starting from the definition of the notion
nesses an ever-growing number of automated sys- on which it is centered. Considering the growing
tems trained on annotated data and built to solve, attention it is gaining, see e.g. the variety of re-
with remarkable results, the most diverse tasks. sources and tasks for HS developed in the last few
As performances increase, resources, settings and years, we believe that error analysis could be espe-
features that contributed to the improvement are cially interesting and useful for this case, as well
(understandably) emphasized, but sometimes little as in other tasks where the outcome of systems
or no room is given to an analysis of the factors meaningfully depends on resources exploited for
that caused the system to misclassify some items. training and testing.
This paper wants to draw attention to the impor-
The paper outlines the background and motiva-
tance of a thorough error analysis on the perfor-
tions behind this research (Section 2), describes
mance of supervised systems, as a means to pro-
the sub-task on which the study is based (Section
duce advancement in the field. Errors made by a
3), reports on the error analysis process (Section 4)
system may entail not only the poorness of the sys-
and discusses its results (Section 5), and presents
tem itself but also the sparseness of the data used
some conclusive remarks (Section 6).
in training, the failure of the annotation scheme in
describing the observed phenomena or a cue of the 2 Background and Motivations
data inherent ambiguity. The presence of the same
errors in the results of several systems involved in There are several issues connected to the identifi-
cation of HS: its juridical definition, the subjectiv-
Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0 ity of its perception, the need to remove potentially
International (CC BY 4.0). illegal content from the web without unjustly re-
moving legal content, and a list of linguistic phe- used within the HaSpeeDe shared task, namely
nomena that partly overlap to HS but need to be the HaSpeeDe-TW sub-task dataset (described in
kept apart. Section 3). Characteristics of this dataset make
Many works have recently contributed to the it ideal for our purpose: each tweet is connected
field by releasing novel annotated resources or to a target and is annotated not only for the pres-
presenting automated classifiers. Two reviews on ence of HS but for four other parameters. If
HS detection were recently published by Schmidt a comparative analysis of two corpora present-
and Wiegand (2017) and Fortuna and Nunes ing different textual genres (HaSpeeDe-TW and
(2018). Since 2016, shared tasks on the detection HaSpeeDe-FB) might have offered interesting per-
of HS or related phenomena (such as abusive lan- spectives, the lack of such characteristic in the FB
guage or misogyny) have been organized, effec- dataset prevents a thorough comparison. Further-
tively enhancing advancements in resource build- more, among the in-domain HaSpeeDe sub-tasks,
ing and system development. These include Hat- HaSpeeDe-TW is the one where systems achieved
Eval at SemEval 2019 (Basile et al., 2019), AMI the lower F1 -scores, providing thus more material
at IberEval 2018 (Fersini et al., 2018), HaSpeeDe for our analysis.
at EVALITA 2018 (Bosco et al., 2018) and more.
Nevertheless, the growing interest in HS detection 3 HaSpeeDe-TW at EVALITA 2018: A
suggests that the task is far from being solved: to Brief Overview
improve quality and interoperability of resources, While a description of the HaSpeeDe task as
to design suitable annotation schemes and to re- a whole has been provided in the organizers’
duce biases in the annotation is still as needed as overview (Bosco et al., 2018), here we focus on
it is to work on system engineering. Establishing HaSpeeDe-TW, one of the three sub-tasks into
standards and good practices in error analysis can which the competition was structured2 . The sub-
enhance these processes and push towards the de- task consisted in a binary classification of hateful
velopment of effective classifiers for HS. vs non-hateful tweets. Training set and test set
While academic literature is rich with works on contain 3,000 and 1,000 tweets respectively, la-
human annotation and evaluation metrics, it is not beled with 1 or 0 for the presence of HS, and with
as easy to find works dedicated to error analysis a distribution, in both sets, of around 1/3 hateful
of automated classification systems. This is rather against 2/3 non-hateful tweets. Data are drawn
more often found as a section of papers describ- from an already existing HS corpus (Poletto et al.,
ing a system (see, e.g., (Mohammad et al., 2018)). 2017), whose original annotation scheme was sim-
This section, however, is not always present. To plified for the purposes of the task (see Section 4).
examine the errors made by a system, classify Nine teams participated in the task, submitting
them and search for linguistic patterns appear to fifteen runs. The five best scores, submitted by
be a somewhat undervalued job, especially when the teams ItaliaNLP (whose runs ranked 1st and
the system had an overall good performance.Yet, it 2nd) (Cimino and De Mattei, 2018), RuG (Bai et
is crucial to understand why a system proved to be al., 2018), InriaFBK (Corazza et al., 2018) and sb-
a weak solution to certain instances of a problem, MMP (von Grünigen et al., 2018), ranged from
even while being excellent for other instances. 0.7993 to 0.7809 in terms of macro-averaged F1 -
In the context of COLING 2018, error analysis score3 . They applied both classical machine learn-
emerged as one of the most relevant features to ing approaches, Linear Support Vector Machine in
be addressed in NLP research1 . This attention to particular (ItaliaNLP, RuG) and more recent deep
error analysis encouraged authors to submit papers learning algorithms, such as Convolutional Neu-
with a dedicated section, with Yang et al. (2018) ral Networks (sbMMP) or Bi-LSTMs (ItaliaNLP,
winning the award for the best error analysis, and who adopted a multi-task learning approach ex-
is a step towards establishing good practices in the 2
The other two being HaSpeeDe-FB, where Facebook
NLP community. data were used both for training and testing the systems, and
In the wake of this awareness, we apply lin- Cross-HaSpeeDe, further subdivided into Cross-HaSpeeDe-
FB and Cross-HaSpeeDe-TW, where systems were trained
guistic insights to one of the annotated corpora using Facebook data and tested against Twitter data in the
former, and the opposite in the latter.
1 3
https://coling2018.org/ All official ranks are available here: https://goo.
error-analysis-in-research-and-writing/. gl/xPyPRW.
ploiting the SENTIPOLC 2016 (Barbieri et al., Even though only the annotation concerning the
2016) dataset as well). Learning architectures re- presence of HS was distributed to the teams, the
sorted to both surface features such as word and corpus from which the training and test set of
character n-grams (RuG) and linguistic informa- HaSpeeDe-TW were extracted was provided with
tion such as Part of Speech (ItaliaNLP). additional labels (Poletto et al., 2017; Sanguinetti
In the next section, we provide a description of et al., 2018). These labels (see Table 1) were
the errors collected from these best five runs as meant to mark the user’s intention to be aggres-
put in relation with the specific factors we chose sive (aggressiveness), the potentially hurtful effect
to analyze in this study, encompassing and merg- of a tweet (offensiveness), the use of ironic devices
ing qualitative and quantitative observations. Our to possibly mitigate a hateful message (irony), and
analysis is strictly based on the results provided whether the tweet contains any implicit or explicit
by those systems. An analysis focused on the fea- reference to negative beliefs about the targeted
tures of the systems that determined the errors is group (stereotype).
unfortunately beyond the scope of this work, as
in HaSpeeDe participants were only requested to label values
provide the results after training their systems. aggressiveness no, weak, strong
offensiveness no, weak, strong
4 Error Analysis irony yes, no
stereotype yes, no
Error analysis can be used in between runs to im-
prove results or test different feature settings. With Table 1: The original annotation scheme of the HS
the aim of weaving a broader reflection on the es- corpus that was (partially) used in HaSpeeDe-TW.
pecially hard linguistic patterns within a HS de-
tection task, here it is performed a posteriori and These labels were conceived with the aim of
on the aggregated results of five systems on the identifying some particular aspects that may in-
HaSpeeDe-TW test set (1,000 tweets). We fo- tersect HS but occur independently. As a mat-
cus on the answers given by the majority of the ter of fact, hateful contents towards a given target
five best systems because we believe they provide might be expressed using aggressive tones or of-
a faithful representation of the errors without the fensive/stereotypical slurs, but also in much sub-
noise due to the presence of the worst runs. tler forms. At the same time, aggressive or offen-
The test set was composed of 32.4% of hateful sive content, though addressed to a potential HS
tweets and 67.6% non-hateful tweets. As the first target, does not necessarily imply the presence of
step of our analysis, we compared the gold label HS. Our assumption while carrying out this study
assigned to each tweet in the test set with the one was that such close, but at times misleading, rela-
attributed by the majority of the five runs consid- tion between HS on one side and these phenomena
ered for the task. An error was considered to occur on the other could be considered a source of error
when the label assigned by the majority of the sys- for the automatic systems.
tems was different from the gold label. If we ex- In addition, other aspects of both linguistic and
tend our analysis to all the fifteen submitted runs, extra-linguistic nature were taken into account, so
156 out of 1,000 tweets have been misclassified as to complement the analysis. We thus consid-
by the majority of them. However, this number in- ered the tweets targets, i.e. Roma, immigrants and
creases to 172 if only the five best runs are taken Muslims (also an information available from the
into account. original HS corpus). Finally, we selected three
Regardless of the correct label, agreement features that are typical of computer-mediated
among the five best runs is higher than that communication and social platforms such as Twit-
among all runs and among any other set of runs: ter, in particular, the presence of links, multi-word
those systems which have best modeled the phe- hashtags, and the use of capitalized words.
nomenon on the data provided appear to have As for the method adopted, the percentage of
made similar mistakes. This supports our hypoth- errors for the gold positives and the gold negatives
esis that errors mostly depend on data-dependent in the whole test set was calculated. First, the rates
features rather than on systems, which are all dif- were calculated considering the two labels - hate-
ferent in approach and feature setting. ful and non-hateful - separately, in order to bal-
ance their different distribution in the test set; then FNs are more than 30%. Results for the target Im-
the results were halved to represent the whole cor- migrants are similar to the overall performance,
pus in percentage and to maintain the proportion only with a slightly higher number of FPs. The
between the results of the tags. All the percent- target Muslims caused a low number of FNs but
ages correlating two different tags were calculated almost twice as many FPs as in the general perfor-
this way, so that the results could be easily com- mance.
pared. The percentages of mistakes for each la- The systems seem to struggle to recognize hate-
bel of the categories were determined and com- ful content against Roma: this may be caused by
pared to the general result to understand whether an imbalance in the test set (only 6.3% of tweets
they influenced it positively or negatively. Table with the target Roma are labelled as HS, while the
2 summarizes the results for each label showing targets Immigrants and Muslims have 12.6% and
the distribution of the false negatives (FN), false 13.4% of hateful tweets respectively) or by biases
positives (FP), true positives (TP) and true nega- in the annotation.
tives (TN). The error percentages higher than the The poor results achieved in classifying mes-
general result are in bold font. sages with target Roma can also be explained by
the subtler ways of expressing HS when this tar-
5 Results and Discussion get is involved, more heavily based on stereotypes
than it happens with the other targets. The hate
In order to find some answers to our research ques-
against the other two targets, in particular Mus-
tions and evidence of the influence of the anno-
lims, was instead very explicit. See the following
tated features on the systems’ results, we provide
examples extracted from the test set.
in this section an analysis driven by the categories
we described in the previous section. 2235. Roma, colpisce una pecora
con il pallone: bambino rom accecato
Aggressiveness and Offensiveness. The differ- da un pastore https://t.co/KsSAS3fUx9
ent degrees of aggressiveness did not affect the @ilmessaggeroit HA DIFESO I SUOI
systems recall, but we measured more FPs when AVERI!4 [FN, strong aggressiveness,
weak or strong aggressiveness is involved (more target: Roma]
than thrice as many as in the overall results when
strong aggressiveness is present). 4749. @Corriere Uccidere gli islamici,
Offensiveness seems to hold a similar but heavier prima di tutto.5 [TP, strong aggressive-
influence on performance, causing better recall but ness, target: religion]
worse precision: FPs are more than doubled when Other features. Some other features were con-
strong offensiveness is present. sidered in our analysis. The presence of stereo-
The presence of offensiveness is often associ- type was more frequent in hateful tweets, which
ated to slurs or vulgar terms: these are not a con- caused a slight increase in FPs; conversely, cases
sistent presence in the dataset (the most vulgar of HS without stereotype posed no issues to the
tweets are probably quickly removed by the plat- systems. Moreover, as expected, the presence of
form), and mostly appear in tweets classified as irony slightly increased the errors rate both in hate-
HS. However, about half of the non-hateful tweets ful and non-hateful tweets.
containing offensive words were wrongly classi- The presence of Twitter’s linguistic devices
fied as hateful, proving that offensiveness can be also negatively influenced the results, probably
misleading for systems. In these cases, a lexicon- because of the difficulty encountered by sys-
based approach can fail, while attention to the con- tems when some semantic content assumes non-
text could be crucial: in the most common in- standard forms, e.g. links, multi-word hashtags
stances of false positives, in fact, offensive words and capitalized words.
did not refer to the targets. URLs frequently occur in the data, but mostly
in non-hateful tweets (although this may be a pe-
HS Targets. Analyzing the three targets of HS
culiarity of this dataset). Systems appear to have
allowed us understanding how the systems reacted
4
to different ways of expressing hate. ”Rome, Roma child hits a sheep with a ball: blinded by a
shepherd https://t.co/KsSAS3fUx9 @ilmessaggeroit HE DE-
Most of the errors were caused by the target FENDED HIS PROPERTY!”
5
Roma: few hateful tweets were recognized, and ”@Corriere Kill the Muslims, first of all.”
FN FP TP TN Gold HS Gold Not-HS
general 15% 6% 35% 44% 32.3% 67.7%
no aggressiveness 15% 4% 35% 46% 13.5% 56.8%
weak aggressiveness 15% 10% 35% 40% 11.2% 10.1%
strong aggressiveness 15% 19% 35% 31% 7.6% 0.8%
no offensiveness 20% 5% 30% 45% 10.9% 60%
weak offensiveness 13% 11% 37% 39% 14.6% 4.9%
strong offensiveness 12% 16% 38% 34% 6.8% 2.8%
no irony 15% 5% 35% 45% 27.8% 59%
yes irony 18% 9% 32% 41% 4.5% 8.7%
no stereotype 15% 5% 35% 45% 11.6% 49.7%
yes stereotype 15% 8% 35% 42% 20.7% 18%
Immigrants 15% 9% 35% 41% 12.6% 22.4%
Muslims 8% 11% 42% 39% 13.4% 12.2%
Roma 31% 1% 19% 49% 6.3% 33.1%
no link 11% 13% 37% 39% 25.4% 24.4%
yes link 29% 1% 21% 49% 7% 43.2%
multi hashtags 23% 8% 27% 42% 3% 1.9%
no capitalized words 15% 5% 35% 45% 29.1% 64.1%
yes capitalized words 14% 9% 36% 41% 3.3% 3.5%
Table 2: Percentage of correct (TPs and TNs) and erroneous (FPs and FNs) results in relation to the
features considered in the analysis, along with the actual distribution of these features in the test set.
troubles recognizing hateful tweets that contain performances of the systems. The tweets with a
URLs (errors increased by 14%). Conversely, the multi-word hashtag clarifying the text would have
absence of URLs caused an increase in FPs. This a better chance of being correctly identified.
feature is unlikely to be directly connected to hate-
ful language: we rather believe that it could some- Finally, some capitalized words have been
how affect predictions regardless of the actual con- found in the data set, mostly in hateful tweets,
tent. which again caused an increase in FPs. Despite
Also multi-word hashtags influenced results, es- their small number, we noticed that, in non-hateful
pecially for hateful content: their presence in- tweets, a higher percentage of capitalized words
creased FNs by 8%. The reason for this kind of are named entities (nouns of places, people, news-
error might lie in the fact that our dataset contains papers, etc.), while in hateful tweets capitalized
some cases where the crucial element in a hateful words are more often used to intensify opinions
tweet is precisely the hashtag, as in the example or feelings.
below:
Among all the features taken into account, of-
2149. Quando vedremo lo stessa tema fensiveness seems to have affected the perfor-
portato in piazza con la stessa forza e mance in various ways: its absence led systems to
determinazione? Mai credo. #stopislam classify as non-hateful tweets that are indeed hate-
6 https://t.co/dDYLZB1BlJ [multi-word ful, while its presence caused the inverse error. A
hashtag, FN] possible explanation for this is that, as shown in
Sanguinetti et al. (2018), offensiveness does not
The text in this tweet is not hateful, but an correlate with HS even though it can be one of its
element of hatred is conveyed by the hashtag features. The systems might have taken offensive
”#stopislam”. terms as indicators for HS, as also humans tend to
The ability to separate the multi-word hashtags do (see for example Bohra et al. (2018)), but this is
into the words composing them would improve the a false assumption that systems should be trained
6
”When will we see people fighting for the same issue to avoid. Aggressiveness also caused a certain de-
with the same strength and determination? Never, I believe.” gree of errors, but only affecting precision.
6 Lessons Learned and Conclusion the way for future work on, e.g., the development
of tools that perform a more careful analysis of the
This paper presents a detailed error analysis of text.
the results obtained within the context of a shared
task for HS detection. In our study, we took into Acknowledgments
account two types of data: content information,
provided by gold standard labels assigned to each The work of C. Bosco and M. Sanguinetti is par-
tweet; and metadata information, namely the pres- tially funded by Progetto di Ateneo/CSP 2016 (Im-
ence of URLs, hashtags and capitalized words. migrants, Hate and Prejudice in Social Media,
Results prove the importance of considering other S1618 L2 BOSC 01), while that of F. Poletto is
categories related to that on which the task was funded by Fondazione Giovanni Goria and Fon-
centered. dazione CRT (Talenti della Società Civile 2018).
The analysis of performances in relation to
URLs poses a controversial result. There are two References
reasons why tweets collected via Twitter’s API
Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tom-
may contain a URL: the tweet may have been cut
maso Caselli, and Malvina Nissim. 2018. RuG
off and a URL automatically generated as a link @ EVALITA 2018: Hate Speech Detection In Ital-
to the complete tweet, or the URL may be part of ian Social Media. In Proceedings of Sixth Evalua-
the original tweet and lead to an external page. In tion Campaign of Natural Language Processing and
both cases, unless the URL is followed, the tweet Speech Tools for Italian. Final Workshop (EVALITA
2018). CEUR.org.
is likely to be harder to understand compared to a
tweet that contains no URL. This may cause lower Francesco Barbieri, Valerio Basile, Danilo Croce,
agreement among human judges, and it is a very Malvina Nissim, Nicole Novielli, and Viviana Patti.
2016. Overview of the Evalita 2016 SENTIment
complicated issue for automated systems to deal POLarity Classification Task. In Proceedings of
with, especially when the meaning of the tweet the Fifth Evaluation Campaign of Natural Language
is unintelligible without first opening the URL. Processing and Speech Tools for Italian. Final Work-
Tweets containing URLs are, for the time being, shop (EVALITA 2016). CEUR.org.
less reliable as training data and pose a tougher Valerio Basile, Cristina Bosco, Elisabetta Fersini, Deb-
challenge for Sentiment Analysis tasks at large; ora Nozza, Viviana Patti, Francisco Manuel Rangel
we encourage an effort towards solving this issue. Pardo, Paolo Rosso, and Manuela Sanguinetti.
As for capitalized words, future work may in- 2019. Semeval-2019 task 5: Multilingual detec-
tion of hate speech against immigrants and women
clude investigating how they affect human anno- in Twitter. In Proceedings of the 13th International
tation, as some judges may show a bias towards Workshop on Semantic Evaluation, pages 54–63.
associating capitalized words to HS or other cat-
Aditya Bohra, Deepanshu Vijay, Vinay Singh,
egories. Furthermore, improvements may come Syed Sarfaraz Akhtar, and Manish Shrivastava.
from considering the PoS tags of such words, or 2018. A dataset of Hindi-English code-mixed social
the number of consecutive capitalized words. media text for hate speech detection. In Proceedings
Multi-word hashtags as well need to be treated of the Second Workshop on Computational Model-
ing of Peoples Opinions, Personality, and Emotions
with care, as they may affect and even overturn in Social Media, pages 36–41.
the meaning of the whole tweet. Yet, it happens
that a hashtag might require syntactic, semantic Cristina Bosco, Dell’Orletta Felice, Fabio Poletto,
and world-knowledge processing in order to be Manuela Sanguinetti, and Tesconi Maurizio. 2018.
Overview of the EVALITA 2018 hate speech detec-
fully understood: for example, by comparing the tion task. In Proceedings of Sixth Evaluation Cam-
phrase ”stop Islam” with, e.g., ”stop harassment”, paign of Natural Language Processing and Speech
we can see that the word ”stop” is not necessarily Tools for Italian. Final Workshop (EVALITA 2018).
negative, and it becomes so only because it is fol- CEUR.org.
lowed by the name of a religion whose members Tommaso Caselli, Nicole Novielli, Viviana Patti, and
are, nowadays and in Western society, particularly Paolo Rosso. 2018. EVALITA 2018: Overview of
subject to discrimination. the 6th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian. In Pro-
Overall, our analysis suggests that systems fail- ceedings of Sixth Evaluation Campaign of Natural
ures are motivated by the difficulty in dealing with Language Processing and Speech Tools for Italian.
cases where HS is less directly expressed and pave Final Workshop (EVALITA 2018). CEUR.org.
Andrea Cimino and Lorenzo De Mattei. 2018. Multi- Units. In Proceedings of GermEval 2018, 14th
task Learning in Deep Neural Networks for Hate Conference on Natural Language Processing (KON-
Speech Detection in Facebook and Twitter. In Pro- VENS 2018).
ceedings of Sixth Evaluation Campaign of Natural
Language Processing and Speech Tools for Italian. Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei
Final Workshop (EVALITA 2018). CEUR.org. Wu, and Houfeng Wang. 2018. Sgm: sequence
generation model for multi-label classification. In
Michele Corazza, Stefano Menini, Pinar Arslan, Proceedings of the 27th International Conference on
Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and Computational Linguistics, pages 3915–3926.
Serena Villata. 2018. Comparing Different Super-
vised Approaches to Hate Speech Detection. In Pro-
ceedings of Sixth Evaluation Campaign of Natural
Language Processing and Speech Tools for Italian.
Final Workshop (EVALITA 2018). CEUR.org.
Elisabetta Fersini, Paolo Rosso, and Maria Anzovino.
2018. Overview of the Task on Automatic Misog-
yny Identification at IberEval 2018. In Proceed-
ings of the Third Workshop on Evaluation of Hu-
man Language Technologies for Iberian Languages
(IberEval 2018), co-located with 34th Conference of
the Spanish Society for Natural Language Process-
ing (SEPLN 2018), pages 214–228. CEUR-WS.org.
Paula Fortuna and Sérgio Nunes. 2018. A survey on
automatic detection of hate speech in text. ACM
Computing Surveys (CSUR), 51(4):85.
Saif Mohammad, Felipe Bravo-Marquez, Moham-
mad Salameh, and Svetlana Kiritchenko. 2018.
Semeval-2018 task 1: Affect in tweets. In Proceed-
ings of The 12th International Workshop on Seman-
tic Evaluation, pages 1–17.
Malvina Nissim, Lasha Abzianidze, Kilian Evang, Rob
van der Goot, Hessel Haagsma, Barbara Plank, and
Martijn Wieling. 2017. Sharing is caring: The
future of shared tasks. Computational Linguistics,
43(4):897–904.
Fabio Poletto, Marco Stranisci, Manuela Sanguinetti,
Viviana Patti, and Cristina Bosco. 2017. Hate
Speech Annotation: Analysis of an Italian Twit-
ter Corpus. In Proceedings of the Fourth Italian
Conference on Computational Linguistics (CLiC-it
2017). CEUR.org.
Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
Viviana Patti, and Marco Stranisci. 2018. An Italian
Twitter Corpus of Hate Speech against Immigrants.
In Proceedings of the 11th Language Resources and
Evaluation Conference (LREC 2018).
Anna Schmidt and Michael Wiegand. 2017. A Sur-
vey on Hate Speech Detection using Natural Lan-
guage Processing. In Proceedings of the Fifth Inter-
national Workshop on Natural Language Process-
ing for Social Media. Association for Computational
Linguistics.
Dirk von Grünigen, Ralf Grubenmann, Fernando Ben-
ites, Pius Von Däniken, and Mark Cieliebak. 2018.
spMMMP at GermEval 2018 Shared Task: Classifi-
cation of Offensive Content in Tweets using Con-
volutional Neural Networks and Gated Recurrent