Error Analysis in a Hate Speech Detection Task: the Case of HaSpeeDe-TW at EVALITA 2018 Chiara Francesconi Cristina Bosco Dipartimento di Lingue e Letterature Fabio Poletto Straniere e Culture Moderne Manuela Sanguinetti University of Turin Dipartimento di Informatica chiara.francesconi@edu.unito.it University of Turin {bosco,poletto,msanguin}@di.unito.it Abstract a shared task may result in also more interesting hints about the directions to be followed in the im- Taking as a case study the Hate Speech provement of both data and systems. Detection task at EVALITA 2018, the pa- As a case study to carry out error analysis, data per discusses the distribution and typol- from a shared task have been used in this paper. ogy of the errors made by the five best- Shared tasks offer clean, high-quality annotated scoring systems. The focus is on the sub- datasets on which different systems are trained and task where Twitter data was used both for tested. Although often researchers omit to reflect training and testing (HaSpeeDe-TW). In on what caused to system to collect some failures order to highlight the complexity of hate (Nissim et al., 2017), they are an ideal ground speech and the reasons beyond the failures for sharing negative results and encourage reflec- in its automatic detection, the annotation tions on ”what did not work”, an excellent oppor- provided for the task is enriched with or- tunity to carry out a comparative error analysis and thogonal categories annotated in the orig- search for patterns that may, in turn, suggest im- inal reference corpus, such as aggressive- provements in both the dataset and the systems. ness, offensiveness, irony and the presence Here we analyze the case of the Hate Speech of stereotypes. Detection (HaSpeeDe) task (Bosco et al., 2018) presented at EVALITA 2018, the Evaluation Cam- 1 Introduction paign for NLP and Speech Tools for Italian (Caselli et al., 2018). HS detection is a really com- The field of Natural Language Processing wit- plex task, starting from the definition of the notion nesses an ever-growing number of automated sys- on which it is centered. Considering the growing tems trained on annotated data and built to solve, attention it is gaining, see e.g. the variety of re- with remarkable results, the most diverse tasks. sources and tasks for HS developed in the last few As performances increase, resources, settings and years, we believe that error analysis could be espe- features that contributed to the improvement are cially interesting and useful for this case, as well (understandably) emphasized, but sometimes little as in other tasks where the outcome of systems or no room is given to an analysis of the factors meaningfully depends on resources exploited for that caused the system to misclassify some items. training and testing. This paper wants to draw attention to the impor- The paper outlines the background and motiva- tance of a thorough error analysis on the perfor- tions behind this research (Section 2), describes mance of supervised systems, as a means to pro- the sub-task on which the study is based (Section duce advancement in the field. Errors made by a 3), reports on the error analysis process (Section 4) system may entail not only the poorness of the sys- and discusses its results (Section 5), and presents tem itself but also the sparseness of the data used some conclusive remarks (Section 6). in training, the failure of the annotation scheme in describing the observed phenomena or a cue of the 2 Background and Motivations data inherent ambiguity. The presence of the same errors in the results of several systems involved in There are several issues connected to the identifi- cation of HS: its juridical definition, the subjectiv- Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 ity of its perception, the need to remove potentially International (CC BY 4.0). illegal content from the web without unjustly re- moving legal content, and a list of linguistic phe- used within the HaSpeeDe shared task, namely nomena that partly overlap to HS but need to be the HaSpeeDe-TW sub-task dataset (described in kept apart. Section 3). Characteristics of this dataset make Many works have recently contributed to the it ideal for our purpose: each tweet is connected field by releasing novel annotated resources or to a target and is annotated not only for the pres- presenting automated classifiers. Two reviews on ence of HS but for four other parameters. If HS detection were recently published by Schmidt a comparative analysis of two corpora present- and Wiegand (2017) and Fortuna and Nunes ing different textual genres (HaSpeeDe-TW and (2018). Since 2016, shared tasks on the detection HaSpeeDe-FB) might have offered interesting per- of HS or related phenomena (such as abusive lan- spectives, the lack of such characteristic in the FB guage or misogyny) have been organized, effec- dataset prevents a thorough comparison. Further- tively enhancing advancements in resource build- more, among the in-domain HaSpeeDe sub-tasks, ing and system development. These include Hat- HaSpeeDe-TW is the one where systems achieved Eval at SemEval 2019 (Basile et al., 2019), AMI the lower F1 -scores, providing thus more material at IberEval 2018 (Fersini et al., 2018), HaSpeeDe for our analysis. at EVALITA 2018 (Bosco et al., 2018) and more. Nevertheless, the growing interest in HS detection 3 HaSpeeDe-TW at EVALITA 2018: A suggests that the task is far from being solved: to Brief Overview improve quality and interoperability of resources, While a description of the HaSpeeDe task as to design suitable annotation schemes and to re- a whole has been provided in the organizers’ duce biases in the annotation is still as needed as overview (Bosco et al., 2018), here we focus on it is to work on system engineering. Establishing HaSpeeDe-TW, one of the three sub-tasks into standards and good practices in error analysis can which the competition was structured2 . The sub- enhance these processes and push towards the de- task consisted in a binary classification of hateful velopment of effective classifiers for HS. vs non-hateful tweets. Training set and test set While academic literature is rich with works on contain 3,000 and 1,000 tweets respectively, la- human annotation and evaluation metrics, it is not beled with 1 or 0 for the presence of HS, and with as easy to find works dedicated to error analysis a distribution, in both sets, of around 1/3 hateful of automated classification systems. This is rather against 2/3 non-hateful tweets. Data are drawn more often found as a section of papers describ- from an already existing HS corpus (Poletto et al., ing a system (see, e.g., (Mohammad et al., 2018)). 2017), whose original annotation scheme was sim- This section, however, is not always present. To plified for the purposes of the task (see Section 4). examine the errors made by a system, classify Nine teams participated in the task, submitting them and search for linguistic patterns appear to fifteen runs. The five best scores, submitted by be a somewhat undervalued job, especially when the teams ItaliaNLP (whose runs ranked 1st and the system had an overall good performance.Yet, it 2nd) (Cimino and De Mattei, 2018), RuG (Bai et is crucial to understand why a system proved to be al., 2018), InriaFBK (Corazza et al., 2018) and sb- a weak solution to certain instances of a problem, MMP (von Grünigen et al., 2018), ranged from even while being excellent for other instances. 0.7993 to 0.7809 in terms of macro-averaged F1 - In the context of COLING 2018, error analysis score3 . They applied both classical machine learn- emerged as one of the most relevant features to ing approaches, Linear Support Vector Machine in be addressed in NLP research1 . This attention to particular (ItaliaNLP, RuG) and more recent deep error analysis encouraged authors to submit papers learning algorithms, such as Convolutional Neu- with a dedicated section, with Yang et al. (2018) ral Networks (sbMMP) or Bi-LSTMs (ItaliaNLP, winning the award for the best error analysis, and who adopted a multi-task learning approach ex- is a step towards establishing good practices in the 2 The other two being HaSpeeDe-FB, where Facebook NLP community. data were used both for training and testing the systems, and In the wake of this awareness, we apply lin- Cross-HaSpeeDe, further subdivided into Cross-HaSpeeDe- FB and Cross-HaSpeeDe-TW, where systems were trained guistic insights to one of the annotated corpora using Facebook data and tested against Twitter data in the former, and the opposite in the latter. 1 3 https://coling2018.org/ All official ranks are available here: https://goo. error-analysis-in-research-and-writing/. gl/xPyPRW. ploiting the SENTIPOLC 2016 (Barbieri et al., Even though only the annotation concerning the 2016) dataset as well). Learning architectures re- presence of HS was distributed to the teams, the sorted to both surface features such as word and corpus from which the training and test set of character n-grams (RuG) and linguistic informa- HaSpeeDe-TW were extracted was provided with tion such as Part of Speech (ItaliaNLP). additional labels (Poletto et al., 2017; Sanguinetti In the next section, we provide a description of et al., 2018). These labels (see Table 1) were the errors collected from these best five runs as meant to mark the user’s intention to be aggres- put in relation with the specific factors we chose sive (aggressiveness), the potentially hurtful effect to analyze in this study, encompassing and merg- of a tweet (offensiveness), the use of ironic devices ing qualitative and quantitative observations. Our to possibly mitigate a hateful message (irony), and analysis is strictly based on the results provided whether the tweet contains any implicit or explicit by those systems. An analysis focused on the fea- reference to negative beliefs about the targeted tures of the systems that determined the errors is group (stereotype). unfortunately beyond the scope of this work, as in HaSpeeDe participants were only requested to label values provide the results after training their systems. aggressiveness no, weak, strong offensiveness no, weak, strong 4 Error Analysis irony yes, no stereotype yes, no Error analysis can be used in between runs to im- prove results or test different feature settings. With Table 1: The original annotation scheme of the HS the aim of weaving a broader reflection on the es- corpus that was (partially) used in HaSpeeDe-TW. pecially hard linguistic patterns within a HS de- tection task, here it is performed a posteriori and These labels were conceived with the aim of on the aggregated results of five systems on the identifying some particular aspects that may in- HaSpeeDe-TW test set (1,000 tweets). We fo- tersect HS but occur independently. As a mat- cus on the answers given by the majority of the ter of fact, hateful contents towards a given target five best systems because we believe they provide might be expressed using aggressive tones or of- a faithful representation of the errors without the fensive/stereotypical slurs, but also in much sub- noise due to the presence of the worst runs. tler forms. At the same time, aggressive or offen- The test set was composed of 32.4% of hateful sive content, though addressed to a potential HS tweets and 67.6% non-hateful tweets. As the first target, does not necessarily imply the presence of step of our analysis, we compared the gold label HS. Our assumption while carrying out this study assigned to each tweet in the test set with the one was that such close, but at times misleading, rela- attributed by the majority of the five runs consid- tion between HS on one side and these phenomena ered for the task. An error was considered to occur on the other could be considered a source of error when the label assigned by the majority of the sys- for the automatic systems. tems was different from the gold label. If we ex- In addition, other aspects of both linguistic and tend our analysis to all the fifteen submitted runs, extra-linguistic nature were taken into account, so 156 out of 1,000 tweets have been misclassified as to complement the analysis. We thus consid- by the majority of them. However, this number in- ered the tweets targets, i.e. Roma, immigrants and creases to 172 if only the five best runs are taken Muslims (also an information available from the into account. original HS corpus). Finally, we selected three Regardless of the correct label, agreement features that are typical of computer-mediated among the five best runs is higher than that communication and social platforms such as Twit- among all runs and among any other set of runs: ter, in particular, the presence of links, multi-word those systems which have best modeled the phe- hashtags, and the use of capitalized words. nomenon on the data provided appear to have As for the method adopted, the percentage of made similar mistakes. This supports our hypoth- errors for the gold positives and the gold negatives esis that errors mostly depend on data-dependent in the whole test set was calculated. First, the rates features rather than on systems, which are all dif- were calculated considering the two labels - hate- ferent in approach and feature setting. ful and non-hateful - separately, in order to bal- ance their different distribution in the test set; then FNs are more than 30%. Results for the target Im- the results were halved to represent the whole cor- migrants are similar to the overall performance, pus in percentage and to maintain the proportion only with a slightly higher number of FPs. The between the results of the tags. All the percent- target Muslims caused a low number of FNs but ages correlating two different tags were calculated almost twice as many FPs as in the general perfor- this way, so that the results could be easily com- mance. pared. The percentages of mistakes for each la- The systems seem to struggle to recognize hate- bel of the categories were determined and com- ful content against Roma: this may be caused by pared to the general result to understand whether an imbalance in the test set (only 6.3% of tweets they influenced it positively or negatively. Table with the target Roma are labelled as HS, while the 2 summarizes the results for each label showing targets Immigrants and Muslims have 12.6% and the distribution of the false negatives (FN), false 13.4% of hateful tweets respectively) or by biases positives (FP), true positives (TP) and true nega- in the annotation. tives (TN). The error percentages higher than the The poor results achieved in classifying mes- general result are in bold font. sages with target Roma can also be explained by the subtler ways of expressing HS when this tar- 5 Results and Discussion get is involved, more heavily based on stereotypes than it happens with the other targets. The hate In order to find some answers to our research ques- against the other two targets, in particular Mus- tions and evidence of the influence of the anno- lims, was instead very explicit. See the following tated features on the systems’ results, we provide examples extracted from the test set. in this section an analysis driven by the categories we described in the previous section. 2235. Roma, colpisce una pecora con il pallone: bambino rom accecato Aggressiveness and Offensiveness. The differ- da un pastore https://t.co/KsSAS3fUx9 ent degrees of aggressiveness did not affect the @ilmessaggeroit HA DIFESO I SUOI systems recall, but we measured more FPs when AVERI!4 [FN, strong aggressiveness, weak or strong aggressiveness is involved (more target: Roma] than thrice as many as in the overall results when strong aggressiveness is present). 4749. @Corriere Uccidere gli islamici, Offensiveness seems to hold a similar but heavier prima di tutto.5 [TP, strong aggressive- influence on performance, causing better recall but ness, target: religion] worse precision: FPs are more than doubled when Other features. Some other features were con- strong offensiveness is present. sidered in our analysis. The presence of stereo- The presence of offensiveness is often associ- type was more frequent in hateful tweets, which ated to slurs or vulgar terms: these are not a con- caused a slight increase in FPs; conversely, cases sistent presence in the dataset (the most vulgar of HS without stereotype posed no issues to the tweets are probably quickly removed by the plat- systems. Moreover, as expected, the presence of form), and mostly appear in tweets classified as irony slightly increased the errors rate both in hate- HS. However, about half of the non-hateful tweets ful and non-hateful tweets. containing offensive words were wrongly classi- The presence of Twitter’s linguistic devices fied as hateful, proving that offensiveness can be also negatively influenced the results, probably misleading for systems. In these cases, a lexicon- because of the difficulty encountered by sys- based approach can fail, while attention to the con- tems when some semantic content assumes non- text could be crucial: in the most common in- standard forms, e.g. links, multi-word hashtags stances of false positives, in fact, offensive words and capitalized words. did not refer to the targets. URLs frequently occur in the data, but mostly in non-hateful tweets (although this may be a pe- HS Targets. Analyzing the three targets of HS culiarity of this dataset). Systems appear to have allowed us understanding how the systems reacted 4 to different ways of expressing hate. ”Rome, Roma child hits a sheep with a ball: blinded by a shepherd https://t.co/KsSAS3fUx9 @ilmessaggeroit HE DE- Most of the errors were caused by the target FENDED HIS PROPERTY!” 5 Roma: few hateful tweets were recognized, and ”@Corriere Kill the Muslims, first of all.” FN FP TP TN Gold HS Gold Not-HS general 15% 6% 35% 44% 32.3% 67.7% no aggressiveness 15% 4% 35% 46% 13.5% 56.8% weak aggressiveness 15% 10% 35% 40% 11.2% 10.1% strong aggressiveness 15% 19% 35% 31% 7.6% 0.8% no offensiveness 20% 5% 30% 45% 10.9% 60% weak offensiveness 13% 11% 37% 39% 14.6% 4.9% strong offensiveness 12% 16% 38% 34% 6.8% 2.8% no irony 15% 5% 35% 45% 27.8% 59% yes irony 18% 9% 32% 41% 4.5% 8.7% no stereotype 15% 5% 35% 45% 11.6% 49.7% yes stereotype 15% 8% 35% 42% 20.7% 18% Immigrants 15% 9% 35% 41% 12.6% 22.4% Muslims 8% 11% 42% 39% 13.4% 12.2% Roma 31% 1% 19% 49% 6.3% 33.1% no link 11% 13% 37% 39% 25.4% 24.4% yes link 29% 1% 21% 49% 7% 43.2% multi hashtags 23% 8% 27% 42% 3% 1.9% no capitalized words 15% 5% 35% 45% 29.1% 64.1% yes capitalized words 14% 9% 36% 41% 3.3% 3.5% Table 2: Percentage of correct (TPs and TNs) and erroneous (FPs and FNs) results in relation to the features considered in the analysis, along with the actual distribution of these features in the test set. troubles recognizing hateful tweets that contain performances of the systems. The tweets with a URLs (errors increased by 14%). Conversely, the multi-word hashtag clarifying the text would have absence of URLs caused an increase in FPs. This a better chance of being correctly identified. feature is unlikely to be directly connected to hate- ful language: we rather believe that it could some- Finally, some capitalized words have been how affect predictions regardless of the actual con- found in the data set, mostly in hateful tweets, tent. which again caused an increase in FPs. Despite Also multi-word hashtags influenced results, es- their small number, we noticed that, in non-hateful pecially for hateful content: their presence in- tweets, a higher percentage of capitalized words creased FNs by 8%. The reason for this kind of are named entities (nouns of places, people, news- error might lie in the fact that our dataset contains papers, etc.), while in hateful tweets capitalized some cases where the crucial element in a hateful words are more often used to intensify opinions tweet is precisely the hashtag, as in the example or feelings. below: Among all the features taken into account, of- 2149. Quando vedremo lo stessa tema fensiveness seems to have affected the perfor- portato in piazza con la stessa forza e mance in various ways: its absence led systems to determinazione? Mai credo. #stopislam classify as non-hateful tweets that are indeed hate- 6 https://t.co/dDYLZB1BlJ [multi-word ful, while its presence caused the inverse error. A hashtag, FN] possible explanation for this is that, as shown in Sanguinetti et al. (2018), offensiveness does not The text in this tweet is not hateful, but an correlate with HS even though it can be one of its element of hatred is conveyed by the hashtag features. The systems might have taken offensive ”#stopislam”. terms as indicators for HS, as also humans tend to The ability to separate the multi-word hashtags do (see for example Bohra et al. (2018)), but this is into the words composing them would improve the a false assumption that systems should be trained 6 ”When will we see people fighting for the same issue to avoid. Aggressiveness also caused a certain de- with the same strength and determination? Never, I believe.” gree of errors, but only affecting precision. 6 Lessons Learned and Conclusion the way for future work on, e.g., the development of tools that perform a more careful analysis of the This paper presents a detailed error analysis of text. the results obtained within the context of a shared task for HS detection. In our study, we took into Acknowledgments account two types of data: content information, provided by gold standard labels assigned to each The work of C. Bosco and M. Sanguinetti is par- tweet; and metadata information, namely the pres- tially funded by Progetto di Ateneo/CSP 2016 (Im- ence of URLs, hashtags and capitalized words. migrants, Hate and Prejudice in Social Media, Results prove the importance of considering other S1618 L2 BOSC 01), while that of F. Poletto is categories related to that on which the task was funded by Fondazione Giovanni Goria and Fon- centered. dazione CRT (Talenti della Società Civile 2018). The analysis of performances in relation to URLs poses a controversial result. There are two References reasons why tweets collected via Twitter’s API Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tom- may contain a URL: the tweet may have been cut maso Caselli, and Malvina Nissim. 2018. RuG off and a URL automatically generated as a link @ EVALITA 2018: Hate Speech Detection In Ital- to the complete tweet, or the URL may be part of ian Social Media. In Proceedings of Sixth Evalua- the original tweet and lead to an external page. In tion Campaign of Natural Language Processing and both cases, unless the URL is followed, the tweet Speech Tools for Italian. Final Workshop (EVALITA 2018). CEUR.org. is likely to be harder to understand compared to a tweet that contains no URL. This may cause lower Francesco Barbieri, Valerio Basile, Danilo Croce, agreement among human judges, and it is a very Malvina Nissim, Nicole Novielli, and Viviana Patti. 2016. Overview of the Evalita 2016 SENTIment complicated issue for automated systems to deal POLarity Classification Task. In Proceedings of with, especially when the meaning of the tweet the Fifth Evaluation Campaign of Natural Language is unintelligible without first opening the URL. Processing and Speech Tools for Italian. Final Work- Tweets containing URLs are, for the time being, shop (EVALITA 2016). CEUR.org. less reliable as training data and pose a tougher Valerio Basile, Cristina Bosco, Elisabetta Fersini, Deb- challenge for Sentiment Analysis tasks at large; ora Nozza, Viviana Patti, Francisco Manuel Rangel we encourage an effort towards solving this issue. Pardo, Paolo Rosso, and Manuela Sanguinetti. As for capitalized words, future work may in- 2019. Semeval-2019 task 5: Multilingual detec- tion of hate speech against immigrants and women clude investigating how they affect human anno- in Twitter. In Proceedings of the 13th International tation, as some judges may show a bias towards Workshop on Semantic Evaluation, pages 54–63. associating capitalized words to HS or other cat- Aditya Bohra, Deepanshu Vijay, Vinay Singh, egories. Furthermore, improvements may come Syed Sarfaraz Akhtar, and Manish Shrivastava. from considering the PoS tags of such words, or 2018. A dataset of Hindi-English code-mixed social the number of consecutive capitalized words. media text for hate speech detection. In Proceedings Multi-word hashtags as well need to be treated of the Second Workshop on Computational Model- ing of Peoples Opinions, Personality, and Emotions with care, as they may affect and even overturn in Social Media, pages 36–41. the meaning of the whole tweet. Yet, it happens that a hashtag might require syntactic, semantic Cristina Bosco, Dell’Orletta Felice, Fabio Poletto, and world-knowledge processing in order to be Manuela Sanguinetti, and Tesconi Maurizio. 2018. Overview of the EVALITA 2018 hate speech detec- fully understood: for example, by comparing the tion task. In Proceedings of Sixth Evaluation Cam- phrase ”stop Islam” with, e.g., ”stop harassment”, paign of Natural Language Processing and Speech we can see that the word ”stop” is not necessarily Tools for Italian. Final Workshop (EVALITA 2018). negative, and it becomes so only because it is fol- CEUR.org. lowed by the name of a religion whose members Tommaso Caselli, Nicole Novielli, Viviana Patti, and are, nowadays and in Western society, particularly Paolo Rosso. 2018. EVALITA 2018: Overview of subject to discrimination. the 6th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. In Pro- Overall, our analysis suggests that systems fail- ceedings of Sixth Evaluation Campaign of Natural ures are motivated by the difficulty in dealing with Language Processing and Speech Tools for Italian. cases where HS is less directly expressed and pave Final Workshop (EVALITA 2018). CEUR.org. Andrea Cimino and Lorenzo De Mattei. 2018. Multi- Units. In Proceedings of GermEval 2018, 14th task Learning in Deep Neural Networks for Hate Conference on Natural Language Processing (KON- Speech Detection in Facebook and Twitter. In Pro- VENS 2018). ceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Final Workshop (EVALITA 2018). CEUR.org. Wu, and Houfeng Wang. 2018. Sgm: sequence generation model for multi-label classification. In Michele Corazza, Stefano Menini, Pinar Arslan, Proceedings of the 27th International Conference on Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and Computational Linguistics, pages 3915–3926. Serena Villata. 2018. Comparing Different Super- vised Approaches to Hate Speech Detection. In Pro- ceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018). CEUR.org. Elisabetta Fersini, Paolo Rosso, and Maria Anzovino. 2018. Overview of the Task on Automatic Misog- yny Identification at IberEval 2018. In Proceed- ings of the Third Workshop on Evaluation of Hu- man Language Technologies for Iberian Languages (IberEval 2018), co-located with 34th Conference of the Spanish Society for Natural Language Process- ing (SEPLN 2018), pages 214–228. CEUR-WS.org. Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR), 51(4):85. Saif Mohammad, Felipe Bravo-Marquez, Moham- mad Salameh, and Svetlana Kiritchenko. 2018. Semeval-2018 task 1: Affect in tweets. In Proceed- ings of The 12th International Workshop on Seman- tic Evaluation, pages 1–17. Malvina Nissim, Lasha Abzianidze, Kilian Evang, Rob van der Goot, Hessel Haagsma, Barbara Plank, and Martijn Wieling. 2017. Sharing is caring: The future of shared tasks. Computational Linguistics, 43(4):897–904. Fabio Poletto, Marco Stranisci, Manuela Sanguinetti, Viviana Patti, and Cristina Bosco. 2017. Hate Speech Annotation: Analysis of an Italian Twit- ter Corpus. In Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017). CEUR.org. Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Viviana Patti, and Marco Stranisci. 2018. An Italian Twitter Corpus of Hate Speech against Immigrants. In Proceedings of the 11th Language Resources and Evaluation Conference (LREC 2018). Anna Schmidt and Michael Wiegand. 2017. A Sur- vey on Hate Speech Detection using Natural Lan- guage Processing. In Proceedings of the Fifth Inter- national Workshop on Natural Language Process- ing for Social Media. Association for Computational Linguistics. Dirk von Grünigen, Ralf Grubenmann, Fernando Ben- ites, Pius Von Däniken, and Mark Cieliebak. 2018. spMMMP at GermEval 2018 Shared Task: Classifi- cation of Offensive Content in Tweets using Con- volutional Neural Networks and Gated Recurrent