Overview of the EVALITA 2018 Hate Speech Detection Task Cristina Bosco Felice Dell’Orletta Fabio Poletto University of Torino ILC-CNR, Pisa Acmos, Torino Italy Italy Italy bosco@di.unito.it felice.dellorletta@ilc.cnr.it fabio.poletto@edu.unito.it Manuela Sanguinetti Maurizio Tesconi University of Torino IIT-CNR, Pisa Italy Italy msanguin@di.unito.it maurizio.tesconi@iit.cnr.it Abstract di dati oppure addrestrati su una tipolo- gia e testati sull’altra: HaSpeeDe-FB, English. The Hate Speech Detection HaSpeeDe-TW e Cross-HaSpeeDe (a sua (HaSpeeDe) task is a shared task on Ital- volta suddiviso in Cross-HaSpeeDe FB e ian social media (Facebook and Twit- Cross-HaSpeeDe TW). Nel complesso, 9 ter) for the detection of hateful con- gruppi hanno partecipato alla campagna, tent, and it has been proposed for the e il miglior sistema ha ottenuto un pun- first time at EVALITA 2018. Provid- teggio di macro F1 pari a 0,8288 in ing two datasets from two different on- HaSpeeDe-FB, 0,7993 in HaSpeeDe-TW, line social platforms differently featured 0,6541 in Cross-HaSpeeDe FB e 0.6985 from the linguistic and communicative in Cross-HaSpeeDe TW. L’articolo de- point of view, we organized the task scrive i dataset rilasciati e le modalità di in three tasks where systems must be valutazione, e discute i risultati ottenuti. trained and tested on the same resource or using one in training and the other in testing: HaSpeeDe-FB, HaSpeeDe- 1 Introduction and Motivations TW and Cross-HaSpeeDe (further sub- divided into Cross-HaSpeeDe FB and Online hateful content, or Hate Speech (HS), is Cross-HaSpeeDe TW sub-tasks). Over- characterized by some key aspects (such as viral- all, 9 teams participated in the task, and ity, or presumed anonymity) which distinguish it the best system achieved a macro F1- from offline communication and make it poten- score of 0.8288 for HaSpeeDe-FB, 0.7993 tially more dangerous and hurtful. Therefore, its for HaSpeeDe-TW, 0.6541 for Cross- identification becomes a crucial mission in many HaSpeeDe FB and 0.6985 for Cross- fields. HaSpeeDe TW. In this report, we describe The task that we have proposed for this edi- the datasets released and the evaluation tion of EVALITA namely consists in automatically measures, and we discuss results. annotating messages from two popular micro- blogging platforms, Twitter and Facebook, with a Italiano. HaSpeeDe è la prima cam- boolean value indicating the presence (or not) of pagna di valutazione di sistemi per HS. l’identificazione automatica di discorsi HS can be defined as any expression “that is di incitamento all’odio su social media abusive, insulting, intimidating, harassing, and/or (Facebook e Twitter) in lingua italiana, incites to violence, hatred, or discrimination. It is proposta nell’ambito di EVALITA 2018. directed against people on the basis of their race, Fornendo ai partecipanti due insiemi di ethnic origin, religion, gender, age, physical con- dati estratti da due piattaforme differenti dition, disability, sexual orientation, political con- dal punto di vista linguistico e della comu- viction, and so forth” (Erjavec and Kovačič, 2012). nicazione, abbiamo articolato HaSpeeDe Although definitions and approaches to HS vary in tre compiti in cui i sistemi sono ad- a lot and depend on the juridical tradition of the destrati e testati sulla stessa tipologia country, many agree that what is identified as such can not fall under the protection granted horship and aggressiveness analysis (MEX-A3T) by the right to freedom of expression, and must (Carmona et al., 2018) proposed at the 2018 edi- be prohibited. Also for transposing in practical tion of IberEval, the GermEval Shared Task on the initiatives the Code of Conduct of the European Identification of Offensive Language (Wiegand et Union1 , online platforms like Twitter, Facebook al., 2018), the Automatic Misogyny Identification or YouTube discourage hateful content, but its re- task at EVALITA 2018 (Fersini et al., 2018a), and moval mainly relies on users and trusted flaggers finally the SemEval shared task on hate speech de- reports, and lacks a systematic control. tection against immigrants and women (HatEval), Although HS analysis and identification re- that is still ongoing at the time of writing3 . quires a multidisciplinary approach that includes On the other hand, such contributions and knowledge from different fields (psychology, law, events are mainly based on other languages (En- social sciences, among others), NLP plays a fun- glish, for most part), while very few of them deal damental role in this respect. Therefore, the de- with Italian (Del Vigna et al., 2017; Musto et velopment of high-accuracy automatic tools able al., 2016; Pelosi et al., 2017). Precisely for this to identify HS assumes the utmost relevance not reason, the Hate Speech Detection (HaSpeeDe)4 only for NLP – and Italian NLP in particular – task has been conceived and proposed within the but also for all the practical applications a simi- EVALITA context (Caselli et al., 2018); its pur- lar task lends itself to. Furthermore, as also sug- pose is namely to encourage and promote the par- gested in Schmidt and Wiegand (2017), the com- ticipation of several research groups, both from munity would considerably benefit from a bench- academia and industry, making a shared dataset mark dataset for HS detection underlying a com- available, in order to allow an advancement in the monly accepted definition of the task. state of the art in this field for Italian as well. As regards the state of the art, a large number of contributions have been proposed on this topic, 2 Task Organization that adopt from lexicon-based (Gitari et al., 2015) Considering the linguistic, as well as meta- to various machine learning approaches, and with linguistic, features that distinguish Twitter and different learning techniques, ranging from naı̈ve Facebook posts, namely due to the differences in Bayes classifiers (Kwok and Wang, 2013), Logis- use between the two platforms and the character tic Regression and Support Vector Machines (Bur- limitations posed for their messages (especially on nap and Williams, 2015; Davidson et al., 2017), Twitter), the task has been further organized into to the more recent Recurrent and Convolutional three sub-tasks, based on the dataset used (see Sec- Neural Networks (Mehdad and Tetreault, 2016; tion 3): Gambäck and Sikdar, 2017). However, there exist no comparative studies which would allow making • Task 1: HaSpeeDe-FB, where only the judgement on the most effective learning method Facebook dataset could be used to classify (Schmidt and Wiegand, 2017). the Facebook test set Furthermore, a large number of academic events and shared tasks took place in the recent past, • Task 2: HaSpeeDe-TW, where only the thus reflecting the interest in HS and HS-related Twitter dataset could be used to classify the topics by the NLP community; to name a few, Twitter test set the first and second edition of the Workshop on • Task 3: Cross-HaSpeeDe, which has been Abusive Language2 (Waseem et al., 2017), the further subdivided into two sub-tasks: First Workshop on Trolling, Aggression and Cy- berbullying (Kumar et al., 2018), that also in- – Task 3.1: Cross-HaSpeeDe FB, where cluded a shared task on aggression identifica- only the Facebook dataset could be used tion, the tracks on Automatic Misogyny Identifi- to classify the Twitter test set cation (AMI) (Fersini et al., 2018b) and on auto- – Task 3.2: Cross-HaSpeeDe TW, where, conversely, only the Twitter 1 On May 31, 2016, the EU Commission presented with 3 Facebook, Microsoft, Twitter and YouTube a “Code of con- https://competitions.codalab.org/ duct on countering illegal hate speech online”. competitions/19935 2 4 https://sites.google.com/view/ http://www.di.unito.it/˜tutreeb/ alw2018/ haspeede-evalita18/ dataset could be used to classify the 3.2 Twitter Dataset Facebook test set The Twitter dataset released for the competition is a subset of a larger hate speech corpus devel- Cross-HaSpeeDe, in particular, has been pro- oped at the Turin University. The corpus forms posed as an out-of-domain task that specifically indeed part of the Hate Speech Monitoring pro- aimed on one hand at highlighting the challeng- gram5 , coordinated by the Computer Science De- ing aspects of using social media data for classi- partment with the aim at detecting, analyzing and fication purposes, and on the other at enhancing countering HS with an inter-disciplinary approach the systems’ ability to generalize their predictions (Bosco et al., 2017). Its preliminary stage of devel- with different datasets. opment has been described in Poletto et al. (2017), 3 Datasets and Format while the fully developed corpus is described in Sanguinetti et al. (2018). The datasets proposed for this task are the result of The collection includes Twitter posts gathered a joint effort of two research groups on harmoniz- with a classical keyword-based approach, more ing the annotation previously applied to two dif- specifically by filtering the corpus using neutral ferent datasets, in order to allow their exploitation keywords related to three social groups deemed as in the task. potential HS targets in the Italian context: immi- The first dataset is a collection of Facebook grants, Muslims and Roma. posts developed by the group from Pisa and cre- After a first annotation step that resulted in a col- ated in 2016 (Del Vigna et al., 2017), while the lection of around 1,800 tweets, the corpus has other one is a Twitter corpus developed in 2017- been further expanded by adding new annotated 2018 by the Turin group (Sanguinetti et al., 2018). data. The newly introduced tweets were annotated Section 3.1 and 3.2 briefly introduce the original partly by experts and partly by CrowdFlower (now datasets, while Section 3.3 describes the unified Figure Eight) contributors. The final version of the annotation scheme adopted in both corpora for the corpus consists of 6,928 tweets. purposes of this task. The main feature of this corpus is its annotation scheme, specifically designed to properly encode 3.1 Facebook Dataset the multiplicity of factors that can contribute to This is a corpus of comments retrieved from the definition of a hate speech notion, and to of- the Facebook public pages of Italian newspapers, fer a broader tagset capable of better representing politicians, artists, and groups. Those pages were all those factors which may increase, or rather mit- selected because typically they host discussions igate, the impact of the message. This resulted in spanning across a variety of topics. a scheme that includes, besides HS tags (no-yes), The comments collected were related to a series also its intensity degree (from 1 through 4 if HS is of web pages and groups, chosen as being sus- present, and 0 otherwise), the presence of aggres- pected to possibly contain hateful content: salvin- siveness (no-weak-strong) and offensiveness (no- iofficial, matteorenziufficiale, lazanzarar24, jenus- weak-strong), as well as irony and stereotype (no- dinazareth, sinistracazzateliberta2, ilfattoquotidi- yes). ano, emosocazzi, noiconsalviniufficiale. In addition, given that irony has been included Overall, 17,567 Facebook comments were col- as annotation category in the scheme, part of lected from 99 posts crawled from the selected this hate speech corpus (i.e. the tweets an- pages. Five bachelor students were asked to an- notated as ironic) has also been used in an- notate comments, in particular 3,685 received at other task proposed in this edition of EVALITA, least 3 annotations. The annotators were asked to namely the one on irony detection in Italian tweets assign one class to each post, where classes span (IronITA)6 (Cignarella et al., 2018). More pre- over the following levels of hate: No hate, Weak cisely, the overlapping tweets in the IronITA hate, Strong hate. datasets are 781 in the training set and just 96 in Hateful messages were then divided into distinct the test set. categories: Religion, Physical and/or mental hand- 5 http://hatespeech.di.unito.it/ icap, Socio-economical status, Politics, Race, Sex 6 http://www.di.unito.it/˜tutreeb/ and Gender issues, and Other. ironita-evalita18/ 3.3 Format and Data in HaSpeeDe 0 1 The annotation format provided for the task is Train 1,618 1,382 the same for both datasets described above, and Test 323 677 it consists of a simplified version of the schemes total 1,941 2,059 adopted in the two corpora introduced in Section 3.1 and 3.2. Table 3: Label distribution in the Facebook The data have been encoded in UTF-8 plain-text dataset. files with three tab-separated columns, each one 0 1 representing the following information: Train 2,028 972 1. the ID of the Facebook comment or tweet7 , Test 676 324 total 2,704 1,296 2. the text, 3. the class: 1 if the text contains HS, and 0 Table 4: Label distribution in the Twitter dataset. otherwise (see Table 1 and 2 for a few exam- ples). been provided. The evaluation has been performed according to the standard metrics known in literature, i.e Pre- id text hs cision, Recall and F1-score. However, given the 8 Io voterò NO NO E NO 0 imbalanced distribution of hateful vs not hateful 36 Matteo serve un colpo di stato. 1 messages, and in order to get more useful insights Qua tra poco dovremo andare in giro on the system’s performance on a given class, tutti armati come in America. the scores have been computed for each class Table 1: Annotation examples from the Facebook separately; finally the F1-score has been macro- dataset. averaged, so as to get the overall results. For all tasks, the baseline score has been com- puted as the performance of a classifier based on id text hs the most frequent class. 1,783 Corriere: Mafia Capitale, 0 4 patteggiamenti 5 Overview of the Task: Participation Gli appalti truccati dei campi rom and Results 3,290 altro che profughi? sono zavorre 1 e tutti uomini 5.1 Task Participants and Submissions A total amount of 9 teams8 participated in at least Table 2: Annotation examples from the Twitter one of the three HaSpeeDe main tasks. Table 5 dataset. provides an overview of the teams and their affili- ation. Both Facebook and Twitter datasets consist of a Except for one case, where one run was sent for total amount of 4,000 comments/tweets retrieved HaSpeeDe-TW only, all teams submitted at least from the main corpora introduced in Section 3.1 one run for all the tasks. and 3.2. The data were randomly split into devel- opment and test set, of 3,000 and 1,000 messages 5.2 Systems respectively. As participants were allowed to submit up to 2 The distribution in both datasets of the labels ex- runs for each task, several training options were pressing the presence or not of HS is summarized adopted in order to properly classify the texts. in Table 3 and 4. Furthermore, unlike other tasks, we have cho- 4 Evaluation sen to not establish any distinction between con- strained and unconstrained runs, and to allow par- Participants were allowed to submit up to 2 runs ticipants to use all the additional resources that for each task, and a separate official ranking has 8 In fact, 11 teams submitted their results, but one team 7 In order to meet the GDPR requirements, texts have been withdrew its submissions, and another one’s submissions pseudonymized replacing all original IDs in both datasets have been removed from the official rankings by the task or- with newly-generated ones. ganizers. Team Affiliation layer BiLSTM and a newly-introduced one based GRCP Univ. Politècnica de València + on a 2-layer BiLSTM which exploits multi-task CERPAMID, Cuba learning with additional data from the 2016 SEN- InriaFBK Univ. Côte d’Azur, CNRS, Inria + TIPOLC task (Barbieri et al., 2016). FBK, Trento ItaliaNLP ILC-CNR, Pisa + Univ. of Pisa Perugia (Santucci et al., 2018) The participants’ Perugia Univ. for Foreigners of Perugia + system uses a document classifier based on a SVM Univ. of Perugia + Univ. of Florence algorithm. The features used by the system are RuG University of Groningen + a combination of features extracted using mathe- Univ. degli Studi di Salerno matical operations on FastText word embeddings sbMMP Zurich Univ. of Applied Sciences and other 20 features extracted from the raw text. StopPropagHate INESC TEC + Univ. of Porto + RuG (Bai et al., 2018) The authors proposed Eurecat, Centre Tecn. de Catalunya two different classifiers: a SVM based on linear HanSEL University of Bari Aldo Moro kernel algorithm and an ensemble system com- VulpeculaTeam University of Perugia posed of a SVM classifier and a Convolutional Neural Network combined by a logistic regres- Table 5: Participants overview. sion meta-classifier. The features of each classi- fier is algorithm dependent and exploits word em- they deemed useful for the task (other annotated beddings, raw text features and lexical resources resources, lexicons, pre-trained word embeddings, features. etc.), on the sole condition that these were explic- sbMMMP The authors tested two different sys- itly mentioned in their final report. tems, in a similar fashion to what described in von Table 6 summarizes the external resources (if Grüningen et al. (2018). The first one is based any) used by participants to enhance their systems’ on an ensemble of Convolutional Neural Networks performance, while the remainder of this section (CNN), whose outputs are then used as features offers a brief overview of the teams’ systems and by a meta-classifier for the final prediction. The core methods adopted to participate in the task . second system uses a combination of a CNN and GRCP (De la Peña Sarracén et al., 2018) The a Gated Recurrent Unit (GRU) together with a authors proposed a bidirectional Long Short- transfer-learning approach based on pre-training Term Memory Recurrent Neural Network with an with a large, automatically-translated dataset. Attention-based mechanism that allows to esti- StopPropagHate (Fortuna et al., 2018) The au- mate the importance of each word; this context thors use a classifier based on Recurrent Neural vector is then used with another LSTM model to Networks with a binary cross-entropy as loss func- estimate whether a text is hateful or not. tion. In their system, each input word is repre- HanSEL (Polignano and Basile, 2018) The sys- sented by a 10000-dimensional vector which is a tem proposed is based on an ensemble of three one-hot encoding vector. classification strategies, mediated by a majority VulpeculaTeam (Bianchini et al., 2018) Ac- vote algorithm: Support Vector Machine with cording to the description provided by partici- RBF kernel, Random Forest and Deep Multilayer pants, a neural network with three hidden layers Perceptron. The input social media text is repre- was used, with word embeddings trained on a set sented as a concatenation of word2vec sentence of previously extracted Facebook comments. vectors and a TF-IDF bag of words. 5.3 Results and Discussion InriaFBK (Corazza et al., 2018) The authors implemented three different classifier models, In Table 7, 8, 9 and 10, we report the final results based on recurrent neural networks, n-gram based of HaSpeeDe, separated according to the respec- models and linear SVC. tive sub-task and ranked by the macro F1-score (as described in Section 4)9 . ItaliaNLP (Cimino et al., 2018) Participants 9 Due to space constraints, the complete evaluation for all tested three different classification models: one classes has been made available here: https://goo.gl/ based on linear SVM, another one based on a 1- xPyPRW Team External Resources GRCP pre-trained word embeddings InriaFBK emotion lexicon ItaliaNLP Lab polarity and subjectivity lexicons + 2 word-embedding lexicons Perugia Twitter corpus + hate speech lexicon + polarity lexicon RuG pre-trained word embeddings + bad/offensive word lists sbMMP pre-trained word embeddings StopPropagHate – HanSEL pre-trained word embeddings VulpeculaTeam polarity lexicon + lists of bad words + pre-trained word embeddings Table 6: Overview of the additional resources used by participants, besides the datasets provided by the task organizers. In case of multiple runs, the suffixes ” 1” and ” 2” Team Macro F1-score have been appended to each team name, in order baseline 0.4033 to distinguish the run number of the submitted file. ItaliaNLP 2 0.7993 Furthermore, some of the runs in the tables have ItaliaNLP 1 0.7982 been marked with *: this means that they were re- RuG 1 0.7934 submitted because of file incompatibility with the InriaFBK 2 0.7837 evaluation script or other minor issues that did not sbMMMP 0.7809 affect the evaluation process. InriaFBK 1 0.78 VulpeculaTeam* 0.7783 Team Macro F1-score Perugia 2 0.7744 baseline 0.2441 RuG 2 0.753 ItaliaNLP 2 0.8288 StopPropagHate 2* 0.7426 ItaliaNLP 1 0.8106 StopPropagHate 1* 0.7203 InriaFBK 1 0.8002 GRCP 1 0.6638 InriaFBK 2 0.7863 GRCP 2 0.6567 Perugia 2 0.7841 HanSEL 0.6491 RuG 1 0.7751 Perugia 1 0.4033 HanSEL 0.7738 VulpeculaTeam* 0.7554 Table 8: Results of the HaSpeeDe-TW task. RuG 2 0.7428 GRCP 2 0.7147 GRCP 1 0.7144 HaSpeeDe-FB and HaSpeeDe-TW, i.e. ItaliaNLP, StopPropagHate 2* 0.6532 also achieved valuable results in the cross-domain StopPropagHate 1* 0.6419 sub-tasks, ranking at fifth and first position in Perugia 1 0.2424 Cross-HaSpeeDe FB and Cross-HaSpeeDe TW, respectively. But these results can also depend on Table 7: Results of the HaSpeeDe-FB task. the association of the polarity and subjectivity lex- icon with word embeddings, which alone did not In absolute terms, i.e. based on the score allow the achievement of particularly high results. of the first-ranked team, the best results have Furthermore, it is not surprising that the best re- been achieved in the HaSpeeDe-FB task, with sults have been obtained on HaSpeeDe-FB, pro- a macro F1 of 0.8288, followed by HaSpeeDe- vided the fact that messages posted on this plat- TW (0.7993), Cross-HaSpeeDe TW (0.6985) and form are longer and more correct than those in Cross-HaSpeeDe FB (0.6541). Twitter, allowing systems (and humans too) to find The robustness of an approach benefiting from more and more clear indications of the presence of a polarity and subjectivity lexicon is confirmed HS. by the fact that the best ranking team in both The coarse granularity of the annotation scheme, Team Macro F1-score Team Macro F1-score baseline 0.4033 baseline 0.2441 InriaFBK 2 0.6541 ItaliaNLP 2 0.6985 InriaFBK 1 0.6531 InriaFBK 2 0.6802 VulpeculaTeam 0.6542 ItaliaNLP 1 0.6693 Perugia 2 0.6279 InriaFBK 1 0.6547 ItaliaNLP 1 0.6068 VulpeculaTeam* 0.6189 ItaliaNLP 2 0.5848 RuG 1 0.6021 GRCP 2 0.5436 RuG 2 0.5545 RuG 1 0.5409 HanSEL 0.4838 RuG 2 0.4845 Perugia 2 0.4594 GRCP 1 0.4544 GRCP 1 0.4451 HanSEL 0.4502 StopPropagHate* 0.4378 StopPropagHate 0.443 GRCP 2 0.318 Perugia 1 0.4033 Perugia 1 0.2441 Table 9: Results of the Cross-HaSpeeDe FB sub- Table 10: Results of the Cross-HaSpeeDe TW task. sub-task. which is a simplification of the schemes originally Overall, the heterogeneous nature of the proposed for the datasets, and merged specifically datasets provided for the task - both in terms of for the purpose of this task, probably influenced class distribution and data composition - together the scores which are indeed very promising and with their quite small size, made the whole task high with respect to other tasks of the sentiment even more challenging; nonetheless, this did not analysis area. prevent participants from finding the appropriate solutions, thus improving the state of the art for As regards the Cross-HaSpeeDe FB and Cross- HS identification in Italian language as well. HaSpeeDe TW sub-tasks, the lower results with respect to the in-domain tasks can be attributed 6 Closing Remarks to several factors, among which - and as expected - the different distribution in Facebook and Twit- The paper describes the HaSpeeDe task for the de- ter datasets of HS and not HS classes. As a mat- tection of HS in Italian texts from Facebook and ter of fact, the percentage of HS in the Facebook Twitter. The novelty of the task mainly consists train and test set is around 46% and 68%, respec- in allowing the comparison between the results tively, while in the Twitter test set is around 32% obtained on the two platforms and experiments in both sets. Such imbalanced distribution is re- on training on one typology of texts and testing flected in the overall system outputs in the two on the other. The results confirmed the difficulty sub-tasks: in Cross-HaSpeeDe FB, where systems of cross-platform HS detection but also produced have been evaluated against the Twitter test set, very promising scores in the tasks where the data most of the labels predicted as HS were not clas- from the same social network were exploited both sified as such in the gold standard; conversely, in for training and testing. Cross-HaSpeeDe TW, the majority of labels pre- Future work can be devoted to an in-depth analy- dicted as not HS were actually considered as HS sis of errors and to the observation of the contri- in the gold corpus. bution that different resources can give to systems Another feature that distinguishes Facebook from performing this task. Twitter dataset is the wider range of hate cat- Acknowledgments egories in the former, compared to the latter (see Section 3.1 and 3.2). Especially in Cross- The work of Cristina Bosco and Manuela San- HaSpeeDe TW, the identification of hateful mess- guinetti is partially funded by Progetto di Ate- sages may have been made even more difficult due neo/CSP 2016 (Immigrants, Hate and Prejudice to the reduced number of potential hate targets in in Social Media, S1618 L2 BOSC 01). the training set, with respect to the test set. References Neural Networks at EVALITA 2018. In Proceed- ings of the 6th evaluation campaign of Natural Xiaoyu Bai, Flavio Merenda, Claudia Zaghi, Tom- Language Processing and Speech tools for Italian maso Caselli, and Malvina Nissim. 2018. RuG (EVALITA’18), Turin, Italy. CEUR.org. @ EVALITA 2018: Hate Speech Detection In Ital- ian Social Media. In Proceedings of Sixth Evalua- Michele Corazza, Stefano Menini, Pinar Arslan, tion Campaign of Natural Language Processing and Rachele Sprugnoli, Elena Cabrio, Sara Tonelli, and Speech Tools for Italian. Final Workshop (EVALITA Serena Villata. 2018. Comparing Different Super- 2018). CEUR.org. vised Approaches to Hate Speech Detection. In Pro- ceedings of Sixth Evaluation Campaign of Natural Francesco Barbieri, Valerio Basile, Danilo Croce, Language Processing and Speech Tools for Italian. Malvina Nissim, Nicole Novielli, and Viviana Patti. Final Workshop (EVALITA 2018). CEUR.org. 2016. Overview of the Evalita 2016 SENTIment POLarity Classification Task. In Proceedings of Thomas Davidson, Dana Warmsley, Michael W. Macy, the Fifth Evaluation Campaign of Natural Language and Ingmar Weber. 2017. Automated Hate Speech Processing and Speech Tools for Italian. Final Work- Detection and the Problem of Offensive Language. shop (EVALITA 2016). CoRR, abs/1703.04009. Giulio Bianchini, Lorenzo Ferri, and Tommaso Giorni. Gretel Liz De la Peña Sarracén, Reynaldo Gil Pons, 2018. Text Analysis for Hate Speech Detection in Carlos Enrique Muñiz Cuza, and Paolo Rosso. Italian Messages on Twitter and Facebook. In Pro- 2018. Hate Speech Detection Using Attention- ceedings of Sixth Evaluation Campaign of Natural based LSTM. In Proceedings of Sixth Evalua- Language Processing and Speech Tools for Italian. tion Campaign of Natural Language Processing and Final Workshop (EVALITA 2018). CEUR.org. Speech Tools for Italian. Final Workshop (EVALITA 2018). CEUR.org. Cristina Bosco, Patti Viviana, Marcello Bogetti, Michelangelo Conoscenti, Giancarlo Ruffo, Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Rossano Schifanella, and Marco Stranisci. 2017. Marinella Petrocchi, and Maurizio Tesconi. 2017. Tools and Resources for Detecting Hate and Prej- Hate Me, Hate Me Not: Hate Speech Detection on udice Against Immigrants in Social Media. In Facebook. In Proceedings of the First Italian Con- Proceedings of First Symposium on Social Interac- ference on Cybersecurity (ITASEC17). tions in Complex Intelligent Systems (SICIS), AISB Convention 2017, AI and Society. Karmen Erjavec and Melita Poler Kovačič. 2012. ”You Don’t Understand, This is a New War!” Anal- Pete Burnap and Matthew L. Williams. 2015. Cyber ysis of Hate Speech in News Web Sites’ Comments. Hate Speech on Twitter: An Application of Machine Mass Communication and Society, 15(6). Classification and Statistical Modeling for Policy Elisabetta Fersini, Debora Nozza, and Paolo Rosso. and Decision Making. Policy & Internet, 7(2). 2018a. Overview of the EVALITA 2018 Task on Automatic Misogyny Identification (AMI). In Pro- Miguel Ángel Álvarez Carmona, Estefanı́a Guzmán- ceedings of Sixth Evaluation Campaign of Natural Falcón, Manuel Montes-y-Gómez, Hugo Jair Es- Language Processing and Speech Tools for Italian. calante, Luis Villaseñor Pineda, Verónica Reyes- Final Workshop (EVALITA 2018). CEUR.org. Meza, and Antonio Rico Sulayes. 2018. Overview of MEX-A3T at IberEval 2018: Authorship and Ag- Elisabetta Fersini, Paolo Rosso, and Maria Anzovino. gressiveness Analysis in Mexican Spanish Tweets. 2018b. Overview of the Task on Automatic In IberEval@SEPLN. CEUR-WS.org. Misogyny Identification at IberEval 2018. In IberEval@SEPLN. CEUR-WS.org. Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso. 2018. EVALITA 2018: Overview of Paula Fortuna, Ilaria Bonavita, and Sérgio Nunes. the 6th Evaluation Campaign of Natural Language 2018. Merging datasets for hate speech classifi- Processing and Speech Tools for Italian. In Pro- cation in Italian. In Proceedings of Sixth Evalua- ceedings of Sixth Evaluation Campaign of Natural tion Campaign of Natural Language Processing and Language Processing and Speech Tools for Italian. Speech Tools for Italian. Final Workshop (EVALITA Final Workshop (EVALITA 2018). CEUR.org. 2018). CEUR.org. Alessandra Teresa Cignarella, Simona Frenda, Vale- Björn Gambäck and Utpal Kumar Sikdar. 2017. Using rio Basile, Cristina Bosco, Viviana Patti, and Paolo Convolutional Neural Networks to Classify Hate- Rosso. 2018. Overview of the Evalita 2018 Task on Speech. In Proceedings of the First Workshop on Irony Detection in Italian Tweets (IronITA). In Pro- Abusive Language. ceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Final Workshop (EVALITA 2018). CEUR.org. Damien, and Jun Long. 2015. A lexicon-based approach for hate speech detection. International Andrea Cimino, Lorenzo De Mattei, and Felice Journal of Multimedia and Ubiquitous Engineering, Dell’Orletta. 2018. Multi-task Learning in Deep 10(4). Ritesh Kumar, Atul Kr. Ojha, Marcos Zampieri, and Dirk von Grünigen, Ralf Grubenmann, Fernando Ben- Shervin Malmasi, editors. 2018. Proceedings of ites, Pius Von Däniken, and Mark Cieliebak. 2018. the First Workshop on Trolling, Aggression and Cy- spMMMP at GermEval 2018 Shared Task: Classifi- berbullying (TRAC-2018). Association for Compu- cation of Offensive Content in Tweets using Con- tational Linguistics. volutional Neural Networks and Gated Recurrent Units. In Proceedings of GermEval 2018, 14th Irene Kwok and Yuzhou Wang. 2013. Locate the Hate: Conference on Natural Language Processing (KON- Detecting Tweets Against Blacks. In Proceedings VENS 2018). of the Twenty-Seventh AAAI Conference on Artificial Intelligence. AAAI Press. Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy, and Joel Tetreault, editors. 2017. Proceedings of the Yashar Mehdad and Joel Tetreault. 2016. Do Char- First Workshop on Abusive Language Online. Asso- acters Abuse More Than Words? In 17th Annual ciation for Computational Linguistics. Meeting of the Special Interest Group on Discourse Michael Wiegand, Melanie Siegel, and Josef Ruppen- and Dialogue. hofer. 2018. Overview of the GermEval 2018 Shared Task on the Identification of Offensive Lan- Cataldo Musto, Giovanni Semeraro, Marco de Gem- guage. In Proceedings of GermEval 2018, 14th mis, and Pasquale Lops. 2016. Modeling Commu- Conference on Natural Language Processing (KON- nity Behavior through Semantic Analysis of Social VENS 2018). Data: The Italian Hate Map Experience. In Pro- ceedings of the 2016 Conference on User Modeling Adaptation and Personalization, UMAP 2016. Serena Pelosi, Alessandro Maisto, Pierluigi Vitale, and Simonetta Vietri. 2017. Mining Offensive Lan- guage on Social Media. In Proceedings of the Fourth Italian Conference on Computational Lin- guistics (CLiC-it 2017). Fabio Poletto, Marco Stranisci, Manuela Sanguinetti, Viviana Patti, and Cristina Bosco. 2017. Hate Speech Annotation: Analysis of an Italian Twit- ter Corpus. In Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017). CEUR. Marco Polignano and Pierpaolo Basile. 2018. HanSEL: Italian Hate Speech Detection through En- semble Learning and Deep Neural Networks. In Proceedings of Sixth Evaluation Campaign of Natu- ral Language Processing and Speech Tools for Ital- ian. Final Workshop (EVALITA 2018). CEUR.org. Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Viviana Patti, and Marco Stranisci. 2018. An Italian Twitter Corpus of Hate Speech against Immigrants. In Proceedings of the 11th Language Resources and Evaluation Conference 2018. Valentino Santucci, Stefania Spina, Alfredo Milani, Giulio Biondi, and Gabriele Di Bari. 2018. De- tecting Hate Speech for Italian Language in Social Media. In Proceedings of Sixth Evaluation Cam- paign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018). CEUR.org. Anna Schmidt and Michael Wiegand. 2017. A Sur- vey on Hate Speech Detection using Natural Lan- guage Processing. In Proceedings of the Fifth Inter- national Workshop on Natural Language Process- ing for Social Media. Association for Computational Linguistics.