Annotating Hate Speech: Three Schemes at Comparison Fabio Poletto, Valerio Basile, Marco Stranisci Cristina Bosco, Viviana Patti Acmos Dipartimento di Informatica marco.stranisci@acmos.net University of Turin {poletto,basile, bosco,patti}@di.unito.it Abstract scales, rather than strict judgments (Kiritchenko and Mohammad, 2017). Ranking, rather than rat- Annotated data are essential to train and ing, has also proved to be a viable strategy to pro- benchmark NLP systems. The reliabil- duce high-quality annotation of subjective aspects ity of the annotation, i.e. low inter- in natural language (Yannakakis et al., 2018). Our annotator disagreement, is a key factor, es- hypothesis is that binary schemes may oversim- pecially when dealing with highly subjec- plify the target phenomenon, leaving it uniquely tive phenomena occurring in human lan- to the judges’ subjectivity to sort less prototypical guage. Hate speech (HS), in particular, is cases and likely causing higher disagreement. Rat- intrinsically nuanced and hard to fit in any ing or ranking schemes, on the other hand, are typ- fixed scale, therefore crisp classification ically more complex to implement, but they could schemes for its annotation often show their provide higher quality annotation. limits. We test three annotation schemes A framework is first tested by annotators: inter- on a corpus of HS, in order to produce annotator agreement, number of missed test ques- more reliable data. While rating scales tions and overall opinion are some common stan- and best-worst-scaling are more expensive dards against which the quality of the task can be strategies for annotation, our experimental tested. A certain degree of subjectivity and bias is results suggest that they are worth imple- intrinsic to the task, but an effective scheme should menting in a HS detection perspective.1 be able to channel individual interpretations into unambiguous categories. 1 Introduction A second reliability test involves the use of an- Automated detection of hateful language and simi- notated data to train a classifier that assigns the lar phenomena — such as offensive or abusive lan- same labels used by humans to previously unseen guage, slurs, threats and so on — is being inves- data. This process, jointly with a thorough error tigated by a fast-growing number of researchers. analysis, may help spot bias in the annotation or Modern approaches to Hate Speech (HS) detec- flaws in the dataset construction. tion are based on supervised classification, and We aim to explore whether and how different therefore require large amounts of manually an- frameworks differ in modeling HS, what problems notated data. Reaching acceptable levels of inter- do they pose to human annotators and how suit- annotator agreement on phenomena as subjective able they are for training. In particular, we apply a as HS is notoriously difficult. Poletto et al. (2017), binary annotation scheme, as well as a rating scale for instance, report a “very low agreement” in scheme and a best-worst scale scheme, to a corpus the HS annotation of a corpus of Italian tweets, of HS. We set up experiments in order to assess and similar annotation efforts showed similar re- whether such schemes help achieve a lower dis- sults (Del Vigna et al., 2017; Waseem, 2016; agreement and, ultimately, a higher quality dataset Gitari et al., 2015; Ross et al., 2017). In an at- for benchmarking and for supervised learning. tempt to tackle the agreement issue, annotation The experiment we set up involves two stages: schemes have been proposed based on numeric after having the same dataset annotated with three different schemes on the crowdsourcing platform 1 Copyright c 2019 for this paper by its authors. Use Figure Eight2 , we first compare their agreement permitted under Creative Commons License Attribution 4.0 2 International (CC BY 4.0). https://www.figure-eight.com/. rates and label distributions, then we map all annotation process very time-consuming. More schemes to a “yes/no” structure to perform a cross- recently, a ranking scheme has been applied to validation test with a SVM classifier. We launched the annotation of a small dataset of German hate three separate tasks on the platform: Task 1 with speech messages (Wojatzki et al., 2018). a binary scheme, Task 2 with an asymmetric rat- ing scale, and Task 3 with a best-worst scale. For 3 Annotation Schemes each task, a subset has been previously annotated by experts within the research team, to be used as In this section, we introduce the three annotation gold standard against which to evaluate contribu- schemes tested in our study. tors’ trustworthiness on Figure Eight. Binary. Binary annotation implies assigning a binary label to each instance. Beside HS, bi- 2 Related Work nary classification is common in a variety of NLP tasks and beyond. Its simplicity allows a quick Several frameworks have been proposed and manual annotation and an easy computational data tested so far for HS annotation, ranging from processing. As a downside, such a dichoto- straightforward binary schemes to complex, multi- mous choice presupposes that is always possible layered ones and including a variety of linguistic to clearly and objectively determine what answer features. Dichotomous schemes are used, for ex- is true. This may be acceptable in some tasks, but ample, by Alfina et al. (2017), Ross et al. (2017) it is not always the case with human language, es- and Gao et al. (2017) for HS, by Nithyanand et al. pecially for more subjective and nuanced phenom- (2017) for offensiveness and by Hammer (2016) ena. for violent threats. Slightly more nuanced frame- works try to highlight particular features. David- Rating Scales. Rating Scales (RS) are widely son et al. (2017) distinguish between hateful, of- used for annotation and evaluation in a variety fensive but not hateful and not offensive, as do of tasks. Likert scale is the best known (Likert, Mathur et al. (2018) who for the second type use 1932): values are arranged at regular intervals on the label abusive instead; similarly, Mubarak et a symmetric scale, from the most to the least typ- al. (2017) use the labels obscene, offensive and ical of a given concept. It is suitable for measur- clean. Waseem (2016) differentiate hate according ing subjective opinion or perception about a given to its target, using the labels sexism, racism, both topic with a variable number of options. Com- and none. Nobata et al. (2016) uses a two-layer pared to binary scheme, scales are better for man- scheme, where a content can be first labeled either aging subjectivity and intermediate nuances of a as abusive or clean and, if abusive, as hate speech, concept. On the other hand, as pointed out by derogatory or profanity. Del Vigna et al. (2017) (Kiritchenko and Mohammad, 2017), they present uses a simple scale that distinguishes between no some flaws: high inter-annotator disagreement hate, weak hate and strong hate. (the more fine-grained the scale, the higher the Where to draw the line between weak and chance of disagreement), individual inconsisten- strong hate is still highly subjective but, if noth- cies (judges may express different values for sim- ing else, the scheme avoids feebly hateful com- ilar items, or the same value for different items), ments to be classified as not hateful (thus po- scale region bias (judges may tend to prefer val- tentially neutral or positive) just because, strictly ues in one part of the scale, often the middle) and speaking, they can not be called HS. Other au- fixed granularity (which may not represent the ac- thors, such as Olteanu et al. (2018) and Fišer et al. tual nuances of a concept). (2017), use heavier and more elaborated schemes. Olteanu et al. (2018), in particular, experimented Best-Worst Scaling. The Best-Worst Scaling with a rating-based annotation scheme, reporting model (BWS) is a comparative annotation process low agreement. Sanguinetti et al. (2018) also uses developed by Louviere and Woodworth (1991). a complex scheme in which HS is annotated both In a nutshell, a BWS model presents annotators for its presence (binary value) and for its inten- with n items at a time (where n > 1 and nor- sity (1–4 rating scale). Such frameworks poten- mally n = 4) and asks them to pick the best and tially provide valuable insights into the investi- worst ones with regard to a given property. The gated issue, but as a downside they make the whole model has been used in particular by Kiritchenko ethnic group religion Roma label tweet immigrat*, immigrazione terrorismo rom yes Allora dobbiamo stringere la corda: pena capitale migrant*, profug* terrorist*, islam nomad* per tutti i musulmani in Europa immediatamente! stranier* mus[s]ulman* Then we have to adopt stricter measures: death penalty for all Mus- corano lims in Europe now! no I migranti hanno sempre il posto e non pagano. Migrants always get a seat and never pay. Table 1: List of keywords used to filter our dataset. Table 2: Annotation examples for Task 1 (gold la- and Mohammad (2017) and Mohammad and Kir- bels). itchenko (2018), who proved it to be particularly effective for subjective tasks such as sentiment in- mean, dehumanise or threaten it. tensity annotation, which are intrinsically nuanced and hardly fit in any fixed scale. We also provided a list of expressions that are not to be considered HS although they may seem so: 4 Dataset and task description for example, these include slurs and offensive ex- pressions, slanders, and blasphemy. An example For our experiment, we employ a dataset of 4,000 of annotation for this task is presented in Table 2. Italian tweets, extracted from a larger corpus col- lected within the project Contro l’odio3 . For the 4.2 Task 2: Unbalanced Rating Scale purpose of this research, we filtered all the tweets This task requires judges to assign a label to each written between November 1st and December 31st tweet on a 5-degree asymmetric scale (from 1 to with a list of keywords. This list, reported in Table -3) that encompasses the content and tone of the 1, is the same proposed in Poletto et al. (2017) for message as well as the writer’s intention. Again, collecting a dataset focused on three typical targets the target of the message must be one of three of discrimination — namely Immigrants, Muslims mentioned above. The scheme structure is re- and Roma. ported in Table 3, while Table 4 shows an example The concept of HS underlying all three annota- for each label. tion tasks includes any expression based on intol- erance and promoting or justifying hatred towards label meaning +1 positive a given target. For each task we explicitly asked 0 neutral, ambiguous or unclear the annotators to consider only HS directed to- -1 negative and polite, dialogue-oriented attitude wards one of the three above-mentioned targets, -2 negative and insulting/abusive, aggressive attitude -3 strongly negative with overt incitement to hatred, ignoring other targets if present. Each message violence or discrimination, attitude oriented at at- is annotated by at least three contributors. Fig- tacking or demeaning the target ure Eight also report a measure of agreement com- Table 3: Annotation scheme for Task 2: evaluate puted as a Fleiss’ κ weighted by a score indicating the stance or opinion expressed in each tweet. the trustworthiness of each contributor on the plat- form. We note, however, that the agreement mea- sured on the three tasks is not directly comparable, This scale was designed with a twofold aim: to since they follow different annotation schemes. avoid a binary choice that could leave too many doubtful cases, and to split up negative contents 4.1 Task 1: Binary Scheme. in more precise categories, in order to distinguish The first scheme is very straightforward and sim- different degrees of “hatefulness”. ply asks judges to tell whether a tweet contains HS We tried not to influence annotators by match- or not. Each line will thus receive the label HS yes ing the grades of our scale in Task 2 to widespread or HS no. The definition of HS is drawn by (Po- concepts such as stereotypes, abusive language letto et al., 2017). In order to be labeled as hateful, or hateful language, which people might tend to a tweet must: apply by intuition rather then by following strict rules. Instead, we provided definitions as neu- • address one of above-mentioned targets; tral and objective as possible, in order to differ- entiate this task from the others and avoid biases. • either incite, promote or justify hatred, vio- An asymmetric scale, although unusual, fits our lence or intolerance towards the target, or de- purpose of an in-depth investigation of negative 3 https://controlodio.it/. language very well. A possible downturn of this label tweet label tweet +1 Gorino Alla fine questi profughi l’hanno scampata least Roma, ondata di controlli anti-borseggio in centro: bella. Vi immaginate avere tali soggetti come vicini arrestati 8 nomadi, 6 sono minorenni. di casa? Rome, anti-pickpocketing patrolling in the centre: 8 nomads ar- These asylum-seekers had a narrow escape. Can you imagine hav- rested, 6 of them are minor. ing such folks (TN: racist Gorino inhabitants) as neighbours? Tutti i muslims presenti in Europa rappresentano un 0 Bellissimo post sulle cause e conseguenze pericolo mortale latente. L’islam è incompatibile dell’immigrazione, da leggere! con i valori occidentali. Great post on causes and consequences of immmigration, recom- All Muslims in Europe are a dormant deadly danger. Islam is in- mended! compatible with Western values. -1 I migranti hanno sempre il posto e non pagano. Trieste, profughi cacciano disabile dal bus: ar- -2 Con tutti i soldi elargiti ai rom,vedere il degrado nel rivano le pattuglie di Forza Nuova sui mezzi pub- quali si crogiolano,non meritano di rimanere in un blici. paese civile! Trieste, asylum-seekers throw disabled person off the bus: Forza Seeing the decay Roma people wallow in, despite all the money Nuova (TN: far-right, nationalist fringe party) to patrol public lavished on them, they don’t deserve to stay in a civilized country! transport. -3 Allora dobbiamo stringere la corda: pena capitale most Unica soluzione è cacciare TUTTI i musulmani per tutti i musulmani in Europa immediatamente! NON integrati fino alla 3a gen che si ammazzassero nei loro paesi come fanno da secoli MALATI! Only way is to oust EVERY NON-integrated Muslim down to 3rd Table 4: Examples of annotation for Task 2 (gold generation let them kill each other in their own countries as they’ve labels). done for centuries INSANE! Table 5: Examples of annotation for Task 3: 4- scheme is that grades in the scale are supposed to tuple with marks for the least hateful and the most be evenly spread, while the real phenomena they hateful tweets. represent may not be so. 4.3 Task 3: Best-Worst Scaling tion weighted by the trust of each contributor, i.e., The structure of this task differs from the previous a measure of their reliability across their history two. We created a set of tuples made up by four on the platform. On task 1, about 70% of the tweets (4-tuples), grouped so that each tweet is re- tweets were associated with a confidence score of peated four times in the dataset, combined with 1, while the remaining 30% follow a low-variance three different tweet each time. Then we provided normal distribution around .66. contributors with a set of 4-tuples: for each 4-tuple As for Task 2, label distribution tells a differ- they were asked to point out the most hateful and ent story. When measuring inter-annotator agree- the least hateful of the four. Judges have thus seen ment, the mean value between all annotations has a given tweet four times, but have had to compare been computed instead of using the majority cri- it with different tweets every time4 . This method terion. Therefore, results are grouped in intervals avoids assigning a discrete value to each tweet rather than in discrete values, but we can still eas- and gathers information on their “hatefulness” by ily map these intervals to the original labels. As comparing them to other tweets. An example of shown in Figure 1, tweets labeled as having a neu- annotation, with the least and most hateful tweets tral or positive content (in green) are only around marked in a set of four, is provided in Table 5. 27%, less than one third of the tweets labeled as non-hateful in Task 1. Exactly half of the whole 5 Task annotation results dataset is labeled as negative but oriented to dia- In Task 1, the distribution of the labels yes and logue (in yellow), while 20% is labeled as nega- no, referred to the presence of HS, conforms to tive and somewhat abusive (orange) and only less that of other similar annotated HS datasets, such than 3% is labeled as an open incitement to hatred, as Burnap and Williams (2015) in English and violence or discrimination (red). With respect to Sanguinetti et al. (2018) in Italian. After apply- the inter-annotator agreement, only 25% of the in- ing a majority criterion to non-unanimous cases, stances are associated with the maximum confi- tweets labeled as HS are around 16% of the dataset dence score of 1, while the distribution of confi- (see Figure 1). Figure Eight measures the agree- dence presents a high peak around .66 and a minor ment in terms of confidence, with a κ-like func- peak around 0.5. Note that this confidence distri- bution is not directly comparable to Task 1, since 4 The details of the tuple generation the schemes are different. process are explained in this blog post: http://valeriobasile.github.io/ In Task 3, similarly to Task 2, the result of the Best-worst-scaling-and-the-clock-of-Gauss/ annotation is a real value. More precisely, we fier. In order to do so, it was necessary to make our schemes comparable without losing the in- formation each of them gives: we mapped Task 2 and Task 3 schemes down to a binary struc- ture, directly comparable to Task 1 scheme. For Task 2, this was done by drawing an arbitrary line that would split the scale in two. We tested dif- ferent thresholds, mapping the judgements above each threshold to the label HS no from Task 1 and all judgements below the threshold to the label HS yes. We experimented with three values: -0.5, Figure 1: Label distribution for Tasks 1, 2 and 3 -1.0 and -1.5. For Task 3, similarly, we tried set- (red portion of Task 2 bar corresponds to 2.63%). ting different thresholds along the hateful end of the answers distribution spectrum (see Section 5), compute for each tweet the percentage of times it respectively at 0, 0.25, 0.5 and 0.75. We mapped has been indicated as best (more indicative of HS all judgements below each threshold to the label in its tuple) and worst (least indicative of HS in its HS no from Task 1 and all judgements above the tuple), and compute the difference between these threshold to the label HS yes. two values, resulting in a value between −1 (non- When considering as HS yes all tweets whose hateful end of the spectrum) and 1 (hateful end of average value for Task 2 is above 0.5, the num- the spectrum). The bottom chart in Figure 1 shows ber of hateful tweets increases (25.35%); when the that the distribution of values given by the BWS value is set at -1.0, slightly decreases (10.22%); annotation has a higher variance than the scalar but as soon as the threshold is moved up to -1.5, case, and is skewed slightly towards the hateful the number drops dramatically. A possible expla- side. The confidence score for Task 3 follows nation for this is that a binary scheme is not ade- a similar pattern to Task 2, while being slightly quate to depict the complexity of HS and forces higher on average, with about 40% of the tweets judges to squeeze contents into a narrow black- having confidence 1. or-white frame. Conversely, thresholds for Task A last consideration concerns the cost of anno- 3 return different results (however partial). The tation tasks in terms of time and resources. We threshold 0.5 is the closest to the Task 1 partition, measured the cost of our three tasks: T1 and T2 with a similar percentage of HS (16.90%), while had almost the same cost in terms of contributors lower thresholds allow for much higher percent- retribution, but T2 required about twice the time to ages of tweets classified as hateful — setting the be completed; T3 resulted the most expensive in value at 0, for example, results in 40.52% of tweets terms of both money and time. With nearly equal classified as HS. results, a strategy could be chosen instead of oth- To better understand the impact of the different ers for being quicker or cheaper: therefore, when annotation strategies on the quality of the result- designing a research strategy, we deem important ing datasets, we performed a cross-validation ex- not to forget this factor. periment. We implemented a SVM classifier using n-grams (1 ≤ N ≤ 4) as features and measuring 6 Classification tests with different its precision, recall and F1 score in a stratified 10- schemes at comparison fold fashion. Results are shown in Table 6. From the results of this cross-validation exper- Having described the process and results for each iment, we draw some observations. When map- task, we will now observe how they affect the ping the non-binary classification to a binary one, quality of resulting datasets. Our running hypoth- choosing an appropriate threshold has a key im- esis is that a better quality dataset provides better pact on the classifier performance. For both RS training material for a supervised classifier, thus and BWS, the strictness of the threshold (i.e., how leading to higher predictive capabilities. close it is to the hateful end of the spectrum) is di- Assuming that the final goal is to develop an ef- rectly proportional to the performance on the neg- fective system for recognizing HS, we opted to test ative class (0) and inversely proportional to the the three schemes against the same binary classi- Dataset Threshold support (0) support (1) P (0) R (0) F1 (0) P (1) R (1) F1 (1) F1 (macro) binary 3365 635 .878 .923 .899 .450 .316 .354 .627 RS -0.5 2976 1014 .785 .841 .812 .408 .322 .359 .585 RS -1.0 3581 409 .912 .966 .938 .391 .186 .250 .594 RS -1.5 3845 145 .964 .991 .978 .200 .028 .047 .512 BWS 0.0 2206 1782 .677 .703 .690 .614 .585 .599 .644 BWS 0.25 2968 1020 .806 .860 .832 .492 .398 .439 .635 BWS 0.5 3480 508 .893 .949 .920 .390 .222 .281 .601 BWS 0.75 3835 153 .963 .992 .977 .147 .039 .060 .518 Table 6: Result of 10-fold cross-validation on datasets obtained with different annotation strategies. performance on the positive class (1). This may Furthermore, such scale necessarily oversimplifies be explained by different amounts of training data a complex natural phenomenon, because it uses available: as we set a stricter threshold, we will equidistant points to represent shades of meaning have fewer examples for the positive class, result- that may not be as evenly arranged. ing in a poorer performance, but more examples Conversely, our experiment with BWS applied for the negative class, resulting in a more accurate to HS annotation gave encouraging results. Un- classification. Yet, looking at the rightmost col- like Wojatzki et al. (2018), we find that a ranking umn, we observe how permissive thresholds return scheme is slightly better than a rating scheme, be a higher overall F1-score for both RS and BWS. it binary or scalar, in terms of prediction perfor- Regardless of the threshold, RS appears to pro- mance. As future work, we plan to investigate the duce the worst performance, suggesting that re- extent to which such variations depend on circum- ducing continuous values to crisp labels is not the stantial factors, such as how the annotation process best way to model the phenomenon, however ac- is designed and carried out, as opposed to intrinsic curate and pondered the labels are. Conversely, properties of the annotation procedure. compared to the binary annotation, BWS returns The fact that similar distributions are observed higher F1-scores with permissive threshold (0.0 when the dividing line for RS and BWS is drawn and 0.25), thus resulting in the best method to ob- in a permissive fashion suggests that annotators tain a stable dataset. Furthermore, performances tend to overuse the label HS yes when they work with BWS are consistently higher for the positive with a binary scheme, probably because they have class (HS): considering that the task is typically no milder choice. This confirms that, whatever framed as a detection task (as opposed to a clas- framework is used, the issue of hateful language sification task, this result confirms the potential of requires a nuanced approach that goes beyond the ranking annotation (as opposed to rating) to gen- binary classification, being aware that an increase erate better training material for HS detection. in complexity and resources will likely pay off in terms of more accurate and stable performances. 7 Conclusion and Future Work Acknowledgments We performed annotation tasks with three annota- tion schemes on a HS corpus, and computed inter- The work of V. Basile, C. Bosco, V. Patti is par- annotator agreement rate and label distribution for tially funded by Progetto di Ateneo/CSP 2016 each task. We also performed cross-validation Immigrants, Hate and Prejudice in Social Me- tests with the three annotated datasets, to verify dia (S1618 L2 BOSC 01) and by Italian Ministry the impact of the annotation schemes on the qual- of Labor (Contro l’odio: tecnologie informatiche, ity of the produced data. percorsi formativi e storytelling partecipativo per We observed that the RS we designed seems combattere l’intolleranza, avviso n.1/2017 per il easier to use for contributors, but its results are finanziamento di iniziative e progetti di rilevanza more complex to understand, and it returns the nazionale ai sensi dell’art. 72 del decreto legisla- worst overall performance in a cross-validation tivo 3 luglio 2017, n. 117 - anno 2017). The work test. It is especially difficult to compare it with a of F. Poletto is funded by Fondazione Giovanni binary scheme, since merging labels together and Goria and Fondazione CRT (Bando Talenti della mapping them down to a dichotomous choice is Società Civile 2018). in contrast with the nature of the scheme itself. References hindi-english code-switched language. In Proceed- ings of the Sixth International Workshop on Natural Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Language Processing for Social Media, pages 18– Yudo Ekanata. 2017. Hate speech detection in 26. the indonesian language: A dataset and preliminary study. In 2017 International Conference on Ad- Saif Mohammad and Svetlana Kiritchenko. 2018. Un- vanced Computer Science and Information Systems derstanding emotions: A dataset of tweets to study (ICACSIS), pages 233–238. IEEE. interactions between affect categories. In Proceed- ings of the Eleventh International Conference on Pete Burnap and Matthew L. Williams. 2015. Cyber Language Resources and Evaluation (LREC-2018), Hate Speech on Twitter: An Application of Machine pages 198–209. Classification and Statistical Modeling for Policy and Decision Making. Policy & Internet, 7(2):223– Hamdy Mubarak, Kareem Darwish, and Walid Magdy. 242. 2017. Abusive language detection on arabic social media. In Proceedings of the First Workshop on Thomas Davidson, Dana Warmsley, Michael Macy, Abusive Language Online, pages 52–56. and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. Rishab Nithyanand, Brian Schaffner, and Phillipa Gill. In Eleventh International AAAI Conference on Web 2017. Measuring offensive speech in online political and Social Media, pages 368 – 371. discourse. In 7th {USENIX} Workshop on Free and Open Communications on the Internet ({FOCI} 17). Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Marinella Petrocchi, and Maurizio Tesconi. 2017. Chikashi Nobata, Joel Tetreault, Achint Thomas, Hate Me, Hate Me Not: Hate Speech Detection on Yashar Mehdad, and Yi Chang. 2016. Abusive lan- Facebook. In Proceedings of the First Italian Con- guage detection in online user content. In Proceed- ference on Cybersecurity (ITASEC17), pages 86–95. ings of the 25th international conference on world wide web, pages 145–153. International World Wide Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić. Web Conferences Steering Committee. 2017. Legal framework, dataset and annotation Alexandra Olteanu, Carlos Castillo, Jeremy Boy, and schema for socially unacceptable online discourse Kush R Varshney. 2018. The effect of extremist vi- practices in slovene. In Proceedings of the first olence on hateful speech online. In Twelfth Interna- workshop on abusive language online, pages 46–51. tional AAAI Conference on Web and Social Media, Lei Gao, Alexis Kuppersmith, and Ruihong Huang. pages 221–230. 2017. Recognizing explicit and implicit hate speech Fabio Poletto, Marco Stranisci, Manuela Sanguinetti, using a weakly supervised two-path bootstrapping Viviana Patti, and Cristina Bosco. 2017. Hate approach. arXiv preprint arXiv:1710.07394. Speech Annotation: Analysis of an Italian Twit- ter Corpus. In Proceedings of the Fourth Italian Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Conference on Computational Linguistics (CLiC-it Damien, and Jun Long. 2015. A lexicon-based 2017). CEUR. approach for hate speech detection. International Journal of Multimedia and Ubiquitous Engineering, Björn Ross, Michael Rist, Guillermo Carbonell, Ben- 10(4):215–230. jamin Cabrera, Nils Kurowsky, and Michael Wo- jatzki. 2017. Measuring the reliability of hate Hugo Lewi Hammer. 2016. Automatic detection of speech annotations: The case of the European hateful comments in online discussion. In Interna- refugee crisis. arXiv preprint arXiv:1701.08118. tional Conference on Industrial Networks and Intel- ligent Systems, pages 164–173. Springer. Manuela Sanguinetti, Fabio Poletto, Cristina Bosco, Viviana Patti, and Marco Stranisci. 2018. An Italian Svetlana Kiritchenko and Saif Mohammad. 2017. Twitter Corpus of Hate Speech against Immigrants. Best-worst scaling more reliable than rating scales: In Proceedings of the 11th Language Resources and A case study on sentiment intensity annotation. In Evaluation Conference 2018, pages 2798–2805. Proceedings of the 55th Annual Meeting of the As- sociation for Computational Linguistics (Volume 2: Zeerak Waseem. 2016. Are you a racist or am i seeing Short Papers), pages 465–470. ACL. things? annotator influence on hate speech detection on twitter. In Proceedings of the first workshop on Rensis Likert. 1932. A technique for the measurement NLP and computational social science, pages 138– of attitudes. Archives of psychology, 22(140). 142. Jordan J Louviere and George G Woodworth. 1991. Michael Wojatzki, Tobias Horsmann, Darina Gold, and Best-worst scaling: A model for the largest differ- Torsten Zesch. 2018. Do Women Perceive Hate ence judgments. University of Alberta: Working Pa- Differently: Examining the Relationship Between per. Hate Speech, Gender, and Agreement Judgments. In Proceedings of the Conference on Natural Lan- Puneet Mathur, Rajiv Shah, Ramit Sawhney, and De- guage Processing (KONVENS), pages 110–120, Vi- banjan Mahata. 2018. Detecting offensive tweets in enna, Austria. Georgios Yannakakis, Roddy Cowie, and Carlos Busso. 2018. The ordinal nature of emotions: An emerging approach. IEEE Transactions on Affective Computing, pages 1–20, 11. Early Access.