=Paper= {{Paper |id=Vol-2481/paper56 |storemode=property |title=Annotating Hate Speech: Three Schemes at Comparison |pdfUrl=https://ceur-ws.org/Vol-2481/paper56.pdf |volume=Vol-2481 |authors=Fabio Poletto,Valerio Basile,Cristina Bosco,Viviana Patti,Marco Stranisci |dblpUrl=https://dblp.org/rec/conf/clic-it/PolettoBBPS19 }} ==Annotating Hate Speech: Three Schemes at Comparison== https://ceur-ws.org/Vol-2481/paper56.pdf
                Annotating Hate Speech: Three Schemes at Comparison
               Fabio Poletto, Valerio Basile,                     Marco Stranisci
               Cristina Bosco, Viviana Patti                         Acmos
                Dipartimento di Informatica                marco.stranisci@acmos.net
                     University of Turin
                  {poletto,basile,
             bosco,patti}@di.unito.it


                        Abstract                           scales, rather than strict judgments (Kiritchenko
                                                           and Mohammad, 2017). Ranking, rather than rat-
        Annotated data are essential to train and          ing, has also proved to be a viable strategy to pro-
        benchmark NLP systems. The reliabil-               duce high-quality annotation of subjective aspects
        ity of the annotation, i.e. low inter-             in natural language (Yannakakis et al., 2018). Our
        annotator disagreement, is a key factor, es-       hypothesis is that binary schemes may oversim-
        pecially when dealing with highly subjec-          plify the target phenomenon, leaving it uniquely
        tive phenomena occurring in human lan-             to the judges’ subjectivity to sort less prototypical
        guage. Hate speech (HS), in particular, is         cases and likely causing higher disagreement. Rat-
        intrinsically nuanced and hard to fit in any       ing or ranking schemes, on the other hand, are typ-
        fixed scale, therefore crisp classification        ically more complex to implement, but they could
        schemes for its annotation often show their        provide higher quality annotation.
        limits. We test three annotation schemes              A framework is first tested by annotators: inter-
        on a corpus of HS, in order to produce             annotator agreement, number of missed test ques-
        more reliable data. While rating scales            tions and overall opinion are some common stan-
        and best-worst-scaling are more expensive          dards against which the quality of the task can be
        strategies for annotation, our experimental        tested. A certain degree of subjectivity and bias is
        results suggest that they are worth imple-         intrinsic to the task, but an effective scheme should
        menting in a HS detection perspective.1            be able to channel individual interpretations into
                                                           unambiguous categories.
1       Introduction                                          A second reliability test involves the use of an-
Automated detection of hateful language and simi-          notated data to train a classifier that assigns the
lar phenomena — such as offensive or abusive lan-          same labels used by humans to previously unseen
guage, slurs, threats and so on — is being inves-          data. This process, jointly with a thorough error
tigated by a fast-growing number of researchers.           analysis, may help spot bias in the annotation or
Modern approaches to Hate Speech (HS) detec-               flaws in the dataset construction.
tion are based on supervised classification, and              We aim to explore whether and how different
therefore require large amounts of manually an-            frameworks differ in modeling HS, what problems
notated data. Reaching acceptable levels of inter-         do they pose to human annotators and how suit-
annotator agreement on phenomena as subjective             able they are for training. In particular, we apply a
as HS is notoriously difficult. Poletto et al. (2017),     binary annotation scheme, as well as a rating scale
for instance, report a “very low agreement” in             scheme and a best-worst scale scheme, to a corpus
the HS annotation of a corpus of Italian tweets,           of HS. We set up experiments in order to assess
and similar annotation efforts showed similar re-          whether such schemes help achieve a lower dis-
sults (Del Vigna et al., 2017; Waseem, 2016;               agreement and, ultimately, a higher quality dataset
Gitari et al., 2015; Ross et al., 2017). In an at-         for benchmarking and for supervised learning.
tempt to tackle the agreement issue, annotation               The experiment we set up involves two stages:
schemes have been proposed based on numeric                after having the same dataset annotated with three
                                                           different schemes on the crowdsourcing platform
    1
     Copyright c 2019 for this paper by its authors. Use   Figure Eight2 , we first compare their agreement
permitted under Creative Commons License Attribution 4.0
                                                              2
International (CC BY 4.0).                                        https://www.figure-eight.com/.
rates and label distributions, then we map all           annotation process very time-consuming. More
schemes to a “yes/no” structure to perform a cross-      recently, a ranking scheme has been applied to
validation test with a SVM classifier. We launched       the annotation of a small dataset of German hate
three separate tasks on the platform: Task 1 with        speech messages (Wojatzki et al., 2018).
a binary scheme, Task 2 with an asymmetric rat-
ing scale, and Task 3 with a best-worst scale. For       3   Annotation Schemes
each task, a subset has been previously annotated
by experts within the research team, to be used as       In this section, we introduce the three annotation
gold standard against which to evaluate contribu-        schemes tested in our study.
tors’ trustworthiness on Figure Eight.                   Binary. Binary annotation implies assigning a
                                                         binary label to each instance. Beside HS, bi-
2   Related Work                                         nary classification is common in a variety of NLP
                                                         tasks and beyond. Its simplicity allows a quick
Several frameworks have been proposed and
                                                         manual annotation and an easy computational data
tested so far for HS annotation, ranging from
                                                         processing. As a downside, such a dichoto-
straightforward binary schemes to complex, multi-
                                                         mous choice presupposes that is always possible
layered ones and including a variety of linguistic
                                                         to clearly and objectively determine what answer
features. Dichotomous schemes are used, for ex-
                                                         is true. This may be acceptable in some tasks, but
ample, by Alfina et al. (2017), Ross et al. (2017)
                                                         it is not always the case with human language, es-
and Gao et al. (2017) for HS, by Nithyanand et al.
                                                         pecially for more subjective and nuanced phenom-
(2017) for offensiveness and by Hammer (2016)
                                                         ena.
for violent threats. Slightly more nuanced frame-
works try to highlight particular features. David-       Rating Scales. Rating Scales (RS) are widely
son et al. (2017) distinguish between hateful, of-       used for annotation and evaluation in a variety
fensive but not hateful and not offensive, as do         of tasks. Likert scale is the best known (Likert,
Mathur et al. (2018) who for the second type use         1932): values are arranged at regular intervals on
the label abusive instead; similarly, Mubarak et         a symmetric scale, from the most to the least typ-
al. (2017) use the labels obscene, offensive and         ical of a given concept. It is suitable for measur-
clean. Waseem (2016) differentiate hate according        ing subjective opinion or perception about a given
to its target, using the labels sexism, racism, both     topic with a variable number of options. Com-
and none. Nobata et al. (2016) uses a two-layer          pared to binary scheme, scales are better for man-
scheme, where a content can be first labeled either      aging subjectivity and intermediate nuances of a
as abusive or clean and, if abusive, as hate speech,     concept. On the other hand, as pointed out by
derogatory or profanity. Del Vigna et al. (2017)         (Kiritchenko and Mohammad, 2017), they present
uses a simple scale that distinguishes between no        some flaws: high inter-annotator disagreement
hate, weak hate and strong hate.                         (the more fine-grained the scale, the higher the
   Where to draw the line between weak and               chance of disagreement), individual inconsisten-
strong hate is still highly subjective but, if noth-     cies (judges may express different values for sim-
ing else, the scheme avoids feebly hateful com-          ilar items, or the same value for different items),
ments to be classified as not hateful (thus po-          scale region bias (judges may tend to prefer val-
tentially neutral or positive) just because, strictly    ues in one part of the scale, often the middle) and
speaking, they can not be called HS. Other au-           fixed granularity (which may not represent the ac-
thors, such as Olteanu et al. (2018) and Fišer et al.   tual nuances of a concept).
(2017), use heavier and more elaborated schemes.
Olteanu et al. (2018), in particular, experimented       Best-Worst Scaling. The Best-Worst Scaling
with a rating-based annotation scheme, reporting         model (BWS) is a comparative annotation process
low agreement. Sanguinetti et al. (2018) also uses       developed by Louviere and Woodworth (1991).
a complex scheme in which HS is annotated both           In a nutshell, a BWS model presents annotators
for its presence (binary value) and for its inten-       with n items at a time (where n > 1 and nor-
sity (1–4 rating scale). Such frameworks poten-          mally n = 4) and asks them to pick the best and
tially provide valuable insights into the investi-       worst ones with regard to a given property. The
gated issue, but as a downside they make the whole       model has been used in particular by Kiritchenko
    ethnic group              religion            Roma      label     tweet
    immigrat*, immigrazione   terrorismo          rom        yes      Allora dobbiamo stringere la corda: pena capitale
    migrant*, profug*         terrorist*, islam   nomad*              per tutti i musulmani in Europa immediatamente!
    stranier*                 mus[s]ulman*                            Then we have to adopt stricter measures: death penalty for all Mus-
                              corano                                  lims in Europe now!
                                                             no       I migranti hanno sempre il posto e non pagano.
                                                                      Migrants always get a seat and never pay.
Table 1: List of keywords used to filter our dataset.
                                                           Table 2: Annotation examples for Task 1 (gold la-
and Mohammad (2017) and Mohammad and Kir-                  bels).
itchenko (2018), who proved it to be particularly
effective for subjective tasks such as sentiment in-              mean, dehumanise or threaten it.
tensity annotation, which are intrinsically nuanced
and hardly fit in any fixed scale.                         We also provided a list of expressions that are not
                                                           to be considered HS although they may seem so:
4        Dataset and task description                      for example, these include slurs and offensive ex-
                                                           pressions, slanders, and blasphemy. An example
For our experiment, we employ a dataset of 4,000
                                                           of annotation for this task is presented in Table 2.
Italian tweets, extracted from a larger corpus col-
lected within the project Contro l’odio3 . For the         4.2      Task 2: Unbalanced Rating Scale
purpose of this research, we filtered all the tweets       This task requires judges to assign a label to each
written between November 1st and December 31st             tweet on a 5-degree asymmetric scale (from 1 to
with a list of keywords. This list, reported in Table      -3) that encompasses the content and tone of the
1, is the same proposed in Poletto et al. (2017) for       message as well as the writer’s intention. Again,
collecting a dataset focused on three typical targets      the target of the message must be one of three
of discrimination — namely Immigrants, Muslims             mentioned above. The scheme structure is re-
and Roma.                                                  ported in Table 3, while Table 4 shows an example
   The concept of HS underlying all three annota-          for each label.
tion tasks includes any expression based on intol-
erance and promoting or justifying hatred towards           label     meaning
                                                             +1       positive
a given target. For each task we explicitly asked              0      neutral, ambiguous or unclear
the annotators to consider only HS directed to-               -1      negative and polite, dialogue-oriented attitude
wards one of the three above-mentioned targets,               -2      negative and insulting/abusive, aggressive attitude
                                                              -3      strongly negative with overt incitement to hatred,
ignoring other targets if present. Each message                       violence or discrimination, attitude oriented at at-
is annotated by at least three contributors. Fig-                     tacking or demeaning the target
ure Eight also report a measure of agreement com-
                                                           Table 3: Annotation scheme for Task 2: evaluate
puted as a Fleiss’ κ weighted by a score indicating
                                                           the stance or opinion expressed in each tweet.
the trustworthiness of each contributor on the plat-
form. We note, however, that the agreement mea-
sured on the three tasks is not directly comparable,          This scale was designed with a twofold aim: to
since they follow different annotation schemes.            avoid a binary choice that could leave too many
                                                           doubtful cases, and to split up negative contents
4.1       Task 1: Binary Scheme.                           in more precise categories, in order to distinguish
The first scheme is very straightforward and sim-          different degrees of “hatefulness”.
ply asks judges to tell whether a tweet contains HS           We tried not to influence annotators by match-
or not. Each line will thus receive the label HS yes       ing the grades of our scale in Task 2 to widespread
or HS no. The definition of HS is drawn by (Po-            concepts such as stereotypes, abusive language
letto et al., 2017). In order to be labeled as hateful,    or hateful language, which people might tend to
a tweet must:                                              apply by intuition rather then by following strict
                                                           rules. Instead, we provided definitions as neu-
    • address one of above-mentioned targets;              tral and objective as possible, in order to differ-
                                                           entiate this task from the others and avoid biases.
    • either incite, promote or justify hatred, vio-
                                                           An asymmetric scale, although unusual, fits our
      lence or intolerance towards the target, or de-
                                                           purpose of an in-depth investigation of negative
     3
         https://controlodio.it/.                          language very well. A possible downturn of this
 label     tweet                                                                   label   tweet
  +1       Gorino Alla fine questi profughi l’hanno scampata                       least   Roma, ondata di controlli anti-borseggio in centro:
           bella. Vi immaginate avere tali soggetti come vicini                            arrestati 8 nomadi, 6 sono minorenni.
           di casa?                                                                        Rome, anti-pickpocketing patrolling in the centre: 8 nomads ar-
           These asylum-seekers had a narrow escape. Can you imagine hav-                  rested, 6 of them are minor.
           ing such folks (TN: racist Gorino inhabitants) as neighbours?                   Tutti i muslims presenti in Europa rappresentano un
   0       Bellissimo post sulle cause                   e    conseguenze                  pericolo mortale latente. L’islam è incompatibile
           dell’immigrazione, da leggere!                                                  con i valori occidentali.
           Great post on causes and consequences of immmigration, recom-                   All Muslims in Europe are a dormant deadly danger. Islam is in-
           mended!                                                                         compatible with Western values.
  -1       I migranti hanno sempre il posto e non pagano.                                  Trieste, profughi cacciano disabile dal bus: ar-
  -2       Con tutti i soldi elargiti ai rom,vedere il degrado nel                         rivano le pattuglie di Forza Nuova sui mezzi pub-
           quali si crogiolano,non meritano di rimanere in un                              blici.
           paese civile!                                                                   Trieste, asylum-seekers throw disabled person off the bus: Forza
           Seeing the decay Roma people wallow in, despite all the money                   Nuova (TN: far-right, nationalist fringe party) to patrol public
           lavished on them, they don’t deserve to stay in a civilized country!            transport.
  -3       Allora dobbiamo stringere la corda: pena capitale                       most    Unica soluzione è cacciare TUTTI i musulmani
           per tutti i musulmani in Europa immediatamente!                                 NON integrati fino alla 3a gen che si ammazzassero
                                                                                           nei loro paesi come fanno da secoli MALATI!
                                                                                           Only way is to oust EVERY NON-integrated Muslim down to 3rd
Table 4: Examples of annotation for Task 2 (gold                                           generation let them kill each other in their own countries as they’ve
labels).                                                                                   done for centuries INSANE!


                                                                                  Table 5: Examples of annotation for Task 3: 4-
scheme is that grades in the scale are supposed to                                tuple with marks for the least hateful and the most
be evenly spread, while the real phenomena they                                   hateful tweets.
represent may not be so.

4.3      Task 3: Best-Worst Scaling                   tion weighted by the trust of each contributor, i.e.,
The structure of this task differs from the previous  a measure of their reliability across their history
two. We created a set of tuples made up by four       on the platform. On task 1, about 70% of the
tweets (4-tuples), grouped so that each tweet is re-  tweets were associated with a confidence score of
peated four times in the dataset, combined with       1, while the remaining 30% follow a low-variance
three different tweet each time. Then we provided     normal distribution around .66.
contributors with a set of 4-tuples: for each 4-tuple    As for Task 2, label distribution tells a differ-
they were asked to point out the most hateful and     ent story. When measuring inter-annotator agree-
the least hateful of the four. Judges have thus seen  ment, the mean value between all annotations has
a given tweet four times, but have had to compare     been computed instead of using the majority cri-
it with different tweets every time4 . This method    terion. Therefore, results are grouped in intervals
avoids assigning a discrete value to each tweet       rather than in discrete values, but we can still eas-
and gathers information on their “hatefulness” by     ily map these intervals to the original labels. As
comparing them to other tweets. An example of         shown in Figure 1, tweets labeled as having a neu-
annotation, with the least and most hateful tweets    tral or positive content (in green) are only around
marked in a set of four, is provided in Table 5.      27%, less than one third of the tweets labeled as
                                                      non-hateful in Task 1. Exactly half of the whole
5 Task annotation results                             dataset is labeled as negative but oriented to dia-
In Task 1, the distribution of the labels yes and     logue (in yellow), while 20% is labeled as nega-
no, referred to the presence of HS, conforms to       tive and somewhat abusive (orange) and only less
that of other similar annotated HS datasets, such     than 3% is labeled as an open incitement to hatred,
as Burnap and Williams (2015) in English and          violence or discrimination (red). With respect to
Sanguinetti et al. (2018) in Italian. After apply-    the inter-annotator agreement, only 25% of the in-
ing a majority criterion to non-unanimous cases,      stances are associated with the maximum confi-
tweets labeled as HS are around 16% of the dataset    dence score of 1, while the distribution of confi-
(see Figure 1). Figure Eight measures the agree-      dence presents a high peak around .66 and a minor
ment in terms of confidence, with a κ-like func-      peak around 0.5. Note that this confidence distri-
                                                      bution is not directly comparable to Task 1, since
   4
     The    details     of  the   tuple    generation the schemes are different.
process   are     explained  in  this   blog    post:
http://valeriobasile.github.io/                          In Task 3, similarly to Task 2, the result of the
Best-worst-scaling-and-the-clock-of-Gauss/ annotation is a real value. More precisely, we
                                                         fier. In order to do so, it was necessary to make
                                                         our schemes comparable without losing the in-
                                                         formation each of them gives: we mapped Task
                                                         2 and Task 3 schemes down to a binary struc-
                                                         ture, directly comparable to Task 1 scheme. For
                                                         Task 2, this was done by drawing an arbitrary line
                                                         that would split the scale in two. We tested dif-
                                                         ferent thresholds, mapping the judgements above
                                                         each threshold to the label HS no from Task 1 and
                                                         all judgements below the threshold to the label
                                                         HS yes. We experimented with three values: -0.5,
Figure 1: Label distribution for Tasks 1, 2 and 3        -1.0 and -1.5. For Task 3, similarly, we tried set-
(red portion of Task 2 bar corresponds to 2.63%).        ting different thresholds along the hateful end of
                                                         the answers distribution spectrum (see Section 5),
compute for each tweet the percentage of times it        respectively at 0, 0.25, 0.5 and 0.75. We mapped
has been indicated as best (more indicative of HS        all judgements below each threshold to the label
in its tuple) and worst (least indicative of HS in its   HS no from Task 1 and all judgements above the
tuple), and compute the difference between these         threshold to the label HS yes.
two values, resulting in a value between −1 (non-           When considering as HS yes all tweets whose
hateful end of the spectrum) and 1 (hateful end of       average value for Task 2 is above 0.5, the num-
the spectrum). The bottom chart in Figure 1 shows        ber of hateful tweets increases (25.35%); when the
that the distribution of values given by the BWS         value is set at -1.0, slightly decreases (10.22%);
annotation has a higher variance than the scalar         but as soon as the threshold is moved up to -1.5,
case, and is skewed slightly towards the hateful         the number drops dramatically. A possible expla-
side. The confidence score for Task 3 follows            nation for this is that a binary scheme is not ade-
a similar pattern to Task 2, while being slightly        quate to depict the complexity of HS and forces
higher on average, with about 40% of the tweets          judges to squeeze contents into a narrow black-
having confidence 1.                                     or-white frame. Conversely, thresholds for Task
   A last consideration concerns the cost of anno-       3 return different results (however partial). The
tation tasks in terms of time and resources. We          threshold 0.5 is the closest to the Task 1 partition,
measured the cost of our three tasks: T1 and T2          with a similar percentage of HS (16.90%), while
had almost the same cost in terms of contributors        lower thresholds allow for much higher percent-
retribution, but T2 required about twice the time to     ages of tweets classified as hateful — setting the
be completed; T3 resulted the most expensive in          value at 0, for example, results in 40.52% of tweets
terms of both money and time. With nearly equal          classified as HS.
results, a strategy could be chosen instead of oth-         To better understand the impact of the different
ers for being quicker or cheaper: therefore, when        annotation strategies on the quality of the result-
designing a research strategy, we deem important         ing datasets, we performed a cross-validation ex-
not to forget this factor.                               periment. We implemented a SVM classifier using
                                                         n-grams (1 ≤ N ≤ 4) as features and measuring
6   Classification tests with different                  its precision, recall and F1 score in a stratified 10-
    schemes at comparison                                fold fashion. Results are shown in Table 6.
                                                            From the results of this cross-validation exper-
Having described the process and results for each        iment, we draw some observations. When map-
task, we will now observe how they affect the            ping the non-binary classification to a binary one,
quality of resulting datasets. Our running hypoth-       choosing an appropriate threshold has a key im-
esis is that a better quality dataset provides better    pact on the classifier performance. For both RS
training material for a supervised classifier, thus      and BWS, the strictness of the threshold (i.e., how
leading to higher predictive capabilities.               close it is to the hateful end of the spectrum) is di-
   Assuming that the final goal is to develop an ef-     rectly proportional to the performance on the neg-
fective system for recognizing HS, we opted to test      ative class (0) and inversely proportional to the
the three schemes against the same binary classi-
     Dataset   Threshold   support (0)   support (1)   P (0)   R (0)   F1 (0)   P (1)   R (1)   F1 (1)   F1 (macro)
     binary                     3365            635    .878    .923     .899    .450    .316     .354          .627
     RS          -0.5           2976          1014     .785    .841     .812    .408    .322     .359          .585
     RS          -1.0           3581            409    .912    .966     .938    .391    .186     .250          .594
     RS          -1.5           3845            145    .964    .991     .978    .200    .028     .047          .512
     BWS          0.0           2206          1782     .677    .703     .690    .614    .585     .599          .644
     BWS         0.25           2968          1020     .806    .860     .832    .492    .398     .439          .635
     BWS          0.5           3480            508    .893    .949     .920    .390    .222     .281          .601
     BWS         0.75           3835            153    .963    .992     .977    .147    .039     .060          .518

    Table 6: Result of 10-fold cross-validation on datasets obtained with different annotation strategies.


performance on the positive class (1). This may                Furthermore, such scale necessarily oversimplifies
be explained by different amounts of training data             a complex natural phenomenon, because it uses
available: as we set a stricter threshold, we will             equidistant points to represent shades of meaning
have fewer examples for the positive class, result-            that may not be as evenly arranged.
ing in a poorer performance, but more examples                    Conversely, our experiment with BWS applied
for the negative class, resulting in a more accurate           to HS annotation gave encouraging results. Un-
classification. Yet, looking at the rightmost col-             like Wojatzki et al. (2018), we find that a ranking
umn, we observe how permissive thresholds return               scheme is slightly better than a rating scheme, be
a higher overall F1-score for both RS and BWS.                 it binary or scalar, in terms of prediction perfor-
   Regardless of the threshold, RS appears to pro-             mance. As future work, we plan to investigate the
duce the worst performance, suggesting that re-                extent to which such variations depend on circum-
ducing continuous values to crisp labels is not the            stantial factors, such as how the annotation process
best way to model the phenomenon, however ac-                  is designed and carried out, as opposed to intrinsic
curate and pondered the labels are. Conversely,                properties of the annotation procedure.
compared to the binary annotation, BWS returns                    The fact that similar distributions are observed
higher F1-scores with permissive threshold (0.0                when the dividing line for RS and BWS is drawn
and 0.25), thus resulting in the best method to ob-            in a permissive fashion suggests that annotators
tain a stable dataset. Furthermore, performances               tend to overuse the label HS yes when they work
with BWS are consistently higher for the positive              with a binary scheme, probably because they have
class (HS): considering that the task is typically             no milder choice. This confirms that, whatever
framed as a detection task (as opposed to a clas-              framework is used, the issue of hateful language
sification task, this result confirms the potential of         requires a nuanced approach that goes beyond the
ranking annotation (as opposed to rating) to gen-              binary classification, being aware that an increase
erate better training material for HS detection.               in complexity and resources will likely pay off in
                                                               terms of more accurate and stable performances.
7    Conclusion and Future Work
                                                               Acknowledgments
We performed annotation tasks with three annota-
tion schemes on a HS corpus, and computed inter-               The work of V. Basile, C. Bosco, V. Patti is par-
annotator agreement rate and label distribution for            tially funded by Progetto di Ateneo/CSP 2016
each task. We also performed cross-validation                  Immigrants, Hate and Prejudice in Social Me-
tests with the three annotated datasets, to verify             dia (S1618 L2 BOSC 01) and by Italian Ministry
the impact of the annotation schemes on the qual-              of Labor (Contro l’odio: tecnologie informatiche,
ity of the produced data.                                      percorsi formativi e storytelling partecipativo per
   We observed that the RS we designed seems                   combattere l’intolleranza, avviso n.1/2017 per il
easier to use for contributors, but its results are            finanziamento di iniziative e progetti di rilevanza
more complex to understand, and it returns the                 nazionale ai sensi dell’art. 72 del decreto legisla-
worst overall performance in a cross-validation                tivo 3 luglio 2017, n. 117 - anno 2017). The work
test. It is especially difficult to compare it with a          of F. Poletto is funded by Fondazione Giovanni
binary scheme, since merging labels together and               Goria and Fondazione CRT (Bando Talenti della
mapping them down to a dichotomous choice is                   Società Civile 2018).
in contrast with the nature of the scheme itself.
References                                                 hindi-english code-switched language. In Proceed-
                                                           ings of the Sixth International Workshop on Natural
Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and            Language Processing for Social Media, pages 18–
   Yudo Ekanata. 2017. Hate speech detection in            26.
   the indonesian language: A dataset and preliminary
   study. In 2017 International Conference on Ad-        Saif Mohammad and Svetlana Kiritchenko. 2018. Un-
   vanced Computer Science and Information Systems         derstanding emotions: A dataset of tweets to study
   (ICACSIS), pages 233–238. IEEE.                         interactions between affect categories. In Proceed-
                                                           ings of the Eleventh International Conference on
Pete Burnap and Matthew L. Williams. 2015. Cyber           Language Resources and Evaluation (LREC-2018),
  Hate Speech on Twitter: An Application of Machine        pages 198–209.
  Classification and Statistical Modeling for Policy
  and Decision Making. Policy & Internet, 7(2):223–      Hamdy Mubarak, Kareem Darwish, and Walid Magdy.
  242.                                                     2017. Abusive language detection on arabic social
                                                           media. In Proceedings of the First Workshop on
Thomas Davidson, Dana Warmsley, Michael Macy,              Abusive Language Online, pages 52–56.
  and Ingmar Weber. 2017. Automated hate speech
  detection and the problem of offensive language.       Rishab Nithyanand, Brian Schaffner, and Phillipa Gill.
  In Eleventh International AAAI Conference on Web         2017. Measuring offensive speech in online political
  and Social Media, pages 368 – 371.                       discourse. In 7th {USENIX} Workshop on Free and
                                                           Open Communications on the Internet ({FOCI} 17).
Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta,
  Marinella Petrocchi, and Maurizio Tesconi. 2017.       Chikashi Nobata, Joel Tetreault, Achint Thomas,
  Hate Me, Hate Me Not: Hate Speech Detection on           Yashar Mehdad, and Yi Chang. 2016. Abusive lan-
  Facebook. In Proceedings of the First Italian Con-       guage detection in online user content. In Proceed-
  ference on Cybersecurity (ITASEC17), pages 86–95.        ings of the 25th international conference on world
                                                           wide web, pages 145–153. International World Wide
Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić.       Web Conferences Steering Committee.
  2017. Legal framework, dataset and annotation          Alexandra Olteanu, Carlos Castillo, Jeremy Boy, and
  schema for socially unacceptable online discourse        Kush R Varshney. 2018. The effect of extremist vi-
  practices in slovene. In Proceedings of the first        olence on hateful speech online. In Twelfth Interna-
  workshop on abusive language online, pages 46–51.        tional AAAI Conference on Web and Social Media,
Lei Gao, Alexis Kuppersmith, and Ruihong Huang.            pages 221–230.
  2017. Recognizing explicit and implicit hate speech    Fabio Poletto, Marco Stranisci, Manuela Sanguinetti,
  using a weakly supervised two-path bootstrapping         Viviana Patti, and Cristina Bosco. 2017. Hate
  approach. arXiv preprint arXiv:1710.07394.               Speech Annotation: Analysis of an Italian Twit-
                                                           ter Corpus. In Proceedings of the Fourth Italian
Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura           Conference on Computational Linguistics (CLiC-it
  Damien, and Jun Long. 2015. A lexicon-based              2017). CEUR.
  approach for hate speech detection. International
  Journal of Multimedia and Ubiquitous Engineering,      Björn Ross, Michael Rist, Guillermo Carbonell, Ben-
  10(4):215–230.                                            jamin Cabrera, Nils Kurowsky, and Michael Wo-
                                                            jatzki. 2017. Measuring the reliability of hate
Hugo Lewi Hammer. 2016. Automatic detection of              speech annotations: The case of the European
  hateful comments in online discussion. In Interna-        refugee crisis. arXiv preprint arXiv:1701.08118.
  tional Conference on Industrial Networks and Intel-
  ligent Systems, pages 164–173. Springer.               Manuela Sanguinetti, Fabio Poletto, Cristina Bosco,
                                                          Viviana Patti, and Marco Stranisci. 2018. An Italian
Svetlana Kiritchenko and Saif Mohammad. 2017.             Twitter Corpus of Hate Speech against Immigrants.
  Best-worst scaling more reliable than rating scales:    In Proceedings of the 11th Language Resources and
  A case study on sentiment intensity annotation. In      Evaluation Conference 2018, pages 2798–2805.
  Proceedings of the 55th Annual Meeting of the As-
  sociation for Computational Linguistics (Volume 2:     Zeerak Waseem. 2016. Are you a racist or am i seeing
  Short Papers), pages 465–470. ACL.                       things? annotator influence on hate speech detection
                                                           on twitter. In Proceedings of the first workshop on
Rensis Likert. 1932. A technique for the measurement       NLP and computational social science, pages 138–
  of attitudes. Archives of psychology, 22(140).           142.

Jordan J Louviere and George G Woodworth. 1991.          Michael Wojatzki, Tobias Horsmann, Darina Gold, and
   Best-worst scaling: A model for the largest differ-     Torsten Zesch. 2018. Do Women Perceive Hate
   ence judgments. University of Alberta: Working Pa-      Differently: Examining the Relationship Between
   per.                                                    Hate Speech, Gender, and Agreement Judgments.
                                                           In Proceedings of the Conference on Natural Lan-
Puneet Mathur, Rajiv Shah, Ramit Sawhney, and De-          guage Processing (KONVENS), pages 110–120, Vi-
  banjan Mahata. 2018. Detecting offensive tweets in       enna, Austria.
Georgios Yannakakis, Roddy Cowie, and Carlos
  Busso. 2018. The ordinal nature of emotions: An
  emerging approach. IEEE Transactions on Affective
  Computing, pages 1–20, 11. Early Access.