<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Annotating Hate Speech: Three Schemes at Comparison</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Poletto, Valerio Basile,</string-name>
          <email>bosco,pattig@di.unito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Stranisci</string-name>
          <email>marco.stranisci@acmos.net</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Acmos</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Cristina Bosco, Viviana Patti, Dipartimento di Informatica, University of Turin</institution>
          ,
          <addr-line>fpoletto,basile</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Annotated data are essential to train and benchmark NLP systems. The reliability of the annotation, i.e. low interannotator disagreement, is a key factor, especially when dealing with highly subjective phenomena occurring in human language. Hate speech (HS), in particular, is intrinsically nuanced and hard to fit in any fixed scale, therefore crisp classification schemes for its annotation often show their limits. We test three annotation schemes on a corpus of HS, in order to produce more reliable data. While rating scales and best-worst-scaling are more expensive strategies for annotation, our experimental results suggest that they are worth implementing in a HS detection perspective.1</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Automated detection of hateful language and
similar phenomena — such as offensive or abusive
language, slurs, threats and so on — is being
investigated by a fast-growing number of researchers.
Modern approaches to Hate Speech (HS)
detection are based on supervised classification, and
therefore require large amounts of manually
annotated data. Reaching acceptable levels of
interannotator agreement on phenomena as subjective
as HS is notoriously difficult. Poletto et al. (2017),
for instance, report a “very low agreement” in
the HS annotation of a corpus of Italian tweets,
and similar annotation efforts showed similar
results
        <xref ref-type="bibr" rid="ref19 ref21 ref4 ref7">(Del Vigna et al., 2017; Waseem, 2016;
Gitari et al., 2015; Ross et al., 2017)</xref>
        . In an
attempt to tackle the agreement issue, annotation
schemes have been proposed based on numeric
1Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
scales, rather than strict judgments
        <xref ref-type="bibr" rid="ref14 ref15 ref3 ref5 ref6 ref9">(Kiritchenko
and Mohammad, 2017)</xref>
        . Ranking, rather than
rating, has also proved to be a viable strategy to
produce high-quality annotation of subjective aspects
in natural language (Yannakakis et al., 2018). Our
hypothesis is that binary schemes may
oversimplify the target phenomenon, leaving it uniquely
to the judges’ subjectivity to sort less prototypical
cases and likely causing higher disagreement.
Rating or ranking schemes, on the other hand, are
typically more complex to implement, but they could
provide higher quality annotation.
      </p>
      <p>A framework is first tested by annotators:
interannotator agreement, number of missed test
questions and overall opinion are some common
standards against which the quality of the task can be
tested. A certain degree of subjectivity and bias is
intrinsic to the task, but an effective scheme should
be able to channel individual interpretations into
unambiguous categories.</p>
      <p>A second reliability test involves the use of
annotated data to train a classifier that assigns the
same labels used by humans to previously unseen
data. This process, jointly with a thorough error
analysis, may help spot bias in the annotation or
flaws in the dataset construction.</p>
      <p>We aim to explore whether and how different
frameworks differ in modeling HS, what problems
do they pose to human annotators and how
suitable they are for training. In particular, we apply a
binary annotation scheme, as well as a rating scale
scheme and a best-worst scale scheme, to a corpus
of HS. We set up experiments in order to assess
whether such schemes help achieve a lower
disagreement and, ultimately, a higher quality dataset
for benchmarking and for supervised learning.</p>
      <p>The experiment we set up involves two stages:
after having the same dataset annotated with three
different schemes on the crowdsourcing platform
Figure Eight2, we first compare their agreement
2https://www.figure-eight.com/.
rates and label distributions, then we map all
schemes to a “yes/no” structure to perform a
crossvalidation test with a SVM classifier. We launched
three separate tasks on the platform: Task 1 with
a binary scheme, Task 2 with an asymmetric
rating scale, and Task 3 with a best-worst scale. For
each task, a subset has been previously annotated
by experts within the research team, to be used as
gold standard against which to evaluate
contributors’ trustworthiness on Figure Eight.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Several frameworks have been proposed and
tested so far for HS annotation, ranging from
straightforward binary schemes to complex,
multilayered ones and including a variety of linguistic
features. Dichotomous schemes are used, for
example, by Alfina et al. (2017), Ross et al. (2017)
and Gao et al. (2017) for HS, by Nithyanand et al.
(2017) for offensiveness and by Hammer (2016)
for violent threats. Slightly more nuanced
frameworks try to highlight particular features.
Davidson et al. (2017) distinguish between hateful,
offensive but not hateful and not offensive, as do
Mathur et al. (2018) who for the second type use
the label abusive instead; similarly, Mubarak et
al. (2017) use the labels obscene, offensive and
clean. Waseem (2016) differentiate hate according
to its target, using the labels sexism, racism, both
and none. Nobata et al. (2016) uses a two-layer
scheme, where a content can be first labeled either
as abusive or clean and, if abusive, as hate speech,
derogatory or profanity. Del Vigna et al. (2017)
uses a simple scale that distinguishes between no
hate, weak hate and strong hate.</p>
      <p>
        Where to draw the line between weak and
strong hate is still highly subjective but, if
nothing else, the scheme avoids feebly hateful
comments to be classified as not hateful (thus
potentially neutral or positive) just because, strictly
speaking, they can not be called HS. Other
authors, such as Olteanu et al. (2018) and Fisˇer et al.
(2017), use heavier and more elaborated schemes.
Olteanu et al. (2018), in particular, experimented
with a rating-based annotation scheme, reporting
low agreement. Sanguinetti et al. (2018) also uses
a complex scheme in which HS is annotated both
for its presence (binary value) and for its
intensity (1–4 rating scale). Such frameworks
potentially provide valuable insights into the
investigated issue, but as a downside they make the whole
annotation process very time-consuming. More
recently, a ranking scheme has been applied to
the annotation of a small dataset of German hate
speech messages
        <xref ref-type="bibr" rid="ref22">(Wojatzki et al., 2018)</xref>
        .
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Annotation Schemes</title>
      <p>In this section, we introduce the three annotation
schemes tested in our study.</p>
      <p>Binary. Binary annotation implies assigning a
binary label to each instance. Beside HS,
binary classification is common in a variety of NLP
tasks and beyond. Its simplicity allows a quick
manual annotation and an easy computational data
processing. As a downside, such a
dichotomous choice presupposes that is always possible
to clearly and objectively determine what answer
is true. This may be acceptable in some tasks, but
it is not always the case with human language,
especially for more subjective and nuanced
phenomena.</p>
      <p>
        Rating Scales. Rating Scales (RS) are widely
used for annotation and evaluation in a variety
of tasks. Likert scale is the best known
        <xref ref-type="bibr" rid="ref10">(Likert,
1932)</xref>
        : values are arranged at regular intervals on
a symmetric scale, from the most to the least
typical of a given concept. It is suitable for
measuring subjective opinion or perception about a given
topic with a variable number of options.
Compared to binary scheme, scales are better for
managing subjectivity and intermediate nuances of a
concept. On the other hand, as pointed out by
        <xref ref-type="bibr" rid="ref14 ref15 ref3 ref5 ref6 ref9">(Kiritchenko and Mohammad, 2017)</xref>
        , they present
some flaws: high inter-annotator disagreement
(the more fine-grained the scale, the higher the
chance of disagreement), individual
inconsistencies (judges may express different values for
similar items, or the same value for different items),
scale region bias (judges may tend to prefer
values in one part of the scale, often the middle) and
fixed granularity (which may not represent the
actual nuances of a concept).
      </p>
      <p>Best-Worst Scaling. The Best-Worst Scaling
model (BWS) is a comparative annotation process
developed by Louviere and Woodworth (1991).
In a nutshell, a BWS model presents annotators
with n items at a time (where n &gt; 1 and
normally n = 4) and asks them to pick the best and
worst ones with regard to a given property. The
model has been used in particular by Kiritchenko
ethnic group religion
immigrat*, immigrazione terrorismo
migrant*, profug* terrorist*, islam
stranier* mus[s]ulman*
corano</p>
      <p>Roma
rom
nomad*
and Mohammad (2017) and Mohammad and
Kiritchenko (2018), who proved it to be particularly
effective for subjective tasks such as sentiment
intensity annotation, which are intrinsically nuanced
and hardly fit in any fixed scale.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset and task description</title>
      <p>For our experiment, we employ a dataset of 4,000
Italian tweets, extracted from a larger corpus
collected within the project Contro l’odio3. For the
purpose of this research, we filtered all the tweets
written between November 1st and December 31st
with a list of keywords. This list, reported in Table
1, is the same proposed in Poletto et al. (2017) for
collecting a dataset focused on three typical targets
of discrimination — namely Immigrants, Muslims
and Roma.</p>
      <p>The concept of HS underlying all three
annotation tasks includes any expression based on
intolerance and promoting or justifying hatred towards
a given target. For each task we explicitly asked
the annotators to consider only HS directed
towards one of the three above-mentioned targets,
ignoring other targets if present. Each message
is annotated by at least three contributors.
Figure Eight also report a measure of agreement
computed as a Fleiss’ weighted by a score indicating
the trustworthiness of each contributor on the
platform. We note, however, that the agreement
measured on the three tasks is not directly comparable,
since they follow different annotation schemes.
4.1</p>
      <sec id="sec-4-1">
        <title>Task 1: Binary Scheme.</title>
        <p>
          The first scheme is very straightforward and
simply asks judges to tell whether a tweet contains HS
or not. Each line will thus receive the label HS yes
or HS no. The definition of HS is drawn by
          <xref ref-type="bibr" rid="ref18">(Poletto et al., 2017)</xref>
          . In order to be labeled as hateful,
a tweet must:
address one of above-mentioned targets;
either incite, promote or justify hatred,
violence or intolerance towards the target, or
de3https://controlodio.it/.
label
yes
no
tweet
Allora dobbiamo stringere la corda: pena capitale
per tutti i musulmani in Europa immediatamente!
Then we have to adopt stricter measures: death penalty for all
Muslims in Europe now!
I migranti hanno sempre il posto e non pagano.
        </p>
        <p>Migrants always get a seat and never pay.</p>
        <p>mean, dehumanise or threaten it.</p>
        <p>We also provided a list of expressions that are not
to be considered HS although they may seem so:
for example, these include slurs and offensive
expressions, slanders, and blasphemy. An example
of annotation for this task is presented in Table 2.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Task 2: Unbalanced Rating Scale</title>
        <p>This task requires judges to assign a label to each
tweet on a 5-degree asymmetric scale (from 1 to
-3) that encompasses the content and tone of the
message as well as the writer’s intention. Again,
the target of the message must be one of three
mentioned above. The scheme structure is
reported in Table 3, while Table 4 shows an example
for each label.</p>
        <p>label
+1
0
-1
-2
-3
meaning
positive
neutral, ambiguous or unclear
negative and polite, dialogue-oriented attitude
negative and insulting/abusive, aggressive attitude
strongly negative with overt incitement to hatred,
violence or discrimination, attitude oriented at
attacking or demeaning the target</p>
        <p>This scale was designed with a twofold aim: to
avoid a binary choice that could leave too many
doubtful cases, and to split up negative contents
in more precise categories, in order to distinguish
different degrees of “hatefulness”.</p>
        <p>We tried not to influence annotators by
matching the grades of our scale in Task 2 to widespread
concepts such as stereotypes, abusive language
or hateful language, which people might tend to
apply by intuition rather then by following strict
rules. Instead, we provided definitions as
neutral and objective as possible, in order to
differentiate this task from the others and avoid biases.
An asymmetric scale, although unusual, fits our
purpose of an in-depth investigation of negative
language very well. A possible downturn of this
scheme is that grades in the scale are supposed to
be evenly spread, while the real phenomena they
represent may not be so.
label
least
most
tweet
Roma, ondata di controlli anti-borseggio in centro:
arrestati 8 nomadi, 6 sono minorenni.</p>
        <p>Rome, anti-pickpocketing patrolling in the centre: 8 nomads
arrested, 6 of them are minor.</p>
        <p>Tutti i muslims presenti in Europa rappresentano un
pericolo mortale latente. L’islam e` incompatibile
con i valori occidentali.</p>
        <p>All Muslims in Europe are a dormant deadly danger. Islam is
incompatible with Western values.</p>
        <p>Trieste, profughi cacciano disabile dal bus:
arrivano le pattuglie di Forza Nuova sui mezzi
pubblici.</p>
        <p>Trieste, asylum-seekers throw disabled person off the bus: Forza
Nuova (TN: far-right, nationalist fringe party) to patrol public
transport.</p>
        <p>Unica soluzione e` cacciare TUTTI i musulmani
NON integrati fino alla 3a gen che si ammazzassero
nei loro paesi come fanno da secoli MALATI!
Only way is to oust EVERY NON-integrated Muslim down to 3rd
generation let them kill each other in their own countries as they’ve
done for centuries INSANE!</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Task 3: Best-Worst Scaling</title>
        <p>The structure of this task differs from the previous
two. We created a set of tuples made up by four
tweets (4-tuples), grouped so that each tweet is
repeated four times in the dataset, combined with
three different tweet each time. Then we provided
contributors with a set of 4-tuples: for each 4-tuple
they were asked to point out the most hateful and
the least hateful of the four. Judges have thus seen
a given tweet four times, but have had to compare
it with different tweets every time4. This method
avoids assigning a discrete value to each tweet
and gathers information on their “hatefulness” by
comparing them to other tweets. An example of
annotation, with the least and most hateful tweets
marked in a set of four, is provided in Table 5.
tion weighted by the trust of each contributor, i.e.,
a measure of their reliability across their history
on the platform. On task 1, about 70% of the
tweets were associated with a confidence score of
1, while the remaining 30% follow a low-variance
normal distribution around .66.</p>
        <p>As for Task 2, label distribution tells a
different story. When measuring inter-annotator
agreement, the mean value between all annotations has
been computed instead of using the majority
criterion. Therefore, results are grouped in intervals
rather than in discrete values, but we can still
easily map these intervals to the original labels. As
shown in Figure 1, tweets labeled as having a
neutral or positive content (in green) are only around
27%, less than one third of the tweets labeled as
non-hateful in Task 1. Exactly half of the whole
5 Task annotation results dataset is labeled as negative but oriented to
diaIn Task 1, the distribution of the labels yes and logue (in yellow), while 20% is labeled as
negano, referred to the presence of HS, conforms to tive and somewhat abusive (orange) and only less
that of other similar annotated HS datasets, such than 3% is labeled as an open incitement to hatred,
as Burnap and Williams (2015) in English and violence or discrimination (red). With respect to
Sanguinetti et al. (2018) in Italian. After apply- the inter-annotator agreement, only 25% of the
ining a majority criterion to non-unanimous cases, stances are associated with the maximum
confitweets labeled as HS are around 16% of the dataset dence score of 1, while the distribution of
confi(see Figure 1). Figure Eight measures the agree- dence presents a high peak around .66 and a minor
ment in terms of confidence, with a -like func- peak around 0.5. Note that this confidence
distribution is not directly comparable to Task 1, since
4The details of the tuple generation the schemes are different.
phrtotceps:s//vaarleerieoxbpalasinielde.giinthu bth.isio/ blog post: In Task 3, similarly to Task 2, the result of the
Best-worst-scaling-and-the-clock-of-Gauss/ annotation is a real value. More precisely, we
compute for each tweet the percentage of times it
has been indicated as best (more indicative of HS
in its tuple) and worst (least indicative of HS in its
tuple), and compute the difference between these
two values, resulting in a value between 1
(nonhateful end of the spectrum) and 1 (hateful end of
the spectrum). The bottom chart in Figure 1 shows
that the distribution of values given by the BWS
annotation has a higher variance than the scalar
case, and is skewed slightly towards the hateful
side. The confidence score for Task 3 follows
a similar pattern to Task 2, while being slightly
higher on average, with about 40% of the tweets
having confidence 1.</p>
        <p>A last consideration concerns the cost of
annotation tasks in terms of time and resources. We
measured the cost of our three tasks: T1 and T2
had almost the same cost in terms of contributors
retribution, but T2 required about twice the time to
be completed; T3 resulted the most expensive in
terms of both money and time. With nearly equal
results, a strategy could be chosen instead of
others for being quicker or cheaper: therefore, when
designing a research strategy, we deem important
not to forget this factor.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Classification tests with different schemes at comparison</title>
      <p>Having described the process and results for each
task, we will now observe how they affect the
quality of resulting datasets. Our running
hypothesis is that a better quality dataset provides better
training material for a supervised classifier, thus
leading to higher predictive capabilities.</p>
      <p>Assuming that the final goal is to develop an
effective system for recognizing HS, we opted to test
the three schemes against the same binary
classifier. In order to do so, it was necessary to make
our schemes comparable without losing the
information each of them gives: we mapped Task
2 and Task 3 schemes down to a binary
structure, directly comparable to Task 1 scheme. For
Task 2, this was done by drawing an arbitrary line
that would split the scale in two. We tested
different thresholds, mapping the judgements above
each threshold to the label HS no from Task 1 and
all judgements below the threshold to the label
HS yes. We experimented with three values: -0.5,
-1.0 and -1.5. For Task 3, similarly, we tried
setting different thresholds along the hateful end of
the answers distribution spectrum (see Section 5),
respectively at 0, 0.25, 0.5 and 0.75. We mapped
all judgements below each threshold to the label
HS no from Task 1 and all judgements above the
threshold to the label HS yes.</p>
      <p>When considering as HS yes all tweets whose
average value for Task 2 is above 0.5, the
number of hateful tweets increases (25.35%); when the
value is set at -1.0, slightly decreases (10.22%);
but as soon as the threshold is moved up to -1.5,
the number drops dramatically. A possible
explanation for this is that a binary scheme is not
adequate to depict the complexity of HS and forces
judges to squeeze contents into a narrow
blackor-white frame. Conversely, thresholds for Task
3 return different results (however partial). The
threshold 0.5 is the closest to the Task 1 partition,
with a similar percentage of HS (16.90%), while
lower thresholds allow for much higher
percentages of tweets classified as hateful — setting the
value at 0, for example, results in 40.52% of tweets
classified as HS.</p>
      <p>To better understand the impact of the different
annotation strategies on the quality of the
resulting datasets, we performed a cross-validation
experiment. We implemented a SVM classifier using
n-grams (1 N 4) as features and measuring
its precision, recall and F1 score in a stratified
10fold fashion. Results are shown in Table 6.</p>
      <p>From the results of this cross-validation
experiment, we draw some observations. When
mapping the non-binary classification to a binary one,
choosing an appropriate threshold has a key
impact on the classifier performance. For both RS
and BWS, the strictness of the threshold (i.e., how
close it is to the hateful end of the spectrum) is
directly proportional to the performance on the
negative class (0) and inversely proportional to the
performance on the positive class (1). This may
be explained by different amounts of training data
available: as we set a stricter threshold, we will
have fewer examples for the positive class,
resulting in a poorer performance, but more examples
for the negative class, resulting in a more accurate
classification. Yet, looking at the rightmost
column, we observe how permissive thresholds return
a higher overall F1-score for both RS and BWS.</p>
      <p>Regardless of the threshold, RS appears to
produce the worst performance, suggesting that
reducing continuous values to crisp labels is not the
best way to model the phenomenon, however
accurate and pondered the labels are. Conversely,
compared to the binary annotation, BWS returns
higher F1-scores with permissive threshold (0.0
and 0.25), thus resulting in the best method to
obtain a stable dataset. Furthermore, performances
with BWS are consistently higher for the positive
class (HS): considering that the task is typically
framed as a detection task (as opposed to a
classification task, this result confirms the potential of
ranking annotation (as opposed to rating) to
generate better training material for HS detection.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>We performed annotation tasks with three
annotation schemes on a HS corpus, and computed
interannotator agreement rate and label distribution for
each task. We also performed cross-validation
tests with the three annotated datasets, to verify
the impact of the annotation schemes on the
quality of the produced data.</p>
      <p>We observed that the RS we designed seems
easier to use for contributors, but its results are
more complex to understand, and it returns the
worst overall performance in a cross-validation
test. It is especially difficult to compare it with a
binary scheme, since merging labels together and
mapping them down to a dichotomous choice is
in contrast with the nature of the scheme itself.
Furthermore, such scale necessarily oversimplifies
a complex natural phenomenon, because it uses
equidistant points to represent shades of meaning
that may not be as evenly arranged.</p>
      <p>Conversely, our experiment with BWS applied
to HS annotation gave encouraging results.
Unlike Wojatzki et al. (2018), we find that a ranking
scheme is slightly better than a rating scheme, be
it binary or scalar, in terms of prediction
performance. As future work, we plan to investigate the
extent to which such variations depend on
circumstantial factors, such as how the annotation process
is designed and carried out, as opposed to intrinsic
properties of the annotation procedure.</p>
      <p>The fact that similar distributions are observed
when the dividing line for RS and BWS is drawn
in a permissive fashion suggests that annotators
tend to overuse the label HS yes when they work
with a binary scheme, probably because they have
no milder choice. This confirms that, whatever
framework is used, the issue of hateful language
requires a nuanced approach that goes beyond the
binary classification, being aware that an increase
in complexity and resources will likely pay off in
terms of more accurate and stable performances.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>
        The work of V. Basile, C. Bosco, V. Patti is
partially funded by Progetto di Ateneo/CSP 2016
Immigrants, Hate and Prejudice in Social
Media (S1618 L2 BOSC 01) and by Italian Ministry
of Labor
        <xref ref-type="bibr" rid="ref4">(Contro l’odio: tecnologie informatiche,
percorsi formativi e storytelling partecipativo per
combattere l’intolleranza, avviso n.1/2017 per il
finanziamento di iniziative e progetti di rilevanza
nazionale ai sensi dell’art. 72 del decreto
legislativo 3 luglio 2017, n. 117 - anno 2017)</xref>
        . The work
of F. Poletto is funded by Fondazione Giovanni
Goria and Fondazione CRT (Bando Talenti della
Societa` Civile 2018).
      </p>
      <p>Georgios Yannakakis, Roddy Cowie, and Carlos
Busso. 2018. The ordinal nature of emotions: An
emerging approach. IEEE Transactions on Affective
Computing, pages 1–20, 11. Early Access.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Ika</given-names>
            <surname>Alfina</surname>
          </string-name>
          , Rio Mulia, Mohamad Ivan Fanany, and
          <string-name>
            <given-names>Yudo</given-names>
            <surname>Ekanata</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hate speech detection in the indonesian language: A dataset and preliminary study</article-title>
          .
          <source>In 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS)</source>
          , pages
          <fpage>233</fpage>
          -
          <lpage>238</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Pete</given-names>
            <surname>Burnap and Matthew L. Williams</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making</article-title>
          .
          <source>Policy &amp; Internet</source>
          ,
          <volume>7</volume>
          (
          <issue>2</issue>
          ):
          <fpage>223</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Davidson</surname>
          </string-name>
          , Dana Warmsley,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Macy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ingmar</given-names>
            <surname>Weber</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Automated hate speech detection and the problem of offensive language</article-title>
          .
          <source>In Eleventh International AAAI Conference on Web and Social Media</source>
          , pages
          <fpage>368</fpage>
          -
          <lpage>371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Fabio Del Vigna</surname>
            ,
            <given-names>Andrea</given-names>
          </string-name>
          <string-name>
            <surname>Cimino</surname>
            , Felice Dell'Orletta,
            <given-names>Marinella</given-names>
          </string-name>
          <string-name>
            <surname>Petrocchi</surname>
            , and
            <given-names>Maurizio</given-names>
          </string-name>
          <string-name>
            <surname>Tesconi</surname>
          </string-name>
          .
          <year>2017</year>
          . Hate Me, Hate Me Not:
          <article-title>Hate Speech Detection on Facebook</article-title>
          .
          <source>In Proceedings of the First Italian Conference on Cybersecurity (ITASEC17)</source>
          , pages
          <fpage>86</fpage>
          -
          <lpage>95</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Darja</surname>
            <given-names>Fisˇer</given-names>
          </string-name>
          , Tomazˇ Erjavec, and Nikola Ljubesˇic´.
          <year>2017</year>
          .
          <article-title>Legal framework, dataset and annotation schema for socially unacceptable online discourse practices in slovene</article-title>
          .
          <source>In Proceedings of the first workshop on abusive language online</source>
          , pages
          <fpage>46</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Lei</given-names>
            <surname>Gao</surname>
          </string-name>
          , Alexis Kuppersmith, and
          <string-name>
            <given-names>Ruihong</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Recognizing explicit and implicit hate speech using a weakly supervised two-path bootstrapping approach</article-title>
          .
          <source>arXiv preprint arXiv:1710</source>
          .
          <fpage>07394</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Njagi</given-names>
            <surname>Dennis Gitari</surname>
          </string-name>
          , Zhang Zuping, Hanyurwimfura Damien, and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Long</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A lexicon-based approach for hate speech detection</article-title>
          .
          <source>International Journal of Multimedia and Ubiquitous Engineering</source>
          ,
          <volume>10</volume>
          (
          <issue>4</issue>
          ):
          <fpage>215</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Hugo</given-names>
            <surname>Lewi Hammer</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Automatic detection of hateful comments in online discussion</article-title>
          .
          <source>In International Conference on Industrial Networks and Intelligent Systems</source>
          , pages
          <fpage>164</fpage>
          -
          <lpage>173</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Svetlana</given-names>
            <surname>Kiritchenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>Saif</given-names>
            <surname>Mohammad</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation</article-title>
          .
          <source>In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          , pages
          <fpage>465</fpage>
          -
          <lpage>470</lpage>
          . ACL.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Rensis</given-names>
            <surname>Likert</surname>
          </string-name>
          .
          <year>1932</year>
          .
          <article-title>A technique for the measurement of attitudes</article-title>
          .
          <source>Archives of psychology</source>
          ,
          <volume>22</volume>
          (
          <issue>140</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Jordan J Louviere and George G Woodworth</surname>
          </string-name>
          .
          <year>1991</year>
          .
          <article-title>Best-worst scaling: A model for the largest difference judgments</article-title>
          . University of Alberta: Working Paper.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Puneet</given-names>
            <surname>Mathur</surname>
          </string-name>
          , Rajiv Shah, Ramit Sawhney, and
          <string-name>
            <given-names>Debanjan</given-names>
            <surname>Mahata</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Detecting offensive tweets in hindi-english code-switched language</article-title>
          .
          <source>In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media</source>
          , pages
          <fpage>18</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Saif</given-names>
            <surname>Mohammad</surname>
          </string-name>
          and
          <string-name>
            <given-names>Svetlana</given-names>
            <surname>Kiritchenko</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Understanding emotions: A dataset of tweets to study interactions between affect categories</article-title>
          .
          <source>In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)</source>
          , pages
          <fpage>198</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Hamdy</given-names>
            <surname>Mubarak</surname>
          </string-name>
          , Kareem Darwish, and
          <string-name>
            <given-names>Walid</given-names>
            <surname>Magdy</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Abusive language detection on arabic social media</article-title>
          .
          <source>In Proceedings of the First Workshop on Abusive Language Online</source>
          , pages
          <fpage>52</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Rishab</given-names>
            <surname>Nithyanand</surname>
          </string-name>
          , Brian Schaffner, and
          <string-name>
            <given-names>Phillipa</given-names>
            <surname>Gill</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Measuring offensive speech in online political discourse</article-title>
          .
          <source>In 7th fUSENIXg Workshop on Free and Open Communications on the Internet (fFOCIg 17).</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Chikashi</given-names>
            <surname>Nobata</surname>
          </string-name>
          , Joel Tetreault, Achint Thomas,
          <string-name>
            <given-names>Yashar</given-names>
            <surname>Mehdad</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yi</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Abusive language detection in online user content</article-title>
          .
          <source>In Proceedings of the 25th international conference on world wide web</source>
          , pages
          <fpage>145</fpage>
          -
          <lpage>153</lpage>
          . International World Wide Web Conferences Steering Committee.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Alexandra</given-names>
            <surname>Olteanu</surname>
          </string-name>
          , Carlos Castillo, Jeremy Boy, and
          <string-name>
            <surname>Kush R Varshney</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The effect of extremist violence on hateful speech online</article-title>
          .
          <source>In Twelfth International AAAI Conference on Web and Social Media</source>
          , pages
          <fpage>221</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Poletto</surname>
          </string-name>
          , Marco Stranisci, Manuela Sanguinetti, Viviana Patti, and
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Hate Speech Annotation: Analysis of an Italian Twitter Corpus</article-title>
          .
          <source>In Proceedings of the Fourth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2017</year>
          ). CEUR.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Bjo¨rn Ross</surname>
            ,
            <given-names>Michael</given-names>
            Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael
          </string-name>
          <string-name>
            <surname>Wojatzki</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Measuring the reliability of hate speech annotations: The case of the European refugee crisis</article-title>
          .
          <source>arXiv preprint arXiv:1701</source>
          .
          <fpage>08118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Fabio Poletto, Cristina Bosco, Viviana Patti, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Stranisci</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>An Italian Twitter Corpus of Hate Speech against Immigrants</article-title>
          .
          <source>In Proceedings of the 11th Language Resources and Evaluation Conference</source>
          <year>2018</year>
          , pages
          <fpage>2798</fpage>
          -
          <lpage>2805</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Zeerak</given-names>
            <surname>Waseem</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter</article-title>
          .
          <source>In Proceedings of the first workshop on NLP and computational social science</source>
          , pages
          <fpage>138</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Wojatzki</surname>
          </string-name>
          , Tobias Horsmann, Darina Gold, and
          <string-name>
            <given-names>Torsten</given-names>
            <surname>Zesch</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Do Women Perceive Hate Differently: Examining the Relationship Between Hate Speech, Gender, and Agreement Judgments</article-title>
          .
          <source>In Proceedings of the Conference on Natural Language Processing (KONVENS)</source>
          , pages
          <fpage>110</fpage>
          -
          <lpage>120</lpage>
          , Vienna, Austria.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>