=Paper= {{Paper |id=Vol-3015/136 |storemode=property |title=Mining Annotator Perspectives from Hate Speech Corpora |pdfUrl=https://ceur-ws.org/Vol-3015/paper136.pdf |volume=Vol-3015 |authors=Michael Fell,Sohail Akhtar,Valerio Basile |dblpUrl=https://dblp.org/rec/conf/aiia/FellAB21 }} ==Mining Annotator Perspectives from Hate Speech Corpora== https://ceur-ws.org/Vol-3015/paper136.pdf
       Mining Annotator Perspectives from Hate
                  Speech Corpora

                  ª
                      Michael Fell, ¨ Sohail Akhtar, and © Valerio Basile
                       ª¨©
                        Università degli Studi di Torino, Turin, Italy
             ª
                 mic.fell@gmail.com, ¨© {firstname.lastname}@unito.it



        Abstract. Disagreement in annotation, traditionally treated mostly as
        noise, is now more and more often considered as a source of valuable
        information instead. We investigate a particular form of disagreement,
        occurring when the focus of an annotated dataset is a subjective and
        controversial phenomenon, therefore inducing a certain degree of polar-
        ization among the annotators’ judgments. We argue that the polarization
        is indicative of the conflicting perspectives held by different annotator
        groups, and propose a quantitative method to model this phenomenon.
        Moreover, we introduce a method to automatically identify shared per-
        spectives stemming from a common background. We test our method on
        several corpora in English and Italian, manually annotated according to
        their hate speech content, validating prior knowledge about the groups of
        annotators, when available, and discovering characteristic traits among
        annotators with unknown background. We found several precisely de-
        fined perspectives, described in terms of increased sensitivity towards
        textual content expressing attitudes such as xenophobia, islamophobia,
        and homophobia.

        Keywords: Linguistic Annotation · Perspective Identification · Anno-
        tator Bias · Hate Speech · Polarization of Opinions




1     Introduction

Most modern approaches to Natural Language Processing (NLP) tasks rely on
supervised Machine Learning. This is true, among other tasks, for text classifi-
cation tasks such as abusive language and hate speech detection [35,6]. However,
while bias in datasets has been investigated [33,27], the bias in the annotation
of the datasets used for training hate speech models is relatively less studied.
    Recent works highlight the importance of a “perspectivist turn”, i.e., a change
of paradigm in supervised machine learning moving away from datasets aggre-
gated by majority vote, and towards frameworks that consider multiple anno-
tator perspectives in data creation, model training, and evaluation [7]. Taking
    Copyright c 2021 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2      M. Fell et al.

an inclusive stance towards disagreement in data annotation does not only have
ethical implications, but rather it has practical impact on the performance of
predictive systems [30] and on the reliability of the evaluation [8].
    We focus on hate speech (HS), and in general abusive phenomena in on-
line verbal communication, for several reasons. Firstly, hateful discourse online
is growing at a worrying rate [36], and it is linked to an increase of violence
and hatred towards vulnerable communities, with strong negative social im-
pact [14,21,22]. Moreover, hate speech is a highly subjective phenomenon. While
no phenomenon is neither totally subjective nor totally objective, the position of
hate speech on a hypothetical inter-subjectivity spectrum [18] is far from the cen-
ter, as its judgment is influenced by factors such as socioeconomic background,
ethnicity, gender, among others [31]. Moreover, the hatred is typically directed
towards targets carrying specific socio-economic, cultural, or demographic traits,
which are likely aligned to the factors influencing the judgments of hateful mes-
sages by human annotators. Indeed, messages containing hateful content are of-
ten controversial, that is, they reference events, people, and issues that prompt
very different reactions depending on the recipient of the message [26].
    In the area of hate speech detection, Akhtar et al. [1] introduced a quantita-
tive measure of the polarization of the annotation induced by the controversiality
of the messages. They show how in presence of highly subjective phenomena like
hate speech, systematic patterns emerge that suggest a diversification of the
annotators’ perspectives beyond the mere disagreement. In a follow-up work,
Akhtar et al. [2] leveraged the polarization in the annotation to create multiple
perspective-encoding classifiers, boosting the classification performance in the
process. In this work, we further explore the polarization of annotation, and
particularly at providing a methodology to qualitatively study emerging groups
of annotators holding different, and sometimes conflicting, perspectives.
    More specifically, in this work we deal with shared perspectives, that is,
the set of factors that cause a certain annotation by a group of human annota-
tors (each holding an individual perspective). By analyzing the annotation
with computational methods, we aim at i) distinguishing groups of annotators
holding different shared perspectives, and ii) identifying the nature of the shared
perspectives, providing a human-readable description. First, we provide a formal
definition of perspective in the context of the annotation of NLP datasets, hing-
ing on the difference between label agreement and the novel concept of feature
agreement We then empirically demonstrate the emergence of perspectives in
real datasets of hate speech, computed with a straightforward yet effective pro-
cedure, and illustrated in the form of important words and selected examples.


2   Related Work

Most research related to the identification of perspectives mainly focuses on the
perspective of the author of the messages. The literature is typically concerned
with subjective phenomena in natural language, such as abusive language, where
an abundance of expressions of emotions, opinions and sentiments is found. Sub-
                  Mining Annotator Perspectives from Hate Speech Corpora           3

jective language is considered a catalyst for multiple perspectives [32,28] and
varying opinions at sentence level [34]. Political discourse analysis is an important
research area and many researchers worked in identifying different perspectives
on political topics including election campaigns as a qualitative analysis task
[24]. Lin et al. [17] automatically identified perspectives at the document and
sentence level with high accuracy by developing statistical models and learning
algorithms on articles about the Israeli-Palestinian conflict.
    In NLP, the task of stance detection [20] aims at identifying points of view,
judgments or opinions on a given topic of interest in natural language. The
social and political issues on which individuals tend to express their opinions are
usually controversial in nature, causing polarization among people [4]. Beigman-
Klebanov et al. [10] worked on perspective identification in public stance on
controversial topics such as abortion.
    Highly Controversial topics, such as hate speech, are a rich source to identify
and analyze conflicting perspectives in online environments. When social media
users express different opinions on topics or social issues, the text depicts high
level of controversy due to varying perspectives [26]. When such phenomena are
manually annotated by human judges, high controversy is bound to have an
impact on such annotations, in terms of agreement between the human judges.
    In the aforementioned work, Akhtar et al. [1] developed a novel method to
measure the level of polarization in conflicting annotations on social media cor-
pora. The authors developed a quantitative index, called polarization index, to
measure the level at which polarized opinions appear in individual messages.
The authors extended their work [2] by developing perspective-aware models
based on automatically clustered groups of annotators. State-of-the-art machine
learning models are trained on gold standard training sets based on this divi-
sion, successfully picking up the divergence of opinions in group-based test data.
The same authors recently developed a novel multi-perspective abusive language
dataset [3] on Brexit to identify and model perspectives expressed by annota-
tors with different ethnic, cultural and demographic background. In contrast to
traditionally published NLP corpora, this dataset provides a natural grouping
of the annotators into groups of similar backgrounds.
    It is noteworthy that disagreement in annotation is a relevant topic also in
more objective tasks such as POS tagging [8] and even outside the scope of NLP;
for instance, Basile et al. [7] describe the high disagreement in the annotation of
medical images by experts.


3   Mining Perspectives in Annotations

We postulate that annotators and their individual perspectives influence how
they annotate different items related to a given topic. This is particularly rel-
evant to annotation tasks that exhibit a high degree of subjectivity as here the
influence of the perspective on the ratings may be higher. According to a com-
mon definition, a judgment is considered subjective when it is mainly “based
on, or influenced by, personal feelings, tastes, or opinions.”; we usually contrast
4        M. Fell et al.

this concept with that of objective, a term that characterizes judgments that,
ideally, are not influenced by personal feelings or idiosyncrasies and which, on a
practical level, the vast majority of people would see and label in the same way.
For instance, in hate speech detection, different annotators have been shown to
diverge highly in their ratings and are polarized [5]. Furthermore, the offensive-
ness of words depends on the context in which the words are uttered [23]. For
example, consider the difference between the use of the word “nigga” in a Rap
song, where it is considered as lowly offensive, as opposed to using such words in
a political discourse, where it is understood as highly offensive. We assume that
annotators implicitly or explicitly take perspectives on topics, and we model this
as described in the following.
    In order to mine shared annotator perspectives in a given dataset, we postu-
late a two-step procedure. First, we detect perspectives that are shared among
annotators. To this end, we measure how much the annotators agree on item
labels, the label agreement. Second, we measure to what extent annotators agree
on the importance of linguistic features of the items, the feature agreement.
Combined, our method ensures that annotators in the same shared perspective
label items similarly and do so for similar reasons. In this work, we only use
unigrams as linguistic features to allow simpler explanations. For instance, an-
notators holding the perspective that the word “fag” is especially hateful, tend
to always label items containing this word as hate speech, i.e., they exhibit a
high feature agreement on this unigram.1 We finally perform analysis on shared
perspectives consisting of annotators that are similar both in label agreement
and feature agreement. Such annotators tend to agree both on their item labels
and the importances they give to the item features (unigrams).

Individual Perspectives Given a list of items, an annotator A judges these items,
according to their opinion on each of them. We call this labeling the individual
perspective of annotator A on the items. Formally, given n items, assume there
are possible opinions 0, 1, ..., c for each item. Then, an annotator A takes a per-
spective pA by holding an opinion on each item. We call pA ∈ {0, 1, ..., c}n the
perspective of A (on the items). By modelling annotator perspectives as vectors,
we can compare them quantitatively.
    In order to identify perspectives in annotations, we require items to have dis-
agreeing annotations. This is only possible in the case where annotations have
not previously been aggregated into a single label, what in [11] has been called
diamond standard. This is in contrast with the usual gold standard paradigm
where multiple annotations are harmonized into one gold label, often imple-
mented by majority voting. Under the paradigm of annotator perspectives that
we have introduced above, the reduction of multiple labels (annotator opinions)
into a gold label by majority vote is equivalent to taking the majority perspective.

Shared Perspectives While each annotator takes their own perspective, we are
more interested in finding perspectives which are shared among annotators. We
1
    Note that our method is agnostic to the type of features extracted from the messages,
    and it could therefore be used in conjunction with other, more refined features.
                  Mining Annotator Perspectives from Hate Speech Corpora         5

call perspectives pA , pB shared based on their similarity. We employ clustering
to find clusters of annotators that share perspectives. While shared perspectives
arise from an agreement of annotators on item labels (label agreement), we
also aim to understand how shared perspectives are linguistically defined. To
this end, we analyze the importance that different annotators give to different
linguistics features, i.e. which words annotators in a shared perspective agree to
be important (feature agreement).

3.1   Label Agreement
We measure label agreement in terms of inter-annotator agreement. We use
Krippendorff’s alpha reliability [15] and cluster the annotators based on the
label agreement. We proceed as follows:

 – Given n annotators that label the same k items, we obtain a label matrix
   V ∈ Rn×k , where Vi,j is the rating of annotator i of item j. We compute
   the similarity matrix A ∈ Rn×n , where Ai,j = α(Vi,: , Vj,: ) that encodes the
   pairwise agreement between annotators i, j, where α is Krippendorff’s alpha
   reliability, and Vi,: is the label vector of annotator i. Then, the distance
   matrix D = 1 − A induces a clustering of the annotators. The distances in
   D are the pairwise disagreements between annotators.
 – We use an off-the-shelf clustering algorithm to cluster the annotators based
   on their distances D to one another, into groups of annotators with high
   intra-group label agreement and low inter-group label agreement.
 – A high label agreement α(i, j) indicates that annotators i, j tend to give
   similar labels on the items (texts).

Note that Krippendorff’s alpha is also defined for incomplete annotation, i.e.,
where not all annotators covered all the instances. This is a typical scenario in
crowdsourcing, but could happen with other annotation procedures as well.

3.2   Feature Agreement
We want to measure whether annotators agree on the importance of linguis-
tic features of the textual items. In this paper, we use a simple bag of words
(BOW) to model the texts; the features are unigram counts. Feature agreement
between annotators i, j arises when i and j give similar importance to features.
We measure the importance of each feature to an annotator by computing the
chi-square (χ2 ) statistics between the feature distribution and the label distri-
bution in the annotator, following a univariate feature selection approach. This
measures how the annotator’s label depends on the presence of a word in an
item. For instance, the presence of the unigram bitch often coincides with the
label hate speech, while this is not the case for the word sunny. The χ2 statistics
captures this; it is much higher for bitch than it is for sunny. When annotators
tend to agree on the importance of words, they exhibit an overall high feature
agreement. Specifically, we compute feature agreement as follows:
6         M. Fell et al.

    – We extract k features from the corpus. Since we employ the BOW model,
      given a corpus of n documents, we compute the term-document matrix F ∈
      Nn×k , where Fi,j indicates the count of word j in document i. The columns
      of F are the features, as F:,i is the word counts of word i over the documents.
    – For each feature fi = F:,i and annotator r, we compute the importance
      imp of feature fi to annotator r as imp(fi , r) = χ2 (fi , Vr,:   T
                                                                          ), where V is the
      previously introduced label matrix.
    – We define the feature agreement between annotators i, j by comparing their
      feature importances. To this end, let I ∈ Rk×n be the importance matrix,
      where Ii,j = imp(fi , j). Then, the vector of all importances of annotator
      j is given by I:,j . The feature agreement β between annotators i, j is then
      computed by the cosine similarity of their importances vectors: β(i, j) =
      cosine(I:,i , I:,j ), where cosine(x, y) = x · y · (kxk · kyk)−1 .
    – A high feature agreement β(i, j) indicates that annotators i, j tend to give
      similar labels when similar words are present.
    – Given n annotators, we compute the similarity matrix Bi,j = β(i, j) that en-
      codes the pairwise feature agreement between annotators i, j. Analogously
      to the label agreement case, we use the distance matrix D = 1 − B to clus-
      ter the annotators into groups of annotators with high intra-group feature
      agreement and low inter-group feature agreement.
Since the χ2 statistics requires a dense label matrix, if an annotator has not
labelled an item, we insert the negative label (i.e., not hate speech). Truly unim-
portant words then correctly get low importance, while truly important words
get assigned a somewhat diminished importance.

3.3      Label Feature Agreement
We consider two different ways of clustering the annotators: by label agreement
and by feature agreement. These two clusterings sometimes differ, for instance,
when two annotators agree on the item labels (label agreement), but do not
agree on the importance of words related to those labels (feature agreement).
Since our goal is to find annotators that label similarly and do so for similar
reasons, we analyze all annotators that cluster in the exact same way in both
labels and features. Specifically, we perform two clusterings for the annotators
ai . First, we cluster the ai into k different clusters {1, 2, ..., k} according to label
agreement, assigning each ai a cluster Lab(ai ) ∈ {1, 2, ..., k}. Analogously, each
ai is assigned a cluster F eat(ai ) ∈ {1, 2, ..., k} according to feature agreement.
Then, we only consider such annotators ai that cluster in the same way, i.e.,
label feature agreement is defined as {ai : Lab(ai ) = F eat(ai )}.

3.4      Cluster Analysis
Given the clusters of annotators we obtain, we analyze certain cluster properties
statistics, how the clusters differ and which words are important to each cluster.
Specifically, we perform the following analyses.
                  Mining Annotator Perspectives from Hate Speech Corpora           7

     Quantitative cluster description: the number of annotators in the cluster, the
positive label rate %, the label agreement α, the number of features, and the
feature agreement β. We compare the cluster numbers also with the numbers for
all annotators disregard of their cluster affiliation.
     Qualitative cluster description: we inspect the most characteristic unigrams
for the clusters, i.e. the words with the highest relative importance R to the clus-
                                                                  1+med{imp(w,i)}i∈C
ter. We measure RC (w) of a word w to a cluster C as RC (w) = 1+med{imp(w,i)}    i∈¬C
                                                                                      ,
i.e., the median importance to all annotators inside the cluster vs. the median
importance to all annotators outside the cluster. We inspect examples that are
polarized between the clusters, i.e., they are annotated with disagreement be-
tween the clusters. These examples often carry important words as vocabulary.

4     Datasets
The experiments described in this paper are conducted on several hate speech
corpora, consisting of Twitter messages (tweets); they are published in various
research studies on hate speech. This section provides details about the datasets,
such as the annotation process with scheme and guidelines, and information on
the annotators.

4.1   HS Dataset on Brexit
The hate speech dataset on Brexit was recently published [3]. Originally, the au-
thors gathered the data from a study on stance detection in political debates [16]
where around 5 millions tweets were collected during the Brexit voting period,
June 2016. The authors developed a multi-perspective dataset to automatically
detect abusive language on social media with the intention to model annotator
perspectives and polarized opinions. The collected tweets are filtered with a list
of selected abusive keywords based on a previous study [19]. 1,120 tweets were
randomly sampled and annotated for hate speech, Aggressiveness, Offensiveness
and Stereotype, following the scheme described in [25,29]. In total, six annota-
tors contributed to the dataset. Three of the annotators were researchers with
western background and experience in linguistic annotation who volunteered to
annotate the data. The other three volunteers were first- or second-generation
immigrants and migrants as students from the developing countries to Europe
and the UK, of Muslim background. The group of migrants is named Target and
the locals are named Control. The dataset is unique in the way that it involves
migrants as the victims of abuse on social media. Personal details of all anno-
tators such as cultural and demographic background and ethnicity are known
and considered a valuable source of information for perspective-aware abusive
language detection. For the current study, we only used the hate speech label.

4.2   HS Dataset in Italian
The hate speech dataset in Italian language [13] (HS Italian) consists of 3,200
tweets collected from TWITA [9] in 2017, partially overlapping with the Italian
8       M. Fell et al.

Hate Speech Corpus [29]. The collection was filtered by the authors of the orig-
inal dataset with a list of handcrafted keywords related to migrants and ethnic
and religious minorities in Italy. The tweets were annotated on the Figure Eight
platform2 . A minimum of three annotators annotated the whole corpus, subse-
quently aggregated by the crowd-sourcing platform to create a gold standard
dataset. We requested and obtained the dataset from the authors.

4.3   HS Dataset in English
Davidson et al. [12] developed a hate speech dataset to perform automatic hate
speech detection as a multi-classification task. The authors gathered around
85.4 million tweets from a total of 33,458 Twitter users. A hate speech lexicon
containing hateful words and phrases was used to query the tweets. This lexicon
was compiled by Hatebase.org and the hateful words in the lexicon were identified
by internet users. The authors randomly selected about 27,000 tweets from the
dataset by using the keywords from the hate speech lexicon. CrowdFlower (now
Appen) workers were hired to manually annotate the tweets. The annotation
scheme comprises the labels hate speech, offensive but not hate speech, and neither
offensive nor hate speech. The authors developed detailed guidelines with their
own definitions of different hate speech terms including the context in which
the words were used. Each tweet in the dataset was annotated by three or more
annotators. The Davidson dataset is only available for download in an aggregated
gold standard form3 , therefore we requested and obtained the non-aggregated
dataset from the authors.4


5     Perspective Mining Experiments
We performed analysis on all the datasets that we introduced in the previous
section. An important factor for our experiments is what prior knowledge we
have about the annotators that annotated the datasets. Where such background
information is given, we can confirm or reject our findings by comparing our em-
pirically found annotator clusters (shared perspectives) with groupings of human
annotators. As stated in the dataset section before, we have the following infor-
mation on the dataset annotators. On the Brexit dataset: the personal details
of all annotators such as cultural and demographic background and ethnicity
are known. On all other datasets: no background information on the annotators
is available. Note that the HS Italian and Davidson datasets are sparsely an-
notated, as annotators have only labeled a fraction of the instances. This is in
opposition to the Brexit dataset which has a dense annotation matrix.
2
  https://www.figure-eight.com/, now Appen.
3
  https://github.com/t-davidson/hate-speech-and-offensive-language
4
  As we needed non-aggregated data for our work, we only found aggregated gold
  standard data on author’s GitHub repository. Therefore, we requested the authors
  to provide us with pre-aggregated data and we are grateful to them for providing us
  the required format of the dataset.
                      Mining Annotator Perspectives from Hate Speech Corpora             9

                            Brexit           HS Italian               Davidson
      Cluster           A    all   B       A      all    B            A      all   B
      Annotators          3    6     2       7     14      7           45 111       41
      Pos. labels %    20.5 12.9 5.8      30.9 26.6 22.3 (off.) 77.2 71.2 65.6
                                                              (HS) 4.1 10.0 15.4
    Label agr. α     0.58 0.35 0.44 0.03 0.23 0.42                   0.64 0.58 0.60
    Features                266                   623                      2366
    Feature agr. β 0.86 0.7 0.86 0.34 0.35 0.48                      0.22 0.29 0.46
Table 1. Quantitative cluster statistics for different datasets. Note that in the Davidson
dataset there are two kinds of positive labels, one for “offensive language” content (off.)
and one for “hate speech” (HS, stronger label)



   Table 1 gives an overview of the quantitative statistics and differences be-
tween the clusters. As for the qualitative analysis, we provide examples where
important words are shown in context. For all clustering experiments, we used
the experimental setup as described below.

5.1     Experimental Setup
 – Preprocessing: we removed URLs and Twitter handles (@username) from the
   tweets, tokenized them using the NLTK5 Tweet Tokenizer and lemmatized
   them using spaCy6 .
 – BOW features: we created the BOW feature space with the scikit-learn7
   CountVectorizer, where we set the minimum document frequency to 10. This
   number was decided based on the fact that some tweets occured duplicate
   or near-duplicate, because of the dialog structure of Twitter, users will cite
   each other. We alleviate the problem by setting a rather high minimum
   document frequency of 10. Furthermore, we counted each word once per
   document (“binary”) and extracted solely unigrams.
 – Clustering: we used the KMeans algorithm with different numbers k of clus-
   ters, we settled to k = 2 which appeared most reasonable based on the
   inspection of 2D-PCA embeddings of the datasets. This parameter choice
   makes us conform with the polarization paradigm, i.e. we analyze two con-
   flicting/polarized perspectives.
 – Important words: for each cluster, the top 20 words with highest relative
   importance are considered. The polarized examples are extracted using the
   polarization index method [1].

5.2     Perspectives in the Brexit Corpus
In the Brexit datasets, the inter-annotator agreement is measured as α = 0.35.
The positive label rate is 12.9%. We extract 266 features from the corpus and
obtain β = 0.7 as feature agreement between all annotators.
5
  https://www.nltk.org
6
  https://spacy.io
7
  https://scikit-learn.org
10        M. Fell et al.

    After label feature agreement (see Section 3.3), we obtain the clusters A =
{3, 4, 5} and B = {1, 2}. Since we know the annotator backgrounds, we know
that A corresponds to the migrants with muslim background (target group) and
that B corresponds to the non-migrants (control group). This result effectively
validates our clustering methodology based on label and feature agreement to
extract perspectives empirically.
    Quantitatively, we find the following differences between the clusters: i) The
positive label rate is much higher in A (20.5%) as compared to positive labels
in B (5.8%), indicating the annotators in A are more sensitive in this task
(all annotators 12.9%). ii) The label agreement is higher in cluster A (αA =
0.58) as compared to αB = 0.44, indicating that cluster A holds more coherent
opinions as B. Both values are much higher than the average, meaning that
the groups hold polarized opinions. iii) The feature agreement is higher in both
clusters (βA = 0.86 = βB ) compared to the dataset feature agreement (β = 0.7),
indicating polarization of the feature agreements of the clusters as well.
    Qualitatively, we find that certain words are highly correlated with the posi-
tive label in both groups, and some words are specific to the annotator clusters.
The shared vocabulary contains words such as “islam”, “kill” and hashtags re-
lated to US president Donald Trump (#maga, #trump2016). When inspecting
the corpus, we find examples such as the following that exemplify the use of the
words; matched words are bold. And indeed, in this example, both annotator
groups give the positive label.
        RT @ davidmatheson27 : The U.K. Must ban Islam and close all mosques!
      URL

        London should kick all Muslim Refugees out before they all kill them. #Trump2016
      URL

    From these shared vocabulary examples, we can see that since “islam” is one
of the important words, the hate speech in this corpus appears to be at least
partially islamophobia. An inspection of the important words for the potential
target group of islamophobia, cluster A , supports this claim. We find a specific
and distinctive vocabulary related to muslims, invasion, terrorists.8 The following
examples illustrate the important words for cluster A. The examples got the
positive label in A and the negative label (“no, this is not hate speech”) in
cluster B. Interestingly, while “islam” is a shared top word, we found it in the
combination “radical islam” typically in cluster A.
        FYI world, the ppl of GB supporting #Brexit know if they don’t control their
      own immigration/borders radical Islam will end their lives.

Stealing jobs, a well-known negative prejudice towards foreigners, is also among
the examples that are important for cluster A:
         Bloody foreigners coming here & taking our jobs though! #Brexit URL
8
     Words with highest relative importance for cluster A: radical, job, illegal, invasion,
     love, can, let, merkel, mayor, then.
                   Mining Annotator Perspectives from Hate Speech Corpora          11

Identified Perspectives Overall, we find two polarized groups, both by label and
feature agreement. Cluster A - the target group - is much more likely to give the
positive label and this group of annotators consistently bases their opinion on a
specific and distinct vocabulary which can be described as Islamophobic. Given
the background information we have on all annotators, we identify cluster A as
the Muslim perspective on the topic, highly sensitive to Islamophobic content. In
opposition, for cluster B we did not find a characteristic vocabulary, those anno-
tators form more a counter position to the migrant group, therefore we describe
them as control group or non-muslim perspective. We conclude, annotators in
cluster A are very sensitive towards islamophobic and, more general, xenophobic
textual content.


5.3   Perspectives in the HS Italian Dataset

In this dataset, we found large differences between the number of items anno-
tated by the different annotators. To avoid biasing our model, we only analyze
annotators with a high rating count9 . When clustering using both α and β
agreement, we obtained the same clustering into the two clusters A, B of each 7
annotators.
    Quantitatively, we found an anomaly here, as the label agreement in cluster
A is almost zero (αA = 0.03), whilst in cluster B it is rather high (αB = 0.42).
This already indicates that A is a cluster of outliers. Furthermore, the feature
agreement is higher in cluster B (βB = 0.48) as compared to A (βA = 0.34).
The latter appears to be due to noise only.
    Qualitatively, we found that degrading talk about immigrants get positive
labels from both clusters. For cluster B, we found examples with complains about
immigrants driving up public costs by living in “hotels” as well as concerns about
“sicurezza”(security) being diminished in the country after immigration.10
    Identified Perspectives: in this dataset, we found a defined perspective in
cluster B. The annotators tend to label a large spectrum of content - from
critical, over conservative, nationalistic, to openly hateful tweets, all as hate.
Hence, annotators in cluster B are very sensitive towards xenophobic textual
content.


5.4   Perspectives in the Davidson Dataset

Analogously to the HS Italian dataset, we only analyze annotators with a high
rating count11 . We obtained different clusters according to α and β agreement.
After computing the label feature agreement, we obtained cluster A of size 45
and cluster B of size 41. Note that, in contrast with all previous datasets, we
9
   this means for this dataset at least 800 ratings per annotator
10
   Words with highest relative importance for cluster B: hotel, #immigrati, spesa,
   se, clandestino, #gabbiaopen, giusto, tangere, succedere, #sicurezza (hotel, #immi-
   grants, expense, if, illegal alien, #opencage, right, touch, succeed, #safety).
11
   for this dataset, at least 500 annotations per annotator
12      M. Fell et al.

have two kinds of positive labels in this dataset, one for “offensive language”
content and one for “hate speech” (stronger label).
    Quantitatively, we found cluster B to have a much higher hate speech label
rate (15.4%) over cluster A (4.1%). The base rate is 10%. While both cluster
have comparable positive label rates, this indicates that cluster B has a tendency
to give the hate speech label when the offensive label would have been an option.
as compared to cluster A. Further, feature agreement is much lower in cluster A
(βA = 0.22) as opposed to cluster B (βB = 0.46), indicating the annotators in
cluster B agree much more on their important words.
    Qualitatively, we found some words are understood by both clusters as hate-
ful. As cluster A has a much lower positive label rate, A was rarely more critical
than B. For cluster B we find several examples with the same keywords, centered
around homophobic slurs such as “faggot”.12
    Identified Perspectives: in this dataset, we find a defined perspective for clus-
ter B. Annotators in this group give harsher labels when homophobic slurs are
present in a tweet, as compared to annotators in A. We conclude that the anno-
tators in B are highly sensitive towards homophobic textual content.



6    Conclusion


In this paper, we analyzed a number of annotated hate speech corpora, showing
how the opinions of the annotators, reflected in their annotation, are far from
uniformly distributed. In fact, the annotation of hate speech tends to be polar-
ized, and our methodology is able to highlight the groups of annotators sharing
similar opinions. We identified perspectives in the datasets, defined as increased
sensitivity towards certain types of textual content (xenophobic, islamophobic,
homophobic). Further, we introduced an automated method to support the man-
ual exploration of the perspectives emerging from a polarized annotation of hate
speech, resulting in consistent patterns describing why certain groups of people
are more or less keen on judging a message as hateful.
    As future work, we plan to test our methods with deeper and more refined
linguistic features, to abstract away from individual words and therefore provide
a more robust analysis. We also plan on investigating other NLP tasks tradi-
tionally considered less subjective, but recently found to contain informative
disagreement [30], as well as non-linguistic tasks such as image labeling.
    Finally, we note how this was was only possible thanks to the availability of
non-aggregated datasets. In line with [5] and the Perspectivist Data Manifesto 13 ,
we consider this factor crucial for research like ours.

12
   Words with highest relative importance for cluster B: hypocrite, til, mike, warn, fag,
   spread, faggot, jealous, tat, texas.
13
   https://pdai.info/
                   Mining Annotator Perspectives from Hate Speech Corpora             13

References
 1. Akhtar, S., Basile, V., Patti, V.: A new measure of polarization in the annotation of
    hate speech. In: Alviano, M., Greco, G., Scarcello, F. (eds.) AI*IA 2019 – Advances
    in Artificial Intelligence. pp. 588–603. Springer International Publishing, Cham
    (2019)
 2. Akhtar, S., Basile, V., Patti, V.: Modeling annotator perspective and polarized
    opinions to improve hate speech detection. Proceedings of the AAAI Conference
    on Human Computation and Crowdsourcing 8(1), 151–154 (Oct 2020), https://
    ojs.aaai.org/index.php/HCOMP/article/view/7473
 3. Akhtar, S., Basile, V., Patti, V.: Whose opinions matter? perspective-aware models
    to identify opinions of hate speech victims in abusive language detection (2021),
    https://arxiv.org/abs/2106.15896
 4. AlDayel, A., Magdy, W.: Stance detection on social media: State of
    the art and trends. Information Processing & Management 58(4), 102597
    (2021). https://doi.org/https://doi.org/10.1016/j.ipm.2021.102597, https://www.
    sciencedirect.com/science/article/pii/S0306457321000960
 5. Basile, V.: It’s the end of the gold standard as we know it. on the impact of pre-
    aggregation on the evaluation of highly subjective tasks. In: 2020 AIxIA Discussion
    Papers Workshop, AIxIA 2020 DP. vol. 2776, pp. 31–40. CEUR-WS (2020)
 6. Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo,
    F.M., Rosso, P., Sanguinetti, M.: SemEval-2019 task 5: Multilingual detec-
    tion of hate speech against immigrants and women in twitter. In: Pro-
    ceedings of the 13th International Workshop on Semantic Evaluation. pp.
    54–63. Association for Computational Linguistics, Minneapolis, Minnesota,
    USA (Jun 2019). https://doi.org/10.18653/v1/S19-2007, https://www.aclweb.org/
    anthology/S19-2007
 7. Basile, V., Cabitza, F., Campagner, A., Fell, M.: Toward a perspectivist turn in
    ground truthing for predictive computing. In: Proceedings of the XVIII Confer-
    ence of the Italian chapter of AIS - Digital Resiliance and Sustainability: People,
    Organizations, and Society. pp. 1–16. Association for Intelligent Systems, Trento
    (Oct 2021), http://www.itais.org/itais2021-proceedings/pdf/21.pdf
 8. Basile, V., Fell, M., Fornaciari, T., Hovy, D., Paun, S., Plank, B., Poe-
    sio, M., Uma, A.: We need to consider disagreement in evaluation. In:
    Proceedings of the 1st Workshop on Benchmarking: Past, Present and Fu-
    ture. pp. 15–21. Association for Computational Linguistics, Online (Aug
    2021). https://doi.org/10.18653/v1/2021.bppf-1.3, https://aclanthology.org/2021.
    bppf-1.3
 9. Basile, V., Lai, M., Sanguinetti, M.: Long-term social media data collection at the
    university of turin. In: CLiC-it (2018)
10. Beigman Klebanov, B., Beigman, E., Diermeier, D.: Vocabulary choice as an indi-
    cator of perspective. In: Proceedings of the ACL 2010 Conference Short Papers. pp.
    253–257. Association for Computational Linguistics, Uppsala, Sweden (Jul 2010),
    https://aclanthology.org/P10-2047
11. Campagner, A., Ciucci, D., Svensson, C.M., Figge, M.T., Cabitza, F.: Ground
    truthing from multi-rater labeling with three-way decision and possibility theory.
    Information Sciences 545, 771–790 (2021)
12. Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection
    and the problem of offensive language. In: Proceedings of the 11th International
    AAAI Conference on Web and Social Media. pp. 512–515. ICWSM ’17 (2017)
14      M. Fell et al.

13. Florio, K., Basile, V., Lai, M., Patti, V.: Leveraging hate speech detection to in-
    vestigate immigration-related phenomena in italy. In: 2019 8th International Con-
    ference on Affective Computing and Intelligent Interaction Workshops and Demos
    (ACIIW). pp. 1–7. IEEE (2019)
14. Izsak-Ndiaye, R.: Report of the special rapporteur on minority issues, rita izsak
    : comprehensive study of the human rights situation of roma worldwide, with a
    particular focus on the phenomenon of anti-gypsyism. Tech. rep., UN,, Geneva
    :. 2015-05-11 (May 2015), http://digitallibrary.un.org/record/797194, submitted
    pursuant to Human Rights Council resolution 26/4.
15. Krippendorff, K.: Estimating the reliability, systematic error and random error of
    interval data. Educational and Psychological Measurement 30, 61 – 70 (1970)
16. Lai, M., Tambuscio, M., Patti, V., Ruffo, G., Rosso, P.: Stance polarity in
    political debates: A diachronic perspective of network homophily and conver-
    sations on twitter. Data & Knowledge Engineering 124, 101738 (09 2019).
    https://doi.org/10.1016/j.datak.2019.101738
17. Lin, W.H., Wilson, T., Wiebe, J., Hauptmann, A.: Which side are you on? iden-
    tifying perspectives at the document and sentence levels. In: Proceedings of the
    Tenth Conference on Computational Natural Language Learning (CoNLL-X). pp.
    109–116. Association for Computational Linguistics, New York City (Jun 2006),
    https://www.aclweb.org/anthology/W06-2915
18. Maul, A., Mari, L., Wilson, M.: Intersubjectivity of measurement across the sci-
    ences. Measurement 131, 764–770 (2019)
19. Miller, C., Arcostanzo, F., Smith, J., Krasodomski-Jones, A., Wiedlitzka, S.,
    Jamali, R., Dale, J.: From brussels to brexit: Islamophobia, xenophobia, racism
    and reports of hateful incidents on twitter. DEMOS. Available at www. demos.
    co.       uk/wpcontent/uploads/2016/07/From-Brussels-to-Brexit -Islamophobia-
    Xenophobia-Racism-and-Reports-of-Hateful-Incidents-on-Twitter-Research-
    Prepared-for-Channel-4-Dispatches-% E2 80 (2016)
20. Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., Cherry, C.: SemEval-
    2016 task 6: Detecting stance in tweets. In: Proceedings of the 10th Inter-
    national Workshop on Semantic Evaluation (SemEval-2016). pp. 31–41. As-
    sociation for Computational Linguistics, San Diego, California (Jun 2016).
    https://doi.org/10.18653/v1/S16-1003, https://aclanthology.org/S16-1003
21. Mossie, Z., Wang, J.H.: Vulnerable community identification using hate speech
    detection on social media. Information Processing & Management 57, 102087 (07
    2019). https://doi.org/10.1016/j.ipm.2019.102087
22. O’Keeffe, G.S., Clarke-Pearson, K.: The impact of social media on children, ado-
    lescents, and families. Pediatrics 127, 800 – 804 (2011)
23. Pamungkas, E.W., Basile, V., Patti, V.: Do you really want to hurt me? predicting
    abusive swearing in social media. In: The 12th Language Resources and Evaluation
    Conference. pp. 6237–6246. European Language Resources Association (2020)
24. Pan, Z., Lee, C.C., Man, J., So, C.: One event, three stories. Gazette 61, 99–112
    (04 1999). https://doi.org/10.1177/0016549299061002001
25. Poletto, F., Stranisci, M., Sanguinetti, M., Patti, V., Bosco, C.: Hate speech anno-
    tation: Analysis of an italian twitter corpus. In: Proceedings of the Fourth Italian
    Conference on Computational Linguistics (CLiC-it 2017), Rome, Italy, December
    11-13, 2017. CEUR Workshop Proceedings, vol. 2006. CEUR-WS.org (2017)
26. Popescu, A.M., Pennacchiotti, M.: Detecting controversial events from twit-
    ter. In: Proceedings of the 19th ACM International Conference on Informa-
    tion and Knowledge Management. pp. 1873–1876. CIKM ’10, ACM, New York,
                   Mining Annotator Perspectives from Hate Speech Corpora             15

    NY, USA (2010). https://doi.org/10.1145/1871437.1871751, http://doi.acm.org/
    10.1145/1871437.1871751
27. Razo, D., Kübler, S.: Investigating sampling bias in abusive language
    detection. In: Proceedings of the Fourth Workshop on Online Abuse
    and Harms. pp. 70–78. Association for Computational Linguistics, Online
    (Nov 2020). https://doi.org/10.18653/v1/2020.alw-1.9, https://www.aclweb.org/
    anthology/2020.alw-1.9
28. Riloff, E., Wiebe, J.: Learning extraction patterns for subjective expressions. In:
    Proceedings of the 2003 Conference on Empirical Methods in Natural Language
    Processing. pp. 105–112 (2003), https://www.aclweb.org/anthology/W03-1014
29. Sanguinetti, M., Poletto, F., Bosco, C., Patti, V., Stranisci, M.: An italian twitter
    corpus of hate speech against immigrants. In: Proceedings of the Eleventh Interna-
    tional Conference on Language Resources and Evaluation (LREC-2018). European
    Language Resource Association (2018), http://aclweb.org/anthology/L18-1443
30. Uma, A., Fornaciari, T., Hovy, D., Paun, S., Plank, B., Poesio, M.: A case for
    soft-loss functions. In: Proceedings of the 8th AAAI Conference on Human Com-
    putation and Crowdsourcing. pp. 173–177 (2020), https://ojs.aaai.org/index.php/
    HCOMP/article/view/7478
31. Warner, W., Hirschberg, J.: Detecting hate speech on the world wide web. In: Pro-
    ceedings of the Second Workshop on Language in Social Media. pp. 19–26. LSM
    ’12, Association for Computational Linguistics, Stroudsburg, PA, USA (2012),
    http://dl.acm.org/citation.cfm?id=2390374.2390377
32. Wiebe, J., Wilson, T., Bruce, R., Bell, M., Martin, M.: Learning sub-
    jective language. Computational Linguistics 30, 277–308 (09 2004).
    https://doi.org/10.1162/0891201041850885
33. Wiegand, M., Ruppenhofer, J., Kleinbauer, T.: Detection of Abusive Lan-
    guage: the Problem of Biased Datasets. In: Proceedings of the 2019 Confer-
    ence of the North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies, Volume 1 (Long and Short Pa-
    pers). pp. 602–608. Association for Computational Linguistics, Minneapolis, Min-
    nesota (Jun 2019). https://doi.org/10.18653/v1/N19-1060, https://www.aclweb.
    org/anthology/N19-1060
34. Yu, H., Hatzivassiloglou, V.: Towards answering opinion questions: Separating facts
    from opinions and identifying the polarity of opinion sentences. In: EMNLP (2003)
35. Zampieri, M., Nakov, P., Rosenthal, S., Atanasova, P., Karadzhov, G., Mubarak,
    H., Derczynski, L., Pitenis, Z., Çöltekin, c.: SemEval-2020 Task 12: Multilingual
    Offensive Language Identification in Social Media (OffensEval 2020). In: Proceed-
    ings of SemEval (2020)
36. Zhang, Z., Luo, L.: Hate speech detection: A solved problem? the chal-
    lenging case of long tail on twitter. Semantic Web Accepted (10 2018).
    https://doi.org/10.3233/SW-180338