<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>British Journal of Mathematical and Sta</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1162/089120104773633402</article-id>
      <title-group>
        <article-title>Social or Individual Disagreement? Perspectivism in the Annotation of Sexist Jokes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Berta Chulvi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lara Fontanella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Labadie-Tamayo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>G. d'Annunzio University of Chieti-Pescara</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politècnica de València</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1995</year>
      </pub-date>
      <volume>71</volume>
      <fpage>530</fpage>
      <lpage>535</lpage>
      <abstract>
        <p>The purpose of this paper is to show that the disagreement expressed in the data does not come from individual diferences but from diverse and sometimes conflicting, social positions. Using a medium size dataset, 210 sexist jokes and 76 annotators, we test the hypothesis that, from a certain point (size of 12 in our data), adding more subjects to the annotation process does not increase the disagreement. We also measure the attitudes of subjects in sexism, introducing a new scale of Hostile Neosexism, and the consistent or inconsistent behaviour of annotators regarding their attitudes. We propose that perspectives are a combination of attitudes and behaviours, and we explore how they afect inter-rater agreement and which will be the number of annotators that we need to include all the perspectives in an annotation strategy.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tradition of research, the present study tries to
demonstrate that the Learning from Disagreement paradigm
Artificial Intelligence (AI) applications often perpetuate needs to consider disagreement as a social phenomenon
and accentuate unfair biases that can originate from mul- and not at the individual level. Individual attitudes
totiple sources, such as data sampling, labelling processes, wards various issues, such as equality, abortion, or
imtraining data, etc. This paper focuses on new strategies migration, are the expression of ideological and social
for reducing bias in the labelling process following the conflicts in which individuals take part. Then, the general
Learning from Disagreements paradigm (for a recent re- idea underlying this research is that when dealing with
view, see [1]). This new approach in Natural Language socially relevant problems, NLP tasks need to consider
Processing (NLP) tries to avoid the bias of considering a that diferent perspectives in the data respond to
diferunique and correct vision of one phenomenon captured ent social positions in the social realm. The hypothesis
by a gold standard corpus, even when the problem ad- derived from this assumption is that from a certain point
dressed is the object of a strong social debate such as hate on, the inclusion of more individuals in an annotation
speech or sexist language. The research we present raises process does not produce more disagreement [H1]. If
two fundamental questions, one of a theoretical nature - the results verify this hypothesis, the following research
what is the nature of these disagreements that we need question is how to estimate the optimal size of a group
to consider? - and the other of a methodological nature: of annotators from which disagreement does not change
how to approach an annotation process that includes the significantly [RQ1].
diferent perspectives of a phenomenon considering the To identify bias in the labelling process, recent research
existence of limited resources for the labelling process? in NLP focuses on demographic, ideological, and
attitu</p>
      <p>Regarding the first theoretical question, in social psy- dinal diferences among individuals [ 5]. We propose that
chology there is strong evidence that humans disagree considering only attitudes and ideology is insuficient to
even in seemingly objective tasks like estimating which approach the perspectivism paradigm correctly. A
charline has the same length as a standard line [2, 3]. It has acteristic of human beings that we know from the
beginbeen studied in detail how these disagreements do not ning of social psychological research is that attitudes do
occur in a social vacuum due to individual diferences in not always predict behaviour [6] or do not directly
preperception, but instead are the result of social influence dict behaviour [7]. People’s inclination for consistency is
strategies with implications for the individuals at the widely acknowledged, and while they occasionally
manlevel of their social relations or their social identity (for a age to maintain it, more often than not, they fall short of
recent review of this literature, see [4]). In line with this achieving it. Social psychology has developed a vast
theoretical and empirical efort to understand consistency and
inconsistency in human attitudes and behaviour [8, 9].</p>
      <p>As labelling is a behaviour, a second assumption
arising from our research is that diferent perspectives in
annotation will be related not only to the expression of
2nd Workshop on Perspectivist Approaches to NLP
* Corresponding author.
$ berta.chulvi@upv.es (B. Chulvi)</p>
      <p>0000-0003-1169-0978 (B. Chulvi); 0000-0002-5441-0035
(L. Fontanella); 0000-0003-4928-8706 (R. Labadie-Tamayo);
0000-0002-8922-1242 (P. Rosso)</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License certain attitudes but also to the fact of acting consistently
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) or inconsistently with the values these attitudes express.</p>
      <p>High relevance of error
Ambiguous perception
Problem resolution
Decision-making</p>
      <p>Perceptual evidence
Familiar information tasks </p>
      <p>Simple logic tasks
Aptitud task
(TAP) </p>
      <p>Non-ambiguous task </p>
      <p>(TONA)
Opinion taks  Non-implicants task </p>
      <p>(TOP) (TANI)
Opinions task
Attitudes task
Values task</p>
      <p>Tasks of personal taste</p>
      <p>Predictions on a game of chance
Tasks with high
social relevance</p>
      <p>Tasks with low
social relevance</p>
      <p>Low relevance of error
Then a hypothesis derived from this assumption is that
agreement in an annotation process will change
considering individuals’ attitudes related to the issue and the
consistent or inconsistent annotators’ behaviour in the
annotation process [H2]. If the results verify this
hypothesis, the research question is which size of the annotators’
group ensures that our annotators’ team reproduce the
mix of perspectives that reflect well attitudes and the
consistent or inconsistent behaviour with them, which gives
the complete picture of a controversial debate [RQ2].</p>
      <p>Using a relatively small corpus (210 sexist jokes) and a
large group of 76 annotators, we test hypotheses 1 and 2
and try to answer the two research questions about which
will be the optimal size of the group to include diferent
perspectives [RQ1] and how to ensure our annotators
reproduce a representative mix of perspectives [RQ2].</p>
      <p>The rest of the article is organised as follows. Section 2
presents previous research related to the concepts that
we use. In Section 3, we present our empirical research:
data, task, and procedure. Details about the statistical
analyses are given in Section 4. We present the results
of our empirical evaluation in Section 5 and conclusions
and limitations in Section 6.
opinion tasks in NLP. The sift paradigm advocates for
the publication of datasets in pre-aggregated form and
the development of new measures for the evaluation of
models that take into account all the perspectives linked
to diferent backgrounds.</p>
      <p>The research adopting perspectivism in NLP grows
year by year (for a recent review, see [1]) and one main
concern is the labelling bias introduced by the cultural
background of annotators [13, 14].
2. Related work In recent research, Sap and colleagues [5] have shown
strong associations between annotator identity and
be2.1. The perspectivism sift and the liefs and their ratings of toxicity. Specifically, their results
labelling bias show that more conservative annotators and those who
scored highly on a racist beliefs scale were less likely
In modern computational linguistics, the standardised to rate anti-black language. Closer to our research
quesannotation process of a corpus includes diferent tech- tions is the work of Akhtar et al. [15, 16], which leverages
niques to classify a single piece of language in a given diferent opinions emerging from groups of annotators
taxonomy. It implies training annotators, multiple classi- with the goal of studying how polarised instances afect
ifcation subjects, measures of inter-annotator agreement, the performance of the classifiers. Considering binary
harmonisation, aggregation by the majority, and con- classification tasks, they introduce a novel measure of the
struction of a “gold standard” corpus representing the polarisation of opinions able to identify which instances
truth against which future predictions of NLP models in a dataset are more controversial. In a pilot study about
will be compared. According to the tasks’ taxonomy of xenophobia arguments in the context of Brexit, the
annoPerez and Mugny [10], it means that the labelling pro- tation process was organised to contrast the annotation
cess is being approached as an aptitude task, that is, a done by three people with an immigrant background
task with a correct answer (see Figure 1). This approach (target group) in front of three people with a mainstream
is hardly applicable when confronted with what difer- background as a control group. Using their
polarisaent authors have referred to as a “highly subjective task” tion index, the authors show how in several tweets, all
[11, 12]. We propose to denominate these tasks opinion the members of the target group (immigrants) marked
tasks, following the taxonomy of [10], because their main the message as racist and hateful, while the members
characteristic is not their subjectivity but the fact that, of the control group marked it as conveying no hate or
looking at the way that society considers them, it seems racism. It is interesting to note that they only found a
that a correct answer does not exist (low relevance of few tweets (1.13%) on which all the annotators agreed
error). Still, all the possible answers situate the person that they contained hateful messages. Implicitly, in this
at the point of a continuum whose extremes are defined work the authors assume, similar to our perspective, that
by a social confrontation (high social relevance). We the nature of the disagreement is social and sustained by
view the sift paradigm, proposed in the Perspective Data a social conflict, but they do not provide any empirical
Manifesto1, as a more stringent approach to handling
1https://pdai.info/
2For the tasks classification, we have kept the original acronyms
from the French version.
measure of annotators’ attitudes. Their results suggest actually discriminatory). These two components difer
that consensus-based methods to create gold standard in tone but are positively correlated and work together
data are not necessarily the best choice when dealing to perpetuate gender inequalities (for a recent review,
with what they call highly subjective phenomena and we see [23]). Also related to the evolution of sexism, is the
consider opinion tasks. concept of neosexism [24] or modern sexism [25]. Like
modern racism, modern sexism is characterised by the
2.2. Attitudes and behaviour relation denial of continued discrimination, antagonism toward
women’s demands, and lack of support for policies
deIn binary classification tasks, annotating a corpus is a signed to improve women’s position in society.
behaviour more than the expression of an opinion. The In a recent review on ambivalent sexism, Barreto and
annotators will use their attitudes and beliefs to decide, Doyle [23] point out future directions in the study of
sexbut it is hard to expect that attitudes predict perfectly ism due to the rapid developments in societal norms and
this behaviour. Attitudes influence behaviour, as we have attitudes towards sex, gender, and sexuality across many
already seen in the work of [5], but the relation attitude- countries. Surprisingly, despite an important amount
behaviour is not a pacific question in social-psychology of research noting a rise in the number of men with a
literature (for a classical review, see [17]). For example, self-proclaimed anti-feminist agenda [26, 27, 28], these
Donald Campbell [18], in the sixties, argued that people authors do not consider as future work to investigate the
who hold negative attitudes toward minorities may be link between hostile sexism and anti-feminist attitudes.
reluctant to express their attitudes through public be- To go deeper into the interaction between hostile
sexhaviour because norms of tolerance and politeness were ism and anti-feminist attitudes seems relevant because a
typically held in American society. Things have changed new kind of strong hostility towards women uses
antia lot regarding the open expression of hate towards mi- feminist frames, but also supports certain feminist
polinorities, that is why The New York Times published, in cies, such as equality [29]. This new latent attitude, that
2019, an editorial with the suggestive headline of “Racism we denominate Hostile neosexism, is dificult to capture
Comes Out of the Closet”3. with old attitudes scales towards feminism, such the one</p>
      <p>Not only does agreeing with social norms and situa- developed by Smith in the seventies [30], because most of
tional constraints explain the inconsistencies between the items of this instrument fit with the feminist values
attitudes and behaviours, but there are also specific do- that this new Hostile neosexism seems to support. Also, it
mains, such as humour, that significantly facilitate these seems to get out from the scope of the whole ambivalent
kinds of inconsistencies. Often, some groups use humour sexism inventory [21] that does not pay specific
attento avoid moral judgement that penalises discrimination. tion to feminism itself. Regarding the modern sexism
Ofensive people find support from a majority who con- scale [25] or the Neosexism scale [24], we argue that
Hossider that some messages are "only" jokes. When a society tile neosexism presents a high degree of hostility against
begins to overcome its prejudices towards certain social women that the previous scales do not capture4. The
groups, we can observe that humour becomes a space core of this Hostile neosexism attitude is the claim that
in which these prejudiced attitudes are maintained. In societal changes driven by the feminist movement are
fact, when we examine ofensive jokes, we find they are inherently unfair and put men as a group in a
disadvanmainly related to some social minorities [19]. These in- tageous position. Despite, the hostile sexism subscale
consistencies between attitudes and the behaviour of the [21] was primarily driven by the idea that men’s
domannotators could also be a symptom of changes or re- inance over women is both appropriate and desirable,
sistances of subjects and capture the evolution of some some items of this subscale connect well with the idea
opinion groups in controversial debates. that nowadays there is no reason for feminist demand
and that the feminist movement overreacts (see items 3,
4 and 5 in Section 3.2.1).</p>
      <sec id="sec-1-1">
        <title>2.3. The Hostile Neosexism</title>
        <p>Traditionally, sexism [20] has been viewed as the holding
of discriminatory attitudes toward women, both manifest
and subtle. This distinction in the tone of sexism was
proposed by the ambivalent sexism theory [21, 22]. It
was developed to account for a sort of evolution from a
hostile component of sexism (overtly negative attitudes
towards women) to a benevolent component (attitudes
towards women that seem subjectively positive but are</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Study Design</title>
      <p>3.1. Data</p>
      <sec id="sec-2-1">
        <title>To carry out our study, we relied on a manually selected</title>
        <p>set of 210 jokes, conveying prejudice against women,
from the corpus proposed in the shared task: HUrtful
HUmour (HUHU): Detection of Humour Spreading
Preju3https://www.nytimes.com/2019/07/15/opinion/trump-twitterracist.html
4Authors are currently conducting research to test the need for this
new instrument and validate a longer version of the scale
dice in Twitter at IberLEF 2023 [31]. This dataset ofers a responded to a questionnaire containing the Hostile
neogold standard corpus of tweets in Spanish containing prej- sexism scale and a question about their ideology.
udice against four minorities: women, the LGBTIQ
community, immigrants and racially discriminated people, 3.2.1. Annotators attitudes and ideology
and overweight people. During the annotation process
of the HUHU dataset each instance was assessed for the To measure annotators’ attitudes in Hostile Neosexism,
presence of humour and prejudice by 3 annotators. The we created a short scale that we denominate Brief Hostile
criterion used for annotation was based on the relative Neosexism Scale. It is composed of six items: three of
majority agreement of the annotators, with a threshold them (4 to 6) are part of the Hostile Sexism subscale of the
of 2 out of 3. For the present study, we select jokes that Ambivalent Sexism Scale from Glick and Fiske [32] and
convey diferent kinds of prejudice against women. We the other three (1 to 3) are new items that we created ah
have classified the 210 jokes into 5 categories with the hoc to measure anti-feminist attitude:
aim of describing the content of the dataset providing
some examples:
1. Some of the demands of the feminist movement</p>
        <p>seem to me to be a bit exaggerated.
2. I sometimes feel that our society pays too much</p>
        <p>attention to the rights of certain minorities.
3. In the name of equality, many women try to gain</p>
        <p>certain privileges.
4. Many women interpret innocent comments and</p>
        <p>actions as sexist.
5. Women are easily ofended.
6. Women exaggerate the problems they sufer
because they are women.</p>
        <sec id="sec-2-1-1">
          <title>3.2. Participants and procedure</title>
          <p>A total of 76 students of psychology (76.3% women and
23.7% men) took part in the experiments as an activity of
a practical workshop in the first year of the degree. The
activity was done in silence without other any
distractions and took two hours time. Students were assigned
a secret number to keep anonymity and access an
Excel document to label the jokes. Annotation of task 1
consisted in reading the 210 jokes and classifying them
as sexist (containing a prejudice against women) or not.</p>
          <p>The annotators had to say also whether the text contains
humour or not (task 2) and which was the ofensiveness
of prejudice (task 3) on an ordinal scale (0=not at all,
1=slightly; 2=somewhat, 3=very much). After complet- 5Data are public in https://github.com/Bertachulvi/ECAI2023
ing the annotation task, using the secret number, students 6https://allea.org/code-of-conduct/</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>As discussed in the Introduction, our research aims to</title>
        <p>evaluate the influence of attitudes on the annotation
process and the relation between attitudes and behaviour.</p>
        <p>To derive annotators’ latent attitude and behaviour, we
exploit an Item Factor Analytic approach, which
constitutes an extension of classical linear factor analysis and is
particularly suitable for addressing categorical variables.</p>
        <p>Specifically, within the framework of Item Response
Theory (IRT) [35], we adopt the two-parameter normal ogive
(2PNO) formulation [36]
 ( = | ,  ,  ) =Φ (    −  ,)
− Φ (    −  ,+1)
(1)
‘
where Φ( · ) is the normal cumulative function. Through the application of the knee point method, an
Here the probability of observing a given category annotator sample size of  = 12 was determined to be
 = 1, . . . , , for unit  = 1, . . . ,  and item the point of stabilisation for AC1 variability, indicating
 = 1, . . . , , is modelled in terms of the latent trait  , that further increases in the number of annotators do not
the factor loading   and a vector of ordered threshold yield significant modification in agreement [RQ1].
 . To estimate the model parameters, we embrace a
fouflmlyisBsainyegsviaanluaepsp[r3o7a]c.h that incorporates the handling 1 (a)</p>
        <p>We are also interested in measuring inter-rater agree- C10.5
ment in the task of annotating sexism. As expected, be- A
cause our data come from the HUHU dataset, we have
observed that in the binary annotation scheme, most of 00 5 10 15 20 25 30 35 40 45
the texts are categorised as jokes conveying prejudice Annotator sample size
against women, with 81% of the annotations falling into e (b)
this category. This skewed distribution of data leads to litpdu0.6
a low level of agreement among diferent raters when IaCm0.4 n=12
using traditional inter-rater agreement measures such %
as Fleiss’  or Kripendorf’s  . This discrepancy arises :9510.2
from the paradoxical situation where the observed agree- AC 0
ment appears to be very high, while the chance-corrected 0 5 10 15Annot2a0tor sam2p5le size30 35 40 45
agreement is actually low [38]. To address this issue, we
employ Gwet’s AC1 measure of inter-rater agreement Figure 2: Simulation results: (a) Mean and 95% confidence
[39], which utilises a probabilistic model of agreement interval of Gwet’s AC1 coeficient; (b) Amplitude of the 95%
[40]. This approach estimates the dificulty levels of the confidence interval of Gwet’s AC1 coeficient and knee-point.
items within the corpus through probabilistic inference 5.2. How do attitudes afect the
and then estimates the probability of chance agreement
separately for easy and hard items. This probabilistic agreement among annotators?
modelling approach helps mitigate the impact of the A Bayesian exploratory IRT analysis was employed,
folskewed data distribution on the agreement assessment lowing the approach described in [42], in order to
evalprocess. uate the construct validity of the scale outlined in
Section 3.2.1. The results of the analysis indicated that the
5. Results scale exhibits unidimensionality, supporting its validity
as a measurement tool for the intended construct.
Therefore, a unidimensional 2PNO model (Equation 1) was
5.1. Do more annotators produce more exploited to estimate the Hostile neosexism attitude of
disagreement? the annotators, taking into account the influence of their
To test hypothesis 1 which considers disagreement as a so- gender and ideology as relevant features. The estimated
cial phenomenon and not at the individual level, we need values for the model parameters can be found in Table 1.
to investigate the influence of the number of annotators The factor loadings indicate the weight of the
correspondon inter-rater agreement. For doing so, we randomly se- ing items in the derivation of the latent trait scores, while
lected samples without replacement from the population the location values give insights on the level of
consolof 76 annotators, with sample sizes  ranging from 3 to 45. idation of the corresponding Hostile neosexism attitude:
To ensure statistical robustness, 10,000 iterations were lower values correspond to a belief that gains more
supperformed for each sample size. The results of this anal- port in our sample [43]. As for the regression parameter
ysis are presented in Figure 2. In particular, Figure 2(a) estimates, the only covariate that seems to significantly
depicts the mean and 95% confidence interval for each impact the Hostile neosexism attitude is endorsing right
sample size. To determine the optimal annotator sample ideology.
size that leads to stabilisation in the variability of Gwet’s To assess the influence of the Hostile neosexism
attiAC1 coeficient, the knee-point method was employed tude on the level of agreement, we contrast the inter-rater
[41]. This method is commonly used to identify the point agreement among the  = 12 annotators in three
subat which a graph exhibits a significant change in slope. groups: a homogeneous group with the lowest scores on
In this study, the knee-point method was applied to the the Hostile neosexism attitude, a homogeneous group with
amplitude of the confidence intervals (see Figure 2(b)). the highest scores, and a mixed group with six
annotators positioned at the lower end of the Hostile neosexism
and six annotators positioned at the higher end. The
observed and expected agreements and the Gwet’s AG1
coeficients for all the 76 annotators and for the 3
subgroups are displayed in Table 2. The results demonstrate
a clear distinction in the level of agreement among the
annotators with lower Hostile neosexism attitude compared
to the other groups. On the other hand, the agreement
within the mixed group is similar to that observed in the
overall population of annotators, indicating a comparable
level of consensus among individuals with varying levels
of Hostile neosexism attitude.</p>
        <p>We develop a second sub-sampling strategy to test the
influence of attitudes on the level of agreement. A
simulation was conducted with a sample size of  = 12,
and the sample units were randomly selected from sub- 7We use the classical adjective here because a 77% of jokes refer to
populations characterised by scores on the latent trait traditional misogynistic stereotypes that present women as dumb,
body-centred, gossipy, incomprehensible for men or malicious
below the first quartile ( Low Hostile Neosexism), above the
third quartile (High Hostile Neosexism), and evenly
distributed between the two sub-populations (Mixed Hostile
Neosexism). From each group, we selected 10,000 samples
without replacement. The findings (see Figure 3) provide
further evidence of the influence of attitude on the level
of agreement in the annotation process.</p>
        <p>Following the two strategies, we find that the level of
agreement decreases among the Mixed Hostile
Neosexism group but also among High Hostile Neosexism. The
decline in agreement among mixed groups is
understandable but would not be expected among homogeneous
groups high in Hostile Neosexism. Then we address the
inconsistency between attitude and behaviour discussed
in Section 2.2.</p>
        <sec id="sec-2-2-1">
          <title>5.3. Are attitudes consistent with the annotators’ behaviour?</title>
          <p>An alternative approach based on IRT models, as
proposed in [44], can be employed to gain insights into
consistency in annotators’ behaviour across the 210 tweets,
specifically regarding their ability to recognise instances
of sexism in the jokes. This alternative formulation of
the IRT model deviates from the traditional approach by
treating the annotators as items, allowing the threshold
parameter in the binary annotation task to be interpreted
in terms of the level of dificulty in recognising the
presence of classical sexist content in jokes7. We denominate
this variable Sexism Recognition Shortcoming because all
text comes from a dataset that expresses sexism, but we
do not interpret these recognition problems as a lack
of skill, but rather, as the expression of an opinion. As
the pragmatic of communication emphasises, every be- haviour dimensions may be related to some annotators’
haviour is a communication act, even the silence [45]. characteristics. Table 4 provides the percentage
compo</p>
          <p>As depicted in Figure 4, there is evidence of a positive sition of the identified groups in terms of gender and
correlation between the Hostile Neosexism attitude of the ideology. The chi-square test of independence leads to
annotators and their Sexism Recognition Shortcoming be- conclude that there is a significant association between
haviour, reinforcing the idea that attitude and behaviour those characteristics and the group identified along the
are connected. However, the intriguing result is that the sexist latent traits (gender: p-value 0.0018; ideology:
pstrength of this association is relatively modest, as indi- value 0.0014). As we can see, the expected result on the
cated by the Pearson’s correlation coeficient (  = 0.234). impact of gender and ideology showed in Table 4 are
This suggests that the impact of attitude on the behaviour especially manifest in consistent groups. The left is the
of identifying the presence of sexist content is somewhat majority in Low-Low group, and the right in the
Highlimited and we need to introduce a more complex view High group. The novelty is that we can mostly link the
to identify the diferent perspectives. inconsistencies with the moderate left. This group finds
diferent partners in the inconsistency behaviour: the
left in the low Hostile Neosexism-high Sexism Recognition</p>
          <p>Shortcoming (Low-High) group and the right in the high
Linear: Ry2==00.1.09535*x - 1.308 (HHoisgthil-eLNoweo)sgexroisump-.low Sexism Recognition Shortcoming</p>
          <p>To further explore the relationship between attitude
and behaviour, we classified the annotators into four
groups based on their positioning relative to the means With the inclusion of two supplementary annotation
of the two identified variables: Hostile Neosexism attitude tasks as outlined in Section 3.2, we can assess whether
and Sexism Recognition Shortcoming (see Table 3). As we the inconsistencies among annotators are related to the
can see, the most numerous are the consistent groups: perception of humour in tweets or to their judgement of
low-low (34%) or high-high (27%). However, the number the level of ofensiveness associated with each text. To
of individuals exhibiting annotation behaviour inconsis- this end, we used a procedure similar to the one described
tent with expressed attitudes (22.4% and 15.8%) is not in Section 5.2 in order to derive annotators’ scores on
negligible. the latent dimensions of Humour recognition and Degree
of ofensiveness . Figure 5 shows the distribution of the
Table 3 estimated scores for the recognition of humorous content
Groups’ composition according to Hostile Neosexism attitude and for the evaluation of the degree of ofensiveness
and Sexism Recognition Shortcoming across the four annotators’ groups.</p>
          <p>Sexism Recognition In Figure 5, we appreciate that the inconsistency
be</p>
          <p>Shortcoming tween attitudes and behaviour in the case of
individuHostile Neosexism Low High als with Low Hostile Neosexism attitude but High Sexism
Low 34.22%6 22.41%7 oRfecthogentietxiotnasShhuormtcoormouins.gTrheilsieisncoonnasishtiegnhceyrsruepcpoogrntistitohne
High 15.81%2 27.62%1 ihmuprtl.icTithaisngdreoxutpenisdeadlsaostshuemopntieotnhtahtartahteusmt woueretdsoaess lneosst
ofensive. In this group, the left and the moderate left
represents the 82.4% of the total. Humour recognition also
plays a role in the other inconsistent group, the
individu</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>This grouping allows for a more nuanced examination of how diferent positions on the attitude and be</title>
        <p>High</p>
        <p>High
47.6%
52.4%
19.0%
9.5%
28.6%
42.9%
Total
als with High Hostile Neosexism attitude but Low Sexism
Recognition Shortcoming where moderate left and right
sum to 66.6%. We believe that this group, with its
inconsistency, is expressing that annotators embrace Hostile
Neosexism which targets the feminist movement as
overacting but recognises well the classical sexism expressed
in 77% of jokes. For this interpretation, it is important to
take into account that our data mostly fits with categories
that express classical prejudices and stereotypes against
women (see Section 3.1). The position of the two
consistent groups (Low-Low and High-High) seems coherent:
for diferent reasons, some because jokes contain
prejudice (Low-Low), others because maybe they think jokes
describe reality well (High-High), both find the tweets
less humorous, but they difer in the degree of
ofensiveness. As expected, for the High-High group tweets are
less ofensive than for the Low-Low group. These results
lead us to afirm that perspectives are expressed through
a combination of attitudes and behaviours.</p>
        <sec id="sec-2-3-1">
          <title>5.4. Agreement and perspectives</title>
          <p>In this section, we explore whether the agreement
changes considering individual’s attitudes and
consistent or inconsistent behaviour [H2]. As we see in Table 5,
individuals with similar attitude, Low Hostile Neoexism,
will exhibit very diferent inter-rater agreement (0.83 &gt;
0.37) if we consider the consistency between attitudes
and behaviour. The same occurs with the opposite
attitude: High Hostile Neosexist people exhibit very diferent
inter-rater agreement (0.82 &gt; 0.49) if we consider the
consistency between attitudes and behaviour.</p>
          <p>We can not conclude that an inconsistent behaviour
reduces the agreement because, in the Low Hostile
Neosexism group, high agreement occurs in the consistent
subgroup, while in the High Hostile Neosexism group, it
occurs in the inconsistent subgroup. As we argue in
Section 5.3, individuals communicate their opinions not only
through attitude expression but also through behaviour,
as the pragmatics of communication assesses [45]. In
this regard, we interpret high inter-rater agreement as
the identification of a clear social position and low
interrater agreement as the existence of a changing social
position. By changing social position we mean a
process in which individuals did not find a clear indication
in the social realm about which will be the action that
must be expected from them in the given context. Then,
the interpretation of the diferent perspectives must
focus on identifying which kind of consensus or conflict
causes the respective high or low agreement. We do not
think that diferent perspectives must be matched with
diferent groups with a strong agreement because not
polarised groups on a particular issue could exhibit a low
level of agreement (according to what [15] propose). This
group might also express a diferent perspective as a way
to approach a controversial issue even if there is not a
polarised position, because this lack of polarisation is
what defines the group. Moreover, we need to consider
controversial issues dynamically, and then it is
reasonable to think that new perspectives, or changing ones,
will register low levels of agreement because they reflect
a social position that is being formed or one that is in
crisis. Our interpretation of the diferent perspectives
that we find in our data, taking into account the nature
of the task of labelling a corpus that entirely contains
sexist jokes, is the following:
1. Low-Low group: People that highly support the
modern feminist movement (Low Hostile
Neosexism) and that do not find funny ( Low Sexism
Recognition Shortcomings) classical sexist jokes. It is a
clear social position in sociological terms, then
we find a high agreement (Gwet’s AC 1=0.838).
2. Low-High group: People that support the
modern feminist movement (Low Hostile Neosexism)
but still find funny ( High Sexism Recognition
Shortcomings) classical sexist jokes. It is a changing
social position in sociological terms because the
mainstream message is that this humour is not
funny, then we find a lower agreement (Gwet’s
AC1=0.37).
3. High-Low group: People that do not support
the modern feminist movement (they think that
some feminist overreacts) but give support to the
old feminist movement (the one that emphasises
equality) and is able to recognise ofensiveness
in the sexist jokes. This is a clear social position
because fits with the 20th century feminism, and
then we find a high level of agreement (Gwet’s
AC1=0.829).
4. High-High group: People that represent new
phenomenon that we have labelled as Hostile
Neosexism. They manifest a strong hostility to the
modern feminist movement that could lead to a
not recognition of the classical sexism jokes, that
is, it can endanger the achievements of the
equality movement during the 20th century. This a
new social position and then we find a low level
of agreement (Gwet’s AC1=0.49).</p>
          <p>Aside from the aforementioned understanding of the
various views, we believe that multiple perspectives
should be be present in an ideal team of annotators. The
next study research question is about determining the
ideal size of the group to include all of them based on our
data.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>5.5. Size of the group and perspectives</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Assuming the composition of the annotators’ population</title>
        <p>detailed in Table 3, our objective in this section is to
investigate the sample size required to ensure the inclusion of
all diverse perspectives within an annotator team [RQ2].
To achieve this, we randomly selected, with replacement,
100 samples from the original population for each sample
size in the range 2-45. The representativeness of each
sample with respect to the composition of the original
population was assessed using the Frobenius distance
between the original and the sample composition. The
knee-point method was employed to identify the optimal
sample size, meaning the sample size that guarantees a
minimal distance between the sample and the population
composition in terms of the proportion of annotators
belonging to the four identified groups. To ensure the
robustness of our findings, we repeated the simulation
procedure 1000 times, resulting in an empirical
distribution of the optimal sample size across the repetitions (see
Table 6). From the results, we can conclude that for our
study a sample size ranging from 10 to 12 will most likely
guarantee a fair representation of the diferent
perspectives in the annotators’ team.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusion and limitations</title>
      <sec id="sec-3-1">
        <title>In this paper, we presented a methodology that ap</title>
        <p>proaches several common problems that arise when we
intend to translate the perspectivism paradigm to a
coherent annotation strategy. We tested H1, and our results in
Section 5.1 suggest that the nature of the disagreement in
the annotation is social and not individual because, from a
certain point, it does not increase by adding more
individuals. We apply a social psychology-grounded taxonomy
for classifying tasks that could be helpful for dealing with
what, in NLP research, is referred to as a subjective task.
We also verify that diferent perspectives arise not only
from attitudes but also from inconsistent or consistent
behaviour of the annotators with these attitudes. We find
this important because it shows that we can not assume
that we will include all perspectives in a dataset only
relying on attitude or biographical diferences. We also
argue that these inconsistencies are valuable information
about how controversial issues evolve in social debate.
We propose that perspectives are a combination of
attitudes and behaviour. We evaluate which will be the size
of the group to include all the perspectives detected in
our data.</p>
        <p>Several limitations of this work must be considered.
First, the annotator team is composed of psychology
students, but even within this homogeneous group, we have
seen that diferent perspectives arise. Also, we choose
to work with a dataset containing only sexist jokes,
because we try to avoid the diversity coming from the data
and to concentrate on annotators’ perspectives, but a
deep analysis of the text will give us more insights and a
more complex view. The more challenging future work
is to translate the knowledge obtained in this research
into a feasible methodology to include all perspectives in
an annotation plan that might need to proceed in three
steps at the time of creating the corpus: (i) a first
exploratory step that identifies perspectives and how these
perspectives are reflected in the data, (ii) a second step
to ensure the representativeness of the data in terms of
perspectivism and (iii) a final step that control if, at the
end of the annotation procedure, the data reflect all the
perspectives.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>Berta Chulvi and Paolo Rosso are supported by</title>
        <p>FairTransNLP-Stereotypes PID2021–124361OB-C31
funded by MCIN/AEI/10.13039/501100011033 and by of Social Psychology (2019). doi:10.5334/irsp.
ERDF, EU A way of making Europe. The work of 277.</p>
        <p>Roberto Labadie was supported by valgrAI - Valencian [10] J. A. Pérez, G. Mugny, Influences sociales : la théorie
Graduate School and Research Network of Artificial de l’élaboration du conflit, 1993.
Intelligence and the Generalitat Valenciana. Lara [11] V. Basile, It’s the End of the Gold Standard as
Fontanella is supported by the ICOMIC (Identifying and we Know it. On the Impact of Pre-aggregation
Counteracting Online Misogyny in Cyberspace) Project on the Evaluation of Highly Subjective Tasks, in:
funded by EU Next Generation, MUR-Fondo Promozione DP@AI*IA, 2020.
e Sviluppo-DM 737/2021 [12] V. Basile, T. Caselli, A. Balahur, L. Ku, Editorial: Bias,
subjectivity and perspectives in natural language
processing, Frontiers in Artificial Intelligence 5
References (2022). doi:10.3389/frai.2022.926435.
[13] M. Sap, D. Card, S. Gabriel, Y. Choi, N. A. Smith,
[1] A. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, The Risk of Racial Bias in Hate Speech
DetecM. Poesio, Learning from Disagreement: A Survey, tion, in: Proceedings of the 57th Annual
MeetJournal of Artificial Intelligence Research 72 (2021) ing of the Association for Computational
Linguis1385–1470. tics, Association for Computational Linguistics,
Flo[2] S. E. Asch, Studies of independence and con- rence, Italy, 2019, pp. 1668–1678. doi:10.18653/
formity: I. A minority of one against a
unanimous majority, Psychological Monographs: Gen- [14] vZ1.W/Pa1se9e-m1,1A63re. You a Racist or Am I Seeing Things?
eral and Applied 70 (1956) 1–70. doi:10.1007/ Annotator Influence on Hate Speech Detection on
s11135-022-01494-7. Twitter, in: Proceedings of the First Workshop on
[3] J. D. Campbell, P. J. Fairey, Informational and nor- NLP and Computational Social Science, Association
mative routes to conformity: The efect of faction for Computational Linguistics, Austin, Texas, 2016,
size as a function of norm extremity and
attention to the stimulus, Journal of Personality and [15] Sp.p.A1k3h8t–a1r4,2V..dBoai:s1il0e.,1V8. 6P5a3tt/i,v1A/WN1e6w-5M6e1a8s.ure of
Social Psychology 57 (1989) 457–468. doi:https: Polarization in the Annotation of Hate Speech, in:
//doi.org/10.1037/0022-3514.57.3.457. M. Alviano, G. Greco, F. Scarcello (Eds.), AI*IA 2019
[4] R. Spears, Social Influence and Group – Advances in Artificial Intelligence, Springer
InterIdentity, Annual Review of Psychol- national Publishing, Cham, 2019, pp. 588–603.
ogy 72 (2021) 367–390. doi:10.1146/ [16] S. Akhtar, V. Basile, V. Patti, Whose Opinions
Matannurev-psych-070620-111818. ter? Perspective-aware Models to Identify
Opin[5] M. Sap, S. Swayamdipta, L. Vianna, X. Zhou, Y. Choi, ions of Hate Speech Victims in Abusive Language
N. A. Smith, Annotators with Attitudes: How An- Detection, 2021. URL: arXiv:2106.15896v1[cs.CL]
notator Beliefs And Identities Bias Toxic Language 30Jun2021.</p>
        <p>Detection, in: Proceedings of the 2022 Conference [17] A. H. Eagly, S. Chaiken, The psychology of attitudes,
of the North American Chapter of the Association Harcourt brace Jovanovich college publishers, 1993,
for Computational Linguistics: Human Language pp. 155–218.</p>
        <p>Technologies, Association for Computational Lin- [18] D. T. Campbell, Social attitudes and other acquired
guistics, Seattle, United States, 2022, pp. 5884–5906. behavioral dispositions, in: S. Koch (Ed.),
PsycholURL: https://aclanthology.org/2022.naacl-main.431. ogy: A study of a science. Study II. Empirical
subdoi:10.18653/v1/2022.naacl-main.431. structure and relations with other sciences. Vol. 6.
[6] R. T. LaPiere, Attitudes vs. actions, Social forces 13 Investigations of man as socius: Their place in
psy(1934) 230–237. chology and the social sciences, McGraw-Hill, 1963.
[7] I. Ajzen, M. Fishbein, Attitude-behavior relations: [19] L. I. Merlo, B. Chulvi, R. Ortega-Bueno, P. Rosso,
A theoretical analysis and review of empirical re- When humour hurts: linguistic features to foster
search, Psychological bulletin 84 (1977) 888. explainability, Procesamiento del Lenguaje Natural
[8] A. W. Kruglanski, K. Jasko, M. Milyavsky, 70 (2023) 85–98.</p>
        <p>M. Chernikova, D. Webber, A. Pierro, D. Di Santo, [20] J. K. Swim, L. L. Hyers, Sexism, in: T. D. Nelson
Cognitive consistency theory in social psychology: (Ed.), Handbook of prejudice, stereotyping, and
disA paradigm reconsidered, Psychological Inquiry 29 crimination, Psychology Press, 2009, p. 407–430.
(2018) 45–59. [21] P. Glick, S. T. Fiske, The ambivalent sexism
inven[9] J. Cooper, Cognitive Dissonance: Where We’ve tory: Diferentiating hostile and benevolent
sexBeen and Where We’re Going, International Review ism, Journal of personality and social psychology
70 (1996) 491.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>