1. Introduction

British Journal of Mathematical and Sta

10.1162/089120104773633402

Social or Individual Disagreement? Perspectivism in the Annotation of Sexist Jokes

Berta Chulvi

Lara Fontanella

Roberto Labadie-Tamayo

Paolo Rosso

1 0 G. d'Annunzio University of Chieti-Pescara 1 Universitat Politècnica de València

1995

71 530 535

The purpose of this paper is to show that the disagreement expressed in the data does not come from individual diferences but from diverse and sometimes conflicting, social positions. Using a medium size dataset, 210 sexist jokes and 76 annotators, we test the hypothesis that, from a certain point (size of 12 in our data), adding more subjects to the annotation process does not increase the disagreement. We also measure the attitudes of subjects in sexism, introducing a new scale of Hostile Neosexism, and the consistent or inconsistent behaviour of annotators regarding their attitudes. We propose that perspectives are a combination of attitudes and behaviours, and we explore how they afect inter-rater agreement and which will be the number of annotators that we need to include all the perspectives in an annotation strategy.

1. Introduction

tradition of research, the present study tries to demonstrate that the Learning from Disagreement paradigm Artificial Intelligence (AI) applications often perpetuate needs to consider disagreement as a social phenomenon and accentuate unfair biases that can originate from mul- and not at the individual level. Individual attitudes totiple sources, such as data sampling, labelling processes, wards various issues, such as equality, abortion, or imtraining data, etc. This paper focuses on new strategies migration, are the expression of ideological and social for reducing bias in the labelling process following the conflicts in which individuals take part. Then, the general Learning from Disagreements paradigm (for a recent re- idea underlying this research is that when dealing with view, see [1]). This new approach in Natural Language socially relevant problems, NLP tasks need to consider Processing (NLP) tries to avoid the bias of considering a that diferent perspectives in the data respond to diferunique and correct vision of one phenomenon captured ent social positions in the social realm. The hypothesis by a gold standard corpus, even when the problem ad- derived from this assumption is that from a certain point dressed is the object of a strong social debate such as hate on, the inclusion of more individuals in an annotation speech or sexist language. The research we present raises process does not produce more disagreement [H1]. If two fundamental questions, one of a theoretical nature - the results verify this hypothesis, the following research what is the nature of these disagreements that we need question is how to estimate the optimal size of a group to consider? - and the other of a methodological nature: of annotators from which disagreement does not change how to approach an annotation process that includes the significantly [RQ1]. diferent perspectives of a phenomenon considering the To identify bias in the labelling process, recent research existence of limited resources for the labelling process? in NLP focuses on demographic, ideological, and attitu

Regarding the first theoretical question, in social psy- dinal diferences among individuals [ 5]. We propose that chology there is strong evidence that humans disagree considering only attitudes and ideology is insuficient to even in seemingly objective tasks like estimating which approach the perspectivism paradigm correctly. A charline has the same length as a standard line [2, 3]. It has acteristic of human beings that we know from the beginbeen studied in detail how these disagreements do not ning of social psychological research is that attitudes do occur in a social vacuum due to individual diferences in not always predict behaviour [6] or do not directly preperception, but instead are the result of social influence dict behaviour [7]. People’s inclination for consistency is strategies with implications for the individuals at the widely acknowledged, and while they occasionally manlevel of their social relations or their social identity (for a age to maintain it, more often than not, they fall short of recent review of this literature, see [4]). In line with this achieving it. Social psychology has developed a vast theoretical and empirical efort to understand consistency and inconsistency in human attitudes and behaviour [8, 9].

As labelling is a behaviour, a second assumption arising from our research is that diferent perspectives in annotation will be related not only to the expression of 2nd Workshop on Perspectivist Approaches to NLP * Corresponding author. $ berta.chulvi@upv.es (B. Chulvi)

0000-0003-1169-0978 (B. Chulvi); 0000-0002-5441-0035 (L. Fontanella); 0000-0003-4928-8706 (R. Labadie-Tamayo); 0000-0002-8922-1242 (P. Rosso)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License certain attitudes but also to the fact of acting consistently CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) or inconsistently with the values these attitudes express.

High relevance of error Ambiguous perception Problem resolution Decision-making

Perceptual evidence Familiar information tasks

Simple logic tasks Aptitud task (TAP)

Non-ambiguous task

(TONA) Opinion taks Non-implicants task

(TOP) (TANI) Opinions task Attitudes task Values task

Tasks of personal taste

Predictions on a game of chance Tasks with high social relevance

Tasks with low social relevance

Low relevance of error Then a hypothesis derived from this assumption is that agreement in an annotation process will change considering individuals’ attitudes related to the issue and the consistent or inconsistent annotators’ behaviour in the annotation process [H2]. If the results verify this hypothesis, the research question is which size of the annotators’ group ensures that our annotators’ team reproduce the mix of perspectives that reflect well attitudes and the consistent or inconsistent behaviour with them, which gives the complete picture of a controversial debate [RQ2].

Using a relatively small corpus (210 sexist jokes) and a large group of 76 annotators, we test hypotheses 1 and 2 and try to answer the two research questions about which will be the optimal size of the group to include diferent perspectives [RQ1] and how to ensure our annotators reproduce a representative mix of perspectives [RQ2].

The rest of the article is organised as follows. Section 2 presents previous research related to the concepts that we use. In Section 3, we present our empirical research: data, task, and procedure. Details about the statistical analyses are given in Section 4. We present the results of our empirical evaluation in Section 5 and conclusions and limitations in Section 6. opinion tasks in NLP. The sift paradigm advocates for the publication of datasets in pre-aggregated form and the development of new measures for the evaluation of models that take into account all the perspectives linked to diferent backgrounds.

The research adopting perspectivism in NLP grows year by year (for a recent review, see [1]) and one main concern is the labelling bias introduced by the cultural background of annotators [13, 14]. 2. Related work In recent research, Sap and colleagues [5] have shown strong associations between annotator identity and be2.1. The perspectivism sift and the liefs and their ratings of toxicity. Specifically, their results labelling bias show that more conservative annotators and those who scored highly on a racist beliefs scale were less likely In modern computational linguistics, the standardised to rate anti-black language. Closer to our research quesannotation process of a corpus includes diferent tech- tions is the work of Akhtar et al. [15, 16], which leverages niques to classify a single piece of language in a given diferent opinions emerging from groups of annotators taxonomy. It implies training annotators, multiple classi- with the goal of studying how polarised instances afect ifcation subjects, measures of inter-annotator agreement, the performance of the classifiers. Considering binary harmonisation, aggregation by the majority, and con- classification tasks, they introduce a novel measure of the struction of a “gold standard” corpus representing the polarisation of opinions able to identify which instances truth against which future predictions of NLP models in a dataset are more controversial. In a pilot study about will be compared. According to the tasks’ taxonomy of xenophobia arguments in the context of Brexit, the annoPerez and Mugny [10], it means that the labelling pro- tation process was organised to contrast the annotation cess is being approached as an aptitude task, that is, a done by three people with an immigrant background task with a correct answer (see Figure 1). This approach (target group) in front of three people with a mainstream is hardly applicable when confronted with what difer- background as a control group. Using their polarisaent authors have referred to as a “highly subjective task” tion index, the authors show how in several tweets, all [11, 12]. We propose to denominate these tasks opinion the members of the target group (immigrants) marked tasks, following the taxonomy of [10], because their main the message as racist and hateful, while the members characteristic is not their subjectivity but the fact that, of the control group marked it as conveying no hate or looking at the way that society considers them, it seems racism. It is interesting to note that they only found a that a correct answer does not exist (low relevance of few tweets (1.13%) on which all the annotators agreed error). Still, all the possible answers situate the person that they contained hateful messages. Implicitly, in this at the point of a continuum whose extremes are defined work the authors assume, similar to our perspective, that by a social confrontation (high social relevance). We the nature of the disagreement is social and sustained by view the sift paradigm, proposed in the Perspective Data a social conflict, but they do not provide any empirical Manifesto1, as a more stringent approach to handling 1https://pdai.info/ 2For the tasks classification, we have kept the original acronyms from the French version. measure of annotators’ attitudes. Their results suggest actually discriminatory). These two components difer that consensus-based methods to create gold standard in tone but are positively correlated and work together data are not necessarily the best choice when dealing to perpetuate gender inequalities (for a recent review, with what they call highly subjective phenomena and we see [23]). Also related to the evolution of sexism, is the consider opinion tasks. concept of neosexism [24] or modern sexism [25]. Like modern racism, modern sexism is characterised by the 2.2. Attitudes and behaviour relation denial of continued discrimination, antagonism toward women’s demands, and lack of support for policies deIn binary classification tasks, annotating a corpus is a signed to improve women’s position in society. behaviour more than the expression of an opinion. The In a recent review on ambivalent sexism, Barreto and annotators will use their attitudes and beliefs to decide, Doyle [23] point out future directions in the study of sexbut it is hard to expect that attitudes predict perfectly ism due to the rapid developments in societal norms and this behaviour. Attitudes influence behaviour, as we have attitudes towards sex, gender, and sexuality across many already seen in the work of [5], but the relation attitude- countries. Surprisingly, despite an important amount behaviour is not a pacific question in social-psychology of research noting a rise in the number of men with a literature (for a classical review, see [17]). For example, self-proclaimed anti-feminist agenda [26, 27, 28], these Donald Campbell [18], in the sixties, argued that people authors do not consider as future work to investigate the who hold negative attitudes toward minorities may be link between hostile sexism and anti-feminist attitudes. reluctant to express their attitudes through public be- To go deeper into the interaction between hostile sexhaviour because norms of tolerance and politeness were ism and anti-feminist attitudes seems relevant because a typically held in American society. Things have changed new kind of strong hostility towards women uses antia lot regarding the open expression of hate towards mi- feminist frames, but also supports certain feminist polinorities, that is why The New York Times published, in cies, such as equality [29]. This new latent attitude, that 2019, an editorial with the suggestive headline of “Racism we denominate Hostile neosexism, is dificult to capture Comes Out of the Closet”3. with old attitudes scales towards feminism, such the one

Not only does agreeing with social norms and situa- developed by Smith in the seventies [30], because most of tional constraints explain the inconsistencies between the items of this instrument fit with the feminist values attitudes and behaviours, but there are also specific do- that this new Hostile neosexism seems to support. Also, it mains, such as humour, that significantly facilitate these seems to get out from the scope of the whole ambivalent kinds of inconsistencies. Often, some groups use humour sexism inventory [21] that does not pay specific attento avoid moral judgement that penalises discrimination. tion to feminism itself. Regarding the modern sexism Ofensive people find support from a majority who con- scale [25] or the Neosexism scale [24], we argue that Hossider that some messages are "only" jokes. When a society tile neosexism presents a high degree of hostility against begins to overcome its prejudices towards certain social women that the previous scales do not capture4. The groups, we can observe that humour becomes a space core of this Hostile neosexism attitude is the claim that in which these prejudiced attitudes are maintained. In societal changes driven by the feminist movement are fact, when we examine ofensive jokes, we find they are inherently unfair and put men as a group in a disadvanmainly related to some social minorities [19]. These in- tageous position. Despite, the hostile sexism subscale consistencies between attitudes and the behaviour of the [21] was primarily driven by the idea that men’s domannotators could also be a symptom of changes or re- inance over women is both appropriate and desirable, sistances of subjects and capture the evolution of some some items of this subscale connect well with the idea opinion groups in controversial debates. that nowadays there is no reason for feminist demand and that the feminist movement overreacts (see items 3, 4 and 5 in Section 3.2.1).

2.3. The Hostile Neosexism

Traditionally, sexism [20] has been viewed as the holding of discriminatory attitudes toward women, both manifest and subtle. This distinction in the tone of sexism was proposed by the ambivalent sexism theory [21, 22]. It was developed to account for a sort of evolution from a hostile component of sexism (overtly negative attitudes towards women) to a benevolent component (attitudes towards women that seem subjectively positive but are

3. Study Design

3.1. Data

To carry out our study, we relied on a manually selected

set of 210 jokes, conveying prejudice against women, from the corpus proposed in the shared task: HUrtful HUmour (HUHU): Detection of Humour Spreading Preju3https://www.nytimes.com/2019/07/15/opinion/trump-twitterracist.html 4Authors are currently conducting research to test the need for this new instrument and validate a longer version of the scale dice in Twitter at IberLEF 2023 [31]. This dataset ofers a responded to a questionnaire containing the Hostile neogold standard corpus of tweets in Spanish containing prej- sexism scale and a question about their ideology. udice against four minorities: women, the LGBTIQ community, immigrants and racially discriminated people, 3.2.1. Annotators attitudes and ideology and overweight people. During the annotation process of the HUHU dataset each instance was assessed for the To measure annotators’ attitudes in Hostile Neosexism, presence of humour and prejudice by 3 annotators. The we created a short scale that we denominate Brief Hostile criterion used for annotation was based on the relative Neosexism Scale. It is composed of six items: three of majority agreement of the annotators, with a threshold them (4 to 6) are part of the Hostile Sexism subscale of the of 2 out of 3. For the present study, we select jokes that Ambivalent Sexism Scale from Glick and Fiske [32] and convey diferent kinds of prejudice against women. We the other three (1 to 3) are new items that we created ah have classified the 210 jokes into 5 categories with the hoc to measure anti-feminist attitude: aim of describing the content of the dataset providing some examples: 1. Some of the demands of the feminist movement

seem to me to be a bit exaggerated. 2. I sometimes feel that our society pays too much

attention to the rights of certain minorities. 3. In the name of equality, many women try to gain

certain privileges. 4. Many women interpret innocent comments and

actions as sexist. 5. Women are easily ofended. 6. Women exaggerate the problems they sufer because they are women.

3.2. Participants and procedure

A total of 76 students of psychology (76.3% women and 23.7% men) took part in the experiments as an activity of a practical workshop in the first year of the degree. The activity was done in silence without other any distractions and took two hours time. Students were assigned a secret number to keep anonymity and access an Excel document to label the jokes. Annotation of task 1 consisted in reading the 210 jokes and classifying them as sexist (containing a prejudice against women) or not.

The annotators had to say also whether the text contains humour or not (task 2) and which was the ofensiveness of prejudice (task 3) on an ordinal scale (0=not at all, 1=slightly; 2=somewhat, 3=very much). After complet- 5Data are public in https://github.com/Bertachulvi/ECAI2023 ing the annotation task, using the secret number, students 6https://allea.org/code-of-conduct/

As discussed in the Introduction, our research aims to

evaluate the influence of attitudes on the annotation process and the relation between attitudes and behaviour.

To derive annotators’ latent attitude and behaviour, we exploit an Item Factor Analytic approach, which constitutes an extension of classical linear factor analysis and is particularly suitable for addressing categorical variables.

Specifically, within the framework of Item Response Theory (IRT) [35], we adopt the two-parameter normal ogive (2PNO) formulation [36] ( = | , , ) =Φ ( − ,) − Φ ( − ,+1) (1) ‘ where Φ( · ) is the normal cumulative function. Through the application of the knee point method, an Here the probability of observing a given category annotator sample size of = 12 was determined to be = 1, . . . , , for unit = 1, . . . , and item the point of stabilisation for AC1 variability, indicating = 1, . . . , , is modelled in terms of the latent trait , that further increases in the number of annotators do not the factor loading and a vector of ordered threshold yield significant modification in agreement [RQ1]. . To estimate the model parameters, we embrace a fouflmlyisBsainyegsviaanluaepsp[r3o7a]c.h that incorporates the handling 1 (a)

We are also interested in measuring inter-rater agree- C10.5 ment in the task of annotating sexism. As expected, be- A cause our data come from the HUHU dataset, we have observed that in the binary annotation scheme, most of 00 5 10 15 20 25 30 35 40 45 the texts are categorised as jokes conveying prejudice Annotator sample size against women, with 81% of the annotations falling into e (b) this category. This skewed distribution of data leads to litpdu0.6 a low level of agreement among diferent raters when IaCm0.4 n=12 using traditional inter-rater agreement measures such % as Fleiss’ or Kripendorf’s . This discrepancy arises :9510.2 from the paradoxical situation where the observed agree- AC 0 ment appears to be very high, while the chance-corrected 0 5 10 15Annot2a0tor sam2p5le size30 35 40 45 agreement is actually low [38]. To address this issue, we employ Gwet’s AC1 measure of inter-rater agreement Figure 2: Simulation results: (a) Mean and 95% confidence [39], which utilises a probabilistic model of agreement interval of Gwet’s AC1 coeficient; (b) Amplitude of the 95% [40]. This approach estimates the dificulty levels of the confidence interval of Gwet’s AC1 coeficient and knee-point. items within the corpus through probabilistic inference 5.2. How do attitudes afect the and then estimates the probability of chance agreement separately for easy and hard items. This probabilistic agreement among annotators? modelling approach helps mitigate the impact of the A Bayesian exploratory IRT analysis was employed, folskewed data distribution on the agreement assessment lowing the approach described in [42], in order to evalprocess. uate the construct validity of the scale outlined in Section 3.2.1. The results of the analysis indicated that the 5. Results scale exhibits unidimensionality, supporting its validity as a measurement tool for the intended construct. Therefore, a unidimensional 2PNO model (Equation 1) was 5.1. Do more annotators produce more exploited to estimate the Hostile neosexism attitude of disagreement? the annotators, taking into account the influence of their To test hypothesis 1 which considers disagreement as a so- gender and ideology as relevant features. The estimated cial phenomenon and not at the individual level, we need values for the model parameters can be found in Table 1. to investigate the influence of the number of annotators The factor loadings indicate the weight of the correspondon inter-rater agreement. For doing so, we randomly se- ing items in the derivation of the latent trait scores, while lected samples without replacement from the population the location values give insights on the level of consolof 76 annotators, with sample sizes ranging from 3 to 45. idation of the corresponding Hostile neosexism attitude: To ensure statistical robustness, 10,000 iterations were lower values correspond to a belief that gains more supperformed for each sample size. The results of this anal- port in our sample [43]. As for the regression parameter ysis are presented in Figure 2. In particular, Figure 2(a) estimates, the only covariate that seems to significantly depicts the mean and 95% confidence interval for each impact the Hostile neosexism attitude is endorsing right sample size. To determine the optimal annotator sample ideology. size that leads to stabilisation in the variability of Gwet’s To assess the influence of the Hostile neosexism attiAC1 coeficient, the knee-point method was employed tude on the level of agreement, we contrast the inter-rater [41]. This method is commonly used to identify the point agreement among the = 12 annotators in three subat which a graph exhibits a significant change in slope. groups: a homogeneous group with the lowest scores on In this study, the knee-point method was applied to the the Hostile neosexism attitude, a homogeneous group with amplitude of the confidence intervals (see Figure 2(b)). the highest scores, and a mixed group with six annotators positioned at the lower end of the Hostile neosexism and six annotators positioned at the higher end. The observed and expected agreements and the Gwet’s AG1 coeficients for all the 76 annotators and for the 3 subgroups are displayed in Table 2. The results demonstrate a clear distinction in the level of agreement among the annotators with lower Hostile neosexism attitude compared to the other groups. On the other hand, the agreement within the mixed group is similar to that observed in the overall population of annotators, indicating a comparable level of consensus among individuals with varying levels of Hostile neosexism attitude.

We develop a second sub-sampling strategy to test the influence of attitudes on the level of agreement. A simulation was conducted with a sample size of = 12, and the sample units were randomly selected from sub- 7We use the classical adjective here because a 77% of jokes refer to populations characterised by scores on the latent trait traditional misogynistic stereotypes that present women as dumb, body-centred, gossipy, incomprehensible for men or malicious below the first quartile ( Low Hostile Neosexism), above the third quartile (High Hostile Neosexism), and evenly distributed between the two sub-populations (Mixed Hostile Neosexism). From each group, we selected 10,000 samples without replacement. The findings (see Figure 3) provide further evidence of the influence of attitude on the level of agreement in the annotation process.

Following the two strategies, we find that the level of agreement decreases among the Mixed Hostile Neosexism group but also among High Hostile Neosexism. The decline in agreement among mixed groups is understandable but would not be expected among homogeneous groups high in Hostile Neosexism. Then we address the inconsistency between attitude and behaviour discussed in Section 2.2.

5.3. Are attitudes consistent with the annotators’ behaviour?

An alternative approach based on IRT models, as proposed in [44], can be employed to gain insights into consistency in annotators’ behaviour across the 210 tweets, specifically regarding their ability to recognise instances of sexism in the jokes. This alternative formulation of the IRT model deviates from the traditional approach by treating the annotators as items, allowing the threshold parameter in the binary annotation task to be interpreted in terms of the level of dificulty in recognising the presence of classical sexist content in jokes7. We denominate this variable Sexism Recognition Shortcoming because all text comes from a dataset that expresses sexism, but we do not interpret these recognition problems as a lack of skill, but rather, as the expression of an opinion. As the pragmatic of communication emphasises, every be- haviour dimensions may be related to some annotators’ haviour is a communication act, even the silence [45]. characteristics. Table 4 provides the percentage compo

As depicted in Figure 4, there is evidence of a positive sition of the identified groups in terms of gender and correlation between the Hostile Neosexism attitude of the ideology. The chi-square test of independence leads to annotators and their Sexism Recognition Shortcoming be- conclude that there is a significant association between haviour, reinforcing the idea that attitude and behaviour those characteristics and the group identified along the are connected. However, the intriguing result is that the sexist latent traits (gender: p-value 0.0018; ideology: pstrength of this association is relatively modest, as indi- value 0.0014). As we can see, the expected result on the cated by the Pearson’s correlation coeficient ( = 0.234). impact of gender and ideology showed in Table 4 are This suggests that the impact of attitude on the behaviour especially manifest in consistent groups. The left is the of identifying the presence of sexist content is somewhat majority in Low-Low group, and the right in the Highlimited and we need to introduce a more complex view High group. The novelty is that we can mostly link the to identify the diferent perspectives. inconsistencies with the moderate left. This group finds diferent partners in the inconsistency behaviour: the left in the low Hostile Neosexism-high Sexism Recognition

Shortcoming (Low-High) group and the right in the high Linear: Ry2==00.1.09535*x - 1.308 (HHoisgthil-eLNoweo)sgexroisump-.low Sexism Recognition Shortcoming

To further explore the relationship between attitude and behaviour, we classified the annotators into four groups based on their positioning relative to the means With the inclusion of two supplementary annotation of the two identified variables: Hostile Neosexism attitude tasks as outlined in Section 3.2, we can assess whether and Sexism Recognition Shortcoming (see Table 3). As we the inconsistencies among annotators are related to the can see, the most numerous are the consistent groups: perception of humour in tweets or to their judgement of low-low (34%) or high-high (27%). However, the number the level of ofensiveness associated with each text. To of individuals exhibiting annotation behaviour inconsis- this end, we used a procedure similar to the one described tent with expressed attitudes (22.4% and 15.8%) is not in Section 5.2 in order to derive annotators’ scores on negligible. the latent dimensions of Humour recognition and Degree of ofensiveness . Figure 5 shows the distribution of the Table 3 estimated scores for the recognition of humorous content Groups’ composition according to Hostile Neosexism attitude and for the evaluation of the degree of ofensiveness and Sexism Recognition Shortcoming across the four annotators’ groups.

Sexism Recognition In Figure 5, we appreciate that the inconsistency be

Shortcoming tween attitudes and behaviour in the case of individuHostile Neosexism Low High als with Low Hostile Neosexism attitude but High Sexism Low 34.22%6 22.41%7 oRfecthogentietxiotnasShhuormtcoormouins.gTrheilsieisncoonnasishtiegnhceyrsruepcpoogrntistitohne High 15.81%2 27.62%1 ihmuprtl.icTithaisngdreoxutpenisdeadlsaostshuemopntieotnhtahtartahteusmt woueretdsoaess lneosst ofensive. In this group, the left and the moderate left represents the 82.4% of the total. Humour recognition also plays a role in the other inconsistent group, the individu

This grouping allows for a more nuanced examination of how diferent positions on the attitude and be

High

High 47.6% 52.4% 19.0% 9.5% 28.6% 42.9% Total als with High Hostile Neosexism attitude but Low Sexism Recognition Shortcoming where moderate left and right sum to 66.6%. We believe that this group, with its inconsistency, is expressing that annotators embrace Hostile Neosexism which targets the feminist movement as overacting but recognises well the classical sexism expressed in 77% of jokes. For this interpretation, it is important to take into account that our data mostly fits with categories that express classical prejudices and stereotypes against women (see Section 3.1). The position of the two consistent groups (Low-Low and High-High) seems coherent: for diferent reasons, some because jokes contain prejudice (Low-Low), others because maybe they think jokes describe reality well (High-High), both find the tweets less humorous, but they difer in the degree of ofensiveness. As expected, for the High-High group tweets are less ofensive than for the Low-Low group. These results lead us to afirm that perspectives are expressed through a combination of attitudes and behaviours.

5.4. Agreement and perspectives

In this section, we explore whether the agreement changes considering individual’s attitudes and consistent or inconsistent behaviour [H2]. As we see in Table 5, individuals with similar attitude, Low Hostile Neoexism, will exhibit very diferent inter-rater agreement (0.83 > 0.37) if we consider the consistency between attitudes and behaviour. The same occurs with the opposite attitude: High Hostile Neosexist people exhibit very diferent inter-rater agreement (0.82 > 0.49) if we consider the consistency between attitudes and behaviour.

We can not conclude that an inconsistent behaviour reduces the agreement because, in the Low Hostile Neosexism group, high agreement occurs in the consistent subgroup, while in the High Hostile Neosexism group, it occurs in the inconsistent subgroup. As we argue in Section 5.3, individuals communicate their opinions not only through attitude expression but also through behaviour, as the pragmatics of communication assesses [45]. In this regard, we interpret high inter-rater agreement as the identification of a clear social position and low interrater agreement as the existence of a changing social position. By changing social position we mean a process in which individuals did not find a clear indication in the social realm about which will be the action that must be expected from them in the given context. Then, the interpretation of the diferent perspectives must focus on identifying which kind of consensus or conflict causes the respective high or low agreement. We do not think that diferent perspectives must be matched with diferent groups with a strong agreement because not polarised groups on a particular issue could exhibit a low level of agreement (according to what [15] propose). This group might also express a diferent perspective as a way to approach a controversial issue even if there is not a polarised position, because this lack of polarisation is what defines the group. Moreover, we need to consider controversial issues dynamically, and then it is reasonable to think that new perspectives, or changing ones, will register low levels of agreement because they reflect a social position that is being formed or one that is in crisis. Our interpretation of the diferent perspectives that we find in our data, taking into account the nature of the task of labelling a corpus that entirely contains sexist jokes, is the following: 1. Low-Low group: People that highly support the modern feminist movement (Low Hostile Neosexism) and that do not find funny ( Low Sexism Recognition Shortcomings) classical sexist jokes. It is a clear social position in sociological terms, then we find a high agreement (Gwet’s AC 1=0.838). 2. Low-High group: People that support the modern feminist movement (Low Hostile Neosexism) but still find funny ( High Sexism Recognition Shortcomings) classical sexist jokes. It is a changing social position in sociological terms because the mainstream message is that this humour is not funny, then we find a lower agreement (Gwet’s AC1=0.37). 3. High-Low group: People that do not support the modern feminist movement (they think that some feminist overreacts) but give support to the old feminist movement (the one that emphasises equality) and is able to recognise ofensiveness in the sexist jokes. This is a clear social position because fits with the 20th century feminism, and then we find a high level of agreement (Gwet’s AC1=0.829). 4. High-High group: People that represent new phenomenon that we have labelled as Hostile Neosexism. They manifest a strong hostility to the modern feminist movement that could lead to a not recognition of the classical sexism jokes, that is, it can endanger the achievements of the equality movement during the 20th century. This a new social position and then we find a low level of agreement (Gwet’s AC1=0.49).

Aside from the aforementioned understanding of the various views, we believe that multiple perspectives should be be present in an ideal team of annotators. The next study research question is about determining the ideal size of the group to include all of them based on our data.

5.5. Size of the group and perspectives Assuming the composition of the annotators’ population

detailed in Table 3, our objective in this section is to investigate the sample size required to ensure the inclusion of all diverse perspectives within an annotator team [RQ2]. To achieve this, we randomly selected, with replacement, 100 samples from the original population for each sample size in the range 2-45. The representativeness of each sample with respect to the composition of the original population was assessed using the Frobenius distance between the original and the sample composition. The knee-point method was employed to identify the optimal sample size, meaning the sample size that guarantees a minimal distance between the sample and the population composition in terms of the proportion of annotators belonging to the four identified groups. To ensure the robustness of our findings, we repeated the simulation procedure 1000 times, resulting in an empirical distribution of the optimal sample size across the repetitions (see Table 6). From the results, we can conclude that for our study a sample size ranging from 10 to 12 will most likely guarantee a fair representation of the diferent perspectives in the annotators’ team.

6. Conclusion and limitations In this paper, we presented a methodology that ap

proaches several common problems that arise when we intend to translate the perspectivism paradigm to a coherent annotation strategy. We tested H1, and our results in Section 5.1 suggest that the nature of the disagreement in the annotation is social and not individual because, from a certain point, it does not increase by adding more individuals. We apply a social psychology-grounded taxonomy for classifying tasks that could be helpful for dealing with what, in NLP research, is referred to as a subjective task. We also verify that diferent perspectives arise not only from attitudes but also from inconsistent or consistent behaviour of the annotators with these attitudes. We find this important because it shows that we can not assume that we will include all perspectives in a dataset only relying on attitude or biographical diferences. We also argue that these inconsistencies are valuable information about how controversial issues evolve in social debate. We propose that perspectives are a combination of attitudes and behaviour. We evaluate which will be the size of the group to include all the perspectives detected in our data.

Several limitations of this work must be considered. First, the annotator team is composed of psychology students, but even within this homogeneous group, we have seen that diferent perspectives arise. Also, we choose to work with a dataset containing only sexist jokes, because we try to avoid the diversity coming from the data and to concentrate on annotators’ perspectives, but a deep analysis of the text will give us more insights and a more complex view. The more challenging future work is to translate the knowledge obtained in this research into a feasible methodology to include all perspectives in an annotation plan that might need to proceed in three steps at the time of creating the corpus: (i) a first exploratory step that identifies perspectives and how these perspectives are reflected in the data, (ii) a second step to ensure the representativeness of the data in terms of perspectivism and (iii) a final step that control if, at the end of the annotation procedure, the data reflect all the perspectives.

Acknowledgments Berta Chulvi and Paolo Rosso are supported by

FairTransNLP-Stereotypes PID2021–124361OB-C31 funded by MCIN/AEI/10.13039/501100011033 and by of Social Psychology (2019). doi:10.5334/irsp. ERDF, EU A way of making Europe. The work of 277.

Roberto Labadie was supported by valgrAI - Valencian [10] J. A. Pérez, G. Mugny, Influences sociales : la théorie Graduate School and Research Network of Artificial de l’élaboration du conflit, 1993. Intelligence and the Generalitat Valenciana. Lara [11] V. Basile, It’s the End of the Gold Standard as Fontanella is supported by the ICOMIC (Identifying and we Know it. On the Impact of Pre-aggregation Counteracting Online Misogyny in Cyberspace) Project on the Evaluation of Highly Subjective Tasks, in: funded by EU Next Generation, MUR-Fondo Promozione DP@AI*IA, 2020. e Sviluppo-DM 737/2021 [12] V. Basile, T. Caselli, A. Balahur, L. Ku, Editorial: Bias, subjectivity and perspectives in natural language processing, Frontiers in Artificial Intelligence 5 References (2022). doi:10.3389/frai.2022.926435. [13] M. Sap, D. Card, S. Gabriel, Y. Choi, N. A. Smith, [1] A. Uma, T. Fornaciari, D. Hovy, S. Paun, B. Plank, The Risk of Racial Bias in Hate Speech DetecM. Poesio, Learning from Disagreement: A Survey, tion, in: Proceedings of the 57th Annual MeetJournal of Artificial Intelligence Research 72 (2021) ing of the Association for Computational Linguis1385–1470. tics, Association for Computational Linguistics, Flo[2] S. E. Asch, Studies of independence and con- rence, Italy, 2019, pp. 1668–1678. doi:10.18653/ formity: I. A minority of one against a unanimous majority, Psychological Monographs: Gen- [14] vZ1.W/Pa1se9e-m1,1A63re. You a Racist or Am I Seeing Things? eral and Applied 70 (1956) 1–70. doi:10.1007/ Annotator Influence on Hate Speech Detection on s11135-022-01494-7. Twitter, in: Proceedings of the First Workshop on [3] J. D. Campbell, P. J. Fairey, Informational and nor- NLP and Computational Social Science, Association mative routes to conformity: The efect of faction for Computational Linguistics, Austin, Texas, 2016, size as a function of norm extremity and attention to the stimulus, Journal of Personality and [15] Sp.p.A1k3h8t–a1r4,2V..dBoai:s1il0e.,1V8. 6P5a3tt/i,v1A/WN1e6w-5M6e1a8s.ure of Social Psychology 57 (1989) 457–468. doi:https: Polarization in the Annotation of Hate Speech, in: //doi.org/10.1037/0022-3514.57.3.457. M. Alviano, G. Greco, F. Scarcello (Eds.), AI*IA 2019 [4] R. Spears, Social Influence and Group – Advances in Artificial Intelligence, Springer InterIdentity, Annual Review of Psychol- national Publishing, Cham, 2019, pp. 588–603. ogy 72 (2021) 367–390. doi:10.1146/ [16] S. Akhtar, V. Basile, V. Patti, Whose Opinions Matannurev-psych-070620-111818. ter? Perspective-aware Models to Identify Opin[5] M. Sap, S. Swayamdipta, L. Vianna, X. Zhou, Y. Choi, ions of Hate Speech Victims in Abusive Language N. A. Smith, Annotators with Attitudes: How An- Detection, 2021. URL: arXiv:2106.15896v1[cs.CL] notator Beliefs And Identities Bias Toxic Language 30Jun2021.

Detection, in: Proceedings of the 2022 Conference [17] A. H. Eagly, S. Chaiken, The psychology of attitudes, of the North American Chapter of the Association Harcourt brace Jovanovich college publishers, 1993, for Computational Linguistics: Human Language pp. 155–218.

Technologies, Association for Computational Lin- [18] D. T. Campbell, Social attitudes and other acquired guistics, Seattle, United States, 2022, pp. 5884–5906. behavioral dispositions, in: S. Koch (Ed.), PsycholURL: https://aclanthology.org/2022.naacl-main.431. ogy: A study of a science. Study II. Empirical subdoi:10.18653/v1/2022.naacl-main.431. structure and relations with other sciences. Vol. 6. [6] R. T. LaPiere, Attitudes vs. actions, Social forces 13 Investigations of man as socius: Their place in psy(1934) 230–237. chology and the social sciences, McGraw-Hill, 1963. [7] I. Ajzen, M. Fishbein, Attitude-behavior relations: [19] L. I. Merlo, B. Chulvi, R. Ortega-Bueno, P. Rosso, A theoretical analysis and review of empirical re- When humour hurts: linguistic features to foster search, Psychological bulletin 84 (1977) 888. explainability, Procesamiento del Lenguaje Natural [8] A. W. Kruglanski, K. Jasko, M. Milyavsky, 70 (2023) 85–98.

M. Chernikova, D. Webber, A. Pierro, D. Di Santo, [20] J. K. Swim, L. L. Hyers, Sexism, in: T. D. Nelson Cognitive consistency theory in social psychology: (Ed.), Handbook of prejudice, stereotyping, and disA paradigm reconsidered, Psychological Inquiry 29 crimination, Psychology Press, 2009, p. 407–430. (2018) 45–59. [21] P. Glick, S. T. Fiske, The ambivalent sexism inven[9] J. Cooper, Cognitive Dissonance: Where We’ve tory: Diferentiating hostile and benevolent sexBeen and Where We’re Going, International Review ism, Journal of personality and social psychology 70 (1996) 491.