1. Introduction

Workshop on Perspectivist Approaches to NLP * Corresponding author † These authors contributed equally. $ sodamarem.lo@unito.it (S. M. Lo); valerio.basile@unito.it (V. Basile)

Hierarchical Clustering of Label-based Annotator Representations for Mining Perspectives

Soda Marem Lo

Valerio Basile

0 0 Computer Science Department, University of Turin , Turin , Italy

2023

000 0 0002

Modeling annotator perspectives has emerged as a technique to model subjective linguistic phenomena more accurately. Authors in the NLP community approached this issue by creating perspective-aware and personalized models, where demographic data or previous annotations are needed. In this paper, we explore two methodologies to represent annotators solely on the basis of the labels they assigned: label agreement and Kernel PCA. For both these techniques, we computed respectively 5 and 4 clusters, trained perspective-aware models on each of them, and finally implemented majority vote ensembles. The results show that clusters obtained by the first mining technique are more balanced and homogeneous in terms of annotators' demographic traits, while those obtained by KPCA tend to correlate more with their nationalities. Despite these diferences, both ensemble models outperform the baseline, confirming that leveraging annotation using clustering techniques is advantageous for the classification of a subjective phenomenon such as irony. We sustain that this approach can be beneficial for taking into account annotators' perspectives when demographic data are not known, together with the possibility that their annotations might be influenced by factors other than given demographics.

eol>Perspectivism clustering irony detection

1. Introduction On the other hand, it is important to notice that anno

tators’ opinions are not necessarily linked to these traits Subjective tasks in Natural Language Processing face only, especially when considering phenomena where the issue of correctly modeling the perception of the hu- both demographic-depending aspects such as cultural mans involved in the process, e.g., producing language background and culturally-shared linguistic expressions resources used to train and evaluate models. In recent can be key elements to their definition and individuation. years, several authors have started considering the im- This is what happens with irony, influenced by elements portance of disagreement, criticizing the idea of a single as the origin of the speaker [8, 9] and linguistic patterns valid truth [1], and examining its potential impact on sev- sometimes shared across languages [10]. eral aspects of NLP [2]. Such observation is fundamental Moving from the idea that human labels hide values especially considering highly subjective tasks where an- and possible interpretations of a linguistic phenomenon notators’ opinions may difer in relation to their cultural [6], we want to explore whether annotators choices overand social background, or their personal experiences [3]. lap with their demographics, or might be linked to other

To this aim, the perspectivist approach1 works towards traits that influence a similarity of opinions despite the modeling raters’ perspectives, keeping all human labels diferent backgrounds. Specifically, we mined annotaduring the training phase of the classifier [4]. tors’ perspectives to see how they group together on the

Authors moving along this paradigm shift have often base of their annotations only. We propose two methpointed out the necessity to publish disaggregated [5], ods to vectorize annotators leveraging the set of their and well-documented data, with as much meta-data as labels. Then, we trained cluster-based models and built a possible [6]. This information has been used in [7] to majority voting ensemble to validate our representation build perspective-aware models, based on demographic techniques in a in-dataset and cross-dataset setting. traits such as gender, nationality and generation, which The main contributions of this papers are the followresulted to be more confident in detecting irony in respect ing: to the non-perspectivist ones. • Two techniques to model annotators as vector

representations and automatically cluster them; • A quantitative and qualitative analysis of the automatically predicted clusters of annotators, both in terms of quality of the clusters and mapping between clusters and divisions of annotators based on demographic traits; • Experimental evidence that leveraging automatically grouping of the annotators in a disaggregated dataset is beneficial for the predictive power grouped annotators based on their agreement level, to of an ensemble of classifiers for irony detection. extract social groups and analyze the impact of group profile on the task of ofensive content recognition. InThe experiments are conducted on EPIC (English Per- terestingly, when testing the agreement measure on despectivist Irony Corpus) [7], a disaggregated corpus for mographic groups, no significant correlation was found, irony detection, described in Section 3. The methods are showing that there might be other factors conditioning introduced in Section 4 where the results of the clustering users’ perceptions of aggressiveness. are analysed, and applied to irony detection in Section 5. Agreement was already used to mine annotators’ perspective in [24], where the authors measured label and 2. Related works features agreement, in order to cluster together those who shared a perspective for similar reasons. Influenced The correlation between annotators’ choices, their demo- by this work, this paper wants to explore how annotagraphic traits, beliefs and social backgrounds has become tors are clustered based on their annotations about ironic subject of attention in tasks such as ofensive language content. Thus, we compared two methodologies to mine [11, 12], hate speech [13, 3] and toxicity detection [14]. raters’ opinions, observing whether these choices coinThese works have demonstrated how the identity of the cide with their demographic data; finally we implemented annotators, their social groups and their beliefs can play cluster-based models inspired by [16] and [7]. a role in the annotation phase.

Taking into account raters’ backgrounds can be of fun- 3. Corpus description damental importance to avoid building machines biased toward the opinions of a majority [4, 15], especially when In this section we present the two corpora used for our inworking on phenomena that cannot be objectively de- dataset and cross-dataset experiment, respectively EPIC ifned. (English Perspectivist Irony Corpus), released by [7]; and

The perspectivist approach aims at leveraging disagree- the corpus used for SemEval-2018 Task 3 "Irony Detection ment to model annotators’ points of view and culturally- in English Tweets" [25]. driven perspectives [5]. In [16] the authors grouped annotators by measuring polarization of their judgments on hate speech content, then created a gold standard of 3.1. EPIC each group to obtain perspective-aware models, eventu- For the in-dataset setting we trained and tested our modally including the learned perspectives in an ensemble els on the English Perspectivist Irony Corpus [7, EPIC], classifier. Inspired by this work, authors in [ 7] imple- a disaggregated corpus consisting of 3, 000 Post, Reply mented perspective-aware models based on annotators’ pairs from Reddit (1, 500) and Twitter (1, 500) collected demographic characteristics, and proposed to evaluate across five English-speaking countries: Australia, India, them on the confidence [ 17] of their predictions. The Ireland, United Kingdom and United States. Regarding perspective-aware models resulted to be more confident Twitter, authors used the API geolocation service to identhan non-perspectivist ones. tify the five English varietes. With respect to Reddit, they

Techniques for modeling annotators’ perspectives collected data from the following subreddits, assuming have also been developed using personalization methods, the origin of the texts: r/AskReddit (United States), r/recently applied to NLP with the aim of processing diver- CasualUK (United Kingdom), r/britishproblems (United sity among annotators [18] in several subjective tasks, Kingdom), r/australia (Australia), r/ireland (Ireland), r/insuch as ofensive content, sense of humor and emotion dia (India). The 74 annotators were balanced across both detection [19, 20, 21], but also in the classification of in- gender and nationality, with a total of ∼ 15 raters for terpersonal conflict types [ 22]. This approach tends to each of the aforementioned nationalities, who labelled consider not always demographic data, but also personal around 200 instances each. Thus, the corpus consists of beliefs and opinions obtained by historical posts of the 14, 172 annotations, with a median of annotations per same user [22, 23]. For example, in [21], the authors de- instance of 5. veloped a measure of the human bias to model individual The authors collected demographic information about human perspectives, i.e. how a user’s perception difer the annotators (gender, age group, nationality, ethnicfrom others, to obtain a representation of the subjectiv- ity, student status and employment status), and used ity of each annotator. Authors in [12] propose both a data related to gender (female, male), age (boomer, Genmesoscopic (group-based) and microscopic (user-based) eration X, Generation Y, Generation Z) and nationalapproach to predict annotators’ beliefs, considering their ity (Australian, British, Indian, Irish and US-American) metadata, the annotator identifier (id), and previous an- to build 11 demographic-based models, each trained notations, demonstrating improved performance of clas- only on the labels provided by one group, and tested sifiers as users’ information increased. Moreover, they on both a demographic-independent aggregated test set values in the annotation when estimating how much each and perspective-based test sets. The former, to which couple of annotators agrees between each other. we will refer as gold test set, was obtained applying a majority voting strategy on the entire corpus. The au- Second representation technique: dimensionality thors discarded those instances for which a majority was reduction (KPCA) We opted for reducing the dimennot available resulting in an aggregated set of 2, 767 in- sionality of the label matrix adopting a nonlinear form of stances. This set was split into training (80%, 440 ironic, Principal Component Analysis (Kernel PCA) [ 27 ], then 1331 not ironic) and test set (20%, 110 ironic, and 443 not computing the pairwise distance matrix among annotaironic), thus obtaining the gold test set of 553 instances tors. (246 from Reddit and 307 from Twitter). The two methodologies will be explained and discussed

We replicated this methodology to train and test the in the following paragraphs. non-perspectivist (NP) model on this split, as in [7]. 3.2. SemEval-2018 Task 3

To verify the robustness of our cluster-based models, we tested their performances in a cross-dataset setting on the corpus used for the SemEval-2018 shared task on iroy detection [25].

It consists of 4,792 tweets, collected between December 2014 and January 2015, and annotated by three students in linguistics, who spoke English as a second language (other demographic data were not collected). For the shared task the corpus was randomly split into training (1445 ironic, 1417 not ironic) validation (456 ironic, 499 not ironic) and test set, (784 instances, 311 ironic, and 473 not ironic).

For the experiment in the cross-dataset setting we tested our models, previously trained on EPIC, on SemEval-2018 test set.

4. Mining perspectives This section introduces the methodology used to auto

matically compute clusters of annotators. The core of our approach is to vectorize each annotator based on the labels assigned for each of the 3, 000 instances . Given raters annotating instances, we obtained a matrix × , which will be called label matrix.

Considering that each (Post, Reply) pair has an average of 4.72 and a median of 5 annotations, annotators can have three possible opinions: 0 (not ironic), 1 (ironic), or a missing value. Thus, for each annotator, we obtain a vector with the dimensionality of the number of instances i, where the combination of the assigned label represents rater’s perspective. Since annotators have annotated around 200 instances each, there are at least 2, 800 missing values per annotator. For this reason we have chosen to adopt two methods to represent the annotators as vectors.

First representation technique: label agreement ( ) We computed a pairwise similarity matrix using Krippendorf’s alpha ( ) [ 26 ] as a metric to handle missing

4.1. Label agreement Following [24], we measured label agreement in terms of

Krippendorf’s , since it has been developed both to take into account that some agreement can arise by chance (as the more common Cohen’s Kappa agreement score), and to measure agreement among raters with incomplete annotations, in contrast with Kappa measures (Cohen’s and Fleiss’) that rely on a complete annotation matrix.

Considering n annotators labeling k instances, we ifrstly obtained the the label matrix × . We used the to compute the pairwise agreement between annotators i and j, resulting in the similarity matrix ∈ R× , computed as , = (:, :). Finally, we obtained a distance matrix = 1 − , used as input for the unsupervised clustering algorithms.

Given the high sparsity of the matrix, and the annotation distribution already discussed in Section 3, we have encountered 82 cases in which annotators did not have any common annotation. Since missing values are not acceptable in agglomerative clustering, we decided to assign = 0. As a consequence, we assumed no correlation between the two in the clustering phase, totally relying on the similarities that these annotators might have with other raters. While this is a strong assumption, made for practical reasons, the incidence of such pairs of annotators is very low, i.e., about 1% of all the pairs.

Moreover, in computing we have encountered a major limitation of the metric itself, already pointed out by Checco et al. [ 28 ] as a “paradox” that makes systematic agreement less reliable than random guessing. In fact, in 158 cases, although there was perfect agreement between pairs of annotators, the number of samples was not enough for the to be well-defined. In these cases, we relaxed this constraint by setting = 1 for the sake of the further clustering steps.

4.2. Nonlinear PCA As a second method to vectorize annotators’ perspective,

we have performed a dimensionality reduction of the label matrix × . Since it was a sparse matrix with a highly number of missing values, we have firstly applied (a) Label agreement ( ) (b) Dimensionality reduction (KPCA) a one-hot encoding considering the three possible cate- Principal Component Analysis (PCA) is a technique gories: ironic (encoded as 01), not ironic (encoded as 10) used to reduce the dimensionality of data by applying and missing value (encoded as 00). We obtained a new an orthogonal linear transformation into a low dimenmatrix with twice as many columns as the original label sional subspace, keeping the largest variance as possible matrix, which has been reduced via Kernel Principal Com- in order to avoid loosing relevant information. As an ponent Analysis, using the Scikit-learn decomposition extension of it, Kernel PCA makes possible to apply a package. nonlinear mapping of the data into a high-dimensional feature space [ 27 ] using kernel methods.

We have firstly tried to apply regular Principal Component Analysis selecting 59 components to keep the 85.7% of the variance. When computing the pairwise distance of the reduced matrix with either euclidean, cosine or manhattan metrics, we obtained a poorly informative dendrogram, suggesting that our data might not be linearly separable.

For this reason we opted for a nonlinear PCA; we computed a dendrogram for multiple kernels, and eventually we chose the cosine similarity as the kernel that resulted in the most balanced clustering. For the number of components, we calculated the ratio between the sum of the eigenvalues of components, and the sum of the eigenvalues of all non-zero components : ∑︀

=1 ∑︀ =1

We tried with multiple fixed dimensionalities , and stopped at 60 components to explain the 85.5% of the variance. Then, we obtained a distance matrix computing the pairwise distance of our reduced matrix, calculated via the euclidean metric. 4.3. Hierarchical clustering same representation technique, and considering the combination of the two measures together with the computed dendrograms. The results show that to a lower number of clusters corresponds an increase in density and separation (higher Calinski Harabaz Index), together with an increasing generalization, thus having clusters more similar among each other (higher Davies Bouldin Index).

We tried to balance these two efects, by minimizing the ratio between the two metrics, and assigned a number of 5 clusters to the clustering obtained with , and a number of 4 for KPCA.

After obtaining a distance matrix of the annotators for each of the two representation techniques described in previous sections, we used the library Scikit learn to perform hard clustering on both data. Specifically, we computed a clustering to have a graphical representation of how the annotators join together, and how clusters themselves are connected to each other by analyzing the resulting nodes.

In both cases, we opted for Ward’s linkage criterion, 4.4. Quantitative analysis calculating the linkage with the euclidean distance metric, Comparing the two figures, it is possible to notice that as the method requires, and computing the full tree. It in the second representation technique, in cluster 1 and resulted in the clusters illustrated by the dendrograms cluster 3 (Figure 1 (b)) the first nodes formed when the in Figure 1. DBSCAN and Afinity Propagation were two most similar items joined together are almost at also tried as clustering algorithms, however they did not the same level of the cluster formation. Moreover, as converge to usable clusters on our dataset. illustrated in Table 2, the four clusters join nearly at the same level, showing a lower distance between them. This Choosing the number of clusters Once the two clus- is reflected by a systematically lower Silhouette score for terings are computed, we applied the Calinski Harabaz the clusters obtained applying the Kernel PCA, in respect [ 29 ] and Davies Bouldin Indexes [ 30 ] to respectively mea- to the first representation technique Figure 1 (a), where sure their density and their similarity. We used these the distance between the clusters is well defined and intrinsic evaluation metrics to assess the best number of reflected by the diferent height of all the nodes, including clusters between 2 and 5, adding a further analysis with the ones where the clusters are formed (Table 1). 11 clusters as the sum of the number of demographic Looking at the positive label rate, it is higher in cluster traits considered for the perspective-aware models in [7]. 1, 2 and 3 from the representation technique (Table 1) Since these two metrics do not need any ground truth and cluster 3 from the KPCA representation technique labels, we were able to perform an intrinsic clustering (Table 2), indicating a major sensitivity of these annotavalidation comparing the scores among clusters of the tors to irony.

Representation technique

KPCA

Demographics

Gender Nationality Generation

Gender Nationality Generation 0.030 -0.007 -0.002 -0.001

4.5. Qualitative analysis

To see whether there was a correlation between the obtained clusters and demographics, we firstly leveraged the Rand index (ARI) [ 31 ] and the Mutual information (AMI) [ 32 ] both adjusted by chance. The former estimates the similarity between two clusterings, while the latter is a measure of similarity between two labels. Both metrics are typically used to validate the output of a clustering algorithm. However, in this work they were used to infer a mapping between our cluster and each of the annotators’ demographics (gender, generation and nationality), treated as the ground truth. The results in Table 3 show a negative correlation for at least one of the two measures in most of the cases, with the exception of gender for the representation technique, and nationality for the KPCA-based one. Especially in the latter, both the

ARI and AMI tend to be higher than other scores, which

instead are always very close to zero. This result is in line with recent observations that using demographic information about the annotators does not necessarily guarantee a better performance in terms of perspective modeling [ 33 ].

Consequently, we further explored the correlation with

demographic data: we looked at the composition of the clusters with respect to gender, nationality and generation,2 as illustrated in Table 4 and Table 5.

From the clusters obtained via Krippendorf’s alpha

( ), we did not find any systematic mapping between demographic traits and the clusters. In particular, in

2For this analysis, we excluded a single annotator for whom age was

not disclosed, clustered in cluster 1 ( ), and cluster 2 (KPCA).

GenZ annotators: the former are totally absent in cluster

0 and 1, and the latter are concentrated especially in cluster 0 and cluster 2 in respect to the remaining two.

Nevertheless, no partition of demographic group can be highlighted, since none of the considered social groups merges homogeneously into specific clusters.

Dem. data Female

Male Australia India Ireland

US Boomer GenX GenY

GenZ

Note however that these two cohorts of annotators are the less numerous.

Dem. data

Female

Male Australia

India Ireland

US Boomer GenX GenY

GenZ the first representation technique, we interpreted the Representation technique

Cluster

#Instances KPCA 0 1 2 3 4 0 1 2 3

Modelling mined perspectives In this section we present experiments carried out to

validate our methodology. In particular, we created and explored the diference between non-perspectivist and cluster-based ensemble models both in-dataset and cross-dataset.

As regarding the experimental setup, we fine-tuned the

uncased version of bert 3 [ 34 ] for sequence classification.

The input consisted in the Post, Reply pairs. We set a batch

size of 16 and a learning rate of 5 · 10− 5 and, to prevent overfitting, we customized the model to implement the

Focal Loss [35]. Finally, we set early stopping with a

patience of 2 epochs on the validation loss (using 20% of the training data as validation set).

As a baseline (called NP for non-perspectivist), we aggregated the annotations via majority voting and discarded those where a majority was not found, adopting the methodology explained in Section 3. Thus, we trained the model on the aggregated set of 1, 771 instances, and tested it on the gold test set. For the models based on ble strategy, inspired by [16]: for each cluster we created a gold standard to train a perspective-aware model, and applied majority voting on their predictions, obtaining an ensemble classifier per technique. We tested the models on the gold test set and compared the results with the baseline.

3https://huggingface.co/bert-base-uncased

perspective-aware models [16] based on the automat- formance of cluster-based ensembles in a cross-dataset ically extracted clusters of annotators, ensembled them, setting. More importantly, these experiments prove that the two clustering techniques, we implemented an ensem- tator perspectives we can let the annotators’ opinions To train the cluster-based models, we firstly excluded the gold test set, and grouped the remaining labeltexts pairs according to each of the obtained clusters, extracting 5 and 4 datasets respectively for the first and second representation technique. Eventually, we applied a majority voting strategy and excluded those instances where a majority was not present. Table 6 illustrates the number of instances per dataset.

After training, we tested the models both in a indataset (on EPIC’s gold test set) and cross-dataset setting, specifically on SemEval 2018 Task 3 test set [ 25], previously described in Section 3.2. Finally, we implemented a majority voting ensemble (M-ENS), that returns a final label by applying majority vote over the predictions of each cluster-based classifier. Table 7 shows the average precision, recall and F1-score over 10 runs. We found low variation in the scores, as illustrated by the standard deviation in parenthesis.

Looking at Table 7, we can notice that the two majority ensembles obtained from the explored representation techniques always outperform the baseline, both indataset and cross-dataset. In the first setting, the macroaveraged F1 score of the M-ENS gives the best results, while M-ENS KPCA presents the best performance crossdataset. Results demonstrate that modelling annotators’ opinions is necessary when working on highly subjective phenomena as irony, as strongly confirmed by the pertraining perspective-aware models based on annotators’ mined opinion can be an efective instrument to capture a diversity of points of view.

Notably, the increase in macro-F1 score is a reflection

of a better prediction of the positive class. Considering that the classes were highly unbalanced (see Section 3.1) the accuracy measure is higher for the baseline model, which is less sensitive to the presence of irony and therefore over-predicts the negative class.

Despite the clusters obtained in the two representation techniques being very diferent in terms of methodology (Section 4.1, Section 4.2) and composition (Section 4.3), the models exhibit comparable performance. In-dataset, the ensemble based on clusters gives slightly better scores than KPCA; but this trend is inverted in the second setting.

These results confirm the idea that by mining annoemerge regardless of their demographics, observing how social background can influence the individual’s definition of what is ironic, shared among characteristics that might go beyond common demographic traits. setting in-dataset cross-dataset model NP M-ENS M-ENS KPCA NP M-ENS M-ENS KPCA

6. Conclusion

built-in biases in creating perspective-aware classifiers, testing whether annotators’ choices might be driven by In this paper, we implemented and tested two techniques factors uncorrelated to given demographics, but rather to mine annotator perspectives, moving from the idea linked to other elements of their social and individual that the set of their annotations can be used as a represen- background. tation of their opinion on the topic they are annotating, Although we tackled the Krippendorf’s alpha paradox in our case ironic content in social media platforms. We described in Section 4.1, there are other abnormalities of chose to perform this analysis on irony since it is a highly the measure itself extensively described in [ 28 ], which subjective phenomenon where not only demographic, but might had a negative impact on the clusters obtained via also linguistic and social aspects can influence annota- the first representation technique. tors’ interpretation and judgement. For this reason, we Moreover, in this work we group annotators using a used the recently published English Perspectivist Irony hard clustering algorithm. However, as reality is more Corpus (EPIC). nuanced and many dimensions interact in describing

For mining annotators’ perspectives we proposed two human variability, a soft clustering approach could lead methodologies. The former, inspired by [24], was to in- to more accurate representations, although its application terpret similarity of opinions in terms of inter-annotator is computationally more complex in this context. agreement, adapting Krippendorf’s alpha and overcom- For the future, we plan to perform the same experiing its structural limitations. The latter consisted in a ments on multiple pre-trained language models, to furdimensionality reduction of annotator vectors, using Ker- ther test the consistency of our results, and test other nel Principal Component Analysis, thus applying a non- representation techniques such as autoencoders. Our linear mapping of our data. Then, we applied a hierar- analysis of the composition of the annotator clusters inchical clustering algorithm to analyse how annotators dicates some degree of intersectionality of demographic group together. Looking at the composition of clusters in traits with respect to the annotation of irony, which we respect to annotators’ demographic data, results demon- consider a research direction to pursue further. Another strate how diferent the two mining techniques are. In aspect worth investigating is the relative position of indifact, Kernel PCA highlights the correlation between an- vidual annotators among their assigned clusters, checknotators’ nationality and irony perception, while the first ing whether it correlates with factors like annotation method returns more heterogeneous and better balanced quality. Finally, while our results are very encouraging, it clusters. must be noted that the experimental task still involved an

In the experimental phase, we trained perspective- aggregated test benchmark. We expect that our method aware models for each cluster obtained via the two rep- will produce more impactful results when measured on resentation techniques, and implemented an ensemble a perspectivist, disaggregated benchmark, which we aim strategy to select the predicted labels, based on majority to develop in the next steps of our research. voting. Both in-dataset and cross-dataset performance showed that the ensemble models always outperform the baseline, demonstrating the robustness of our method References also when tested on a diferent corpus.

Considering these promising results, we believe that this approach can be of fundamental use for future research in the perspectivist field. Firstly, it makes possible to mine annotators’ opinions when demographic information are not known. Secondly, it can help to avoid [1] L. Aroyo, C. Welty, Truth is a lie: Crowd truth and the seven myths of human annotation, AI Magazine 36 (2015) 15–24. [2] V. Basile, M. Fell, T. Fornaciari, D. Hovy, S. Paun,

B. Plank, M. Poesio, A. Uma, et al., We need to consider disagreement in evaluation, in: Proceedings of the 1st workshop on benchmarking: past, 1668–1678. present and future, Association for Computational [14] M. Sap, S. Swayamdipta, L. Vianna, X. Zhou, Y. Choi, Linguistics, 2021, pp. 15–21. N. A. Smith, Annotators with attitudes: How an[3] S. Akhtar, V. Basile, V. Patti, Whose opinions mat- notator beliefs and identities bias toxic language ter? perspective-aware models to identify opinions detection, arXiv preprint arXiv:2111.07997 (2021). of hate speech victims in abusive language detec- [15] V. Prabhakaran, A. M. Davani, M. Diaz, On releasing tion, arXiv preprint arXiv:2106.15896 (2021). annotator-level labels and information in datasets, [4] F. Cabitza, A. Campagner, V. Basile, Toward a per- arXiv preprint arXiv:2110.05699 (2021). spectivist turn in ground truthing for predictive [16] S. Akhtar, V. Basile, V. Patti, Modeling annotator computing, Washington DC, USA, 2023. perspective and polarized opinions to improve hate [5] V. Basile, et al., It’s the end of the gold standard speech detection, in: Proceedings of the AAAI as we know it. on the impact of pre-aggregation Conference on Human Computation and Crowdon the evaluation of highly subjective tasks, in: sourcing, volume 8, 2020, pp. 151–154. CEUR WORKSHOP PROCEEDINGS, volume 2776, [17] A. A. Taha, L. Hennig, P. Knoth, Confidence estiCEUR-WS, 2020, pp. 31–40. mation of classification based on the distribution [6] B. Plank, The’problem’of human label variation: of the neural network output layer, arXiv preprint On ground truth in data, modeling and evaluation, arXiv:2210.07745 (2022).

arXiv preprint arXiv:2211.02570 (2022). [18] L. Flek, Returning the n to nlp: Towards contextu[7] S. Frenda, A. Pedrani, V. Basile, S. M. Lo, ally personalized classification models, in: ProceedA. T. Cignarella, R. Panizzon, C. Marco, ings of the 58th annual meeting of the association B. Scarlini, V. Patti, C. Bosco, D. Bernardi, for computational linguistics, 2020, pp. 7828–7838. Epic: Multi-perspective annotation of a cor- [19] J. Bielaniewicz, K. Kanclerz, P. Miłkowski, M. Gruza, pus of irony, in: ACL 2023, 2023. URL: K. Karanowski, P. Kazienko, J. Kocoń, Deep-sheep: https://www.amazon.science/publications/ Sense of humor extraction from embeddings in the epic-multi-perspective-annotation-of-a-corpus-of-irony. personalized context, in: 2022 IEEE International [8] A. Joshi, P. Bhattacharyya, M. J. Carman, Investiga- Conference on Data Mining Workshops (ICDMW), tions in computational sarcasm, Springer, 2018. IEEE, 2022, pp. 967–974. [9] R. Ortega-Bueno, F. Rangel, D. Hernández Farıas, [20] J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling, P. Rosso, M. Montes-y Gómez, J. E. Medina Pagola, K. Kanclerz, P. Miłkowski, P. Kazienko, Learning Overview of the task on irony detection in spanish personal human biases and representations for subvariants, in: Proceedings of the Iberian languages jective tasks in natural language processing, in: evaluation forum (IberLEF 2019), co-located with 2021 IEEE International Conference on Data Min34th conference of the Spanish Society for natural ing (ICDM), IEEE, 2021, pp. 1168–1173. language processing (SEPLN 2019). CEUR-WS. org, [21] P. Kazienko, J. Bielaniewicz, M. Gruza, K. Kanclerz, volume 2421, 2019, pp. 229–256. K. Karanowski, P. Miłkowski, J. Kocoń, Human[10] J. Karoui, F. Benamara, V. Moriceau, V. Patti, centred neural reasoning for subjective content proC. Bosco, N. Aussenac-Gilles, Exploring the im- cessing: Hate speech, emotions, and humor, Inforpact of pragmatic phenomena on irony detection mation Fusion (2023). in tweets: A multilingual corpus study, in: Pro- [22] J. Plepi, B. Neuendorf, L. Flek, C. Welch, Unifyceedings of the 15th Conference of the European ing data perspectivism and personalization: An Chapter of the Association for Computational Lin- application to social norms, arXiv preprint guistics: Volume 1, Long Papers, 2017, pp. 262–272. arXiv:2210.14531 (2022). [11] E. Leonardelli, S. Menini, A. P. Aprosio, M. Guerini, [23] K. Kanclerz, M. Gruza, K. Karanowski, S. Tonelli, Agreeing to disagree: Annotating ofen- J. Bielaniewicz, P. Miłkowski, J. Kocoń, P. Kazienko, sive language datasets with annotators’ disagree- What if ground truth is subjective? personalized ment, arXiv preprint arXiv:2109.13563 (2021). deep neural hate speech detection, in: Proceedings [12] J. Kocoń, A. Figas, M. Gruza, D. Puchalska, T. Kaj- of the 1st Workshop on Perspectivist Approaches danowicz, P. Kazienko, Ofensive, aggressive, and to NLP@ LREC2022, 2022, pp. 37–45. hate speech analysis: From data-centric to human- [24] M. Fell, S. Akhtar, V. Basile, Mining annotator percentered approach, Information Processing & Man- spectives from hate speech corpora., in: NL4AI@ agement 58 (2021) 102643. AI* IA, 2021. [13] M. Sap, D. Card, S. Gabriel, Y. Choi, N. A. Smith, [25] C. Van Hee, E. Lefever, V. Hoste, Semeval-2018 task The risk of racial bias in hate speech detection, in: 3: Irony detection in english tweets, in: Proceedings Proceedings of the 57th annual meeting of the as- of The 12th International Workshop on Semantic sociation for computational linguistics, 2019, pp. Evaluation, 2018, pp. 39–50.

[26]

Krippendorf , Computing krippendorf's alphareliability ( 2011 ).

[27]

Schölkopf ,

Smola , K.-R. Müller , Kernel principal component analysis , in: Artificial Neural Networks-ICANN'97: 7th International Conference Lausanne, Switzerland, October 8 - 10 , 1997 Proceeedings, Springer, 2005 , pp. 583 - 588 .

[28]

Checco ,

Roitero ,

Maddalena ,

Mizzaro , G. Demartini, Let's agree to disagree: Fixing agreement measures for crowdsourcing , in: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing , volume 5 , 2017 , pp. 11 - 20 .

[29]

Caliński ,

Harabasz , A dendrite method for cluster analysis , Communications in Statistics-theory and Methods 3 ( 1974 ) 1 - 27 .

[30]

D. L.

Davies ,

D. W.

Bouldin , A cluster separation measure , IEEE transactions on pattern analysis and machine intelligence ( 1979 ) 224 - 227 .

[31]

Hubert ,

Arabie , Comparing partitions , Journal of classification 2 ( 1985 ) 193 - 218 .

[32]

N. X.

Vinh ,

Epps ,

Bailey , Information theoretic measures for clusterings comparison: is a correction for chance necessary? , in: Proceedings of the 26th annual international conference on machine learning , 2009 , pp. 1073 - 1080 .

[33]

Orlikowski ,

Röttger ,

Cimiano , D. H. B. University, U. of Oxford, C. S. Department, B. University, Milan, Italy., The ecological fallacy in annotation: Modelling human label variation goes beyond sociodemographics , ArXiv abs/2306 .11559 ( 2023 ).

[34]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[35] T.-Y. Lin , P.

Goyal , R.

Girshick , K.

He , P.

Dollár , Focal loss for dense object detection , in: Proceedings of the IEEE international conference on computer vision , 2017 , pp. 2980 - 2988 .