<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Perspectivist Approaches to NLP
* Corresponding author
† These authors contributed equally.
$ sodamarem.lo@unito.it (S. M. Lo); valerio.basile@unito.it
(V. Basile)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Hierarchical Clustering of Label-based Annotator Representations for Mining Perspectives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soda Marem Lo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Turin</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Modeling annotator perspectives has emerged as a technique to model subjective linguistic phenomena more accurately. Authors in the NLP community approached this issue by creating perspective-aware and personalized models, where demographic data or previous annotations are needed. In this paper, we explore two methodologies to represent annotators solely on the basis of the labels they assigned: label agreement and Kernel PCA. For both these techniques, we computed respectively 5 and 4 clusters, trained perspective-aware models on each of them, and finally implemented majority vote ensembles. The results show that clusters obtained by the first mining technique are more balanced and homogeneous in terms of annotators' demographic traits, while those obtained by KPCA tend to correlate more with their nationalities. Despite these diferences, both ensemble models outperform the baseline, confirming that leveraging annotation using clustering techniques is advantageous for the classification of a subjective phenomenon such as irony. We sustain that this approach can be beneficial for taking into account annotators' perspectives when demographic data are not known, together with the possibility that their annotations might be influenced by factors other than given demographics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Perspectivism</kwd>
        <kwd>clustering</kwd>
        <kwd>irony detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>On the other hand, it is important to notice that anno</title>
        <p>tators’ opinions are not necessarily linked to these traits
Subjective tasks in Natural Language Processing face only, especially when considering phenomena where
the issue of correctly modeling the perception of the hu- both demographic-depending aspects such as cultural
mans involved in the process, e.g., producing language background and culturally-shared linguistic expressions
resources used to train and evaluate models. In recent can be key elements to their definition and individuation.
years, several authors have started considering the im- This is what happens with irony, influenced by elements
portance of disagreement, criticizing the idea of a single as the origin of the speaker [8, 9] and linguistic patterns
valid truth [1], and examining its potential impact on sev- sometimes shared across languages [10].
eral aspects of NLP [2]. Such observation is fundamental Moving from the idea that human labels hide values
especially considering highly subjective tasks where an- and possible interpretations of a linguistic phenomenon
notators’ opinions may difer in relation to their cultural [6], we want to explore whether annotators choices
overand social background, or their personal experiences [3]. lap with their demographics, or might be linked to other</p>
        <p>To this aim, the perspectivist approach1 works towards traits that influence a similarity of opinions despite the
modeling raters’ perspectives, keeping all human labels diferent backgrounds. Specifically, we mined
annotaduring the training phase of the classifier [4]. tors’ perspectives to see how they group together on the</p>
        <p>Authors moving along this paradigm shift have often base of their annotations only. We propose two
methpointed out the necessity to publish disaggregated [5], ods to vectorize annotators leveraging the set of their
and well-documented data, with as much meta-data as labels. Then, we trained cluster-based models and built a
possible [6]. This information has been used in [7] to majority voting ensemble to validate our representation
build perspective-aware models, based on demographic techniques in a in-dataset and cross-dataset setting.
traits such as gender, nationality and generation, which The main contributions of this papers are the
followresulted to be more confident in detecting irony in respect ing:
to the non-perspectivist ones.
• Two techniques to model annotators as vector</p>
        <p>representations and automatically cluster them;
• A quantitative and qualitative analysis of the
automatically predicted clusters of annotators, both
in terms of quality of the clusters and mapping
between clusters and divisions of annotators based
on demographic traits;
• Experimental evidence that leveraging
automatically grouping of the annotators in a
disaggregated dataset is beneficial for the predictive power grouped annotators based on their agreement level, to
of an ensemble of classifiers for irony detection. extract social groups and analyze the impact of group
profile on the task of ofensive content recognition.
InThe experiments are conducted on EPIC (English Per- terestingly, when testing the agreement measure on
despectivist Irony Corpus) [7], a disaggregated corpus for mographic groups, no significant correlation was found,
irony detection, described in Section 3. The methods are showing that there might be other factors conditioning
introduced in Section 4 where the results of the clustering users’ perceptions of aggressiveness.
are analysed, and applied to irony detection in Section 5. Agreement was already used to mine annotators’
perspective in [24], where the authors measured label and
2. Related works features agreement, in order to cluster together those
who shared a perspective for similar reasons. Influenced
The correlation between annotators’ choices, their demo- by this work, this paper wants to explore how
annotagraphic traits, beliefs and social backgrounds has become tors are clustered based on their annotations about ironic
subject of attention in tasks such as ofensive language content. Thus, we compared two methodologies to mine
[11, 12], hate speech [13, 3] and toxicity detection [14]. raters’ opinions, observing whether these choices
coinThese works have demonstrated how the identity of the cide with their demographic data; finally we implemented
annotators, their social groups and their beliefs can play cluster-based models inspired by [16] and [7].
a role in the annotation phase.</p>
        <p>Taking into account raters’ backgrounds can be of fun- 3. Corpus description
damental importance to avoid building machines biased
toward the opinions of a majority [4, 15], especially when In this section we present the two corpora used for our
inworking on phenomena that cannot be objectively de- dataset and cross-dataset experiment, respectively EPIC
ifned. (English Perspectivist Irony Corpus), released by [7]; and</p>
        <p>The perspectivist approach aims at leveraging disagree- the corpus used for SemEval-2018 Task 3 "Irony Detection
ment to model annotators’ points of view and culturally- in English Tweets" [25].
driven perspectives [5]. In [16] the authors grouped
annotators by measuring polarization of their judgments
on hate speech content, then created a gold standard of 3.1. EPIC
each group to obtain perspective-aware models, eventu- For the in-dataset setting we trained and tested our
modally including the learned perspectives in an ensemble els on the English Perspectivist Irony Corpus [7, EPIC],
classifier. Inspired by this work, authors in [ 7] imple- a disaggregated corpus consisting of 3, 000 Post, Reply
mented perspective-aware models based on annotators’ pairs from Reddit (1, 500) and Twitter (1, 500) collected
demographic characteristics, and proposed to evaluate across five English-speaking countries: Australia, India,
them on the confidence [ 17] of their predictions. The Ireland, United Kingdom and United States. Regarding
perspective-aware models resulted to be more confident Twitter, authors used the API geolocation service to
identhan non-perspectivist ones. tify the five English varietes. With respect to Reddit, they</p>
        <p>
          Techniques for modeling annotators’ perspectives collected data from the following subreddits, assuming
have also been developed using personalization methods, the origin of the texts: r/AskReddit (United States),
r/recently applied to NLP with the aim of processing diver- CasualUK (United Kingdom), r/britishproblems (United
sity among annotators [18] in several subjective tasks, Kingdom), r/australia (Australia), r/ireland (Ireland),
r/insuch as ofensive content, sense of humor and emotion dia (India). The 74 annotators were balanced across both
detection [19, 20, 21], but also in the classification of in- gender and nationality, with a total of ∼ 15 raters for
terpersonal conflict types [ 22]. This approach tends to each of the aforementioned nationalities, who labelled
consider not always demographic data, but also personal around 200 instances each. Thus, the corpus consists of
beliefs and opinions obtained by historical posts of the 14, 172 annotations, with a median of annotations per
same user [22, 23]. For example, in [21], the authors de- instance of 5.
veloped a measure of the human bias to model individual The authors collected demographic information about
human perspectives, i.e. how a user’s perception difer the annotators (gender, age group, nationality,
ethnicfrom others, to obtain a representation of the subjectiv- ity, student status and employment status), and used
ity of each annotator. Authors in [12] propose both a data related to gender (female, male), age (boomer,
Genmesoscopic (group-based) and microscopic (user-based) eration X, Generation Y, Generation Z) and
nationalapproach to predict annotators’ beliefs, considering their ity (Australian, British, Indian, Irish and US-American)
metadata, the annotator identifier (id), and previous an- to build 11 demographic-based models, each trained
notations, demonstrating improved performance of clas- only on the labels provided by one group, and tested
sifiers as users’ information increased. Moreover, they
on both a demographic-independent aggregated test set values in the annotation when estimating how much each
and perspective-based test sets. The former, to which couple of annotators agrees between each other.
we will refer as gold test set, was obtained applying
a majority voting strategy on the entire corpus. The au- Second representation technique: dimensionality
thors discarded those instances for which a majority was reduction (KPCA) We opted for reducing the
dimennot available resulting in an aggregated set of 2, 767 in- sionality of the label matrix adopting a nonlinear form of
stances. This set was split into training (80%, 440 ironic, Principal Component Analysis (Kernel PCA) [
          <xref ref-type="bibr" rid="ref2">27</xref>
          ], then
1331 not ironic) and test set (20%, 110 ironic, and 443 not computing the pairwise distance matrix among
annotaironic), thus obtaining the gold test set of 553 instances tors.
(246 from Reddit and 307 from Twitter). The two methodologies will be explained and discussed
        </p>
        <p>We replicated this methodology to train and test the in the following paragraphs.
non-perspectivist (NP) model on this split, as in [7].
3.2. SemEval-2018 Task 3</p>
      </sec>
      <sec id="sec-1-2">
        <title>To verify the robustness of our cluster-based models, we tested their performances in a cross-dataset setting on the corpus used for the SemEval-2018 shared task on iroy detection [25].</title>
        <p>It consists of 4,792 tweets, collected between December
2014 and January 2015, and annotated by three students
in linguistics, who spoke English as a second language
(other demographic data were not collected). For the
shared task the corpus was randomly split into training
(1445 ironic, 1417 not ironic) validation (456 ironic, 499
not ironic) and test set, (784 instances, 311 ironic, and
473 not ironic).</p>
        <p>For the experiment in the cross-dataset setting we
tested our models, previously trained on EPIC, on
SemEval-2018 test set.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Mining perspectives</title>
      <sec id="sec-2-1">
        <title>This section introduces the methodology used to auto</title>
        <p>matically compute clusters of annotators. The core of
our approach is to vectorize each annotator based on the
labels assigned for each of the 3, 000 instances . Given
 raters annotating  instances, we obtained a matrix
 × , which will be called label matrix.</p>
        <p>Considering that each (Post, Reply) pair has an average
of 4.72 and a median of 5 annotations, annotators can
have three possible opinions: 0 (not ironic), 1 (ironic),
or a missing value. Thus, for each annotator, we obtain
a vector with the dimensionality of the number of
instances i, where the combination of the assigned label
represents rater’s perspective. Since annotators have
annotated around 200 instances each, there are at least
2, 800 missing values per annotator. For this reason we
have chosen to adopt two methods to represent the
annotators as vectors.</p>
        <p>
          First representation technique: label agreement ( )
We computed a pairwise similarity matrix using
Krippendorf’s alpha (  ) [
          <xref ref-type="bibr" rid="ref1">26</xref>
          ] as a metric to handle missing
        </p>
        <sec id="sec-2-1-1">
          <title>4.1. Label agreement</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Following [24], we measured label agreement in terms of</title>
        <p>Krippendorf’s  , since it has been developed both to take
into account that some agreement can arise by chance
(as the more common Cohen’s Kappa agreement score),
and to measure agreement among raters with incomplete
annotations, in contrast with Kappa measures (Cohen’s
and Fleiss’) that rely on a complete annotation matrix.</p>
        <p>Considering n annotators labeling k instances, we
ifrstly obtained the the label matrix  × . We used the
 to compute the pairwise agreement between
annotators i and j, resulting in the similarity matrix  ∈ R× ,
computed as , =  (:, :). Finally, we obtained
a distance matrix  = 1 − , used as input for the
unsupervised clustering algorithms.</p>
        <p>Given the high sparsity of the matrix, and the
annotation distribution already discussed in Section 3, we have
encountered 82 cases in which annotators did not have
any common annotation. Since missing values are not
acceptable in agglomerative clustering, we decided to
assign  = 0. As a consequence, we assumed no
correlation between the two in the clustering phase, totally
relying on the similarities that these annotators might
have with other raters. While this is a strong assumption,
made for practical reasons, the incidence of such pairs of
annotators is very low, i.e., about 1% of all the pairs.</p>
        <p>
          Moreover, in computing  we have encountered a
major limitation of the metric itself, already pointed out by
Checco et al. [
          <xref ref-type="bibr" rid="ref3">28</xref>
          ] as a “paradox” that makes systematic
agreement less reliable than random guessing. In fact,
in 158 cases, although there was perfect agreement
between pairs of annotators, the number of samples was
not enough for the  to be well-defined. In these cases,
we relaxed this constraint by setting  = 1 for the sake
of the further clustering steps.
        </p>
        <sec id="sec-2-2-1">
          <title>4.2. Nonlinear PCA</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>As a second method to vectorize annotators’ perspective,</title>
        <p>
          we have performed a dimensionality reduction of the
label matrix  × . Since it was a sparse matrix with a
highly number of missing values, we have firstly applied
(a) Label agreement ( )
(b) Dimensionality reduction (KPCA)
a one-hot encoding considering the three possible cate- Principal Component Analysis (PCA) is a technique
gories: ironic (encoded as 01), not ironic (encoded as 10) used to reduce the dimensionality of data by applying
and missing value (encoded as 00). We obtained a new an orthogonal linear transformation into a low
dimenmatrix with twice as many columns as the original label sional subspace, keeping the largest variance as possible
matrix, which has been reduced via Kernel Principal Com- in order to avoid loosing relevant information. As an
ponent Analysis, using the Scikit-learn decomposition extension of it, Kernel PCA makes possible to apply a
package. nonlinear mapping of the data into a high-dimensional
feature space [
          <xref ref-type="bibr" rid="ref2">27</xref>
          ] using kernel methods.
        </p>
        <p>We have firstly tried to apply regular Principal
Component Analysis selecting 59 components to keep the
85.7% of the variance. When computing the pairwise
distance of the reduced matrix with either euclidean,
cosine or manhattan metrics, we obtained a poorly
informative dendrogram, suggesting that our data might not
be linearly separable.</p>
        <p>For this reason we opted for a nonlinear PCA; we
computed a dendrogram for multiple kernels, and
eventually we chose the cosine similarity as the kernel that
resulted in the most balanced clustering. For the number
of components, we calculated the ratio between the sum
of the eigenvalues   of  components, and the sum of
the eigenvalues   of all non-zero components :
∑︀</p>
        <p>=1  
∑︀
=1</p>
        <p>We tried with multiple fixed dimensionalities , and
stopped at 60 components to explain the 85.5% of the
variance. Then, we obtained a distance matrix computing
the pairwise distance of our reduced matrix, calculated
via the euclidean metric.
4.3. Hierarchical clustering
same representation technique, and considering the
combination of the two measures together with the computed
dendrograms. The results show that to a lower number
of clusters corresponds an increase in density and
separation (higher Calinski Harabaz Index), together with
an increasing generalization, thus having clusters more
similar among each other (higher Davies Bouldin Index).</p>
        <p>We tried to balance these two efects, by minimizing the
ratio between the two metrics, and assigned a number
of 5 clusters to the clustering obtained with  , and a
number of 4 for KPCA.</p>
        <p>After obtaining a distance matrix of the annotators for
each of the two representation techniques described in
previous sections, we used the library Scikit learn to
perform hard clustering on both data. Specifically, we
computed a clustering to have a graphical representation
of how the annotators join together, and how clusters
themselves are connected to each other by analyzing the
resulting nodes.</p>
        <p>
          In both cases, we opted for Ward’s linkage criterion, 4.4. Quantitative analysis
calculating the linkage with the euclidean distance metric, Comparing the two figures, it is possible to notice that
as the method requires, and computing the full tree. It in the second representation technique, in cluster 1 and
resulted in the clusters illustrated by the dendrograms cluster 3 (Figure 1 (b)) the first nodes formed when the
in Figure 1. DBSCAN and Afinity Propagation were two most similar items joined together are almost at
also tried as clustering algorithms, however they did not the same level of the cluster formation. Moreover, as
converge to usable clusters on our dataset. illustrated in Table 2, the four clusters join nearly at the
same level, showing a lower distance between them. This
Choosing the number of clusters Once the two clus- is reflected by a systematically lower Silhouette score for
terings are computed, we applied the Calinski Harabaz the clusters obtained applying the Kernel PCA, in respect
[
          <xref ref-type="bibr" rid="ref4">29</xref>
          ] and Davies Bouldin Indexes [
          <xref ref-type="bibr" rid="ref5">30</xref>
          ] to respectively mea- to the first representation technique Figure 1 (a), where
sure their density and their similarity. We used these the distance between the clusters is well defined and
intrinsic evaluation metrics to assess the best number of reflected by the diferent height of all the nodes, including
clusters between 2 and 5, adding a further analysis with the ones where the clusters are formed (Table 1).
11 clusters as the sum of the number of demographic Looking at the positive label rate, it is higher in cluster
traits considered for the perspective-aware models in [7]. 1, 2 and 3 from the  representation technique (Table 1)
Since these two metrics do not need any ground truth and cluster 3 from the KPCA representation technique
labels, we were able to perform an intrinsic clustering (Table 2), indicating a major sensitivity of these
annotavalidation comparing the scores among clusters of the tors to irony.
        </p>
        <p>Representation
technique</p>
        <p>KPCA</p>
        <p>Demographics</p>
        <p>Gender
Nationality
Generation</p>
        <p>Gender
Nationality
Generation
0.030
-0.007
-0.002
-0.001</p>
        <sec id="sec-2-3-1">
          <title>4.5. Qualitative analysis</title>
          <p>
            To see whether there was a correlation between the
obtained clusters and demographics, we firstly leveraged
the Rand index (ARI) [
            <xref ref-type="bibr" rid="ref6">31</xref>
            ] and the Mutual information
(AMI) [
            <xref ref-type="bibr" rid="ref7">32</xref>
            ] both adjusted by chance. The former estimates
the similarity between two clusterings, while the latter is
a measure of similarity between two labels. Both metrics
are typically used to validate the output of a clustering
algorithm. However, in this work they were used to infer
a mapping between our cluster and each of the
annotators’ demographics (gender, generation and nationality),
treated as the ground truth. The results in Table 3 show
a negative correlation for at least one of the two
measures in most of the cases, with the exception of gender
for the  representation technique, and nationality for
the KPCA-based one. Especially in the latter, both the
          </p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>ARI and AMI tend to be higher than other scores, which</title>
        <p>
          instead are always very close to zero. This result is in
line with recent observations that using demographic
information about the annotators does not necessarily
guarantee a better performance in terms of perspective
modeling [
          <xref ref-type="bibr" rid="ref8">33</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Consequently, we further explored the correlation with</title>
        <p>demographic data: we looked at the composition of the
clusters with respect to gender, nationality and
generation,2 as illustrated in Table 4 and Table 5.</p>
      </sec>
      <sec id="sec-2-6">
        <title>From the clusters obtained via Krippendorf’s alpha</title>
        <p>( ), we did not find any systematic mapping between
demographic traits and the clusters. In particular, in</p>
      </sec>
      <sec id="sec-2-7">
        <title>2For this analysis, we excluded a single annotator for whom age was</title>
        <p>not disclosed, clustered in cluster 1 ( ), and cluster 2 (KPCA).</p>
      </sec>
      <sec id="sec-2-8">
        <title>GenZ annotators: the former are totally absent in cluster</title>
        <p>0 and 1, and the latter are concentrated especially in
cluster 0 and cluster 2 in respect to the remaining two.</p>
      </sec>
      <sec id="sec-2-9">
        <title>Nevertheless, no partition of demographic group can be highlighted, since none of the considered social groups merges homogeneously into specific clusters.</title>
        <p>Dem. data
Female</p>
        <p>Male
Australia
India
Ireland</p>
        <p>UK</p>
        <p>US
Boomer
GenX
GenY</p>
        <p>GenZ</p>
      </sec>
      <sec id="sec-2-10">
        <title>Note however that these two cohorts of annotators are the less numerous.</title>
        <p>Dem. data</p>
        <p>Female</p>
        <p>Male
Australia</p>
        <p>India
Ireland</p>
        <p>UK</p>
        <p>US
Boomer
GenX
GenY</p>
        <p>GenZ
the first representation technique, we interpreted the
Representation technique</p>
        <p>Cluster</p>
        <p>#Instances

KPCA
0
1
2
3
4
0
1
2
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Modelling mined perspectives</title>
      <sec id="sec-3-1">
        <title>In this section we present experiments carried out to</title>
        <p>validate our methodology. In particular, we created
and explored the diference between non-perspectivist
and cluster-based ensemble models both in-dataset and
cross-dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>As regarding the experimental setup, we fine-tuned the</title>
        <p>
          uncased version of bert 3 [
          <xref ref-type="bibr" rid="ref9">34</xref>
          ] for sequence classification.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>The input consisted in the Post, Reply pairs. We set a batch</title>
        <p>size of 16 and a learning rate of 5 · 10− 5 and, to prevent
overfitting, we customized the model to implement the</p>
      </sec>
      <sec id="sec-3-4">
        <title>Focal Loss [35]. Finally, we set early stopping with a</title>
        <p>patience of 2 epochs on the validation loss (using 20% of
the training data as validation set).</p>
        <p>As a baseline (called NP for non-perspectivist), we
aggregated the annotations via majority voting and
discarded those where a majority was not found, adopting
the methodology explained in Section 3. Thus, we trained
the model on the aggregated set of 1, 771 instances, and
tested it on the gold test set. For the models based on
ble strategy, inspired by [16]: for each cluster we created
a gold standard to train a perspective-aware model, and
applied majority voting on their predictions, obtaining an
ensemble classifier per technique. We tested the models
on the gold test set and compared the results with the
baseline.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3https://huggingface.co/bert-base-uncased</title>
        <p>perspective-aware models [16] based on the automat- formance of cluster-based ensembles in a cross-dataset
ically extracted clusters of annotators, ensembled them, setting. More importantly, these experiments prove that
the two clustering techniques, we implemented an ensem- tator perspectives we can let the annotators’ opinions
To train the cluster-based models, we firstly excluded
the gold test set, and grouped the remaining
labeltexts pairs according to each of the obtained clusters,
extracting 5 and 4 datasets respectively for the first and
second representation technique. Eventually, we applied
a majority voting strategy and excluded those instances
where a majority was not present. Table 6 illustrates the
number of instances per dataset.</p>
        <p>After training, we tested the models both in a
indataset (on EPIC’s gold test set) and cross-dataset
setting, specifically on SemEval 2018 Task 3 test set [ 25],
previously described in Section 3.2. Finally, we
implemented a majority voting ensemble (M-ENS), that returns
a final label by applying majority vote over the
predictions of each cluster-based classifier. Table 7 shows the
average precision, recall and F1-score over 10 runs. We
found low variation in the scores, as illustrated by the
standard deviation in parenthesis.</p>
        <p>Looking at Table 7, we can notice that the two
majority ensembles obtained from the explored
representation techniques always outperform the baseline, both
indataset and cross-dataset. In the first setting, the
macroaveraged F1 score of the M-ENS  gives the best results,
while M-ENS KPCA presents the best performance
crossdataset. Results demonstrate that modelling annotators’
opinions is necessary when working on highly subjective
phenomena as irony, as strongly confirmed by the
pertraining perspective-aware models based on annotators’
mined opinion can be an efective instrument to capture
a diversity of points of view.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Notably, the increase in macro-F1 score is a reflection</title>
        <p>of a better prediction of the positive class. Considering
that the classes were highly unbalanced (see Section 3.1)
the accuracy measure is higher for the baseline model,
which is less sensitive to the presence of irony and
therefore over-predicts the negative class.</p>
        <p>Despite the clusters obtained in the two representation
techniques being very diferent in terms of methodology
(Section 4.1, Section 4.2) and composition (Section 4.3),
the models exhibit comparable performance. In-dataset,
the ensemble based on  clusters gives slightly better
scores than KPCA; but this trend is inverted in the second
setting.</p>
        <p>These results confirm the idea that by mining
annoemerge regardless of their demographics, observing how
social background can influence the individual’s
definition of what is ironic, shared among characteristics that
might go beyond common demographic traits.
setting
in-dataset
cross-dataset
model
NP
M-ENS 
M-ENS KPCA
NP
M-ENS 
M-ENS KPCA</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion</title>
      <p>
        built-in biases in creating perspective-aware classifiers,
testing whether annotators’ choices might be driven by
In this paper, we implemented and tested two techniques factors uncorrelated to given demographics, but rather
to mine annotator perspectives, moving from the idea linked to other elements of their social and individual
that the set of their annotations can be used as a represen- background.
tation of their opinion on the topic they are annotating, Although we tackled the Krippendorf’s alpha paradox
in our case ironic content in social media platforms. We described in Section 4.1, there are other abnormalities of
chose to perform this analysis on irony since it is a highly the measure itself extensively described in [
        <xref ref-type="bibr" rid="ref3">28</xref>
        ], which
subjective phenomenon where not only demographic, but might had a negative impact on the clusters obtained via
also linguistic and social aspects can influence annota- the first representation technique.
tors’ interpretation and judgement. For this reason, we Moreover, in this work we group annotators using a
used the recently published English Perspectivist Irony hard clustering algorithm. However, as reality is more
Corpus (EPIC). nuanced and many dimensions interact in describing
      </p>
      <p>For mining annotators’ perspectives we proposed two human variability, a soft clustering approach could lead
methodologies. The former, inspired by [24], was to in- to more accurate representations, although its application
terpret similarity of opinions in terms of inter-annotator is computationally more complex in this context.
agreement, adapting Krippendorf’s alpha and overcom- For the future, we plan to perform the same
experiing its structural limitations. The latter consisted in a ments on multiple pre-trained language models, to
furdimensionality reduction of annotator vectors, using Ker- ther test the consistency of our results, and test other
nel Principal Component Analysis, thus applying a non- representation techniques such as autoencoders. Our
linear mapping of our data. Then, we applied a hierar- analysis of the composition of the annotator clusters
inchical clustering algorithm to analyse how annotators dicates some degree of intersectionality of demographic
group together. Looking at the composition of clusters in traits with respect to the annotation of irony, which we
respect to annotators’ demographic data, results demon- consider a research direction to pursue further. Another
strate how diferent the two mining techniques are. In aspect worth investigating is the relative position of
indifact, Kernel PCA highlights the correlation between an- vidual annotators among their assigned clusters,
checknotators’ nationality and irony perception, while the first ing whether it correlates with factors like annotation
method returns more heterogeneous and better balanced quality. Finally, while our results are very encouraging, it
clusters. must be noted that the experimental task still involved an</p>
      <p>In the experimental phase, we trained perspective- aggregated test benchmark. We expect that our method
aware models for each cluster obtained via the two rep- will produce more impactful results when measured on
resentation techniques, and implemented an ensemble a perspectivist, disaggregated benchmark, which we aim
strategy to select the predicted labels, based on majority to develop in the next steps of our research.
voting. Both in-dataset and cross-dataset performance
showed that the ensemble models always outperform the
baseline, demonstrating the robustness of our method References
also when tested on a diferent corpus.</p>
      <p>Considering these promising results, we believe that
this approach can be of fundamental use for future
research in the perspectivist field. Firstly, it makes possible
to mine annotators’ opinions when demographic
information are not known. Secondly, it can help to avoid
[1] L. Aroyo, C. Welty, Truth is a lie: Crowd truth and
the seven myths of human annotation, AI Magazine
36 (2015) 15–24.
[2] V. Basile, M. Fell, T. Fornaciari, D. Hovy, S. Paun,</p>
      <p>B. Plank, M. Poesio, A. Uma, et al., We need to
consider disagreement in evaluation, in:
Proceedings of the 1st workshop on benchmarking: past, 1668–1678.
present and future, Association for Computational [14] M. Sap, S. Swayamdipta, L. Vianna, X. Zhou, Y. Choi,
Linguistics, 2021, pp. 15–21. N. A. Smith, Annotators with attitudes: How
an[3] S. Akhtar, V. Basile, V. Patti, Whose opinions mat- notator beliefs and identities bias toxic language
ter? perspective-aware models to identify opinions detection, arXiv preprint arXiv:2111.07997 (2021).
of hate speech victims in abusive language detec- [15] V. Prabhakaran, A. M. Davani, M. Diaz, On releasing
tion, arXiv preprint arXiv:2106.15896 (2021). annotator-level labels and information in datasets,
[4] F. Cabitza, A. Campagner, V. Basile, Toward a per- arXiv preprint arXiv:2110.05699 (2021).
spectivist turn in ground truthing for predictive [16] S. Akhtar, V. Basile, V. Patti, Modeling annotator
computing, Washington DC, USA, 2023. perspective and polarized opinions to improve hate
[5] V. Basile, et al., It’s the end of the gold standard speech detection, in: Proceedings of the AAAI
as we know it. on the impact of pre-aggregation Conference on Human Computation and
Crowdon the evaluation of highly subjective tasks, in: sourcing, volume 8, 2020, pp. 151–154.
CEUR WORKSHOP PROCEEDINGS, volume 2776, [17] A. A. Taha, L. Hennig, P. Knoth, Confidence
estiCEUR-WS, 2020, pp. 31–40. mation of classification based on the distribution
[6] B. Plank, The’problem’of human label variation: of the neural network output layer, arXiv preprint
On ground truth in data, modeling and evaluation, arXiv:2210.07745 (2022).</p>
      <p>arXiv preprint arXiv:2211.02570 (2022). [18] L. Flek, Returning the n to nlp: Towards
contextu[7] S. Frenda, A. Pedrani, V. Basile, S. M. Lo, ally personalized classification models, in:
ProceedA. T. Cignarella, R. Panizzon, C. Marco, ings of the 58th annual meeting of the association
B. Scarlini, V. Patti, C. Bosco, D. Bernardi, for computational linguistics, 2020, pp. 7828–7838.
Epic: Multi-perspective annotation of a cor- [19] J. Bielaniewicz, K. Kanclerz, P. Miłkowski, M. Gruza,
pus of irony, in: ACL 2023, 2023. URL: K. Karanowski, P. Kazienko, J. Kocoń, Deep-sheep:
https://www.amazon.science/publications/ Sense of humor extraction from embeddings in the
epic-multi-perspective-annotation-of-a-corpus-of-irony. personalized context, in: 2022 IEEE International
[8] A. Joshi, P. Bhattacharyya, M. J. Carman, Investiga- Conference on Data Mining Workshops (ICDMW),
tions in computational sarcasm, Springer, 2018. IEEE, 2022, pp. 967–974.
[9] R. Ortega-Bueno, F. Rangel, D. Hernández Farıas, [20] J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling,
P. Rosso, M. Montes-y Gómez, J. E. Medina Pagola, K. Kanclerz, P. Miłkowski, P. Kazienko, Learning
Overview of the task on irony detection in spanish personal human biases and representations for
subvariants, in: Proceedings of the Iberian languages jective tasks in natural language processing, in:
evaluation forum (IberLEF 2019), co-located with 2021 IEEE International Conference on Data
Min34th conference of the Spanish Society for natural ing (ICDM), IEEE, 2021, pp. 1168–1173.
language processing (SEPLN 2019). CEUR-WS. org, [21] P. Kazienko, J. Bielaniewicz, M. Gruza, K. Kanclerz,
volume 2421, 2019, pp. 229–256. K. Karanowski, P. Miłkowski, J. Kocoń,
Human[10] J. Karoui, F. Benamara, V. Moriceau, V. Patti, centred neural reasoning for subjective content
proC. Bosco, N. Aussenac-Gilles, Exploring the im- cessing: Hate speech, emotions, and humor,
Inforpact of pragmatic phenomena on irony detection mation Fusion (2023).
in tweets: A multilingual corpus study, in: Pro- [22] J. Plepi, B. Neuendorf, L. Flek, C. Welch,
Unifyceedings of the 15th Conference of the European ing data perspectivism and personalization: An
Chapter of the Association for Computational Lin- application to social norms, arXiv preprint
guistics: Volume 1, Long Papers, 2017, pp. 262–272. arXiv:2210.14531 (2022).
[11] E. Leonardelli, S. Menini, A. P. Aprosio, M. Guerini, [23] K. Kanclerz, M. Gruza, K. Karanowski,
S. Tonelli, Agreeing to disagree: Annotating ofen- J. Bielaniewicz, P. Miłkowski, J. Kocoń, P. Kazienko,
sive language datasets with annotators’ disagree- What if ground truth is subjective? personalized
ment, arXiv preprint arXiv:2109.13563 (2021). deep neural hate speech detection, in: Proceedings
[12] J. Kocoń, A. Figas, M. Gruza, D. Puchalska, T. Kaj- of the 1st Workshop on Perspectivist Approaches
danowicz, P. Kazienko, Ofensive, aggressive, and to NLP@ LREC2022, 2022, pp. 37–45.
hate speech analysis: From data-centric to human- [24] M. Fell, S. Akhtar, V. Basile, Mining annotator
percentered approach, Information Processing &amp; Man- spectives from hate speech corpora., in: NL4AI@
agement 58 (2021) 102643. AI* IA, 2021.
[13] M. Sap, D. Card, S. Gabriel, Y. Choi, N. A. Smith, [25] C. Van Hee, E. Lefever, V. Hoste, Semeval-2018 task
The risk of racial bias in hate speech detection, in: 3: Irony detection in english tweets, in: Proceedings
Proceedings of the 57th annual meeting of the as- of The 12th International Workshop on Semantic
sociation for computational linguistics, 2019, pp. Evaluation, 2018, pp. 39–50.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Krippendorf</surname>
          </string-name>
          ,
          <article-title>Computing krippendorf's alphareliability (</article-title>
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smola</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-R. Müller</surname>
          </string-name>
          ,
          <article-title>Kernel principal component analysis</article-title>
          ,
          <source>in: Artificial Neural Networks-ICANN'97: 7th International Conference Lausanne, Switzerland, October</source>
          <volume>8</volume>
          -
          <issue>10</issue>
          ,
          <year>1997</year>
          Proceeedings, Springer,
          <year>2005</year>
          , pp.
          <fpage>583</fpage>
          -
          <lpage>588</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Checco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roitero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Maddalena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          , G. Demartini,
          <article-title>Let's agree to disagree: Fixing agreement measures for crowdsourcing</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Human Computation and Crowdsourcing</source>
          , volume
          <volume>5</volume>
          ,
          <year>2017</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>T.</given-names>
            <surname>Caliński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harabasz</surname>
          </string-name>
          ,
          <article-title>A dendrite method for cluster analysis</article-title>
          ,
          <source>Communications in Statistics-theory and Methods</source>
          <volume>3</volume>
          (
          <year>1974</year>
          )
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Davies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Bouldin</surname>
          </string-name>
          ,
          <article-title>A cluster separation measure</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          (
          <year>1979</year>
          )
          <fpage>224</fpage>
          -
          <lpage>227</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hubert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Arabie</surname>
          </string-name>
          ,
          <article-title>Comparing partitions</article-title>
          ,
          <source>Journal of classification 2</source>
          (
          <year>1985</year>
          )
          <fpage>193</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>N. X.</given-names>
            <surname>Vinh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Epps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <article-title>Information theoretic measures for clusterings comparison: is a correction for chance necessary?</article-title>
          ,
          <source>in: Proceedings of the 26th annual international conference on machine learning</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>1073</fpage>
          -
          <lpage>1080</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>M.</given-names>
            <surname>Orlikowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Röttger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          , D. H. B. University, U. of Oxford, C. S. Department, B. University, Milan, Italy.,
          <article-title>The ecological fallacy in annotation: Modelling human label variation goes beyond sociodemographics</article-title>
          ,
          <source>ArXiv abs/2306</source>
          .11559 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [35]
          <string-name>
            <surname>T.-Y. Lin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>