-

1613-0073

PERSEID - Perspectivist Irony Detection: A CALAMITA Challenge

ValerioBasile

SilviaCasol a

SimonaFrenda

0 3

Soda MaremLo

Perspectivism, Irony Detection, Evaluation

0 Interaction Lab, Heriot-Watt University , Edinburgh, Scotland 1 MaiNLP & MCML, LMU Munich , Germany 2 University of Turin , Italy 3 aequa-tech , Turin , Italy

provided. Works in perspectivism and human label variation have emphasized the need to collect and leverage various voices and points of view in the whole Natural Language Processing pipeline. PERSEID places itself in this line of work. We consider the task of irony detection from short social media conversations in Italian collected from Twitter (X) and Reddit. To do so, we leverage data from MultiPICO, a recent multilingual dataset with disaggregated annotations and annotators' metadata, containing 1000 Post, Reply pairs with five annotations each on average. We aim to evaluate whether prompting LLMs with additional annotators' demographic information (namely gender only, age only, and the combination of the two) results in improved performance compared to a baseline in which only the input text is The evaluation is zero-shot; and we evaluate the results on the disaggregated annotations using f1.

CEUR ceur-ws.org

1. Challenge: Introduction and Motivation

intrinsically subjectiv10e][, as points of view might differ depending on users’ social background, beliefs, and demographics. Using a single aggregated label has thus Recently, researchers have shown a growing interesbteienn increasingly questione1d1,[12, 13], and preservhuman-centered technologies to make Artificial Inteinllgi-disaggregated data is preferred. On the other hand, gence (AI) models and products more attentive tortehceent work has shown that design choices and biases users’ sensitivity and needs.

In Natural Language Processing (NLP), works on peurn-expectedly aligned with a given population segment spectivism 1[] and human label variatio2n]h[ave emand thus the importance of incorporating a diverse seotthoefrs 1[ 5, 4 ]. phasized the intrinsic variability in human annotattoiorneflect a minority of perspectives, under-representing alyzing existing disagreemen6]t, [learning from disag-and by Plank2[]2. voices; this aspect afects all phases of the NLP pipeline, As a result, disaggregated datasets have become more including collecting disaggregated dat3a,s4e,t5s],[an- popular, as listed in the Perspectivist Data Man1 ifesto gregated data7,[8], and evaluating considering severalResearchers are incresingly reporting annotators’ deafect datasets and models and often result in models more than with anothe1r4][; in fact, aggregated data tend voices as valid9[ , 1 ].

During the data collection and annotation pdhaastea,set, which was first advised as a good practice to works in this area have gone beyond considering daivs-oid excluding, minimizing, and misrepresenting ceragreement as motivated by noise only and thus astaanin groups of users16[]. Recent work has also explored attribute to be minimized and resolved, e.g., throwughehther annotators’ demographics and background — as majority voting. In contrast, research has emphasidzeesdcribed by available metadata — influence their annothe necessity of collecting a variety of voices and tcaotni-on5[ , 17, 18, 19, 4 ] and can help during the modeling sidering all such voices as valid. The reason is twofoofldth.e phenomenon under stud2y0[, 8, 21]. On the one hand, researchers have argued that manDyespite the increasing interest in disaggregated and tasks that are popular in the NLP community (inclmuedt-adata-rich datasets, few such datasets for irony deing, for example, hate speech and humor detection) atreection exist. Simpson et [a2l2.] released a corpus for mographics and other metadata when describing the humor detection in English, used as a benchmark in the shared task2[ 3 ]. No annotators’ metadata, however, are

1https://pdai.info/

included. Frenda et a[4l]. proposed a dataset for irony • Gender (Task 2): theperspective is the binary detection and investigated the influence of the annota- self-identified gender of the annotator. tors’ demographics on their percepti6o]n.T[he dataset • Age + Gender (Task 3): in this case, both atcontains English texts only. tributes are provided as tpherespective.

For this challenge at CALAMIT2A4][, we propose to The post is a textual post, to which the tarregpelty use the Italian portion of MultiPICo (MultilingualisPearr-eply. The output of the prediction is a binary label spectivist Irony Corpu3s[)25]. Multipico is a multilinguailndicating whether trheeply is ironic (or non-ironic) for corpus of short Post-Reply conversational pairs extraactheudman bearing the characteristic opfetrhsepective from Twitter and Reddit and annotated as ironic otrontohtetext. The performance of the model is evaluated ironic by crowdsourcing workers with diferent demtoh-rough a global f1 metric on the disaggregated annotagraphics and backgrounds. MultiPICo cov9elrasnguages tions. (Arabic, English, Dutch, French, German, Hindi, ItalianT, he challenge is zero-shot: no training, fine-tuning, Portuguese, and Spanish) an2d5 language variet4ie,s or in-context learning is considered for this version of ranging from high- to low-resourced ones. MoreovPeErR, SEID and the whole dataset can be used for inference. a rich set of annotators’ sociodemographic informatioNnote that since each annotator can be described by no (balanced gender, age, nationality, ethnicity, studentt,raanidts (Task 0), one single trait (Task 1 and Task 2), and employment status) is provided. two traits (Task 3), we do not aim at optimal performance

While no perspectivist task leveraging the datasetwhhaesn considering personalized irony detection; instead, been proposed so far, PERSEID is related to the Learonur- goal is to understand whether models improve their ing With Disagreement task held at SemEval 21012]1 [performance when one or multiple traits is provided and and 2023 [13]. In LeWiDi, participant systems were chatl-o understand the impact of diferent configurations. lenged to learn the distribution of labels, tested by cross entropy-based metrics. In contrast, PERSEID aims at stimulating the development of models of human p3e.r- Data description spectives, in order to explain the label distributions rather than just quantifying them. 3.1. Origin of data

The data for the challenge are part of Multi2P5I]C,o [

2. Challenge: Description a corpus of18, 778 short conversations collected from

Reddit 8(, 956) and Twitter9,(822) in 9 languages, and a The task of Perspectivist Irony Detection aims to meastuorteal o2f5 varieties. models’ capability to detect irony in a short verbal Dexa-ta were collected to reproduce the structure of short change for each annotator, conditioned on the knowlceodngeversations. of demographic information about them. To this purposeF,or both Reddit and Twitter, ptohset is typically a we want to look at diferent model performances if itmiesssage initiating a thread andrtehpley a direct reply informed by one demographic trait or a combinationtootfhat messa5g.e two. In particular, we focus on the gender and age of thReeddit data were retrieved using the Pushshift reposiannotator, due to the balanced number of male antdofrey6- from January2020 to June2021. For Italian, data male annotators by desi3g.2n, and due to the fact that agweere downloaded from the subreddit /r/Italy. was shown to be one of the most polarized dimensionsPairs having at least one deleted or removed comment in [25]. were filtered out, and the language of the messages was

The input to the task does not consist only of a tfeuxrt,her validated using the Python library for language but rather of a tuplpee<rspective, post, reply>. identification LangID7.

In this iteration of PERSEID, we considered severalTwitter data were collected via Twitter Stream API, variables for thperspective attribute: using the geolocation service and excluding quotes and retweets. Then, the full conversation was retrieved, and • None (Task 0): acting as a baseline, we want ttoweets that directly replied to the starting ones were investigate the models’ outputs when no infroert-ained.

mation about the annotator is provided. The data collection resulte1d8,in778 instances, to• Age (Task 1): theperspective is one of four val- gether with their metadata, consisting of Post-Reply origues encoding the age group of the annotator. inal IDs, subreddits, and geolocation information. 3MultiPICo is available ahtttps://huggingface.co/datasets5/For Reddit, second-level replies were collected in a minority of Multilingual-Perspectivist-NLU/MultiPwICitoh a CC-BY 4.0 cases; for Twitter, tphoest is a reply to a thread-starting message license. in a minority of cases. 4For example, texts in Austrian, German, and Swiss German 6ahrtetps://redditsearch.io/ included in the dataset. 7https://github.com/saffsd/langid.py #Texts 2,181 1,000 2,999 1,760 2,375 786 1,000 1,994 4,683 18,778

For Italian, data account for 1p0o0s0t, reply pairs, equally sourced from Reddit and Twitter. • Their completion rate had to be greater or equal to99% 3.2. Annotation details • They had to be native speakers of the considered Annotators were asked to read a septosotf andreply language (i.e., Italian, for the portion of data used pairs and answer whether the text orfetphley was ironic in the challenges) or not, given the context. • The set of annotators needed to be balanced

The human annotation of the collected data was per- across genders. formed on the crowdsourcing platform Pr8o,ltihficrough a custom-built annotation interface designed to collTehcetquality of the annotation was further assured usa diverse and balanced set of annotators. The interi nfagcaettention check questions in the for“mPleoafse anmimicked a message conversation, havingptohset as swer X to this question”. Annotators ha1d% probability of context and asking whether rtehpely was Ironic or Not receiving these special questions. Annotators who failed ironic. to respond correctly to at least 50% of these questions

For Italian, 24 native-speaker annotators were hwireerde, excluded from the final corpus. who performed 4,790 annotations in total, resulting inAarich set of metadata is also provided. These include mean of 4,79 annotations per instance (see T1a)b.le the self-identified Gender (balanced by design), their nationality, theAirge Group (1 GenX, 15 GenY, 8 GenZ, for Italian)E, thnicity (23 white people, 1 mixed person, for

Annotators were selected based on three criteria:

Italian)S,tudent status (14 yes, 9 no, for ItalianE)m,ploy- 'reply_id': 2497527360959166890, ment status (9 in full-time jobs, 7 unemployed, 5 working'source': 'twitter', part-time, 1 not in paid work and 1 due to start, for'Ittiamle-stamp': '2022-12-07 15:49:50' ian), as reported in Tab2l.e 3.3. Data format 3.4. Example of prompts used for 9No workers whose age i>s 42, i.e., from the baby boomer generations, participated in the annotation of the Italian portion of the dataset • “una persona giovane della generazione Z”

if Generation == GenZ (Age < 26) • “una persona giovane della generazione Y”

if Generation == GenY (26 ≤ Age < 42) • “una persona adulta della generazione X”

if Generation == GenX (42 ≤ Age < 58) • “una persona adulta della generazione baby boomer” if Generation == Boomer (Age > 58)

Task 2 The perspective variable is a verbalization of the Gender variable, which is expressed as a string in English. It can be instantiated with one of two values:

In the vast majority∼9(0%) of cases, the conversation-starting messages and their direct replies were downloaded to capture the full conversational context. In a few cases, the downloaded reply was not direct but rather a secondlevel reply (a reply to a direct reply); thus, some conversational context might be missing.

Challenge design We describe annotators by no sociodemographic traits (Task 0), one single demographic trait (Task 1 and Task 2), or two demographic traits (Task 3). We evaluate disaggregated annotations at inference time, having the annotators represented only by those traits. Annotators’ sociodemographic information does not always align with the most relevant grouping of annotators according to the language phenomenon under study2[ 1, 28 ], and the limited amount of sociodemographic traits we provide is undoubtedly not enough to describe every single annotator. We are aware of this limitation. In fact, our main aim is to understand whether providing one or more annotator traits makes the model predictions more aligned with annotators having a given characteristic.

Task 3 The perspective variable is a verbalization of both theAge andGender variables, e.g., “una giovane donna della generazione Z.”

4. Metrics

• “una donna”

if Gender == “Female” • “un uomo” if Gender == “Male”

Inspired by Mokhberian et a[2l.6], the Perspectivist Irony

Detection task is evaluated by meangsloobfal F1, that 6. Ethical issues is, the F1-score computed across all the individual annotations in the dataset against the predictions oTfhtishweork places itself in an increasing amount of work model. that calls to consider and include the subjectivity of the annotators in NLP applications, encouraging reflection on the diferent perspectives encoded in annotated 5. Limitations datasets to minimize the amplification of biases. We hope this challenge will be a starting point for investigating Data The sociodemographic information about the aann-d evaluating LLMs in Italian to make them suitable for notators is partial, bound to what was avai lfinaablleusers. from the crowdsourcing platform, and following Tahe dataset used in the challenge was built by adoptdiscretization of human personal traits that cionugldmeasures to protect the privacy of annotators, and be perceived as forced (e.g., representing seltfh-e data handling protocols were designed to safeguard identified gender as a single binary label). Fupre-rsonal information (like anonymization of users’ menthermore, as shown by Orlikowski et[2a1l]., an- tions). Although the attention during the collection of notators’ sociodemographics do not always aldiganta was focused on ironic content spread online, we with the most relevant grouping of annotataocrksnowledge that some of the material contains racist, according to the language phenomenon undseerxist, stereotypical, violent, or generally disturbing constudy. tent.

Annotators of the Italian portion of MultiPICAOnnotators are balanced through their self-identified tend to be young (with no annotators from gtehneder. However, we are aware that considering genbaby boomer generation and only one frodmer in a binary form is limited; moreover, a substantial GenX). This aspect might influence the results. unbalance for some dimensions, like the self-identified ethnicities, is present in the dataset. This pattern sugSimilarly to Sachdeva et[a5]l,. Sap et al.[19], gests the need to interact diferently with annotators or Forbes et al[.27], we noticed the ethnicity of an

social communities if we want a diversity of annotators notators is unbalanced, and all but one annotaatnodrpserspectives in terms of social background. are white for the considered data.

7. Data license and copyright issues

leasing annotator-level labels and information inComputational Linguistics, Florence, Italy, 2019, pp. datasets, in: Proceedings of the Joint 15th Linguis- 5716–5728. URL: https://aclanthology.org/P19-1.572 tic Annotation Workshop (LAW) and 3rd Designing doi:10.18653/v1/P19-1572.

Meaning Representations (DMR) Workshop, 2021[,23] A. Uma, T. Fornaciari, A. Dumitrache, T. Miller, p. 133–138. J. Chamberlain, B. Plank, E. Simpson, M. Poesio, [16] E. M. Bender, B. Friedman, Data statements for nat- SemEval-2021 task 12: Learning with disagreeural language processing: Toward mitigating sys- ments, in: A. Palmer, N. Schneider, N. Schluter, tem bias and enabling better science, Transactions G. Emerson, A. Herbelot, X. Zhu (Eds.), Proceedings of the Association for Computational Linguistics 6 of the 15th International Workshop on Semantic (2018) 587–604. Evaluation (SemEval-2021), Association for Com[17] D. Almanea, M. Poesio, ArMIS - the Arabic Misog- putational Linguistics, Online, 2021, pp. 338–347. yny and Sexism Corpus with Annotator Subjective URL: https://aclanthology.org/2021.semeval.-1.41 Disagreements, in: N. Calzolari, F. Béchet, P. Blache, doi:10.18653/v1/2021.semeval-1.41. K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa[-24] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Franhara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. RiS. Piperidis (Eds.), Proceedings of the Thirteenth naldi, D. Scalena, CALAMITA: Challenge the AbiliLanguage Resources and Evaluation Conference, ties of LAnguage Models in ITAlian, in: ProceedEuropean Language Resources Association, Mar- ings of the 10th Italian Conference on Computaseille, France, 2022, pp. 2282–2291. URLh:ttps: tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem//aclanthology.org/2022.lrec-1.244 ber 4 - December 6, 2024, CEUR Workshop Proceed[18] S. Akhtar, V. Basile, V. Patti, Whose opinions mat- ings, CEUR-WS.org, 2024.

ter? Perspective-aware models to identify opinio[2n5s] S. Casola, S. Frenda, S. Lo, E. Sezerer, A. Uva, of hate speech victims in abusive language detec- V. Basile, C. Bosco, A. Pedrani, C. Rubagotti, V. Patti, tion, arXiv preprint arXiv:2106.15896 (2021). D. Bernardi, MultiPICo: Multilingual perspectivist [19] M. Sap, S. Swayamdipta, L. Vianna, X. Zhou, Y. Choi, irony corpus, in: L.-W. Ku, A. Martins, V. Srikumar N. A. Smith, Annotators with attitudes: How an- (Eds.), Proceedings of the 62nd Annual Meeting notator beliefs and identities bias toxic languageof the Association for Computational Linguistics detection, in: Proceedings of the 2022 Conference (Volume 1: Long Papers), Association for Compuof the North American Chapter of the Association tational Linguistics, Bangkok, Thailand, 2024, pp. for Computational Linguistics: Human Language 16008–16021. URL: https://aclanthology.org/2024. Technologies, Association for Computational Lin- acl-long.84.9 guistics, Seattle, United States, 2022, pp. 5884–590[266.] N. Mokhberian, M. Marmarelis, F. Hopp, V. Basile, URL: https://aclanthology.org/2022.naacl-mai.n.431 F. Morstatter, K. Lerman, Capturing perspectives doi:10.18653/v1/2022.naacl-main.431. of crowdsourced annotators in subjective learning [20] R. Wan, J. Kim, D. Kang, Everyone’s voice mat- tasks, in: K. Duh, H. Gomez, S. Bethard (Eds.), ters: Quantifying annotation disagreement using Proceedings of the 2024 Conference of the North demographic information, in: Proceedings of the American Chapter of the Association for Compu37th AAAI Conference on Anrtificial Intelligence - tational Linguistics: Human Language TechnoloAAAI Special Track on AI for Social Impact, 2023. gies (Volume 1: Long Papers), Association for Com[21] M. Orlikowski, P. Röttger, P. Cimiano, D. Hovy, The putational Linguistics, Mexico City, Mexico, 2024, ecological fallacy in annotation: Modeling human pp. 7337–7349. URL: https://aclanthology.org/2024. label variation goes beyond sociodemographics, in: naacl-long.4 0.7 A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Pro[-27] M. Forbes, J. D. Hwang, V. Shwartz, M. Sap, Y. Choi, ceedings of the 61st Annual Meeting of the Associa- Social chemistry 101: Learning to reason about sotion for Computational Linguistics (Volume 2: Short cial and moral norms, in: Proceedings of the 2020 Papers), Association for Computational Linguis- Conference on Empirical Methods in Natural Lantics, Toronto, Canada, 2023, pp. 1017–1029. URL: guage Processing (EMNLP), Association for Comhttps://aclanthology.org/2023.acl-sho.rdto.8i:810. putational Linguistics, Online, 2020, pp. 653–670. 18653/v1/2023.acl-short.88. URL: https://aclanthology.org/2020.emnlp-mai.n.48 [22] E. Simpson, E.-L. Do Dinh, T. Miller, I. Gurevych, doi:10.18653/v1/2020.emnlp-main.48.

Predicting humorousness and metaphor novel[t2y8] S. M. Lo, V. Basile, Hierarchical clustering of labelwith Gaussian process preference learning, in: based annotator representations for mining perA. Korhonen, D. Traum, L. Màrquez (Eds.), Proceed- spectives, in: G. Abercrombie, V. Basile, D. Bernardi, ings of the 57th Annual Meeting of the Associa- S. Dudy, S. Frenda, L. Havens, E. Leonardelli, tion for Computational Linguistics, Association for S. Tonelli (Eds.), Proceedings of the 2nd Workshop on Perspectivist Approaches to NLP co-located with 26th European Conference on Artificial Intelligence (ECAI 2023), Kraków, Poland, September 30th, 2023, volume 3494 ofCEUR Workshop Proceedings, CEURWS.org, 2023. URL:https://ceur-ws.org/Vol-3494/ paper8.pdf.

[1]

Basile ,

Fell ,

Fornaciari ,

Hovy ,

Paun ,

Plank ,

Poesio ,

Uma , et al., We need to consider disagreement in evaluation , in: Proceed- 18653 /v1/ 2023 .emnlp-main. 212. ings of the 1st workshop on benchmarking: past[ ,9]

A. N.

Uma ,

Fornaciari ,

Hovy ,

Paun ,

Plank , present and future, Association for Computational M. Poesio, Learning from disagreement: A survey, Linguistics , 2021 , pp. 15 - 21 . Journal of Artificial Intelligence Research 72 ( 2021 )

[2]

Plank , The ”problem” of human label variation: 1385 - 1470 . On ground truth in data, modeling and evaluat [i1o0n],

Aroyo ,

Welty , Truth is a lie: in: Proceedings of the 2022 Conference on Empiri- Crowd truth and the seven myths of hucal Methods in Natural Language Processing , 2022 , man annotation, AI Magazine 36 ( 2015 ) pp. 10671 - 10682 . 15 - 24 . URL: https://ojs.aaai.org/aimagazine/

[3]

Cabitza ,

Campagner ,

Basile , Toward a per- index .php/aimagazine/article/view/2.564 spectivist turn in ground truthing for predictive doi:10.1609/aimag.v36i1.2564. computing, in: Proceedings of the AAAI Con [-11]

Leonardelli ,

Menini ,

A. P.

Aprosio , M. Guerini, ference on Artificial Intelligence , volume 37 , 2023 ,

Tonelli , Agreeing to disagree: Annotating ofenpp. 6860 - 6868 . URL: https://ojs.aaai.org/index.php/ sive language datasets with annotators' disagreeAAAI/article/view/2584.0 ment, in : Proceedings of the 2021 Conference on

[4]

Frenda ,

Pedrani ,

Basile ,

S. M.

Lo , A. T. Empirical Methods in Natural Language Processing , Cignarella,

Panizzon ,

Marco ,

Scarlini , 2021 , p. 10528 - 10539 . V. Patti , C.

Bosco , D.

Bernardi , EPIC: Mult[ i1 -2]

Uma ,

Fornaciari ,

Dumitrache , T. Miller, perspective annotation of a corpus of irony , in: J. Chamberlain , B.

Plank , E. Simpson, M.

Poesio , A.

Rogers , J.

Boyd-Graber , N. Okazaki (Eds.), Pro- Semeval-2021 task 12: Learning with disagreeceedings of the 61st Annual Meeting of the Associa- ments , in : Proceedings of the 15th International tion for Computational Linguistics (Volume 1 :

Long

Workshop on Semantic Evaluation (SemEval-2021) , Papers), Association for Computational Linguis- 2021 , pp. 338 - 347 . tics, Toronto, Canada, 2023 , pp. 13844 - 13857 . URL:[13]

Leonardelli ,

Uma ,

Abercrombie , D. Alhttps://aclanthology.org/ 2023 .acl-lon. gd.7o7i4:10 . manea , V. Basile, T.

Fornaciari , B.

Plank , V.

Rieser , 18653 /v1/ 2023 . acl-long .774. M. Poesio , Semeval-2023 task 11: Learning with

[5]

Sachdeva ,

Barreto ,

Bacon ,

Sahn , C. von disagreements (lewidi) , in: Proceedings of the 17th Vacano ,

Kennedy , The measuring hate speech International Workshop on Semantic Evaluation corpus: Leveraging rasch measurement theory for (SemEval- 2023 ), 2023 , p. 2304 - 2318 . data perspectivism, in: G. Abercrombie, V. Basil[e1 ,4]

Santy ,

Liang ,

R. Le

Bras ,

Reinecke ,

Sap ,

Tonelli ,

Rieser , A . Uma (Eds.), Proceedings of NLPositionality: Characterizing design biases the 1st Workshop on Perspectivist Approaches to of datasets and models , in: Proceedings of NLP @LREC2022 , European Language Resources the 61st Annual Meeting of the Association for Association , Marseille, France, 2022 , pp. 83 - 94 . URL: Computational Linguistics (Volume 1 : Long Pahttps://aclanthology.org/ 2022 .nlperspectives.- 1 .11 pers), Association for Computational Linguistics,

[6]

Frenda ,

S. M.

Lo ,

Casola ,

Scarlini ,

Marco , Toronto, Canada, 2023 , pp. 9080 - 9102 . URLh:ttps:// V. Basile,

Bernardi , Does anyone see the irony aclanthology .org/ 2023 . acl-long. . d50o5i:10 .18653/ here? Analysis of perspective-aware model predic- v1/2023.acl-long . 505 . [15]

Prabhakaran ,

A. M.

Davani ,

Diaz , On re-