1. Introduction

Subjectivity in Stereotypes Against Migrants in Italian: An Experimental Annotation Procedure

Soda Marem Lo

Marco A. Stranisci

3 4

Alessandra Teresa Cignarella

0 4

Simona Frenda

1 4

Valerio Basile

Elisabetta Jezek

Viviana Patti

3 0 Ghent University, Language and Translation Technology Team , Groot-Brittanniëlaan 45 - 9000 Ghent , Belgium 1 Interaction Lab, Heriot-Watt University , EH14 4AS Edinburgh, Scotland 2 Università di Pavia, Department of Humanities , Piazza del Lino 2 - 27100 Pavia , Italy 3 Università di Torino , Dipartimento di Informatica, Corso Svizzera 185 - 10149 Turin , Italy 4 aequa-tech , Via Quarello 15/A - 10153 Turin , Italy

2025

The presence of social stereotypes in NLP resources is an emerging topic that challenges traditionally used approaches for the creation of corpora and resources. An increasing number of scholars proposed strategies for considering annotators' subjectivity in order to reduce such bias both in computational resources and in NLP models. In this paper, we present Open-Stereotype, an annotated corpus of Italian tweets and news headlines regarding immigration in Italy developed through an experimental procedure for the annotation of stereotypes aimed to investigate their diferent interpretation. The annotation is the result of a six-step process, where annotators identify text-spans expressing stereotypes, generate rationales about these spans and group them in a more comprehensive set of labels. Results show that humans exhibit high subjectivity in conceptualizing this phenomenon, and that the prior knowledge of an Italian LLM leads to more consistent classifications of specific labels that do not depend on annotators' background.

eol>Subjectivity Annotation Italian Stereotypes Social Bias

1. Introduction

as political discourse [9], reactions to fake news [10], news comments [11], news and social media messages Developing fair Natural Language Processing (NLP) tech- [12, 13] often through the development of taxonomies nologies for the detection of abusive language is still and annotated corpora. However, these advances do not nowadays an open issue that gathers the attention of encompass the diverse perceptions or interpretations of many scholars. The increasing awareness that corpora stereotypes in the text. For instance, despite some corfor hate speech detection exhibit significant biases, par- pora for the detection of origins-related stereotypes have ticularly favoring Western and white populations [1], has already been released [12, 14, 11, 15, 16], to the best of our led scholars to foster explainability [2, 3] and cultural rep- knowledge, only one of them has been designed to take resentativeness [4, 5] in the design of new resources. Fur- into account subjectivity [17] presenting the annotation thermore, the growing number of perspectivist [6, 7] and of three diferent annotators. This limitation intersects multilingual [8] datasets contributes to a deeper and cul- with the scarcity of studies on bias and disagreement in turally aware understanding of abusive language, paving the design of annotation schemes [18, 19, 20]. the way for the development of less biased technologies. In this work we address this research gap by presenting

Recently, specific attention has been paid in particular the Open Stereotype (O-Ster)1 corpus: a sub-portion to the presence of stereotypes in diferent contexts, such of 1,022 texts of the HaSpeeDe corpus [12] (see details in

Section 3) newly re-annotated through an experimental

CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- annotation procedure in which labels are not defined a tics, September 24 — 26, 2025, Cagliari, Italy. priori, but they are rather defined throughout the anno* Corresponding authors. tation process highlighting annotator subjectivity about †$Thseosdeamauatrheomrs.loco@nutrniibtuot.ietd(Se.qMua.lLlyo.); stereotypes (a posteriori). The resulting annotated corpus marcoantonio.stranisci@unito.it (M. A. Stranisci); allowed us to reply to the following research questions: alessandrateresa.cignarella@ugent.be (A. T. Cignarella); ∙ (RQ 1). How do annotators recognize and cons.frenda@hw.ac.uk (S. Frenda); valerio.basile@unito.it (V. Basile); ceptualize stereotypes? We designed an annotation elisabetta.jezek@unipv.it (E. Jezek); viviana.patti@unito.it (V. Patti) procedure that provides the identification of textual spans (M.0A00.0S-t0r0an02is-c5i8);100-000009-300(0S2.-M44.0L9o-)6;6070900(A-0.0T0.1C-9ig3n37a-r7e2ll5a0); expressing stereotypes, the open-ended generation of ra0000-0002-6215-3374 (S. Frenda); 0000-0001-8110-6832 (V. Basile); tionales about their choice, and the categorization of 0000-0003-2518-5200 (E. Jezek); 0000-0001-5991-370X (V. Patti) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://github.com/SodaMaremLo/Open-Stereotype-corpus.

Attribution 4.0 International (CC BY 4.0). rationales within a closed set of labels. The procedure Bosco et al. [25] and Schmeisser-Nieto et al. [14] in which showed how stereotypes in the same texts are diferently the authors apply an SCM-based scheme for describing perceived by humans, leading to the categorization of stereotypes towards migrants to a trilingual corpus of the same expressions in diferent and creative ways that tweets. might depend on the subjectivity of annotators. Concerning Italian, the HaSpeeDe2 shared task [12] ∙ (RQ 2). How do models conceptualize stereo- was one of the first to explicitly address stereotype detectypes? In this first study, we prompted one specific Large tion by means of a dedicated subtask. Results pioneered Language Model (LLM), i.e., Minerva [21], to generate the way for research into stereotype detection in Itallabels to categorize stereotypes. Observing which labels ian social media, investigating the connection between were created and with which annotator they agreed most hate speech and stereotypical content in models. Furtherof the time, we noticed that the LLM aligns more with the more, the results of the shared task suggest the need to labels Exploiters, Dangerous and Protected, choosing them approach stereotype detection as a subtle and indepenconsistently throughout diferent classification runs. dent phenomenon from hate speech. Schmeisser-Nieto et al. [30] comparing the human annotation and model predictions on stereotype detection noted that models 2. Related Work tend to show low confidence when annotators have more disagreement with each other, highlighting the imporThe detection and modeling of stereotypes in NLP has tance of encoding plural interpretations in resources and gained increasing attention in recent years, particularly models. In such context, Cignarella et al. [31] developed as the field moves toward more socially responsible and the QUEEREOTYPES corpus, in which annotator perinclusive language technologies. While early computa- spectives are encoded in labeling stereotypes towards tional approaches primarily focused on gender bias and LGBTQIA+ people. hate speech [22, 23], new work has begun to explore Perspectives of annotators matter and studies such the broader phenomenon of stereotypes, including their as those of Sap et al. [5] and Xia et al. [32], for instance, implicit [24] and explicit manifestations across diferent have shown that demographic factors such as ethnicity or social groups and languages [25, 26]. personal and/or linguistic background, can significantly

Most current work emphasizes the importance of dis- influence the perception of hate speech and stereotypes. tinguishing between stereotypes, prejudice, and discrimi- The present work builds on the key concepts outlined nation, and highlights the advantages of a more interdisciplinary approach between computational linguistics and itnattihoins spercotcioend,ubrye pthraotp(io)seilnevgaatens eaxnpnoetraitmoresnutbajlecatnivnitoysocial psychology [27]. The Stereotype Content Model and (ii) builds on narrative patterns in free-text descrip(SCM) [28] and its extension, the ABC model [29], have tions of stereotypes against migrants. Rather than enforcbeen quite often adopted by NLP scholars to conceptual- ing a harmonized gold standard, we create and release ize stereotypes along dimensions such as warmth, com- non-harmonized annotations to preserve the diversity petence, and belief alignment. These frameworks have of annotator perspectives.2 This approach aligns with informed both annotation schemes and computational emerging best practices in participatory NLP, and conmodels, enabling more structured analyses of stereotype content. Examples of their application are the work of

2We also include a positionality statement in Appendix A.1.

tributes to the growing body of resources for stereotype detection—particularly in languages other than English.

3. Annotation Procedure For the creation of the O-Ster corpus, we adopted a

descriptive annotation scheme as previously done by Röttger et al. [19], with the overarching goal of emphasizing the subjectivity of annotators in recognizing and describing the presence of stereotypes in texts. The annotation procedure is composed of several steps as shown in Figure 1. In this section, we describe all the steps in detail. (1) Filtering the HaSpeeDe2 corpus.

The annotation process began with the extraction of a

specific subset from the HaSpeeDe2 dataset [ 12]. This dataset, originally annotated with the presence/absence of hate speech and stereotypes, has been extended also in other works with the annotation of various dimensions of harmful language, including Intensity, Aggressiveness,

Ofensiveness, Irony, and Sarcasm [33].

For our purposes, we focused on the subset of texts annotated with a stereotype value of 1. This filtered corpus consists of 1, 022 tweets and news headlines, each explicitly marked as containing stereotypical content (of these, 522 texts are hateful and 500 are non-hateful). 1. Text: “Mattinata di ieri passata in un’aula di tribunale. 23 udienze al ruolo. Imputati stranieri 19; imputati italiani 4. Imputati stranieri presenti 0; imputati italiani presenti 4. Conclusione (del tutto personale); vengono tutti a delinquere qua e se ne fregano della nostra giustizia” translation → Yesterday morning spent in a courtroom. 23 judicial hearings to the register. Foreign defendants 19; Italian defendants 4. Foreign defendants present 0; Italian defendants present 4. Conclusion (entirely personal); they all come here to commit a crime and don’t give a damn about our justice 2. Textual span: they all come here to commit a crime 3. Rationale (S-V-O): [foreigners are delinquents, foreigners commit crimes, ..., immigrants are sneaky, ..., immigrants are criminals] 4a. Targeted entity: foreign defendants = foreigners 4b. Bare stereotype: [are dangerous, are threatening, are delinquents, are criminals] 5. Descriptive label: are delinquents → Criminal 6. Group: THREAT

3https://labelstud.io/. 4Roma people invaded Italy. 5Migrants are privileged.

(2) Identification of textual spans. their base forms; and 4) conjugated them in the Present Five diferent annotators (all researchers in NLP) were Indicative tense. This normalization step allowed us to instructed to identify one or more spans of text that ex- reduce the rationales to a set of 576 distinct bare stereoplicitly conveyed stereotypical content. The annotation types. Finally, all rationales that appeared only once in task was carried out using a simple spreadsheet, where the corpus were removed to ensure focus on recurring annotators copied and pasted the identified spans into a patterns, resulting in a total of 248 frequently occurring designated column corresponding to each text entry and bare stereotypes. partially relied on the Label Studio3 platform. (5) Free-text labeling. (3) Writing of rationales. To further consolidate the subset of bare stereotypes reFor each identified textual span, annotators were asked sulting from the previous step of the procedure into to provide a corresponding rationale that explicitly ex- a manageable and interpretable taxonomy, three annopresses the sense behind the stereotype and the targeted tators were independently tasked with grouping them group. They should be provided in the form of a simple by generating 10 descriptive labels. Each label was desentence, typically following either a Subject-Verb-Object signed to capture the underlying theme or semantic core (S-V-O) or Subject-Noun Phrase (S-NP) structure. Exam- shared by multiple rationales. For example, the stateples include: “i rom hanno invaso l’Italia” 4 (S-V-O) and ments “(they) are delinquents” and “(they) are criminals” “gli immigrati sono privilegiati” 5 (S-NP). This step resulted might have been grouped under the descriptive label in a total of 3, 578 span–rationale pairs. Criminal, while “they are dangerous” might have been categorized under the descriptive label Dangerous. This process allowed the transformation of free-text rationales into a structured set of stereotype categories suitable for classification tasks. (4) Text processing.

We processed the rationales to ensure consistency and

facilitate further linguistic analysis. In particular, 1) we extracted all the target entities mentioned in the sentences; 2) we identified the verbs associated with the targets; 3) we applied lemmatization to reduce verbs to (6) Grouping.

To reach a narrower level of the taxonomy, we asked the 3 annotators to reduce the initial set of 10 descriptive labels to 5 broader groups. This second round of refinement involved merging semantically related labels to enhance clarity and usability. For example, the rationales “(they) parties, and ethnic minorities named by referring to their are delinquents” and “(they) are criminals”, previously origin, or with generic terms such as “foreigners”. grouped under the descriptive label Criminal, and “they Considering the unbalanced number and type of annoare dangerous”, categorized under Dangerous, could all tations across annotators, we computed the proportion be further consolidated under the broader group Threat. of times each target was annotated as an agent (or paIt is important to emphasize that, throughout the entire tient) by each annotator. This was done by dividing the annotation process, annotators were given minimal (if frequency of each target (as agent or patient) by the total any) prescriptive instructions. They received very lim- number of agent or patient annotations made by that ited annotation guidelines, which allowed for a more annotator. We then calculated per-annotator averages open-ended and subjective interpretation of stereotype of these proportions to establish individual thresholds, groupings. This deliberate lack of constraints is a central used to highlight the most frequently annotated targets. feature of our experimental design, aimed at capturing Results are presented in Table 3. the annotators’ intuitive understanding (and subjectivity) Results show that for all annotators when targets are of stereotypical content in Italian texts. presented as immigrant, they tend to be framed as both

An example of a fully annotated text, including its as- agents and patients in high percentages. However, Bear sociated stereotype and final label, is presented in Table 1 and Rhino often give agency to specific ethnic minorito complement the information of the workflow of the ties. When Italians are targets, they only play the role annotation procedure already outlined in Figure 1. of agents, especially presenting rationales linked to financial supports, such as Italians pay for immigrants.

Interestingly, Roma and Sinti are framed as patients by

4. Corpus Analysis Duck, especially using the rationale Roma are treated better than Italians, and in a low percentage by Rhino (3.2%).

Other annotators’ rationales present them only as agents, more often as criminals. O-Ster consists of 1, 022 texts annotated by 5 people in

diferent proportions (Table 2). Almost all posts were annotated by two people, except for 27 by just one person. For each text, the annotator could assign multiple rationales, reaching an average of 1.77 per post, and a total of 3, 578 annotations.

Annotator _01 _02 _03 _04 _05

Nickname

Duck Bear Lion Panda Rhino #Texts 747 75 100 94 1, 001 #Annotations 1, 367 112 129 178 1, 792

Identifying ‘agents’ and ‘patients’ in the rationales.

From the third step described in Section 3, a total of 1, 547 rationales was reached. To better understand their construction, we looked into the role of the subject in terms of agents and patients. Specifically, we syntactically parsed each rationale and assigned the role of ‘agent’ to all the targets that are the subject of active verbs (Migrants are criminals), and ‘patient’ when they are the object of the sentence or the subject of a passive verb (Migrants must be kicked out). Finally, we performed a manual aggregation of Roma and Sinti in a unique category, as well as politicians including specific people and Target Immigrants Immigrants Immigrants Immigrants Immigrants

Italians

Italians Ethnic minority Ethnic minority

Islamic Islamic Islamic

Islamic Roma and Sinti Roma and Sinti Roma and Sinti

Immigrants Immigrants Immigrants Immigrants

Immigrants

Roma and Sinti

Label analysis. As described in Section 3, annotators were asked to

Agency Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Agent Patient Patient Patient Patient Patient Patient

Annotator

Duck Bear Lion Panda Rhino Bear Panda Bear Rhino Duck Lion Panda Rhino Duck Panda Rhino Duck Bear Lion Panda Rhino Duck group the bare stereotypes into 10 descriptive labels, and Duck does not connect the aspect of crime with the then categorize them in 5 broader groups. Results of idea of danger, as might have been expected from looking these steps are presented in Table 4. Focusing on the ten at the choices of the other annotators (Degraded by Bear descriptive labels (grey columns), it is possible to notice and Threat by Rhino). In contrast, Criminal was merged similarities across annotators. They all individuated the with Deceivers, combining the dimension of crime idea of dangerousness (Dangerous), referring to stereo- with cheating, and tagging the group as Subtle. On types connected to being violent. However, analysing the the other hand, Dangerous has been included with Raddataset, Duck characterises this description with the idea icalized in the broader imagery of incompatibility, of invasion, Bear includes non-violent forms of dangers implicitly defining what “we” is not. Bear’s groups betsuch as bringing diseases, while Rhino involves those as- ter encapsulate a contrast us vs. them, specifically with pects that the other two separated in the Criminal label, the labels Worsen our lives and Diferent culture, which such as stealing and cheating. concentrate in a single label the aspects of diversity,

Other similarities are in the idea of being degraded primarily religious and cultural. It is noteworthy how the (Degraded by all annotators), lazy (Loafers by Bear, and annotators’ positionality (Appendix A.1), in this case, is Lazy by Rhino), and a burden (Burden by Duck and Bear, most evident through their clear-cut distinction between and Parasites by Rhino). The use of diferent words for us and them—a trait that is often absent in Rhino’s labels. similar concepts, already suggests the diferent focus Both Duck and Rhino group the idea of being respecadopted by each annotator. For example, Loafers was tively uncivilized and savage with being degraded, the connected to being useless, more than simply acting as former using the expression Immoral, thus framing the lazy. three descriptive labels into a moral stand; the latter

Another interesting commonality is the idea of be- choosing Ruin of Italy, referring to the efect of those ing backward people and also this concept is expressed acts. Finally, Exploiters unifies the dimension of being through diferent labels across annotators. Duck used parasites and lazy, with that of invasion, in a very broad Uncivilized, Bear Diferent culture, while Rhino separated group that defines exploitation from an economic and the concept into two descriptive labels: Savage and Worst territorial point of view. Overall, there is a general focus than us. on the exploitation of the country and of the caused sense

Finally, some stereotypes have been labeled in signif- of danger (respectively Parasites and Subtle by Duck, Do icantly diferent ways. An example is they are nomads, not contribute and Dangerous by Bear, and Exploiters and assigned to Privileged by Duck, Diferent from us by Bear, Threat by Rhino). and Invader by Rhino, highlighting people’s fear of being Each annotator, however, has elements of uniqueness. conquered or having their territories squatted. For Duck, this is reflected in the creation of a single

The way an annotator looks at a phenomenon and its group of stereotypes that define aspects of the target categorization becomes even more evident when ana- groups’ identities perceived as problematic (Problem). lyzing the last step: grouping the descriptive labels in 5 Bear, on the other hand, is the only one to foreground categories (white columns in Table 4). In fact, they are the idea of a worsening of Italians’ lives, defined in required to choose which concepts they believe to be relation to the risk of invasion and economic exploitation. priorities and capable of encompassing multiple stereo- Lastly, Rhino is the only one to maintain a single label types. for the religious dimension and the perspective of protection, concerning the perception of a privileged position of the target group to Italians.

7.32% of the time. Messages that obtained a classifica

tion throughout all three runs are 1, 922: 85.61% of the total. The analysis of results presented in this section considers only texts that obtained a classification in each run.

5. Experiment Hateful comments. Focusing on the last phase of the pipeline, Figure 2

shows how the occurrence of the groups of labels changes Label distribution across runs. based on the presence of hate speech. Labels such as Given the high number of cases in which the LLM Incompatible, Dangerous, and both Exploiters and Radi- provides at least two diferent labels for the same text calized respectively for Duck, Bear and Rhino, tend to across the classification runs ( 68.6% of the time), we be more frequent when the message was annotated as provided an analysis of group labels in runs when the hateful. These results highlight how a stereotypical rep- LLM always produces the same output and when it resentation of the stranger as an invader, religious ex- always produces a diferent output. We considered two tremist, or more generally a threatening individual, is types of distributions: Consistent are the labels that linked to hate speech. It is worth noticing that the blue are always predicted across the runs; Inconsistent are bars tend to be higher in most cases, although the texts the labels produced in runs with at least one diferent are almost perfectly split across hateful and not-hateful prediction. In Table 5 the top-5 Consistent labels and (respectively 522 and 500). This indicates that the pres- the top-5 Inconsistent labels are reported. As can be ence of hate speech also leads to the presence of multiple observed, there are some labels that are more likely stereotypes in the same text. to be consistently predicted by the LLM across runs. It is the case of Exploiters, Dangerous, and Protected that combined represent 81.8% of the distribution.

In this section, we present an experiment aimed at observing the behavior of an Italian LLM in the classification of stereotypes according to the labels derived from our annotation process (Section 3). The experimental setup was a zero-shot text generation task. We fed the LLM with a message and a list of the three labels defined by annotators and asked the model to generate as output one of the three labels.6

We repeated the experiment three times with three

diferent randomizations of the order of the labels in the prompt, and used Minerva-7B-instruct-v1.0 to solve the task. On average, the model generated a bad output

6see Appendix A.2 for details about the prompt.

Stereotype Label Exploiters Dangerous Protected Threat Subtle Incompatible Total If the first two are the most occurring group labels in particular, at the labels generated by the model defined by annotators (Section 3), Protected is not in this specific subset. Results show that it tends to a common label, since it appears only 260 times in prefer Exploiters and Protected over other annotators’ the corpus. This suggests that there is 50% chance labels, selecting both way more frequently than Rhino. that the LLM consistently predict the label Protected Coherently with the previous analysis, Exploiters has a when encountering it, while the chance of having a distribution of 0.365 by the human annotator vs. 0.526 consistent prediction of Dangerous is 17% and 31.2% by the model, while Protected respectively of 0.135 vs. for Exploiters. On the opposite side of this spectrum, 0.312. This shows that the reliance on Rhino should there is Threat, which appears 691 times in the corpus not be explained in terms of alignment to annotators’ but is consistently predicted only 55 times (7.9%). The conceptualization of the stereotypes, but rather as a distribution of labels predicted inconsistently by the preference of the model towards this conceptualization. LLM shows interesting results as well. There is a lower In fact, the other labels chosen by the same annotator gap between most and less occurring labels among rarely appear in this subset, with Ruin of Italy being the top-5 (141 versus 93), suggesting that the model totally missing. tends to spread inconsistent predictions among a more homogeneous pool of labels. Dangerous is the label that Inconsistent labels. appears the most in LLM’s inconsistent predictions, Among the Inconsistent labels, we focused on cases coherently with its distribution among the group labels where all runs disagree, resulting in 248 comments where in the corpus. Threat is the second-most occurring the model chose a diferent annotator’s label for each run. one, appearing in inconsistent predictions twice than Table 6 presents humans’ and models’ label distribution consistent ones (115). This confirms the low ability on this specific subset. Results show that the model leans of LLM to conceptualize this specific label. Protected, toward one annotator at a time, respectively Duck, Rhino which is strongly present in consistent prediction, is and Bear for the first, second and third run. To further not among the top-5 labels in inconsistent predictions, investigate this pattern, we checked whether the order of appearing only 14 times. Finally, it is worth mentioning the variable, randomized for each run, had an influence that Incompatible appears 93 times in inconsistent on this result. We examined how often the selected label predictions (third-most occurring) but only 3 times in appeared in first position, and found that the annotaconsistent ones, suggesting that the LM struggles in the tor’s label each run agrees with is almost always ranked conceptualization of this group label as well. ifrst: specifically, 240, 227, and 234 times out of 248 for each of the three runs respectively. This highlights that Consistent labels. when the model is less confident presents strong inconAs regards the Consistent labels, the model agreed across sistencies among the runs, and we infer it relies on the all the runs for a total of 603 annotations, selecting instruction example “Return as output (Output) a single Rhino’s labels 68.99% of the time, Bear’s 26.37%, and option in the form of a Python list (e.g., [’Option 1’])”(ApDuck’s 4.64%. pendix A.2). These results necessitate a further analysis

Considering the strong reliance on Rhino, we looked, of how LLMs manage challenging texts to annotate and low-confidence scenarios, which we plan to do in the future.

6. Conclusion and Future Work References Acknowledgments The work of A. T. Cignarella is supported by the Eu

ropean Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Actions,

Grant Agreement No. 101146287.

In this paper, we presented O-Ster, a new corpus of Italian stereotypes annotated through an experimental framework. The corpus includes 1, 022 texts annotated at the span level. Each span has been complemented by a rationale expressing the individuated stereotype, and rationales served as a basis for the annotators to create labels associated with each text. This bottom-up process of label generation enabled observing how annotators with diferent backgrounds, and an LLM conceptualize the phenomenon. Results show a high subjectivity in the conceptualization of stereotypes by humans and the alignment of the LLM with certain specific labels in a zero-shot setting.

Future work will focus on expanding the corpus, in or

der to better understand how subjectivity afects this phenomenon and to what extent the annotation procedure may be generalizable and transferable to other languages and tasks of abusive language detection.

Speech Tools for Italian, CEUR, 2020, pp. 1–9. Minerva-7B-instruct-v1.0, 2024. Accessed in

[13] P. Chiril, F. Benamara, V. Moriceau, “Be nice to June 2025.

your wife! The restaurants are closed”: Can Gender [22] K. Stanczak, I. Augenstein, A Survey on Gender Bias

Stereotype Detection Improve Sexism Classifica- in Natural Language Processing, 2021. URL: http:

tion?, in: Findings of the Association for Compu- //arxiv.org/abs/2112.14168. doi:10.48550/arXiv. tational Linguistics: EMNLP 2021, 2021, pp. 2833– 2112.14168, arXiv:2112.14168 [cs]. 2844. [23] P. Fortuna, S. Nunes, A Survey on Automatic De[14] W. S. Schmeisser-Nieto, A. T. Cignarella, tection of Hate Speech in Text, ACM Computing

T. Bourgeade, S. Frenda, A. Ariza-Casabona, Surveys 51 (2019) 1–30. URL: https://dl.acm.org/doi/

M. Laurent, P. G. Cicirelli, A. Marra, G. Corbelli, 10.1145/3232676. doi:10.1145/3232676. F. Benamara, et al., Stereohoax: a multilingual [24] W. Schmeisser-Nieto, M. Nofre, M. Taulé, Criteria corpus of racial hoaxes and social media reactions for the annotation of implicit stereotypes, in: N. Calannotated for stereotypes, Language Resources zolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, and Evaluation ( 2024 ) 1–39. T. Declerck, S. Goggi, H. Isahara, B. Maegaard, [15] M. Nadeem, A. Bethke, S. Reddy, StereoSet: Mea- J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.), suring stereotypical bias in pretrained language Proceedings of the Thirteenth Language Resources models, in: C. Zong, F. Xia, W. Li, R. Nav- and Evaluation Conference, European Language igli (Eds.), Proceedings of the 59th Annual Meet- Resources Association, Marseille, France, 2022, pp. ing of the Association for Computational Lin- 753–762. URL: https://aclanthology.org/2022.lrec-1. guistics and the 11th International Joint Confer- 80/. ence on Natural Language Processing (Volume [25] C. Bosco, V. Patti, S. Frenda, A. T. Cignarella,

1: Long Papers), Association for Computational M. Paciello, F. D’Errico, Detecting racial Linguistics, Online, 2021, pp. 5356–5371. URL: stereotypes: An Italian social media corpus

https://aclanthology.org/2021.acl-long.416/. doi:10. where psychology meets NLP, Information 18653/v1/2021.acl-long.416. Processing & Management 60 (2023) 103118. [16] Z. Wu, S. Bulathwela, M. Pérez-Ortiz, A. S. URL: https://linkinghub.elsevier.com/retrieve/pii/ Koshiyama, Stereotype detection in llms: A S0306457322002199. doi:10.1016/j.ipm.2022. multiclass, explainable, and benchmark-driven ap- 103118. proach, 2024. URL: https://api.semanticscholar.org/ [26] T. Bourgeade, A. T. Cignarella, S. Frenda, M. Lau

CorpusID:268856718. rent, W. Schmeisser-Nieto, F. Benamara, C. Bosco,

[17] W. S. Schmeisser-Nieto, P. Pastells, S. Frenda, V. Moriceau, V. Patti, M. Taulé, A multilin

A. Ariza-Casabona, M. Farrús, P. Rosso, M. Taulé, gual dataset of racial stereotypes in social media Overview of detests-dis at iberlef 2024: Detection conversational threads, in: Findings of the As

and classification of racial stereotypes in spanish- sociation for Computational Linguistics: EACL learning with disagreement, Procesamiento del 2023, Association for Computational Linguistics, Lenguaje Natural 73 ( 2024 ) 323–333. Dubrovnik, Croatia, 2023, pp. 686–696. URL: https: [18] D. Hovy, S. Prabhumoye, Five sources of bias in nat- //aclanthology.org/2023.findings-eacl.51/. doi: 10. ural language processing, Language and linguistics 18653/v1/2023.findings-eacl.51. compass 15 (2021) e12432. [27] A. T. Cignarella, A. Giachanou, E. Lefever, Stereo[19] P. Röttger, B. Vidgen, D. Hovy, J. Pierrehumbert, type Detection in Natural Language Process

Two contrasting data annotation paradigms for sub- ing, 2025. URL: https://arxiv.org/abs/2505.17642.

jective nlp tasks, in: Proceedings of the 2022 Con- arXiv:2505.17642. ference of the North American Chapter of the As- [28] S. T. Fiske, A. J. C. Cuddy, P. Glick, J. Xu, A model of sociation for Computational Linguistics: Human (often mixed) stereotype content: Competence and

Language Technologies, 2022, pp. 175–190. warmth respectively follow from perceived status

[20] A. Hautli-Janisz, E. Schad, C. Reed, Disagreement and competition, Journal of Personality and Social space in argument analysis, in: G. Abercrom- Psychology (2002) 878–902. bie, V. Basile, S. Tonelli, V. Rieser, A. Uma (Eds.), [29] A. Koch, R. Imhof, R. Dotsch, C. Unkel

Proceedings of the 1st Workshop on Perspectivist bach, H. Alves, The abc of stereotypes Approaches to NLP @LREC2022, European Lan- about groups: Agency/socioeconomic success,

guage Resources Association, Marseille, France, conservative-progressive beliefs, and communion., 2022, pp. 1–9. URL: https://aclanthology.org/2022. Journal of personality and social psychology 110 nlperspectives-1.1/. (2016) 675–709. URL: https://api.semanticscholar. [21] SapienzaNLP, sapienzanlp/minerva-7b-instruct- org/CorpusID:6287638.

v1.0, https://huggingface.co/sapienzanlp/ [30] W. S. Schmeisser-Nieto, P. Pastells, S. Frenda,

Annotator 2 self-describes as a 30-year-old white Italian

woman with no migratory background. While she has not experienced migration firsthand, she has lived abroad as an expatriate, navigating international mobility from a position of privilege. Her native language is Italian.

Annotator 5 self-identifies as a 24-year-old woman

second-generation migrant. She is a Master’s student with a background in political activism, she has volunteered with non-governmental organizations and has written non-specialist articles about Italian second generation of migrants.

Presented below is the prompt in Italian: And its English translation:

Ti viene fornita in input (Input) una frase estratta dai social media, insieme a tre possibili stereotipi (Opzioni). Il tuo compito è individuare quale stereotipo è implicito nella frase, scegliendo tra le opzioni fornite. Restituisci in output (Output) una singola opzione, sotto forma di lista Python (es. [’Opzione 1’]).

Input: Mattinata di ieri passata in un’aula di tribunale. 23 udienze al ruolo. Imputati stranieri 19; imputati italiani 4. Imputati stranieri presenti 0; imputati italiani presenti 4. Conclusione (del tutto personale); vengono tutti a delinquere qua e se ne fregano della nostra giustizia Opzioni: [Sono subdoli, Sono pericolosi, Sono una minaccia] Output: You are given as input (Input) a sentence extracted from social media, along with three possible stereotypes (Options). Your task is to identify which stereotype is implied in the sentence by selecting one of the provided options. Return as output (Output) a single option in the form of a Python list (e.g., [’Option 1’]).

Input: Yesterday morning spent in a courtroom. 23 judicial hearings to the register.

Foreign defendants 19; Italian defendants 4.

Foreign defendants present 0; Italian defendants present 4. Conclusion (entirely personal); they all come here to commit a crime and don’t give a damn about our justice Options:[Subtle, Dangerous, Threat] Output: Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Improve writing style and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

Proceedings of the 2024 Joint International Con-

Resources and Evaluation (LREC-COLING

2024 ),

ELRA and

ICCL

, Torino , Italia, 2024 , pp. 8453 - 8463 .

URL: https://aclanthology.org/ 2024 .lrec-main. 741 /. [31]

A. T.

Cignarella ,

Sanguinetti ,

Frenda , A . Marra,

(Eds.), Proceedings of the 2024 Joint International

2024 ), ELRA and ICCL , Torino , Italia, 2024 , pp.

13429- 13441 . URL: https://aclanthology.org/ 2024 .

lrec-main. 1176 /. [32]

Xia ,

Field ,

Tsvetkov , Demoting racial bias in

cessing for Social Media , 2020 , pp. 7 - 14 . [33]

Frenda ,

Patti ,

Rosso , Killing me softly:

Engineering 29 ( 2023 ) 1516 - 1537 . doi: 10 .1017/

S1351324922000316.