<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Subjectivity in Stereotypes Against Migrants in Italian: An Experimental Annotation Procedure</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soda Marem Lo</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco A. Stranisci</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandra Teresa Cignarella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simona Frenda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Basile</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisabetta Jezek</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Patti</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ghent University, Language and Translation Technology Team</institution>
          ,
          <addr-line>Groot-Brittanniëlaan 45 - 9000 Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Interaction Lab, Heriot-Watt University</institution>
          ,
          <addr-line>EH14 4AS Edinburgh, Scotland</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università di Pavia, Department of Humanities</institution>
          ,
          <addr-line>Piazza del Lino 2 - 27100 Pavia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Università di Torino</institution>
          ,
          <addr-line>Dipartimento di Informatica, Corso Svizzera 185 - 10149 Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>aequa-tech</institution>
          ,
          <addr-line>Via Quarello 15/A - 10153 Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The presence of social stereotypes in NLP resources is an emerging topic that challenges traditionally used approaches for the creation of corpora and resources. An increasing number of scholars proposed strategies for considering annotators' subjectivity in order to reduce such bias both in computational resources and in NLP models. In this paper, we present Open-Stereotype, an annotated corpus of Italian tweets and news headlines regarding immigration in Italy developed through an experimental procedure for the annotation of stereotypes aimed to investigate their diferent interpretation. The annotation is the result of a six-step process, where annotators identify text-spans expressing stereotypes, generate rationales about these spans and group them in a more comprehensive set of labels. Results show that humans exhibit high subjectivity in conceptualizing this phenomenon, and that the prior knowledge of an Italian LLM leads to more consistent classifications of specific labels that do not depend on annotators' background.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Subjectivity</kwd>
        <kwd>Annotation</kwd>
        <kwd>Italian</kwd>
        <kwd>Stereotypes</kwd>
        <kwd>Social Bias</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>as political discourse [9], reactions to fake news [10],
news comments [11], news and social media messages
Developing fair Natural Language Processing (NLP) tech- [12, 13] often through the development of taxonomies
nologies for the detection of abusive language is still and annotated corpora. However, these advances do not
nowadays an open issue that gathers the attention of encompass the diverse perceptions or interpretations of
many scholars. The increasing awareness that corpora stereotypes in the text. For instance, despite some
corfor hate speech detection exhibit significant biases, par- pora for the detection of origins-related stereotypes have
ticularly favoring Western and white populations [1], has already been released [12, 14, 11, 15, 16], to the best of our
led scholars to foster explainability [2, 3] and cultural rep- knowledge, only one of them has been designed to take
resentativeness [4, 5] in the design of new resources. Fur- into account subjectivity [17] presenting the annotation
thermore, the growing number of perspectivist [6, 7] and of three diferent annotators. This limitation intersects
multilingual [8] datasets contributes to a deeper and cul- with the scarcity of studies on bias and disagreement in
turally aware understanding of abusive language, paving the design of annotation schemes [18, 19, 20].
the way for the development of less biased technologies. In this work we address this research gap by presenting</p>
      <p>Recently, specific attention has been paid in particular the Open Stereotype (O-Ster)1 corpus: a sub-portion
to the presence of stereotypes in diferent contexts, such of 1,022 texts of the HaSpeeDe corpus [12] (see details in</p>
      <sec id="sec-1-1">
        <title>Section 3) newly re-annotated through an experimental</title>
        <p>CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- annotation procedure in which labels are not defined a
tics, September 24 — 26, 2025, Cagliari, Italy. priori, but they are rather defined throughout the
anno* Corresponding authors. tation process highlighting annotator subjectivity about
†$Thseosdeamauatrheomrs.loco@nutrniibtuot.ietd(Se.qMua.lLlyo.); stereotypes (a posteriori). The resulting annotated corpus
marcoantonio.stranisci@unito.it (M. A. Stranisci); allowed us to reply to the following research questions:
alessandrateresa.cignarella@ugent.be (A. T. Cignarella); ∙ (RQ 1). How do annotators recognize and
cons.frenda@hw.ac.uk (S. Frenda); valerio.basile@unito.it (V. Basile); ceptualize stereotypes? We designed an annotation
elisabetta.jezek@unipv.it (E. Jezek); viviana.patti@unito.it (V. Patti) procedure that provides the identification of textual spans
(M.0A00.0S-t0r0an02is-c5i8);100-000009-300(0S2.-M44.0L9o-)6;6070900(A-0.0T0.1C-9ig3n37a-r7e2ll5a0); expressing stereotypes, the open-ended generation of
ra0000-0002-6215-3374 (S. Frenda); 0000-0001-8110-6832 (V. Basile); tionales about their choice, and the categorization of
0000-0003-2518-5200 (E. Jezek); 0000-0001-5991-370X (V. Patti)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://github.com/SodaMaremLo/Open-Stereotype-corpus.</p>
        <p>Attribution 4.0 International (CC BY 4.0).
rationales within a closed set of labels. The procedure Bosco et al. [25] and Schmeisser-Nieto et al. [14] in which
showed how stereotypes in the same texts are diferently the authors apply an SCM-based scheme for describing
perceived by humans, leading to the categorization of stereotypes towards migrants to a trilingual corpus of
the same expressions in diferent and creative ways that tweets.
might depend on the subjectivity of annotators. Concerning Italian, the HaSpeeDe2 shared task [12]
∙ (RQ 2). How do models conceptualize stereo- was one of the first to explicitly address stereotype
detectypes? In this first study, we prompted one specific Large tion by means of a dedicated subtask. Results pioneered
Language Model (LLM), i.e., Minerva [21], to generate the way for research into stereotype detection in
Itallabels to categorize stereotypes. Observing which labels ian social media, investigating the connection between
were created and with which annotator they agreed most hate speech and stereotypical content in models.
Furtherof the time, we noticed that the LLM aligns more with the more, the results of the shared task suggest the need to
labels Exploiters, Dangerous and Protected, choosing them approach stereotype detection as a subtle and
indepenconsistently throughout diferent classification runs. dent phenomenon from hate speech. Schmeisser-Nieto
et al. [30] comparing the human annotation and model
predictions on stereotype detection noted that models
2. Related Work tend to show low confidence when annotators have more
disagreement with each other, highlighting the
imporThe detection and modeling of stereotypes in NLP has tance of encoding plural interpretations in resources and
gained increasing attention in recent years, particularly models. In such context, Cignarella et al. [31] developed
as the field moves toward more socially responsible and the QUEEREOTYPES corpus, in which annotator
perinclusive language technologies. While early computa- spectives are encoded in labeling stereotypes towards
tional approaches primarily focused on gender bias and LGBTQIA+ people.
hate speech [22, 23], new work has begun to explore Perspectives of annotators matter and studies such
the broader phenomenon of stereotypes, including their as those of Sap et al. [5] and Xia et al. [32], for instance,
implicit [24] and explicit manifestations across diferent have shown that demographic factors such as ethnicity or
social groups and languages [25, 26]. personal and/or linguistic background, can significantly</p>
        <p>Most current work emphasizes the importance of dis- influence the perception of hate speech and stereotypes.
tinguishing between stereotypes, prejudice, and discrimi- The present work builds on the key concepts outlined
nation, and highlights the advantages of a more
interdisciplinary approach between computational linguistics and itnattihoins spercotcioend,ubrye pthraotp(io)seilnevgaatens
eaxnpnoetraitmoresnutbajlecatnivnitoysocial psychology [27]. The Stereotype Content Model and (ii) builds on narrative patterns in free-text
descrip(SCM) [28] and its extension, the ABC model [29], have tions of stereotypes against migrants. Rather than
enforcbeen quite often adopted by NLP scholars to conceptual- ing a harmonized gold standard, we create and release
ize stereotypes along dimensions such as warmth, com- non-harmonized annotations to preserve the diversity
petence, and belief alignment. These frameworks have of annotator perspectives.2 This approach aligns with
informed both annotation schemes and computational emerging best practices in participatory NLP, and
conmodels, enabling more structured analyses of stereotype
content. Examples of their application are the work of</p>
      </sec>
      <sec id="sec-1-2">
        <title>2We also include a positionality statement in Appendix A.1.</title>
        <p>tributes to the growing body of resources for stereotype
detection—particularly in languages other than English.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Annotation Procedure</title>
      <sec id="sec-2-1">
        <title>For the creation of the O-Ster corpus, we adopted a</title>
        <p>descriptive annotation scheme as previously done by
Röttger et al. [19], with the overarching goal of
emphasizing the subjectivity of annotators in recognizing
and describing the presence of stereotypes in texts. The
annotation procedure is composed of several steps as
shown in Figure 1. In this section, we describe all the
steps in detail.
(1) Filtering the HaSpeeDe2 corpus.</p>
      </sec>
      <sec id="sec-2-2">
        <title>The annotation process began with the extraction of a</title>
        <p>specific subset from the HaSpeeDe2 dataset [ 12]. This
dataset, originally annotated with the presence/absence
of hate speech and stereotypes, has been extended also in
other works with the annotation of various dimensions
of harmful language, including Intensity, Aggressiveness,</p>
      </sec>
      <sec id="sec-2-3">
        <title>Ofensiveness, Irony, and Sarcasm [33].</title>
        <p>For our purposes, we focused on the subset of texts
annotated with a stereotype value of 1. This filtered
corpus consists of 1, 022 tweets and news headlines, each
explicitly marked as containing stereotypical content (of
these, 522 texts are hateful and 500 are non-hateful).
1. Text: “Mattinata di ieri passata in un’aula di tribunale.
23 udienze al ruolo. Imputati stranieri 19; imputati italiani
4. Imputati stranieri presenti 0; imputati italiani presenti
4. Conclusione (del tutto personale); vengono tutti a
delinquere qua e se ne fregano della nostra giustizia”
translation → Yesterday morning spent in a courtroom.
23 judicial hearings to the register. Foreign defendants 19;
Italian defendants 4. Foreign defendants present 0; Italian
defendants present 4. Conclusion (entirely personal); they all
come here to commit a crime and don’t give a damn about
our justice
2. Textual span: they all come here to commit a crime
3. Rationale (S-V-O): [foreigners are delinquents,
foreigners commit crimes, ..., immigrants are sneaky, ..., immigrants
are criminals]
4a. Targeted entity: foreign defendants = foreigners
4b. Bare stereotype: [are dangerous, are threatening, are
delinquents, are criminals]
5. Descriptive label: are delinquents → Criminal
6. Group: THREAT</p>
      </sec>
      <sec id="sec-2-4">
        <title>3https://labelstud.io/.</title>
      </sec>
      <sec id="sec-2-5">
        <title>4Roma people invaded Italy.</title>
      </sec>
      <sec id="sec-2-6">
        <title>5Migrants are privileged.</title>
        <p>(2) Identification of textual spans. their base forms; and 4) conjugated them in the Present
Five diferent annotators (all researchers in NLP) were Indicative tense. This normalization step allowed us to
instructed to identify one or more spans of text that ex- reduce the rationales to a set of 576 distinct bare
stereoplicitly conveyed stereotypical content. The annotation types. Finally, all rationales that appeared only once in
task was carried out using a simple spreadsheet, where the corpus were removed to ensure focus on recurring
annotators copied and pasted the identified spans into a patterns, resulting in a total of 248 frequently occurring
designated column corresponding to each text entry and bare stereotypes.
partially relied on the Label Studio3 platform.
(5) Free-text labeling.
(3) Writing of rationales. To further consolidate the subset of bare stereotypes
reFor each identified textual span, annotators were asked sulting from the previous step of the procedure into
to provide a corresponding rationale that explicitly ex- a manageable and interpretable taxonomy, three
annopresses the sense behind the stereotype and the targeted tators were independently tasked with grouping them
group. They should be provided in the form of a simple by generating 10 descriptive labels. Each label was
desentence, typically following either a Subject-Verb-Object signed to capture the underlying theme or semantic core
(S-V-O) or Subject-Noun Phrase (S-NP) structure. Exam- shared by multiple rationales. For example, the
stateples include: “i rom hanno invaso l’Italia” 4 (S-V-O) and ments “(they) are delinquents” and “(they) are criminals”
“gli immigrati sono privilegiati” 5 (S-NP). This step resulted might have been grouped under the descriptive label
in a total of 3, 578 span–rationale pairs. Criminal, while “they are dangerous” might have been
categorized under the descriptive label Dangerous. This
process allowed the transformation of free-text rationales
into a structured set of stereotype categories suitable for
classification tasks.
(4) Text processing.</p>
      </sec>
      <sec id="sec-2-7">
        <title>We processed the rationales to ensure consistency and</title>
        <p>facilitate further linguistic analysis. In particular, 1) we
extracted all the target entities mentioned in the
sentences; 2) we identified the verbs associated with the
targets; 3) we applied lemmatization to reduce verbs to
(6) Grouping.</p>
        <p>To reach a narrower level of the taxonomy, we asked the 3
annotators to reduce the initial set of 10 descriptive labels
to 5 broader groups. This second round of refinement
involved merging semantically related labels to enhance
clarity and usability. For example, the rationales “(they) parties, and ethnic minorities named by referring to their
are delinquents” and “(they) are criminals”, previously origin, or with generic terms such as “foreigners”.
grouped under the descriptive label Criminal, and “they Considering the unbalanced number and type of
annoare dangerous”, categorized under Dangerous, could all tations across annotators, we computed the proportion
be further consolidated under the broader group Threat. of times each target was annotated as an agent (or
paIt is important to emphasize that, throughout the entire tient) by each annotator. This was done by dividing the
annotation process, annotators were given minimal (if frequency of each target (as agent or patient) by the total
any) prescriptive instructions. They received very lim- number of agent or patient annotations made by that
ited annotation guidelines, which allowed for a more annotator. We then calculated per-annotator averages
open-ended and subjective interpretation of stereotype of these proportions to establish individual thresholds,
groupings. This deliberate lack of constraints is a central used to highlight the most frequently annotated targets.
feature of our experimental design, aimed at capturing Results are presented in Table 3.
the annotators’ intuitive understanding (and subjectivity) Results show that for all annotators when targets are
of stereotypical content in Italian texts. presented as immigrant, they tend to be framed as both</p>
        <p>An example of a fully annotated text, including its as- agents and patients in high percentages. However, Bear
sociated stereotype and final label, is presented in Table 1 and Rhino often give agency to specific ethnic
minorito complement the information of the workflow of the ties. When Italians are targets, they only play the role
annotation procedure already outlined in Figure 1. of agents, especially presenting rationales linked to
financial supports, such as Italians pay for immigrants.</p>
      </sec>
      <sec id="sec-2-8">
        <title>Interestingly, Roma and Sinti are framed as patients by</title>
        <p>4. Corpus Analysis Duck, especially using the rationale Roma are treated
better than Italians, and in a low percentage by Rhino (3.2%).</p>
      </sec>
      <sec id="sec-2-9">
        <title>Other annotators’ rationales present them only as agents, more often as criminals.</title>
      </sec>
      <sec id="sec-2-10">
        <title>O-Ster consists of 1, 022 texts annotated by 5 people in</title>
        <p>diferent proportions (Table 2). Almost all posts were
annotated by two people, except for 27 by just one
person. For each text, the annotator could assign multiple
rationales, reaching an average of 1.77 per post, and a
total of 3, 578 annotations.</p>
        <p>Annotator
_01
_02
_03
_04
_05</p>
        <p>Nickname</p>
        <p>Duck
Bear
Lion
Panda
Rhino
#Texts
747
75
100
94
1, 001
#Annotations
1, 367
112
129
178
1, 792</p>
        <sec id="sec-2-10-1">
          <title>Identifying ‘agents’ and ‘patients’ in the rationales.</title>
          <p>From the third step described in Section 3, a total of
1, 547 rationales was reached. To better understand their
construction, we looked into the role of the subject in
terms of agents and patients. Specifically, we
syntactically parsed each rationale and assigned the role of
‘agent’ to all the targets that are the subject of active
verbs (Migrants are criminals), and ‘patient’ when they
are the object of the sentence or the subject of a passive
verb (Migrants must be kicked out). Finally, we performed
a manual aggregation of Roma and Sinti in a unique
category, as well as politicians including specific people and
Target
Immigrants
Immigrants
Immigrants
Immigrants
Immigrants</p>
          <p>Italians</p>
          <p>Italians
Ethnic minority
Ethnic minority</p>
          <p>Islamic
Islamic
Islamic</p>
          <p>Islamic
Roma and Sinti
Roma and Sinti
Roma and Sinti</p>
          <p>Immigrants
Immigrants
Immigrants
Immigrants</p>
          <p>Immigrants</p>
          <p>Roma and Sinti</p>
        </sec>
        <sec id="sec-2-10-2">
          <title>Label analysis.</title>
        </sec>
      </sec>
      <sec id="sec-2-11">
        <title>As described in Section 3, annotators were asked to</title>
        <p>Agency
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Agent
Patient
Patient
Patient
Patient
Patient
Patient</p>
        <p>Annotator</p>
        <p>Duck
Bear
Lion
Panda
Rhino
Bear
Panda
Bear
Rhino
Duck
Lion
Panda
Rhino
Duck
Panda
Rhino
Duck
Bear
Lion
Panda
Rhino
Duck
group the bare stereotypes into 10 descriptive labels, and Duck does not connect the aspect of crime with the
then categorize them in 5 broader groups. Results of idea of danger, as might have been expected from looking
these steps are presented in Table 4. Focusing on the ten at the choices of the other annotators (Degraded by Bear
descriptive labels (grey columns), it is possible to notice and Threat by Rhino). In contrast, Criminal was merged
similarities across annotators. They all individuated the with Deceivers, combining the dimension of crime
idea of dangerousness (Dangerous), referring to stereo- with cheating, and tagging the group as Subtle. On
types connected to being violent. However, analysing the the other hand, Dangerous has been included with
Raddataset, Duck characterises this description with the idea icalized in the broader imagery of incompatibility,
of invasion, Bear includes non-violent forms of dangers implicitly defining what “we” is not. Bear’s groups
betsuch as bringing diseases, while Rhino involves those as- ter encapsulate a contrast us vs. them, specifically with
pects that the other two separated in the Criminal label, the labels Worsen our lives and Diferent culture, which
such as stealing and cheating. concentrate in a single label the aspects of diversity,</p>
        <p>Other similarities are in the idea of being degraded primarily religious and cultural. It is noteworthy how the
(Degraded by all annotators), lazy (Loafers by Bear, and annotators’ positionality (Appendix A.1), in this case, is
Lazy by Rhino), and a burden (Burden by Duck and Bear, most evident through their clear-cut distinction between
and Parasites by Rhino). The use of diferent words for us and them—a trait that is often absent in Rhino’s labels.
similar concepts, already suggests the diferent focus Both Duck and Rhino group the idea of being
respecadopted by each annotator. For example, Loafers was tively uncivilized and savage with being degraded, the
connected to being useless, more than simply acting as former using the expression Immoral, thus framing the
lazy. three descriptive labels into a moral stand; the latter</p>
        <p>Another interesting commonality is the idea of be- choosing Ruin of Italy, referring to the efect of those
ing backward people and also this concept is expressed acts. Finally, Exploiters unifies the dimension of being
through diferent labels across annotators. Duck used parasites and lazy, with that of invasion, in a very broad
Uncivilized, Bear Diferent culture, while Rhino separated group that defines exploitation from an economic and
the concept into two descriptive labels: Savage and Worst territorial point of view. Overall, there is a general focus
than us. on the exploitation of the country and of the caused sense</p>
        <p>Finally, some stereotypes have been labeled in signif- of danger (respectively Parasites and Subtle by Duck, Do
icantly diferent ways. An example is they are nomads, not contribute and Dangerous by Bear, and Exploiters and
assigned to Privileged by Duck, Diferent from us by Bear, Threat by Rhino).
and Invader by Rhino, highlighting people’s fear of being Each annotator, however, has elements of uniqueness.
conquered or having their territories squatted. For Duck, this is reflected in the creation of a single</p>
        <p>The way an annotator looks at a phenomenon and its group of stereotypes that define aspects of the target
categorization becomes even more evident when ana- groups’ identities perceived as problematic (Problem).
lyzing the last step: grouping the descriptive labels in 5 Bear, on the other hand, is the only one to foreground
categories (white columns in Table 4). In fact, they are the idea of a worsening of Italians’ lives, defined in
required to choose which concepts they believe to be relation to the risk of invasion and economic exploitation.
priorities and capable of encompassing multiple stereo- Lastly, Rhino is the only one to maintain a single label
types. for the religious dimension and the perspective of
protection, concerning the perception of a privileged
position of the target group to Italians.</p>
      </sec>
      <sec id="sec-2-12">
        <title>7.32% of the time. Messages that obtained a classifica</title>
        <p>tion throughout all three runs are 1, 922: 85.61% of the
total. The analysis of results presented in this section
considers only texts that obtained a classification in each run.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Experiment</title>
      <sec id="sec-3-1">
        <title>Hateful comments.</title>
        <sec id="sec-3-1-1">
          <title>Focusing on the last phase of the pipeline, Figure 2</title>
          <p>shows how the occurrence of the groups of labels changes Label distribution across runs.
based on the presence of hate speech. Labels such as Given the high number of cases in which the LLM
Incompatible, Dangerous, and both Exploiters and Radi- provides at least two diferent labels for the same text
calized respectively for Duck, Bear and Rhino, tend to across the classification runs ( 68.6% of the time), we
be more frequent when the message was annotated as provided an analysis of group labels in runs when the
hateful. These results highlight how a stereotypical rep- LLM always produces the same output and when it
resentation of the stranger as an invader, religious ex- always produces a diferent output. We considered two
tremist, or more generally a threatening individual, is types of distributions: Consistent are the labels that
linked to hate speech. It is worth noticing that the blue are always predicted across the runs; Inconsistent are
bars tend to be higher in most cases, although the texts the labels produced in runs with at least one diferent
are almost perfectly split across hateful and not-hateful prediction. In Table 5 the top-5 Consistent labels and
(respectively 522 and 500). This indicates that the pres- the top-5 Inconsistent labels are reported. As can be
ence of hate speech also leads to the presence of multiple observed, there are some labels that are more likely
stereotypes in the same text. to be consistently predicted by the LLM across runs.
It is the case of Exploiters, Dangerous, and Protected
that combined represent 81.8% of the distribution.</p>
          <p>In this section, we present an experiment aimed at
observing the behavior of an Italian LLM in the classification
of stereotypes according to the labels derived from our
annotation process (Section 3). The experimental setup
was a zero-shot text generation task. We fed the LLM
with a message and a list of the three labels defined by
annotators and asked the model to generate as output
one of the three labels.6</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>We repeated the experiment three times with three</title>
          <p>diferent randomizations of the order of the labels in
the prompt, and used Minerva-7B-instruct-v1.0 to solve
the task. On average, the model generated a bad output</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>6see Appendix A.2 for details about the prompt.</title>
          <p>Stereotype Label
Exploiters
Dangerous
Protected
Threat
Subtle
Incompatible
Total
If the first two are the most occurring group labels in particular, at the labels generated by the model
defined by annotators (Section 3), Protected is not in this specific subset. Results show that it tends to
a common label, since it appears only 260 times in prefer Exploiters and Protected over other annotators’
the corpus. This suggests that there is 50% chance labels, selecting both way more frequently than Rhino.
that the LLM consistently predict the label Protected Coherently with the previous analysis, Exploiters has a
when encountering it, while the chance of having a distribution of 0.365 by the human annotator vs. 0.526
consistent prediction of Dangerous is 17% and 31.2% by the model, while Protected respectively of 0.135 vs.
for Exploiters. On the opposite side of this spectrum, 0.312. This shows that the reliance on Rhino should
there is Threat, which appears 691 times in the corpus not be explained in terms of alignment to annotators’
but is consistently predicted only 55 times (7.9%). The conceptualization of the stereotypes, but rather as a
distribution of labels predicted inconsistently by the preference of the model towards this conceptualization.
LLM shows interesting results as well. There is a lower In fact, the other labels chosen by the same annotator
gap between most and less occurring labels among rarely appear in this subset, with Ruin of Italy being
the top-5 (141 versus 93), suggesting that the model totally missing.
tends to spread inconsistent predictions among a more
homogeneous pool of labels. Dangerous is the label that Inconsistent labels.
appears the most in LLM’s inconsistent predictions, Among the Inconsistent labels, we focused on cases
coherently with its distribution among the group labels where all runs disagree, resulting in 248 comments where
in the corpus. Threat is the second-most occurring the model chose a diferent annotator’s label for each run.
one, appearing in inconsistent predictions twice than Table 6 presents humans’ and models’ label distribution
consistent ones (115). This confirms the low ability on this specific subset. Results show that the model leans
of LLM to conceptualize this specific label. Protected, toward one annotator at a time, respectively Duck, Rhino
which is strongly present in consistent prediction, is and Bear for the first, second and third run. To further
not among the top-5 labels in inconsistent predictions, investigate this pattern, we checked whether the order of
appearing only 14 times. Finally, it is worth mentioning the variable, randomized for each run, had an influence
that Incompatible appears 93 times in inconsistent on this result. We examined how often the selected label
predictions (third-most occurring) but only 3 times in appeared in first position, and found that the
annotaconsistent ones, suggesting that the LM struggles in the tor’s label each run agrees with is almost always ranked
conceptualization of this group label as well. ifrst: specifically, 240, 227, and 234 times out of 248 for
each of the three runs respectively. This highlights that
Consistent labels. when the model is less confident presents strong
inconAs regards the Consistent labels, the model agreed across sistencies among the runs, and we infer it relies on the
all the runs for a total of 603 annotations, selecting instruction example “Return as output (Output) a single
Rhino’s labels 68.99% of the time, Bear’s 26.37%, and option in the form of a Python list (e.g., [’Option
1’])”(ApDuck’s 4.64%. pendix A.2). These results necessitate a further analysis</p>
          <p>Considering the strong reliance on Rhino, we looked, of how LLMs manage challenging texts to annotate and
low-confidence scenarios, which we plan to do in the
future.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Conclusion and Future Work References</title>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>The work of A. T. Cignarella is supported by the Eu</title>
        <p>ropean Union’s Horizon 2020 research and innovation
programme under the Marie Skłodowska-Curie Actions,</p>
      </sec>
      <sec id="sec-5-2">
        <title>Grant Agreement No. 101146287.</title>
        <p>In this paper, we presented O-Ster, a new corpus of Italian
stereotypes annotated through an experimental
framework. The corpus includes 1, 022 texts annotated at the
span level. Each span has been complemented by a
rationale expressing the individuated stereotype, and
rationales served as a basis for the annotators to create
labels associated with each text. This bottom-up process
of label generation enabled observing how annotators
with diferent backgrounds, and an LLM conceptualize
the phenomenon. Results show a high subjectivity in
the conceptualization of stereotypes by humans and the
alignment of the LLM with certain specific labels in a
zero-shot setting.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Future work will focus on expanding the corpus, in or</title>
        <p>der to better understand how subjectivity afects this
phenomenon and to what extent the annotation procedure
may be generalizable and transferable to other languages
and tasks of abusive language detection.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Speech Tools for Italian, CEUR, 2020, pp. 1–9. Minerva-7B-instruct-v1.0, 2024. Accessed in</title>
        <p>[13] P. Chiril, F. Benamara, V. Moriceau, “Be nice to June 2025.</p>
        <p>your wife! The restaurants are closed”: Can Gender [22] K. Stanczak, I. Augenstein, A Survey on Gender Bias</p>
      </sec>
      <sec id="sec-5-5">
        <title>Stereotype Detection Improve Sexism Classifica- in Natural Language Processing, 2021. URL: http:</title>
        <p>tion?, in: Findings of the Association for Compu- //arxiv.org/abs/2112.14168. doi:10.48550/arXiv.
tational Linguistics: EMNLP 2021, 2021, pp. 2833– 2112.14168, arXiv:2112.14168 [cs].
2844. [23] P. Fortuna, S. Nunes, A Survey on Automatic
De[14] W. S. Schmeisser-Nieto, A. T. Cignarella, tection of Hate Speech in Text, ACM Computing</p>
      </sec>
      <sec id="sec-5-6">
        <title>T. Bourgeade, S. Frenda, A. Ariza-Casabona, Surveys 51 (2019) 1–30. URL: https://dl.acm.org/doi/</title>
        <p>
          M. Laurent, P. G. Cicirelli, A. Marra, G. Corbelli, 10.1145/3232676. doi:10.1145/3232676.
F. Benamara, et al., Stereohoax: a multilingual [24] W. Schmeisser-Nieto, M. Nofre, M. Taulé, Criteria
corpus of racial hoaxes and social media reactions for the annotation of implicit stereotypes, in: N.
Calannotated for stereotypes, Language Resources zolari, F. Béchet, P. Blache, K. Choukri, C. Cieri,
and Evaluation (
          <xref ref-type="bibr" rid="ref6">2024</xref>
          ) 1–39. T. Declerck, S. Goggi, H. Isahara, B. Maegaard,
[15] M. Nadeem, A. Bethke, S. Reddy, StereoSet: Mea- J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.),
suring stereotypical bias in pretrained language Proceedings of the Thirteenth Language Resources
models, in: C. Zong, F. Xia, W. Li, R. Nav- and Evaluation Conference, European Language
igli (Eds.), Proceedings of the 59th Annual Meet- Resources Association, Marseille, France, 2022, pp.
ing of the Association for Computational Lin- 753–762. URL: https://aclanthology.org/2022.lrec-1.
guistics and the 11th International Joint Confer- 80/.
ence on Natural Language Processing (Volume [25] C. Bosco, V. Patti, S. Frenda, A. T. Cignarella,
        </p>
      </sec>
      <sec id="sec-5-7">
        <title>1: Long Papers), Association for Computational M. Paciello, F. D’Errico, Detecting racial</title>
      </sec>
      <sec id="sec-5-8">
        <title>Linguistics, Online, 2021, pp. 5356–5371. URL: stereotypes: An Italian social media corpus</title>
        <p>https://aclanthology.org/2021.acl-long.416/. doi:10. where psychology meets NLP, Information
18653/v1/2021.acl-long.416. Processing &amp; Management 60 (2023) 103118.
[16] Z. Wu, S. Bulathwela, M. Pérez-Ortiz, A. S. URL: https://linkinghub.elsevier.com/retrieve/pii/
Koshiyama, Stereotype detection in llms: A S0306457322002199. doi:10.1016/j.ipm.2022.
multiclass, explainable, and benchmark-driven ap- 103118.
proach, 2024. URL: https://api.semanticscholar.org/ [26] T. Bourgeade, A. T. Cignarella, S. Frenda, M.
Lau</p>
      </sec>
      <sec id="sec-5-9">
        <title>CorpusID:268856718. rent, W. Schmeisser-Nieto, F. Benamara, C. Bosco,</title>
        <p>[17] W. S. Schmeisser-Nieto, P. Pastells, S. Frenda, V. Moriceau, V. Patti, M. Taulé, A
multilin</p>
      </sec>
      <sec id="sec-5-10">
        <title>A. Ariza-Casabona, M. Farrús, P. Rosso, M. Taulé, gual dataset of racial stereotypes in social media</title>
      </sec>
      <sec id="sec-5-11">
        <title>Overview of detests-dis at iberlef 2024: Detection conversational threads, in: Findings of the As</title>
        <p>
          and classification of racial stereotypes in spanish- sociation for Computational Linguistics: EACL
learning with disagreement, Procesamiento del 2023, Association for Computational Linguistics,
Lenguaje Natural 73 (
          <xref ref-type="bibr" rid="ref6">2024</xref>
          ) 323–333. Dubrovnik, Croatia, 2023, pp. 686–696. URL: https:
[18] D. Hovy, S. Prabhumoye, Five sources of bias in nat- //aclanthology.org/2023.findings-eacl.51/. doi: 10.
ural language processing, Language and linguistics 18653/v1/2023.findings-eacl.51.
compass 15 (2021) e12432. [27] A. T. Cignarella, A. Giachanou, E. Lefever,
Stereo[19] P. Röttger, B. Vidgen, D. Hovy, J. Pierrehumbert, type Detection in Natural Language
Process
        </p>
      </sec>
      <sec id="sec-5-12">
        <title>Two contrasting data annotation paradigms for sub- ing, 2025. URL: https://arxiv.org/abs/2505.17642.</title>
        <p>jective nlp tasks, in: Proceedings of the 2022 Con- arXiv:2505.17642.
ference of the North American Chapter of the As- [28] S. T. Fiske, A. J. C. Cuddy, P. Glick, J. Xu, A model of
sociation for Computational Linguistics: Human (often mixed) stereotype content: Competence and</p>
      </sec>
      <sec id="sec-5-13">
        <title>Language Technologies, 2022, pp. 175–190. warmth respectively follow from perceived status</title>
        <p>[20] A. Hautli-Janisz, E. Schad, C. Reed, Disagreement and competition, Journal of Personality and Social
space in argument analysis, in: G. Abercrom- Psychology (2002) 878–902.
bie, V. Basile, S. Tonelli, V. Rieser, A. Uma (Eds.), [29] A. Koch, R. Imhof, R. Dotsch, C.
Unkel</p>
      </sec>
      <sec id="sec-5-14">
        <title>Proceedings of the 1st Workshop on Perspectivist bach, H. Alves, The abc of stereotypes</title>
      </sec>
      <sec id="sec-5-15">
        <title>Approaches to NLP @LREC2022, European Lan- about groups: Agency/socioeconomic success,</title>
        <p>guage Resources Association, Marseille, France, conservative-progressive beliefs, and communion.,
2022, pp. 1–9. URL: https://aclanthology.org/2022. Journal of personality and social psychology 110
nlperspectives-1.1/. (2016) 675–709. URL: https://api.semanticscholar.
[21] SapienzaNLP, sapienzanlp/minerva-7b-instruct- org/CorpusID:6287638.</p>
        <p>v1.0, https://huggingface.co/sapienzanlp/ [30] W. S. Schmeisser-Nieto, P. Pastells, S. Frenda,</p>
      </sec>
      <sec id="sec-5-16">
        <title>Annotator 2 self-describes as a 30-year-old white Italian</title>
        <p>woman with no migratory background. While she has
not experienced migration firsthand, she has lived abroad
as an expatriate, navigating international mobility from
a position of privilege. Her native language is Italian.</p>
      </sec>
      <sec id="sec-5-17">
        <title>Annotator 5 self-identifies as a 24-year-old woman</title>
        <p>second-generation migrant. She is a Master’s student
with a background in political activism, she has
volunteered with non-governmental organizations and has
written non-specialist articles about Italian second
generation of migrants.</p>
      </sec>
      <sec id="sec-5-18">
        <title>Presented below is the prompt in Italian:</title>
      </sec>
      <sec id="sec-5-19">
        <title>And its English translation:</title>
        <p>Ti viene fornita in input (Input) una frase
estratta dai social media, insieme a tre
possibili stereotipi (Opzioni). Il tuo compito
è individuare quale stereotipo è implicito
nella frase, scegliendo tra le opzioni
fornite. Restituisci in output (Output) una
singola opzione, sotto forma di lista Python
(es. [’Opzione 1’]).</p>
        <p>Input: Mattinata di ieri passata in un’aula
di tribunale. 23 udienze al ruolo.
Imputati stranieri 19; imputati italiani 4.
Imputati stranieri presenti 0; imputati italiani
presenti 4. Conclusione (del tutto
personale); vengono tutti a delinquere qua e se ne
fregano della nostra giustizia
Opzioni: [Sono subdoli, Sono pericolosi,
Sono una minaccia]
Output:
You are given as input (Input) a sentence
extracted from social media, along with three
possible stereotypes (Options). Your task
is to identify which stereotype is implied
in the sentence by selecting one of the
provided options. Return as output (Output) a
single option in the form of a Python list
(e.g., [’Option 1’]).</p>
        <p>Input: Yesterday morning spent in a
courtroom. 23 judicial hearings to the register.</p>
        <p>Foreign defendants 19; Italian defendants 4.</p>
        <p>Foreign defendants present 0; Italian
defendants present 4. Conclusion (entirely
personal); they all come here to commit a crime
and don’t give a damn about our justice
Options:[Subtle, Dangerous, Threat]
Output:
Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Improve writing style and Grammar and spelling check. After using these tool(s)/service(s), the
author(s) reviewed and edited the content as needed and take(s) full responsibility for the
publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>Proceedings of the 2024 Joint International Con-</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Resources and Evaluation (LREC-COLING</article-title>
          <year>2024</year>
          ),
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>ELRA and</surname>
            <given-names>ICCL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>8453</fpage>
          -
          <lpage>8463</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          URL: https://aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>741</volume>
          /. [31]
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Marra,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 2024 Joint International</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <year>2024</year>
          ),
          <article-title>ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          13429-
          <fpage>13441</fpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          lrec-main.
          <volume>1176</volume>
          /. [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Field</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsvetkov</surname>
          </string-name>
          , Demoting racial bias in
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>cessing for Social Media</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7</fpage>
          -
          <lpage>14</lpage>
          . [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Killing me softly:
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>Engineering</source>
          <volume>29</volume>
          (
          <year>2023</year>
          )
          <fpage>1516</fpage>
          -
          <lpage>1537</lpage>
          . doi:
          <volume>10</volume>
          .1017/
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>S1351324922000316.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>