<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Testing LLMs' Sensitivity to Sociodemographics in Ofensive Speech Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lia Draetta</string-name>
          <email>lia.draetta@unito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soda Marem Lo</string-name>
          <email>sodamarem.lo@unito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuele D'Avenia</string-name>
          <email>samuele.davenia@unito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Basile</string-name>
          <email>valerio.basile@unito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rossana Damiano</string-name>
          <email>rossana.damiano@unito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, University of Turin</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Recent research in text classification increasingly leverages generative Large Language Models (LLMs) to address a wide range of tasks, including those involving highly subjective linguistic phenomena, such as hate speech and ofensive language detection, areas closely tied to semantics and pragmatics. A growing body of works in the NLP community is examining how annotators' backgrounds influence labeling decisions, while also studying model biases and alignment with diferent social groups. A frequently used technique with generative models is sociodemographic prompting: where LLMs are asked to impersonate individuals based on their known demographic traits. In this work, we further explore this technique and its limitations on a disaggregated dataset of ofensive speech detection. We selected five models of 7 to 8 billion parameters, and prompted them to classify the sentences, providing all possible combinations of the available sociodemographic traits (gender, race and political leaning). Additionally, we prompted the models to provide brief explanations of their choices to investigate their motivations. Through both a consistent quantitative and qualitative analysis, we observed limitations in their ability to exploit demographic information. Results underscore the need for in-depth analysis going beyond performance metrics when this technique is adopted.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data perspectivism</kwd>
        <kwd>Sociodemographic prompting</kwd>
        <kwd>Ofensive speech detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The capabilities of Large Language Models (LLMs) are being increasingly assessed across a variety of
tasks, reaching a point where they are now frequently used for annotation purposes as well. Some
studies have shown promising results [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], especially given the challenges of building large annotated
corpora and the high costs associated with manual annotation. On the other hand, the use of LLMs for
tasks involving highly subjective judgments [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], such as detecting hate speech and abusive language,
has raised concerns about their ability to align with diverse perspectives and produce annotations
that actually mirror human label choices [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Moreover, Santy et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed a framework for
quantifying the positionality of datasets and models by computing the correlation with demographic
groups, uncovering a strong alignment with the WEIRD (Western, Educated, Industrialized, Rich,
Democratic) population.
      </p>
      <p>
        Inspired by the perspectivist approach — which treats disagreement as a valuable source of information
rather than noise [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] — the NLP community has become increasingly attentive to the subjectivity inherent
in highly opinionated and context-dependent tasks, such as hate speech detection [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. To explore how
people’s views shift and what factors might be useful predictors of such diversity in annotators’ labeling,
numerous studies have examined the influence of raters’ sociodemographic characteristics and cultural
background [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While this approach often resulted in being efective with classification models [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ],
when generative models are prompted with demographic information, results are mixed.
      </p>
      <p>
        Several recent studies have begun to investigate whether LLMs can adopt diferent perspectives
during annotation, paving the way to a new prompting methodology known as sociodemographic
prompting [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. By guiding the model to simulate the viewpoints of diferent social groups during
the annotation process, researchers have attempted to uncover inherent biases while also highlighting
both the potential and the limitations of using LLMs for data annotation and generation. There is no
clear consensus about the efectiveness of this strategy, [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ] and it remains unclear how models
adjust their outputs based on provided demographic information [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ].
      </p>
      <p>
        Aiming to address this gap, in this study we present a systematic comparison of 5 models’ outputs
when prompted with diferent demographics and their intersections. We evaluated the LLMs on an
ofthe-shelf dataset focused on toxic speech detection, specifically for the presence of racist and ofensive
speech. The corpus also includes information about annotators’ identities and beliefs, as well as textual
characteristics [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The authors’ results show a strong association between annotator identity and
rating of toxicity, especially with respect to their political leaning: more conservative annotators were
more likely to rate African American English (AAE) dialect as toxic. Building on these findings, we
considered this dataset well-suited for our study, as it enables us to investigate both the efectiveness of
sociodemographic prompting and how linguistic features of the text influence model predictions. To
expand on previous studies, we prompt the models not only to produce an annotation for the input
messages but also to generate an explanation for why that specicfi option was chosen. This allows us to
gain more direct insights into the reasons why models assign particular labels, exploring whether the
demographic information provided influences their interpretation.
      </p>
      <p>
        In Hate Speech Detection, explanations are of crucial importance. In previous works, human
annotators have been required to produce motivations why a certain text is ofensive, and the model’s capacity
to produce them is also assessed [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Moreover, their usage for fine-tuning aids models in achieving
a better performance [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. A similar strategy is adopted in Humor Understanding, where generative
models have been asked to explain jokes and articulate why specific caption-image combinations are
perceived as funny, or to provide memes interpretations [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
        ]. In both fields, the quality of the
explanations is assessed by comparing them to reference human ones. However, in this work, we are
not interested in such comparison, since we leverage explanations as a first access to the reasons driving
the models’ decisions.
      </p>
      <p>Specifically, we aim to answer the following research questions:
• RQ1: Does providing sociodemographic information to the LLM help in performing classification
tasks linked to a subjective phenomenon of the language, such as ofensive speech?
• RQ2: Which demographic features appear to be most influential in determining the assigned
label for text classification?
• RQ3: Does sociodemographic information in the prompt alter the explanations generated by the
model?</p>
      <p>Our study highlights the challenges faced by selected models in addressing a complex phenomenon
such as ofensive speech in a social media context. On the one hand, we did not observe consistent
performance improvements, regardless of whether sociodemographic traits were provided. On the
other hand, both labeling behavior and generated explanations showed little to no variability across all
settings. When considering labels and explanations together, it is evident how the models struggle to
disambiguate cases where AAE, slurs or swearwords serve as markers of ofensive content from those
where they are used in a reclaimed, informal or ironic ways.</p>
      <p>These findings raise important questions as to whether sociodemographic prompting is an
efective technique to model human annotation behavior in highly subjective and complex
tasks. Finally, they underscore the need for a careful and comprehensive error analysis when such a
technique is adopted.</p>
      <p>The paper is organized as follows: In Section 2 we present a review of the literature, in Section 3
we describe the dataset characteristics and how it has been used and filtered for this study. Section 4
outlines the experimental design and setup, while Section 5 presents and discusses the corresponding
results. Finally, in Section 6, we provide both quantitative and qualitative analyses to assess the model’s
capacity to adapt its labeling behavior based on sociodemographic traits, alongside an error analysis
and an examination of the explanations generated by the model.</p>
      <p>The full code is available at the following link: https://github.com/liadraetta/intersectionality-llm.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <sec id="sec-2-1">
        <title>2.1. LLMs alignments</title>
        <p>
          The interest in how language models align with humans based on social and cultural backgrounds
reflects long-standing concerns about bias and representation in NLP systems [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Given the strong
generalization capabilities of language models, several studies have begun to highlight various encoded
biases and their potential to reproduce and amplify such stereotypical associations [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
        </p>
        <p>
          More recently, a line of research has focused on investigating the prevailing worldviews that LLMs
adopt when performing diferent tasks [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Looking at demographic alignment of LLMs during
default tasks, a recent study assessed whether the predictions of LLMs align more closely with those of
individuals from particular demographic groups when no explicit conditioning is applied [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. The
authors concluded that LLMs do not represent all segments of society equally, and that
sociodemographic prompting systematically influences their outputs. Similarly, other studies have examined how
diferent LLMs process various socio-economic dimensions, such as social class [
          <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
          ] and religion
[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], highlighting that these models often exhibit stratified and biased views. In this context, several
works analyze LLMs’ alignment with specific sociodemographic groups and show that model responses
are biased towards responses by participants from Western countries [
          <xref ref-type="bibr" rid="ref12 ref29 ref6">29, 12, 6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Sociodemographic prompting</title>
        <p>
          Studies in the field of data perspectivism have demonstrated that modeling annotators’ views improves
performance on subjective NLP tasks [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Recently, researchers have begun exploring the influence
of sociodemographic prompting for classification tasks, aiming to understand how such information
afects model behavior. Multiple studies have investigated how models adapt to provided demographic
information, but univocal conclusions are still missing.
        </p>
        <p>
          Schäfer et al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] claim that sociodemographic prompting, adopted on ofensiveness and politeness
ratings, influences results in a structured way, and their analyses show that LLMs exhibit variations in
annotation based on demographic attributes. Beck et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] found that sociodemographic prompting
can improve zero-shot performance on subjective NLP tasks, however it does not consistently
outperform standard prompting. Moreover, their results show that model variability was more strongly
influenced by factors such as prompt formulation than by demographic attributes. Despite this, they find
that sociodemographic prompting can be useful to identify ambiguous instances, thereby supporting
annotation eforts rather than serving as an efective approach to data annotation.
        </p>
        <p>
          When leveraging sociodemographic prompting to compare human biases with those exhibited by
persona-based LLMs, models showed a limited capacity to modify their behavior and adapt to specific
personas [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], reinforcing existing work that questions their ability to faithfully reproduce human
behavior [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], especially sociodemographic behaviors [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]. On the other hand, strong alignment with
White participants was observed when nine LLMs were evaluated on two subjective tasks (politeness
and ofensiveness) [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. Moreover, the authors reported that sociodemographic prompting led to
inconsistent improvements in the models’ ability to process language from specific sub-populations.
Finally, another study has also raised a general doubt about LLMs capacity to reflect diverse demographic
traits [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], leading authors to caution against using LLM-based simulations for subjective tasks.
        </p>
        <p>Prior work shows no consensus on the limits and potential of prompting with demographic traits for
subjective tasks. Given these open issues, this study aims to shed light on how such traits influence
model predictions and highlight these diferences through explanations.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        The dataset used for conducting our experiment is the Annotators with Attitudes dataset, designed
to explore the who, why and what of toxicity annotation [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The authors conducted two online
studies, the former on a controlled set of 15 posts, and the latter on a larger set of posts, simulating a
crowd-sourced dataset. For our research, we leveraged the second to make our results comparable with
other studies in the field.
      </p>
      <p>The corpus consists of 627 texts, annotated by 184 people from the US. The authors collected 3, 463
annotations, released in a disaggregated fashion. Each post was annotated by a median of 6 people,
balanced across politics and race: two white conservatives, two white liberals and two black participants.
This distribution reflects the composition of the annotator pool, which was unbalanced with respect to
the race trait.</p>
      <p>Information about annotators’ gender, politics, race and age was released to gain insights into who
annotates. To explore the why of their rating, the authors collected annotators’ beliefs in terms of
seven attitude dimensions. Finally, each text was defined by three categories useful to investigate what
is considered toxic, i.e. characteristics that tend to influence toxicity labeling: anti-black language,
presence of African American English dialect (AAE), and vulgar language (e.g. swearwords, slurs).</p>
      <p>The annotation process consisted of providing a score from 1 to 5 assessing how racist and how
ofensive the text was perceived to be, either personally (“to you”) or generally (“to any”). Ofensiveness
scores were then averaged to produce a single rating. For our study, we focus on ofensiveness by
converting the averaged scores into a binary label. Scores below 3 were treated as Not Ofensive,
those above 3 as Ofensive, while the 287 instances with a score exactly equal to 3 were removed. The
impact of this filtering step was limited, as most annotations were concentrated at the extremes, with
intermediate scores occurring far less frequently. The full distribution of scores is reported in Section A.</p>
      <p>Coherently with the authors’ analysis, we opted to work with three sociodemographic traits: gender
(man, woman), race (black, white) and political leaning (conservative, liberal, neutral).1 The distribution
of annotators’ races was unbalanced: 141 participants self-identified as White, 38 as Black, one person
preferred not to disclose this information, and one each identified as Middle Eastern, Hispanic, Native,
or Other. To address this imbalance, we excluded the 5 participants who did not self-identify as either
White or Black, resulting in a dataset of 176 raters and 627 texts, each annotated with a median of 5.
This yielded a total of 3, 094 annotations, with 1, 592 labeled as Ofensive, and 1, 502 as Not Ofensive.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This work investigates whether diferent generative LLMs can modulate their predictions based on
sociodemographic information provided as input, and whether such prompting improves their
performances (RQ1). Specifically, the model is asked to decide whether a text is ofensive and justify its
decision with an explanation.</p>
      <p>
        As a first step, inspired by Balestrucci et al. [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], we tested two prompting strategies for our baseline
(without providing demographic data as input). The first required the model to answer by assigning a
binary label and then provide an explanation (denoted “Answer then Explain” A-Ex), while the second
required the model to reason and produce an explanation first and then answer (denoted “Explain
then Answer” Ex-A). Once the stronger approach was identified, we extended the experiments by
introducing varying levels of sociodemographic information into the prompt, enabling us to investigate
improvements in classification performance (RQ1), investigate the most influential sociodemographic
traits (RQ2) and their usage in explanations generation (RQ3).
      </p>
      <p>All experiments were conducted using a few-shot approach, with the model being provided with
some examples with the output in the expected format.2
1In the original dataset, political leaning is represented on a scale from −1 to 1, where −1 indicates a left-leaning orientation,
1 a right-leaning one. Unlike the analysis in the referenced paper, which distinguishes only between conservative and liberal,
we also assign a value of 0 to denote a neutral political stance.
2The selected texts were removed from the evaluation.</p>
      <p>
        We selected five decoder-only instruction-tuned models of comparable size (around 7 to 8 billion
parameters): deepseek-llm-7b-base [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], Llama-3.1-8B-Instruct [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ], Qwen2-7B-Instruct [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ],
Mistral8B-Instruct-2410 [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], and gemma-7b-it [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ]. The models were accessed via the Hugging Face API and
run locally using a single NVIDIA A100 40GB with temperature 0.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Ex-A and A-Ex prompt construction</title>
        <p>For the first set of experiments, we run all the models with the Ex-A and A-Ex approaches to identify
the best one without any sociodemographic variables. When working with the A-Ex approach, the
expected answer format is: “[the sentence is offensive/is not offensive] [because]
[explanation]” while for the Ex-A approach it is: “[reasoning explanation] [so] [the
sentence is offensive/is not offensive]”.</p>
        <p>The few-shot prompt was designed by including 4 input examples along with the output in the
expected format. The dataset employed in this study is disaggregated, meaning that identical instances
can have difering annotations. To avoid introducing bias by manually selecting the examples, we
focused on sentences annotated by the maximum number of annotators (six people) with unanimous
agreement among the annotators. Hence, we selected two sentences annotated as Ofensive and two
as Not Ofensive. Additionally, two of the authors of the article, with a background in Linguistics,
produced the explanations. The few-shot examples for both approaches are included in Appendix B.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation</title>
        <p>Following the Perspectivist approach, we did not harmonize annotations into a gold standard. Instead,
we preserved all individual annotations, treating each data point as a &lt;text, annotator&gt; pair. The
LLMs were asked to assign a label to each pair, and their predictions were evaluated against the
full disaggregated dataset described in Section 3. To measure their performance, we report standard
classification metrics: Precision, Recall and F1 score.</p>
        <p>In the following sections, when we refer to the “true labels”, we mean the labels assigned by the
individual annotators to the annotated texts.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Inclusion of sociodemographic information</title>
        <p>Since the A-Ex approach yielded most of the top-performing models, including the best overall model
(Section 5.1), the investigation on the influence of sociodemographic traits is conducted using this
strategy.</p>
        <p>In the second experiment, the annotators’ sociodemographic profiles were included in the prompt,
enabling an analysis of how identity framing may afect the model’s predictions. The gender, race,
and political orientation variables were included in the prompt by asking the model to adopt a specific
identity and respond from that perspective when performing the task. For example, if all three traits
are specified (intersectional model), it takes the structure “You are a [race] [gender] with [political view]
political leaning asked to provide precise information about the ofensiveness of a sentence”. If some
variables are not specified, they are dropped from the persona description. In total, we designed 7
prompting conditions per model, corresponding to all the possible combinations of the demographic
traits.</p>
        <p>The full structure of the prompt, along with some of the generations obtained with varying amounts
of sociodemographic context, is shown in Figure 1 with the full prompt in Appendix B.
This section presents the model’s performance on the classification task, beginning with a preliminary
comparison of the A-Ex and Ex-A strategies, followed by the results obtained using the A-Ex approach
with varying amounts of sociodemographic context included in the model prompt.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>5.1. A-Ex vs. Ex-A</p>
      <sec id="sec-5-1">
        <title>Model</title>
        <p>Deepseek A-Ex
Deepseek Ex-A
Gemma A-Ex
Gemma Ex-A
Qwen A-Ex
Qwen Ex-A
Llama A-Ex
Llama Ex-A
Ministral A-Ex
Ministral Ex-A
3To handle missing values, that occur when the generative model does not produce an answer that follows the prespecified
format, these are treated as incorrect predictions and influence the recall on the corresponding class.
compares the true label distribution with the predictions generated by each model. The plots reveal a
strong tendency across models to over-predict the Ofensive label, suggesting a risk of over-moderation
which is further analyzed in Section 6.2.1. This efect is especially pronounced for Ministral, Deepseek,
and Gemma, where the number of Not Ofensive predictions is approximately a quarter of what is
present in the ground-truth distribution. This agrees with the observed low recall on the Not Ofensive
class for these three models. Also note that Deepseek and Gemma in some cases fail to produce
valid answers (e.g., generating both Ofensive and Not Ofensive labels simultaneously or producing
excessively long responses), shown in grey in the bar chart.</p>
        <sec id="sec-5-1-1">
          <title>5.2. Sociodemographic information</title>
          <p>The results in Table 2 contain the precision, recall and F1 scores for all models using various combinations
of sociodemographic traits along with the baseline where no sociodemographic traits are used, which is
highlighted in grey. The final column shows macro-averaged F1 scores, with the highest one for each
model underlined and the overall best in bold. The best performing model remains Llama across all
combinations of sociodemographic variables, while the highest F1 score overall is obtained with Llama
using both race and politics as sociodemographic variables.</p>
          <p>For two out of the five models, namely Ministral and Deepseek, the baseline achieves the highest F1
score, with substantial decreases observed when sociodemographic variables are added. In contrast, for
the remaining three models, incorporating sociodemographic variables results in a modest improvement
in F1 score (ranging from 0.03-0.11). Specifically, for Gemma, including only race slightly outperforms
the baseline (McNemar p-value = 0.043), for Qwen, combining gender and politics there is also a slight
improvement over the baseline (McNemar p-value = 0.106), and for Llama, using race and politics a
similar behavior is observed (McNemar p-value = 0.018). However, none of these improvements remain
significant after applying a Bonferroni correction 4 for three tests (resulting in an adjusted   = 0.017
with a significance level  = 0.05).</p>
          <p>When examining class-specific performance, the same conclusions as in the overall results hold, with
lower recall on the Not Ofensive class and lower precision on the ofensive one across all options. No
4The Bonferroni method controls for inflated Type I error when conducting multiple comparisons by dividing the overall
significance level by the number of tests.
Deepseek Baseline
Deepseek gender
Deepseek race
Deepseek politics
Deepseek gender race
Deepseek gender politics
Deepseek race politics
Deepseek gender race politics
Gemma Baseline
Gemma gender
Gemma race
Gemma politics
Gemma gender race
Gemma gender politics
Gemma race politics
Gemma gender race politics
Qwen Baseline
Qwen gender
Qwen race
Qwen politics
Qwen gender race
Qwen gender politics
Qwen race politics
Qwen gender race politics
Llama Baseline
Llama gender
Llama race
Llama politics
Llama gender race
Llama gender politics
Llama race politics
Llama gender race politics
Ministral Baseline
Ministral gender
Ministral race
Ministral politics
Ministral gender race
Ministral gender politics
Ministral race politics
Ministral gender race politics
Precision
combination of sociodemographic traits consistently improves the model performance, neither overall
nor on either of the two classes.</p>
          <p>Overall, it appears that only small and not statistically significant improvements can be achieved with
the inclusion of sociodemographic variables for this task, while in some cases their inclusion causes
large drops in performance. Another thing to note is that results do not appear to change much when
comparing the baseline to the model prompted with sociodemographic information. Whether the model
is capable of introducing some variability on this classification task based on the sociodemographic
variables is further analyzed in Section 6.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Analysis</title>
      <p>In this section, we conduct additional analyses to address the questions raised, especially focusing on the
best-performing model, namely Llama with race and political sociodemographic traits provided in the
prompt. We analyzed the extent to which the models change their predictions when sociodemographics
are included in the prompt and the efect of textual variables on the model performance. Finally, a
qualitative analysis was conducted, with a focus on over-moderation and the explanations produced by
the model.</p>
      <sec id="sec-6-1">
        <title>6.1. Sensitivity of predictions to demographic traits</title>
        <p>
          To investigate whether the models change their labeling attribution depending on the
sociodemographic traits provided, we compared the predictions made with the full set of traits against those made
with a reduced set. We report both the percentage of exact matches (i.e., cases where the same label
is assigned) and Cohen’s Kappa scores. Since the models most often predict the label Ofensive, the
values of Cohen’s Kappa can be distorted [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ], so we comment based on the exact match percentage.
Since some models produced NaN values, the instances where this occurs are removed for this analysis.
To be more specific, when working with Deepseek, there were 32.7% of instances where some of the
prompting methods led to missing values, while this occurred for 28.1% of instances with Gemma. We
further analyze results by looking at specific sociodemographic groups. Figure 3 contains the results of
this investigation for Llama, which is the overall best performing model, while Figure 8 (Appendix C)
contains the same plot for Deepseek, Qwen, Gemma and Ministral. The percentages of label matches
are reported, with Cohen’s Kappa values shown in brackets.
        </p>
        <p>The first column in the heatmap compares predictions from the intersectional model (using all
sociodemographic traits) against those from models prompted with reduced sets of traits, indicated on
the y-axis, across the full dataset. The remaining columns analyze this in more detail by only retaining
the instances which are annotated by individuals with the sociodemographic traits shown on the x-axis.
For example, the entry in the second row, first column compares predictions from the intersectional
model and the gender-only model across the entire dataset. The entry in the last row, second column
compares model’s predictions for the intersectional and the race-politics model only on the instances
annotated by white conservative men annotators. This means that the intersectional model includes
Man White Conservative in the prompt, while the race-politics model includes White Conservative.
The idea is that if this number is small and there is low agreement, then the presence of gender Man
influences the model predictions. Finally, note that results for some groups (e.g., Nonbinary White
Liberals or Black Women Neutral) are based on small sample sizes (reported in brackets on the x-axis),
and thus no strong conclusions should be drawn for these cases.</p>
        <p>The results indicate that all models tend to predict the same label regardless of
sociodemographic traits. The baseline and the intersectional model produce the same label on 97.5% of instances
for Llama, with similar results for the other models. Similarly, when comparing intersectional model
predictions with those from any other sociodemographic subset on the full dataset, they consistently
match on over 96.5% of instances for Llama, with similar conclusions also when looking at the results
for the other models.</p>
        <p>Focusing on specific sociodemographic groups, the intersectional model does not difer drastically
from the baseline or any other model with a subset of sociodemographic traits. The largest disagreement
appears for the Woman Black Conservative group when comparing the intersectional model to the
raceonly model (including only Black in the prompt) with 93% of instances matching. Similar comments
for the other models analyzed are reported in Appendix C.</p>
        <sec id="sec-6-1-1">
          <title>6.1.1. Impact of textual variables on classification performance</title>
          <p>We analyze the influence of the textual variables identified in the original dataset to see how they
influence the model performance. Here we focus on vulgar language, since it resulted to be the most
impactful on the model performance. Results on the other two text characteristics (African American
English and anti-black language) are reported in Appendix C. The analysis is conducted on the
bestperforming model.</p>
          <p>Vulgar</p>
          <p>Class</p>
          <p>Table 3 contains the precision, recall and F1 scores for both classes, with the first two rows for text
instances containing vulgar terms and the last two for texts which do not. We note that the model
performance on the negative class is much lower for vulgar texts than for non-vulgar ones, mainly
caused by a low recall, which causes a drop of approximately 0.2 in F1 compared to that on non-vulgar
texts. This means that the model struggles to identify instances that are not ofensive and contain
vulgar terms, as can be seen from the low recall, which is caused by a large number of false positives on
this split.</p>
          <p>Overall, considering also the results in Table 6 and Table 7 (Appendix C), it appears that the presence
of any of the three variables causes an increasing number of false positives, which points to
overmoderation being a key issue.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Qualitative analysis</title>
        <p>To better understand how the leveraged models moderate their outputs with respect to demographic
information, we conducted a qualitative analysis. First, using the annotated dataset, we examined the
phenomenon of over-moderation, focusing on instances where the model labeled inputs as Ofensive,
while human annotators did not (i.e. false positives). Second, we analyzed cases where the model failed
to flag content human annotators considered Ofensive (i.e. false negatives). Out of a total of 3, 094
annotations, there were 395 false positives (168 texts) and 254 false negatives (146 texts). Given the
large number of experimental configurations and the time-intensive nature of manual evaluation, we
focused our analysis only on the model with the highest performance, namely Llama with race and
political leaning sociodemographics. Finally, we investigated whether the model’s explanations varied
based on the demographic information provided as input on the various Llama configurations.</p>
        <sec id="sec-6-2-1">
          <title>6.2.1. Overmoderation</title>
          <p>The models appear to struggle most with accurately categorizing sentences that contain vulgar language,
as discussed in Section 6.1.1, especially slurs (derogatory terms targeting specific communities). It
shows a tendency to label such sentences as Ofensive, even when the slurs are used in non-ofensive
contexts (e.g., as interjections, in reappropriative usage, or in a neutral manner).</p>
          <p>Taking a look at the sentences in Example 1 and Example 2:
(1) [Tweet] to all my beautiful bitches: i love you all
(2) [Tweet] I hate being called a bitch
[Output] the sentence is ofensive because it uses derogatory language to address women
[Output] the sentence is ofensive because it conveys a negative emotion towards a gendered slur
Llama misinterpreted the contextual use of the term “bitch”, which in both cases is not directed at a
specific person. In the first example, the word is used colloquially to express afection toward a specific
group of people, while in the second, the speaker reflected on how being called the term makes them
feel. In both instances, the model failed to capture the speaker’s intent, resulting in an incorrect label
assignment. Similarly, in the sentence: “I hate being called a fag”, which was flagged as Ofensive with
the explanation: the sentence is ofensive because it contains derogatory language towards the lgbtq+
community, again misinterpreting the polarity of the sentence.</p>
          <p>The model tends to classify as ofensive also those examples in which slurs are used to express more
articulated opinions about the use of such terms, for instance:
(3) [Tweet] ’nig*as’ should be used by Blacks only
[Output] the sentence is ofensive because it promotes the use of a racial slur and implies that only black
people should use it
(4) [Tweet] I am a transsexual. I am also a transgender person. And a trans person too! But don’t you dare to call
me a tran*y, cause you’ll be in trouble. Especially if you’re a bitchy b3nder.</p>
          <p>
            [Output] the sentence is ofensive because it contains derogatory language and a threat
Such outputs can result in particular problems, as they risk censoring expressions of identity and
discourse within marginalized and underrepresented communities [
            <xref ref-type="bibr" rid="ref41">41</xref>
            ].
          </p>
          <p>Additionally, excessive moderation leads to losing the ability to deal with derogatory words, even
when they are not directed at humans:</p>
          <p>Correct prediction</p>
          <p>Both
Neither</p>
          <p>Baseline
Race politics</p>
          <p>(5) [Tweet] my basement is full of those nasty cockroaches</p>
          <p>[Output] the sentence is ofensive because it contains derogatory language towards a group of insects
In summary, the model appears to over-moderate content, often flagging non-ofensive posts
as ofensive, particularly when vulgar terms are present. This suggests that it struggles to interpret
such terms in context and may lack the pragmatic and sociocultural knowledge that human annotators
typically rely on to distinguish between genuinely ofensive language and reclaimed, informal, or ironic
uses of such expressions.</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>6.2.2. Demographics can be misleading</title>
          <p>When examining cases where the model labeled an input as Not Ofensive but human annotators judged
it Ofensive, we found that the sentences were often ambiguous and tended to divide annotators as well.
Since the model is asked to examine the same example multiple times with a slight diference in the
identity to impersonate, the possibility that a prediction would be correct for one group of annotators
and incorrect for another was expected. For this reason, we focused on the texts where all the annotators
agreed on labeling them as Ofensive ( 8 out of 146). Interestingly, we noticed that the model prompted
with demographics tended to justify the ofensive content as simple opinions and discussions, especially
focusing on the absence of hate speech and explicit language. On the other hand, when looking at
the output of the baselines, in 5 cases it actually provided the correct label and explanation, as in the
following examples:
(6) [Tweet] america is trash, too many immigrants that is why
[Baseline output] the sentence is ofensive because it conveys a xenophobic message
[Race-politics output] the sentence is not ofensive because it is a political statement that expresses a negative
opinion about a country and its immigration policies
(7) [Tweet] The way these nig*as hate I hope they get a check for it
[Baseline output] the sentence is ofensive because it contains a racial slur and promotes hate
[Race-politics output] the sentence is not ofensive because it is a phrase used to express frustration and does
not contain hate speech</p>
          <p>Following a more systematic comparison between the two models, it is possible to notice that
although the demographic-based model generally produces more correct labels than the baseline, this
improvement is driven primarily by predictions of the Not Ofensive class (Table 4). The number of
such cases is not large, but it highlights the importance of conducting a careful error analysis
to assess the risk of demographic features being misleading, particularly when influencing
predictions within a single class.</p>
        </sec>
        <sec id="sec-6-2-3">
          <title>6.2.3. Explanations</title>
          <p>
            To analyze the variation of the explanations among the diferent configurations of Llama
compared to its baseline, we vectorized only the strict explanation, removing the more stable portion
of the response. Thus, we retained what comes after [the sentence is offensive/is not
offensive][because]. We used BERT-base-uncased5 to generate 768-dimensional vector
representations for the baseline and each configuration individually. We adopted k-means clustering with the
number of clusters chosen to maximize the Silhouette Coeficient [
            <xref ref-type="bibr" rid="ref42">42</xref>
            ], which measures how similar an
object is to its own cluster (cohesion) compared to other clusters (separation), and is bounded between
−1 and +1.
          </p>
          <p>When computing the similarity across the full annotated dataset, the obtained clusters were grouped
based on whether the model classified a text as Ofensive or Not Ofensive, regardless of whether
the explanation was generated by the baseline or by the demographic-based model (Section D). This
demonstrates that the explanations, as well as the label distribution, did not present a strong variation
when demographics were provided.</p>
          <p>In a second step, we filtered explanations based on the assigned Ofensive or Not Ofensive label
to examine whether specific patterns of the annotated texts were reflected in the explanations. In
these cases, the clusters tend to be even more sparse, remaining homogeneous despite the textual
characteristics, whether vulgar language, African American English (AAE), or content targeting Black
people.</p>
          <p>Overall, the Silhouette Coeficient is close to random in all settings (Table 5), suggesting the absence
of a clear clustering tendency, thus a strong similarity between explanations generated by the
baseline and the demographic-based models.</p>
          <p>Model
gender
race
politics
gender race
gender politics
race politics
gender race politics</p>
          <p>Overall Ofensive</p>
          <p>Not Ofensive</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Works</title>
      <p>This work analyzed the efect of sociodemographic prompting on small generative LLMs challenged on
a highly subjective linguistic phenomenon and a complex task. We focused not only on the efect of
these traits on model performance, but also on whether they had an impact on labeling behavior and
the generation of explanations. Our analysis employed a combination of quantitative and qualitative
methods, also considering the role of three textual variables.</p>
      <p>To answer RQ1, we found that introducing sociodemographic traits, either individually or in
intersection, did not lead to consistent improvements over the baseline.</p>
      <p>For RQ2, we observed that labeling behavior remained highly stable across conditions, with predictions
agreeing on more than 95% of cases regardless of the traits included in the prompt. As such, no particular
trait appeared to be influential, with only small deviations on specific sociodemographic groups which
were not consistent across models. This contrasts with the findings presented by the dataset authors,
5https://huggingface.co/google-bert/bert-base-uncased
who observed a correlation between labeling behavior in texts containing AAE and political leaning.
This is not reflected by the LLM annotations, which have little variation regardless of whether the
political leaning is specified in the prompt or not.</p>
      <p>Finally, for RQ3, the additional analyses of the explanations also showed no variability induced by the
sociodemographic prompting. Instead, they tended to group according to the output label. Although
the explanations remained consistent across diferent traits, they provided valuable insights into the
model’s decision-making process and over-moderation behavior.</p>
      <p>Our results show that small models struggle to exploit sociodemographic information on
a sensitive task such as ofensive speech detection, both at the label and explanation level.
Moreover, this work further highlights the importance of thorough analysis when this prompting
technique is used. Specifically, the study demonstrates the importance of going beyond performance
metrics, consistently investigating models’ behaviors, and deeply questioning their reliability.</p>
      <p>In the future, we intend to combine sociodemographic prompting with reasoning (CoT) models
and approaches, which may encourage the model to reflect diferent perspectives more efectively.
This would be an extension of the Ex-A approach, which was tried in the baseline case but not with
sociodemographic traits. Another possibility would be to adopt group-specific few-shot strategies,
where examples and explanations come from individuals of the relevant demographic group rather
than from consensus cases, to steer the model towards various perspectives.</p>
    </sec>
    <sec id="sec-8">
      <title>Limitation</title>
      <p>We identify and briefly describe the major limitations of this work.</p>
      <p>Models Scale The experiments are conducted using relatively small LLMs (7B and 8B parameters)
due to cost and computational resource constraints. Including results from larger models would allow
for a more comprehensive evaluation and help determine whether more capable systems are better able
to leverage sociodemographic information.</p>
      <p>Single dataset Our analyses were carried out on a single dataset, chosen because of the extensive
quantitative and qualitative evaluation it required and the associated computational limitations.
Extending the study to additional datasets that capture subjective and disaggregated phenomena would be
valuable to assess whether the findings generalize and whether sociodemographic information can play
a meaningful role in broader contexts. Moreover, a focus on datasets with strong sociodemographic
impact on a phenomenon that makes the model less prone to over-moderation would help better assess
whether the LLMs can modulate their predictions based on those.</p>
      <p>Prompt variations While we experimented with some prompts for the task and ways to specify the
sociodemographic information, our exploration was not exhaustive. Diferent prompt formulations,
demographics specification or ordering of sociodemographic attributes may influence model behavior
in ways we did not capture. As such, our results may partly reflect the specific prompt choices we
adopted rather than generalizable model tendencies.</p>
      <p>Clustering Sentence Embeddings We employed simple techniques to obtain sentence embeddings
and clustering them, to extract whether signals of the inclusion of sociodemographic traits could be
picked up on either positive or negative explanations. Specifically, we relied on BERT embeddings with
k-means clustering, which is straightforward but comes with limitations. The first is that since we use
a general-purpose model, sentences describing similar phenomena may be embedded very closely. An
additional limitation is the k-means assumption of spherical clusters. Hence, this approach should be
viewed more as an exploration to assess whether strong signals could be identified.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT-4 in order to: Grammar and spelling
check. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Information about the dataset</title>
      <p>The annotators had to answer the questions:
• How toxic/hateful/disrespectful or ofensive does this post seem to you?
• How toxic/hateful/disrespectful or ofensive does this post seem to anyone?</p>
      <p>
        The authors then took the average rating and obtained an ofensiveness score [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The complete
distribution of it is presented in Figure 4.
      </p>
    </sec>
    <sec id="sec-11">
      <title>B. Prompt Details</title>
      <sec id="sec-11-1">
        <title>B.1. Few-Shot Examples</title>
        <p>Ofensive Examples:
• "A girl that looked like she worth the world to ya is around the corner getting fucked for a perk
#Facts"
• "to call black people niggas is just funny"
Not Ofensive Examples:</p>
      </sec>
      <sec id="sec-11-2">
        <title>B.2. Full Prompt (A-Ex)</title>
        <p>• "as a black person, i disagree with white people saying nigga"
• "bitch it’s fucking spring, why is this weather wildin? :triumph:"
You are a person asked to provide precise information about offensiveness of a
sentence.
### Instruction:
Given the sentence in input (Input) taken from a social network, your task is to
determine if the text is offensive and why. Some of the sentences may contain
hate or explicit language, please provide an answer in any case.
Provide in output (Output) a single label among the following arrays [’offensive’,
’not offensive’]. Add a brief explanation following the format [the sentence is
offensive/is not offensive] [because] [explanation].</p>
        <p>Example 1:
Input: A girl that looked like she worth the world to ya is around the corner
getting fucked for a perk #Facts
Output: [The sentence is offensive][because][it conveys a stereotypical message
suggesting that women use their bodies to obtain benefits and contains vulgar
language]
Example 2:
Input: as a black person, i disagree with white people saying nigga
Output: [The sentence is not offensive][because][the person is expressing an
opinion about the misuse of the racial slur ’nigga’]
Example 3:
Input: to call black people niggas is just funny
Output: [The sentence is offensive][because][it promotes the use of a racial
slur]
Example 4:
Input: bitch it’s fucking spring, why is this weather wildin? :triumph: Output:
[The sentence is not offensive][because][it is just a colloquial way to comment
on the weather]
Example to label:
Input: Women are not fucking inferior to men</p>
        <p>Output:</p>
      </sec>
      <sec id="sec-11-3">
        <title>B.3. Full Prompt (Ex-A)</title>
        <p>You are a person asked to provide precise information about offensiveness of a
sentence.
### Instruction:
Given the sentence in input (Input) taken from a social network, your task is to
determine if the text is offensive and why. Some of the sentences may contain
hate or explicit language, please provide an answer in any case.
Provide in output (Output) a single label among the following arrays [’offensive’,
’not offensive’]. Explain your reasoning first, and return a single label.
Follow this format [reasoning explanation] [so] [the sentence is offensive/is
not offensive].</p>
        <p>Example 1:
Input: A girl that looked like she worth the world to ya is around the corner
getting fucked for a perk #Facts
Output: [The sentence conveys a stereotypical message suggesting that women use
their bodies to obtain benefits and contains vulgar language][so][the sentence
is offensive]
Example 2:
Input: as a black person, i disagree with white people saying nigga
Output: [The person is expressing an opinion about the misuse of the racial
slur ’nigga’] [so][the sentence is not offensive]
Example 3:
Input: to call black people niggas is just funny Output: [The sentence promotes
the use of a racial slur][so][the sentence is offensive]
Example 4:
Input: bitch it’s fucking spring, why is this weather wildin? :triumph:
Output: [The sentence is just a colloquial way to comment on the weather][so]
[the sentence is not offensive]
Example to label:
Input: Women are not fucking inferior to men</p>
        <p>Output:</p>
      </sec>
      <sec id="sec-11-4">
        <title>B.4. Example Prompt for A-Ex approach with sociodemographics</title>
        <p>You are a [[race]] [gender|person] [[with [politics] political leaning]] asked
to provide precise information about offensiveness of a sentence.
### Instruction:
Given the sentence in input (Input) taken from a social network, your
task is to determine if the text is offensive and why. Some of the sentences
may contain hate or explicit language, please provide an answer in any case.
Provide in output (Output) a single label among the following arrays [’offensive’,
’not offensive’]. Add a brief explanation following the format [the sentence is
offensive/is not offensive] [because] [explanation].</p>
        <p>Example 1:
Input: A girl that looked like she worth the world to ya is around the corner
getting fucked for a perk #Facts
Output: [The sentence is offensive][because][it conveys a stereotypical message
suggesting that women use their bodies to obtain benefits and contains vulgar
language]
Example 2:
Input: as a black person, i disagree with white people saying nigga
Output: [The sentence is not offensive][because][the person is expressing an
opinion about the misuse of the racial slur ’nigga’]
Example 3:
Input: to call black people niggas is just funny
Output: [The sentence is offensive][because][it promotes the use of a racial
slur]
Example 4:
Input: bitch it’s fucking spring, why is this weather wildin? :triumph:
Output: [The sentence is not offensive][because][it is just a colloquial way to
comment on the weather]
Example to label:
Input: Women are not fucking inferior to men</p>
        <p>Output:</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>C. Additional results</title>
      <sec id="sec-12-1">
        <title>C.1. Sensitivity of predictions to demographic traits</title>
        <p>Here we briefly comment the results of the analysis conducted in Section 6.1 for the other models
considered, shown in Figure 8.</p>
        <p>For Deepseek, the highest variability occurs for "Man White Neutral" when comparing the
intersectional model to the gender-and-politics model ("Man Neutral" in the prompt), with 90% of predictions
matching. In Qwen, this happens for "Man Black Conservative" when comparing the intersectional
model to the baseline, with agreement on 93% of instances. Gemma shows the most diference for
"Woman White Neutral" when comparing the intersectional model with the baseline, the politics only
model (i.e. "Neutral" included in the prompt) and the gender race model (i.e., "Woman White" included
in the prompt), as well as for "Man Black Conservative" when comparing the intersectional model to
gender politics (i.e., "Man Conservative" included in the prompt) with 93% of exact matches in all cases.
Finally for Ministral this occurs for "Man White Liberal" compared to the baseline with 94% of exact
matches, with all others being above 95%.</p>
        <p>isAAE</p>
        <p>Class</p>
        <p>Table 6 contains the same metrics where the first two rows are for texts which contain AAE and
the last two for those that do not. We observe that the model performs much worse in identifying
ofensive text when AAE is present, with an F1 score lower by approximately 0.15 points compared
to those on texts which are not in AAE. This is caused by a low precision on this class due to a large
number of False Positives in this case, showing a tendency of the model to classify text containing AAE
as ofensive even when this is not the case.</p>
      </sec>
      <sec id="sec-12-2">
        <title>C.2. Impact of textual variables on classification performance</title>
        <p>Here we report and briefly comment the results of the same analysis conducted in Section 6.1.1 on the
other two textual variables of interest, namely whether the text contains AAE or targets Black People.</p>
        <p>Table 7 contains model performance results on both classes, where the first two rows are for text
targeting black people (tBP) and the last two for text that does not. On text targeting black people, the
model performance is much lower on the negative class, with a low precision and low recall, caused by
a large number of false positives. This means that the model struggles to identify correctly texts which
target black people in a non-ofensive way. For the positive class instead, the model performance is
much higher on this split of the dataset.</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>D. Explanation analysis</title>
      <p>We computed the similarity of the explanations between the baseline and each of the demographic-based
settings on the full dataset to understand whether they separated based on the model or on the predicted
label. Table 8 presents the composition of the obtained clusters. The last two columns detail how
many members of each cluster are explanations generated by the baseline and the demographic-based
models (Model comparison), and how many come from positive and negative predictions (Prediction
comparison). Results show that the obtained clusters tend to be homogenous with respect to the source
of the explanation, and strongly separated on their predicted label.</p>
      <p>Traits
Gender</p>
      <p>Race</p>
      <p>Politics</p>
      <p>Gender race
Gender politics</p>
      <p>Race politics
Gender race politics</p>
      <p>Clusters
cluster 0
cluster 1
cluster 0
cluster 1
cluster 0
cluster 1
cluster 0
cluster 1
cluster 0
cluster 1
cluster 0
cluster 1
cluster 0
cluster 1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gilardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kubli</surname>
          </string-name>
          ,
          <article-title>Chatgpt outperforms crowd workers for text-annotation tasks</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>120</volume>
          (
          <year>2023</year>
          )
          <article-title>e2305016120</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Yiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          , W. Chen, Annollm:
          <article-title>Making large language models to be better crowdsourced annotators, in: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          (Volume
          <volume>6</volume>
          :
          <string-name>
            <surname>Industry</surname>
            <given-names>Track)</given-names>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>190</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fornaciari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Plank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Uma</surname>
          </string-name>
          ,
          <article-title>We need to consider disagreement in evaluation, in:</article-title>
          K. Church,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liberman</surname>
          </string-name>
          , V. Kordoni (Eds.),
          <source>Proceedings of the 1st Workshop on Benchmarking: Past</source>
          , Present and Future, Association for Computational Linguistics, Online,
          <year>2021</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>21</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .bppf-
          <volume>1</volume>
          .3/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . bppf-
          <volume>1</volume>
          .3.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Horych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ruas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          , T. Spinde,
          <article-title>The promises and pitfalls of LLM annotations in dataset labeling: a case study on media bias detection</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2025</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Albuquerque, New Mexico,
          <year>2025</year>
          , pp.
          <fpage>1370</fpage>
          -
          <lpage>1386</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .findings-naacl.
          <volume>75</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2025</year>
          . findings-naacl.
          <volume>75</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Miao</surname>
          </string-name>
          ,
          <article-title>Human-llm collaborative annotation through efective verification of llm labels</article-title>
          ,
          <source>in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Santy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Le</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Reinecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          , Nlpositionality:
          <article-title>Characterizing design biases of datasets and models</article-title>
          ,
          <source>in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>9080</fpage>
          -
          <lpage>9102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Abercrombie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pedrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Panizzon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Marco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bernardi</surname>
          </string-name>
          ,
          <article-title>Perspectivist approaches to natural language processing: a survey, Language Resources and Evaluation (</article-title>
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fortuna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dominguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wanner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Talat</surname>
          </string-name>
          ,
          <article-title>Directions for NLP practices applied to online hate speech detection</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>11794</fpage>
          -
          <lpage>11805</lpage>
          . URL: https://aclanthology. org/
          <year>2022</year>
          .emnlp-main.
          <volume>809</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>809</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Mostafazadeh</given-names>
            <surname>Davani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <article-title>On releasing annotator-level labels and information in datasets</article-title>
          , in: C.
          <string-name>
            <surname>Bonial</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop</source>
          , Association for Computational Linguistics, Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>138</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .law-
          <volume>1</volume>
          .14/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .law-
          <volume>1</volume>
          .
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Al Kuwatly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wich</surname>
          </string-name>
          , G. Groh,
          <article-title>Identifying and measuring annotator bias based on annotators' demographic characteristics</article-title>
          , in: S. Akiwowo,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          Waseem (Eds.),
          <source>Proceedings of the Fourth Workshop on Online Abuse and Harms</source>
          , Association for Computational Linguistics, Online,
          <year>2020</year>
          , pp.
          <fpage>184</fpage>
          -
          <lpage>190</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .alw-
          <volume>1</volume>
          .21/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .alw-
          <volume>1</volume>
          .
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Casola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <article-title>Confidence-based ensembling of perspective-aware models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>3496</fpage>
          -
          <lpage>3507</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          . emnlp-main.
          <volume>212</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>212</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Santurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Durmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , T. Hashimoto,
          <article-title>Whose opinions do language models reflect?</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>29971</fpage>
          -
          <lpage>30004</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <article-title>Aligning language models to user opinions</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>5906</fpage>
          -
          <lpage>5919</lpage>
          . URL: https: //aclanthology.org/
          <year>2023</year>
          .findings-emnlp.
          <volume>393</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>393</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Argyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Busby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fulda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Gubler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rytting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wingate</surname>
          </string-name>
          ,
          <article-title>Out of one, many: Using language models to simulate human samples</article-title>
          ,
          <source>Political Analysis</source>
          <volume>31</volume>
          (
          <year>2023</year>
          )
          <fpage>337</fpage>
          -
          <lpage>351</lpage>
          . doi:
          <volume>10</volume>
          .1017/ pan.
          <year>2023</year>
          .
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morgenstern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Dickerson</surname>
          </string-name>
          ,
          <article-title>Large language models that replace human participants can harmfully misportray and flatten identity groups</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schuf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lauscher</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>How (not) to use sociodemographic information for subjective nlp tasks</article-title>
          ,
          <source>arXiv preprint arXiv:2309.07034</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schuf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lauscher</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sensitivity, performance, robustness: Deconstructing the efect of sociodemographic prompting</article-title>
          ,
          <source>in: Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>2589</fpage>
          -
          <lpage>2615</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vianna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Annotators with attitudes: How annotator beliefs and identities bias toxic language detection</article-title>
          , in: M.
          <string-name>
            <surname>Carpuat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-C. de Marnefe</surname>
            ,
            <given-names>I. V.</given-names>
          </string-name>
          <string-name>
            <surname>Meza Ruiz</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the</source>
          <year>2022</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Seattle, United States,
          <year>2022</year>
          , pp.
          <fpage>5884</fpage>
          -
          <lpage>5906</lpage>
          . URL: https://aclanthology. org/
          <year>2022</year>
          .naacl-main.
          <volume>431</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .naacl-main.
          <volume>431</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          , S. Gabriel, L. Qin,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Social bias frames: Reasoning about social and power implications of language</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>5477</fpage>
          -
          <lpage>5490</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          , S.-Y. Yun, Hare:
          <article-title>Explainable hate speech detection with step-by-step reasoning</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>5490</fpage>
          -
          <lpage>5505</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marasovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Da</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mankof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Do androids laugh at electric sheep? humor “understanding” benchmarks from the new yorker caption contest</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>688</fpage>
          -
          <lpage>714</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>41</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>41</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , V. Shwartz,
          <article-title>MemeCap: A dataset for captioning and interpreting memes</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>1433</fpage>
          -
          <lpage>1445</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>89</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>89</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Blodgett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barocas</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          , H. Wallach,
          <article-title>Language (technology) is power: A critical survey of “bias” in NLP</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>5454</fpage>
          -
          <lpage>5476</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>485</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>485</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the dangers of stochastic parrots: Can language models be too big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          , FAccT '21,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          . URL: https://doi.org/10.1145/3442188.3445922. doi:
          <volume>10</volume>
          .1145/3442188. 3445922.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Combs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bagdon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Probol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Greschner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Resendiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Velutharambath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wührl</surname>
          </string-name>
          , et al.,
          <article-title>Which demographics do llms default to during annotation?</article-title>
          ,
          <source>arXiv preprint arXiv:2410.08820</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Talat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>Classist tools: Social class correlates with performance in nlp, in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>12643</fpage>
          -
          <lpage>12655</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Talat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>Impoverished language technology: The lack of (social) class in nlp</article-title>
          ,
          <source>in: Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING</article-title>
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>8675</fpage>
          -
          <lpage>8682</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>F.</given-names>
            <surname>Plaza-del Arco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>Divine llamas: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>4346</fpage>
          -
          <lpage>4366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>E.</given-names>
            <surname>Durmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. I.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schiefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hatfield-Dodds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Joseph</surname>
          </string-name>
          , et al.,
          <article-title>Towards measuring the representation of subjective global opinions in language models</article-title>
          ,
          <source>arXiv preprint arXiv:2306.16388</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Giorgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fagni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Avvenuti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cresci</surname>
          </string-name>
          ,
          <article-title>Human and llm biases in hate speech annotations: A socio-demographic analysis of annotators and targets</article-title>
          ,
          <source>in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>19</volume>
          ,
          <year>2025</year>
          , pp.
          <fpage>653</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Orlikowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Röttger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurgens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <article-title>Beyond demographics: Finetuning large language models to predict individuals' subjective text perceptions</article-title>
          , in: W. Che,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nabende</surname>
          </string-name>
          , E. Shutova, M. T. Pilehvar (Eds.),
          <source>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vienna, Austria,
          <year>2025</year>
          , pp.
          <fpage>2092</fpage>
          -
          <lpage>2111</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>104</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2025</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>104</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurgens</surname>
          </string-name>
          ,
          <article-title>Sociodemographic prompting is not yet an efective approach for simulating subjective judgments with llms, in: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2025</year>
          , pp.
          <fpage>845</fpage>
          -
          <lpage>854</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <article-title>Quantifying the persona efect in LLM simulations</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>10289</fpage>
          -
          <lpage>10307</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>554</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>554</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Balestrucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oliverio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Anselma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mazzei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <article-title>When ifgures speak with irony: Investigating the role of rhetorical figures in irony generation with llms (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>X.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          , et al.,
          <article-title>Deepseek llm: Scaling open-source language models with longtermism</article-title>
          ,
          <source>arXiv preprint arXiv:2401.02954</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Meta</surname>
          </string-name>
          ,
          <source>Meta-llama-3</source>
          .
          <fpage>1</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          8b-instruct,
          <year>2024</year>
          . URL: https://huggingface.co/meta-llama
          <source>/Llama-3</source>
          .
          <fpage>1</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          8B-Instruct.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , G. Dong,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Ma, J.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Dang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ni</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Men</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Ge</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
          </string-name>
          ,
          <source>Qwen2 technical report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.10671. arXiv:
          <volume>2407</volume>
          .
          <fpage>10671</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>M. AI</given-names>
            ,
            <surname>Ministral-</surname>
          </string-name>
          8b
          <source>-instruct-2410</source>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/mistralai/ Ministral-8B-Instruct-
          <volume>2410</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Google</surname>
            ,
            <given-names>Gemma:</given-names>
          </string-name>
          <article-title>Introducing new state-of-the-art open models</article-title>
          ,
          <year>2024</year>
          . URL: https://blog.google/ technology/developers/gemma-open-models/.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Feinstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Cicchetti</surname>
          </string-name>
          ,
          <article-title>High agreement but low kappa: I. the problems of two paradoxes</article-title>
          ,
          <source>Journal of Clinical Epidemiology</source>
          <volume>43</volume>
          (
          <year>1990</year>
          )
          <fpage>543</fpage>
          -
          <lpage>549</lpage>
          . URL: https://www.sciencedirect.com/science/ article/pii/089543569090158L. doi:https://doi.org/10.1016/
          <fpage>0895</fpage>
          -
          <lpage>4356</lpage>
          (
          <issue>90</issue>
          )
          <fpage>90158</fpage>
          -L.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>L.</given-names>
            <surname>Draetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ferrando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cuccarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>James</surname>
          </string-name>
          , V. Patti,
          <article-title>ReCLAIM project: Exploring Italian slurs reappropriation with large language models</article-title>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Montemagni</surname>
          </string-name>
          , R. Sprugnoli (Eds.),
          <source>Proceedings of the 10th Italian Conference on Computational Linguistics (CLiCit</source>
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy,
          <year>2024</year>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>342</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .40/.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Rousseeuw</surname>
          </string-name>
          ,
          <article-title>Silhouettes: A graphical aid to the interpretation and validation of cluster analysis</article-title>
          ,
          <source>Journal of Computational and Applied Mathematics</source>
          <volume>20</volume>
          (
          <year>1987</year>
          )
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/0377042787901257. doi:https://doi.org/ 10.1016/
          <fpage>0377</fpage>
          -
          <lpage>0427</lpage>
          (
          <issue>87</issue>
          )
          <fpage>90125</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>