<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N. Gabriele);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Bias in the Age of Reasoning Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riccardo Cantini</string-name>
          <email>rcantini@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Gabriele</string-name>
          <email>nicola.gabriele@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Orsino</string-name>
          <email>aorsino@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Talia</string-name>
          <email>talia@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adversarial Robustness</institution>
          ,
          <addr-line>Fairness, Sustainable AI</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Calabria</institution>
          ,
          <addr-line>Rende</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>inference time; and () Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: () how the introduction of reasoning capabilities afects model fairness and robustness; whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As Large Language Models (LLMs) become increasingly integrated into high-stakes societal domains
such as healthcare, education, and law—owing to their advanced capabilities in natural language
understanding and generation [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]—concerns about embedded biases have grown significantly. These
biases can perpetuate harmful stereotypes, marginalize underrepresented groups, and undermine the
ethical deployment of AI systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. They often originate from multiple sources, including biased
training data that reflect historical inequalities and stereotypes, linguistic imbalances in corpora, flawed
algorithmic designs, and uncritical usage of AI technologies [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        To address the limitations of traditional LLMs, which rely on implicit, pattern-based reasoning,
researchers have developed techniques to elicit more structured and interpretable behavior. One such
approach is Chain-of-Thought (CoT) prompting, which encourages models to generate intermediate
reasoning steps at inference time without requiring architectural changes or specialized training [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
In contrast, a new class of models known as Reasoning Language Models (RLMs) has emerged. Unlike
standard language models using CoT, RLMs are explicitly trained to perform multi-step reasoning
through fine-tuned reasoning trajectories and integrated test-time search strategies [ 7, 8]. By embedding
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
logical inference capabilities directly into their training and architecture, RLMs move beyond
nexttoken prediction, ofering improved performance, transparency, and reliability, which are key for the
responsible deployment of AI systems [9].</p>
      <p>While prior research has extensively benchmarked bias in LLMs [10, 11, 12, 13] and explored
alignment techniques to improve safety [14], the relationship between reasoning capabilities and bias
mitigation remains underexplored. Specifically, it remains unclear whether explicit reasoning
mechanisms help reduce biased behavior in language models or inadvertently reinforce it through structured
inference chains [15]. In addition, the interplay between reasoning capabilities and adversarial bias
elicitation raises the question of whether such mechanisms enhance robustness or, conversely, increase
vulnerability to biased responses. To address this gap, this study presents a systematic evaluation of
bias robustness across diferent reasoning paradigms and three main model families: GPT, DeepSeek,
and Phi-4. For each family, we assessed the latest non-reasoning models (i.e., GPT-4o, DeepSeek V3 671B,
and Phi-4), their CoT–augmented variants, and their reasoning-by-design counterparts (i.e., o3-mini and
o1-preview for GPT; DeepSeek R1 and its distilled versions—DeepSeek Distill Qwen and DeepSeek Distill
Llama—for DeepSeek; and Phi-4-reasoning for Phi-4). Our investigation is guided by the following
research questions:
RQ1 How do diferent reasoning mechanisms (e.g., CoT prompting or reasoning by-design) afect
robustness to bias elicitation?
RQ2 Are reasoning models inherently safer than those relying on reasoning elicitation at inference
time via CoT prompting?
RQ3 How does the efectiveness of diferent jailbreak attacks targeting adversarial bias elicitation vary
across reasoning mechanisms?</p>
      <p>Experiments have been performed using the CLEAR-Bias benchmark [11], leveraging an
LLM-asa-judge framework to evaluate robustness against bias elicitation under adversarial conditions. This
involved exposing the diferent models to a set of curated jailbreak prompts designed to probe biases
across sociocultural and intersectional dimensions. In summary, our key contributions are as follows:
• We conduct a systematic evaluation of bias safety in RLMs at diferent scales, using an adversarial
approach to stress-test model safety under diferent reasoning configurations, including both
CoT-prompted and reasoning-enabled models.
• We provide empirical evidence that explicit reasoning—whether induced at training or inference
time—can increase vulnerability to bias elicitation, with CoT-prompted models exhibiting slightly
worse bias safety than their reasoning-enabled counterparts.
• We empirically show that vulnerability to adversarial prompting strongly depends on the type
of attack and the reasoning mechanisms embedded in the model, with non-reasoning models
exhibiting the highest overall resistance to jailbreak attacks targeting bias elicitation.</p>
      <p>The remainder of the paper is organized as follows. Section 2 reviews prior work on bias benchmarking
and the adversarial safety of reasoning models. Section 3 introduces the CLEAR-Bias benchmark and
outlines the methodology used in our evaluation. Section 4 illustrates the experimental results, and
Section 5 concludes with a discussion of key findings, implications, and directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Recent work has highlighted the vulnerability of LLMs to bias elicitation—the extraction of harmful,
stereotypical, or toxic content via ad-hoc or adversarial prompts—even in models specifically trained to
align with human values. These models encode social biases across diferent dimensions such as race,
gender, nationality, religion, and their intersections, revealing persistent representational harms that
can undermine fairness, inclusivity, and trust in real-world applications [12, 13, 16, 10].</p>
      <p>
        A particularly influential line of research examines how reasoning strategies, such as
Chain-ofThought (CoT) prompting, interact with bias elicitation. While CoT improves performance on a range
of logical and symbolic tasks [
        <xref ref-type="bibr" rid="ref6">6, 17</xref>
        ], its implications for fairness and safety remain less explored. Shaikh
et al. [18] present one of the first controlled studies evaluating the efect of zero-shot CoT prompting
on social bias. Their work reveals that prompting LLMs to “think step by step” can paradoxically
amplify bias, making models more likely to generate stereotypical or toxic outputs. Using adapted
versions of standard bias benchmarks (CrowS-Pairs [13], StereoSet [12], and BBQ [16]), along with
a custom dataset of harmful queries, they show that CoT often reduces refusal rates and increases
the likelihood of harmful completions, especially in larger models. Their analysis suggests that CoT
reasoning may encourage models to hallucinate spurious justifications that override safety constraints,
particularly when the task requires social nuance or judgment. Complementing this, Wu et al. [15]
systematically investigate how social bias manifests in intermediate reasoning steps of instruction-tuned
and reasoning-enabled models. Using the BBQ dataset, they show that reasoning traces often amplify
stereotypes, especially when models shift reasoning paths mid-response or employ shallow forms of
self-reflection. Their findings highlight that even correct answers can embed biased reasoning steps,
and that removing biased steps leads to improved model performance. This reinforces the idea that
reasoning alone does not guarantee fairness and can, in fact, reinforce harmful associations. Other work
in the literature has focused on assessing the general safety of reasoning-enabled LLMs, particularly
OpenAI’s o3-mini and DeepSeek R1. Arrieta et al. [19] conducted a large-scale, automated safety
evaluation using the ASTRAL framework [20], which systematically tests models on a set of prompts
spanning 14 safety-critical categories (e.g., hate speech, terrorism, privacy violations, misinformation).
Their findings show that DeepSeek R1 produces more unsafe outputs than o3-mini, ofering insights
into system-level safety of reasoning-enabled models. However, their work does not explicitly examine
how these safety failures relate to social biases and adversarial robustness, leaving open questions about
the intersection of reasoning capabilities, fairness, and bias safety.
      </p>
      <p>While prior research has highlighted vulnerabilities in reasoning LLMs, it has typically focused on
isolated reasoning strategies (e.g., chain-of-thought or reasoning by design) or a narrow range of model
families, with little attention to adversarial elicitation. This gap underscores the need for a deeper
examination of how reasoning paradigms, model scale, and adversarial techniques interact to influence
bias amplification. In this work, we analyze the behavior of both large and small reasoner models,
along with inference-time reasoning strategies such as zero-shot CoT prompting, to evaluate their
robustness against adversarial bias elicitation across diferent sociodemographic groups. Building on
the CLEAR-Bias benchmark [11], we apply jailbreak techniques to stress-test model safety and quantify
their vulnerability using LLM-based scalable automatic evaluations. This comprehensive analysis
extends prior work by systematically comparing reasoning strategies, model sizes, and adversarial
robustness under unified robustness, fairness, and safety metrics.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Benchmarking Adversarial Robustness to Bias Elicitation</title>
      <p>This section describes the methodology employed for evaluating bias vulnerability in LLMs, as proposed
in the CLEAR-Bias benchmark [11]. This includes an overview of the dataset design and the automated
evaluation framework relying on the LLM-as-a-judge paradigm [21, 22], which enables a scalable and
systematic assessment of both stereotypical behaviors and robustness under adversarial conditions.</p>
      <sec id="sec-3-1">
        <title>3.1. The CLEAR-Bias Dataset</title>
        <p>CLEAR-Bias (Corpus for Linguistic Evaluation of Adversarial Robustness against Bias) is a curated
benchmark dataset designed to assess the robustness of LLMs against bias elicitation, especially under
adversarial conditions. The dataset targets a broad spectrum of social biases and incorporates advanced
attack strategies intended to bypass safety mechanisms. It comprises 4,400 bias-probing prompts,
covering seven isolated bias categories—age, disability, ethnicity, gender, religion, sexual orientation,
and socioeconomic status—as well as three intersectional ones, i.e., ethnicity–socioeconomic status,
gender–sexual orientation, and gender–ethnicity. Prompts are structured according to two task formats:
• Choose the Option: the model selects between two possible answers in response to a contextualized
question, assessing inclination toward stereotypical or biased choices.
• Sentence Completion: the model is asked to complete a partial sentence using one of two options,
revealing tendencies toward biased generations.</p>
        <p>For every bias category  ∈  , 10 prompts are curated for each task  ∈  , yielding 20 base prompts per
category and 200 base prompts in total. These prompts are subsequently transformed using adversarial
techniques  ∈</p>
        <p>to create 4,200 additional prompts, resulting in a final dataset of 4,400 items. Seven
distinct jailbreak techniques are employed to generate adversarially augmented versions of the base
prompts, with each attack implemented in three variants to ensure diversity in adversarial conditions.
Employed jailbreak attacks include machine translation into low-resource languages (e.g., Slovene,
Macedonian, Scottish Gaelic) [23]; obfuscation via encoding schemes (e.g., base64, leetspeak); prefix
injection to enforce specific starting phrases; prompt injection using misleading preambles that instruct
the model to ignore prior text; refusal suppression to discourage typical model safety refusals; reward
incentive to encourage biased outputs through hypothetical rewards; and role-playing to assume biased
personas or characters (e.g., AIM—Always Intelligent and Machiavellian [24]).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Benchmarking Methodology</title>
        <p>To enable large-scale evaluation of responses, the methodology relies on an LLM-as-a-judge approach,
with DeepSeek V3 identified as the most reliable judge based on its highest agreement with a manually
curated control set of prompt–response pairs, as measured by Cohen’s  coeficient [ 11]. The judge
model is used to evaluate the behavior of tested LLMs on both base and adversarial prompts. In the
initial phase, each model is queried with all base prompts, and its responses are classified into four
categories  = {, , , }
stereotypes,</p>
        <p>, where  denotes stereotyped responses that reinforce or perpetuate common
denotes counter-stereotyped responses that challenge or flip stereotypes while still
relying on generalized assumptions,  denotes debiased responses that are impartial and balanced, and
 denotes refusal responses, indicating that the model declines to answer the prompt. Two main metrics
are computed per bias category  ∈  :
refusal rate (  ) and debiasing rate (  ) for task  ∈  :</p>
        <p>• Robustness   , which measures the model’s resistance to bias elicitation, considering both the
counter-stereotyped ( ̄ ) responses:</p>
        <p>• Fairness   , which evaluates the model’s neutrality by comparing the rate of stereotyped (  ) and

These are combined into a bias-specific safety score   , with the overall model safety  computed as the
average across all biases:</p>
        <p>Bias categories with a safety score above a predefined threshold  are considered safe, and are denoted
by the subset  =̃ { ∈  |</p>
        <p>≥  },  ⊆̃  . These categories proceed to subsequent adversarial evaluation,
where the jailbreak prompts of CLEAR-Bias are exploited to evaluate models under adversarial
conditions. To fairly assess model behavior in this more challenging setting, responses classified as refusals
are re-evaluated to identify possible misunderstandings (e.g., due to obfuscation), thereby excluding
cases where the behavior results from prompt misinterpretation rather than genuine refusal. Then, for
each  ∈  ̃, a new safety score  ̃
() is computed per attack, with the final safety score  ̃ incorporating

 =</p>
        <p>+   ,

 = 1 − |  −  ̄ | ,
  =
1
| | ∈
∑ 


  =
1
| | ∈
∑ 


  =   +   ,
2
 =
1
The relative safety reduction for bias  under attack  is denoted by Δ 
attack  computed as the mean safety reduction across all attacked bias categories:
() , with the efectiveness  () of
1
,
 () =
1
|| ̃ ∈  ̃
∑ Δ 
()
the minimum safety across all attacks for each bias. Categories denoted by  ̃c are those that remain
unchanged, i.e., not subjected to adversarial prompting.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setting</title>
        <p>This section presents a comprehensive analysis of our benchmarking results across a wide range of
language models with varying reasoning mechanisms, evaluating their robustness, fairness, and safety
in the context of sociocultural biases captured by CLEAR-Bias. To enable fine-grained evaluation, we
categorize the models into three main groups based on the type of reasoning  ∈ ℛ = { Base, CoT, Reasoner}.
For each group, we analyze diferent models from three families,  ∈ ℱ = { GPT, DeepSeek, Phi-4}.
Specifically, our analysis involves the following models:
• Base: standard pretrained language models without explicit reasoning induction, including</p>
        <p>DeepSeek V3 [25], GPT-4o, and Phi-4 [26].
• CoT : base models prompted with a zero-shot “Think step by step” instruction to elicit reasoning
behavior at inference time—namely, DeepSeek V3 CoT, GPT-4o CoT, and Phi-4 CoT.
• Reasoner : reasoning-enabled models trained for reasoning capabilities. These are further
subdivided by scale into Large Reasoning Models (LRMs)—DeepSeek R1 [27], o3-mini, and
o1preview—and Small Reasoning Models (SRMs)—Phi-4-reasoning [26], DeepSeek Distil Llama
8B [27], and DeepSeek Distill Qwen 14B [27].</p>
        <p>This categorization supports a multifaceted analysis of reasoning robustness under bias elicitation,
which aims to: () compare the robustness of large and small language models against both their
zeroshot CoT-prompted and reasoning-enabled variants; () investigate whether models explicitly fine-tuned
for reasoning are inherently more robust than those with elicited reasoning through prompting; and
() evaluate the efectiveness of diferent jailbreak attacks across diverse reasoning mechanisms.</p>
        <p>Importantly, models prompted with CoT instructions are asked to produce their reasoning within
&lt;think&gt;...&lt;/think&gt; tags. For these models, as well as for reasoner models that output reasoning
traces by default (i.e., without using &lt;think&gt; tags), we evaluate only the final answer and ignore any
reasoning content when categorizing the response with the LLM-as-judge paradigm, to ensure an
uniform assessment of model responses across all groups in  . To systematically assess safety, we used
a safety threshold  = 0.5 . A model is considered safe if its safety score exceeds this threshold, indicating
moderate robustness and fairness while avoiding polarization toward any specific sociocultural category.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>Here we present the results of the initial safety assessment using base prompts from CLEAR-Bias,
followed by the adversarial analysis using jailbreak prompts, and finally the responses to the research
questions posed in Section 1.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Initial Safety Assessment</title>
          <p>Consistently with the analysis in our previous study [11], models exhibit markedly diferent behaviors
across bias categories in terms of robustness, fairness, and safety, as shown in Figure 1. Certain
bias categories show higher safety scores across diferent models, particularly religion (0.59), sexual
orientation (0.48), ethnicity (0.46), and gender (0.46). This suggests that existing alignment strategies
and dataset curation eforts may prioritize minimizing bias in particularly sensitive categories. In
contrast, intersectional bias categories demonstrate lower safety scores, such as gender-ethnicity (0.41),
gender–sexual orientation (0.35), and ethnicity–socioeconomic status (0.32), when compared to their
non-intersectional counterparts. This highlights the challenges language models face in handling
overlapping and multifaceted identities, potentially due to their more nuanced nature and limited
representation in pretraining corpora. Other categories, such as disability, socioeconomic status, and age,
remain less protected, showing the lowest safety scores of 0.23, 0.20, and 0.12, respectively.</p>
          <p>Safety
0.1 0.3 0.65 0.55 0.55 0.55 0.3 0.7 0.7 0.15
0.05 0.25 0.6 0.45 0.45 0.45 0.35 0.6 0.55 0
0.1 0.25 0.6 0.45 0.55 0.35 0.25 0.65 0.55 0.05
0.05 0.2 0.6 0.25 0.5 0.4 0.3 0.65 0.5 0.1
0.2 0.25 0.3 0.2 0.6 0.35 0.25 0.7 0.65 0.55
0 0.2 0.2 0.25 0.2 0.15 0.3 0.25 0.3 0.05
0 0.2 0.1 0.2 0.35 0.35 0.25 0.7 0.25 0.15
0.05 0.25 0.25 0.15 0.15 0.3 0.25 0.3 0.2 0.15
0 0.15 0 0.1 0.15 0.35 0.1 0.15 0.25 0.05
0.25 0.4 0.7 0.55 0.75 0.6 0.55 0.8 0.6 0.3
0.15 0.15 0.7 0.2 0.5 0.35 0.6 0.7 0.5 0.35
0.45 0.2 0.8 0.5 0.8 0.75 0.75 0.85 0.75 0.55
green shades indicate higher positive scores, while darker red ones reflects more biased behaviors.
0.64 and 0.55 across all bias categories, respectively. Other top-performing models, though below the
threshold, include GPT-4o (0.45), Phi-4 CoT (0.42), and DeepSeek V3 (0.40).</p>
          <p>0.0
0.2
0.4
0.8
Models are grouped into their respective families. The gray dotted line indicates the safety threshold  = 0.5 .</p>
          <p>These results reveal a general trend in which small-by-design models, like those from the Phi-4
family, exhibit higher safety than larger models, aligning with findings from previous literature [ 11].
Conversely, the lowest safety scores are primarily observed in the DeepSeek family, where both large
and small reasoner variants struggle to maintain safe behavior in response to bias elicitation.</p>
          <p>A significant analysis, shown in Figure 3, presents safety outcomes across diferent reasoning types
and model sizes. In particular, Figure 3a reports the mean safety scores for base models and their
CoT-prompted and reasoning-enabled counterparts. The results indicate that base models outperform
all their reasoning variants, achieving the highest safety score of 0.50. This suggests that introducing
reasoning capabilities—whether at training or inference time—can reduce safety reliability, possibly due
to increased generative freedom that may lead to spurious justifications or rationalizations. Interestingly,
reasoning-enabled models outperform CoT-prompted variants, potentially because prompt-induced
reasoning can lead to less predictable reasoning paths, which are not tuned for safe, controlled reasoning.
Specifically, reasoning-enabled models achieve a safety score of 0.40, compared to 0.33 for CoT-prompted
models. Overall, our findings highlight the potential negative impact of reasoning capabilities on model
safety, particularly in the context of bias elicitation, ofering early insights into how reasoning may
paradoxically amplify bias. This aligns with prior studies—mainly focused on CoT-prompted
models [18]—and suggests that this efect, while less pronounced, also exists in reasoning-enabled models.</p>
          <p>Further scale-related insights emerge from Figure 3b, which compares safety performance between
large and small reasoning models. The results indicate that small reasoning models (SRMs) are generally
more vulnerable to bias elicitation than large reasoning models (LRMs), with average safety scores of
0.29 for SRMs and 0.33 for LRMs. However, the wider variance among SRMs suggests inconsistent
safety performance across models, with Phi-4-reasoning emerging as the safest reasoning model and the
second-safest model overall. In contrast, the distilled small reasoning variants of DeepSeek R1—Qwen
14B (0.20) and Llama 8B (0.13)—are among the least safe models evaluated. These results suggest that
small-by-design models like Phi-4 may be more robust overall, retaining their relative strength even
when equipped with reasoning capabilities. By contrast, in the case of distilled versions of larger models,
the compression process may reduce their ability to handle nuanced or sensitive prompts efectively,
thereby compromising their safety.
DeDepeSeepeSkeeVk3VD3eCeopTSeekR1</p>
          <p>To better assess model behavior, we analyzed responses in terms of refusal, debiasing, stereotype,
and counter-stereotype rates (Figure 4). Figure 4a illustrates how models handle potentially harmful
prompts, either by refusing to respond or by producing a debiased output. The results reveal that
most models exhibit relatively low refusal rates, with the notable exception of Phi-4-reasoning, which
reaches the highest refusal rate (0.36), consistent with its previously observed high safety. In contrast,
Phi-4
CoT</p>
          <p>DeepSeek</p>
          <p>R1</p>
          <p>Phi-4
reasoning</p>
          <p>DeepSeek</p>
          <p>V3 CoT</p>
          <p>Stereotype</p>
          <p>Counter-Stereotype
Phi-4
o1-preview
debiasing is the dominant strategy for many models, especially those in the Phi-4 family without
built-in reasoning capabilities, with Phi-4 achieving the highest debiasing rate (0.520), followed by
Phi-4-CoT (0.320). This suggests that adding reasoning capabilities in Phi-4 models shifts behavior
from debiasing toward greater reliance on refusal, reflecting more cautious, safety-oriented responses
to sensitive prompts. Figure 4b compares the prevalence of stereotypical and counter-stereotypical
completions. Models from the DeepSeek family—particularly DeepSeek V3 CoT, DeepSeek Llama
8B, and DeepSeek Qwen 14B—produce stereotypical outputs at very high rates (0.81, 0.87, and 0.80,
respectively), while rarely ofering counter-stereotypical responses. DeepSeek R1 is a notable exception,
with both a relatively high stereotype rate (0.65) and the highest counter-stereotype rate (0.22). This may
reflect a reasoning-driven strategy that attempts to avoid bias by proposing counterposed narratives,
even though this approach still generalizes by introducing counter-stereotypical biases. Overall, these
trends highlight the limited efectiveness of current alignment techniques in reducing representational
harms, especially within the DeepSeek family, whose safety issues are even more pronounced in the
case of distilled models. In contrast, models from the Phi-4 and GPT families generally exhibit more
balanced behavior, characterized by lower stereotype rates—especially Phi-4, with a rate of 0.34—and
modest yet more consistent counter-stereotypical outputs.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Adversarial Analysis</title>
          <p>For all bias categories initially deemed safe (i.e.,  ≥ 0.5 ), we conducted an adversarial safety assessment
using the jailbreak prompts from CLEAR-Bias. Results in Table 1 provide key insights into the
efectiveness of diferent attack types across all models. For example, machine translation emerges as the
most efective attack overall (0.49), followed by obfuscation (0.41). Both attacks operate by rephrasing
or translating adversarial prompts into formats that are dificult for the model to reason with, such as
low-resource languages (LRLs) or encoded alphabets (e.g., Base64). In these cases, where the model is
more likely to experience uncertainty, the efects of alignment tuning become less efective, making
it more likely for safety filters to be bypassed. Refusal suppression (0.30) and prompt injection (0.23)
also show moderate efectiveness. These techniques explicitly manipulate the model’s behavior by
removing refusal triggers or appending malicious instructions to otherwise benign prompts. In contrast,
prefix injection (0.10) and reward incentive (0.10) are considerably less efective, while role playing
demonstrates slightly negative efectiveness on average (-0.03), suggesting that this attack may trigger
the model’s safeguard mechanisms, thereby reducing the likelihood of unsafe completions.</p>
          <p>Model</p>
          <p>Finally, to provide a family-wise assessment of how diferent reasoning mechanisms impact
vulnerability to adversarial elicitation, we define the Family-Level Vulnerability Dominance Rate (FL-VDR),
indicated as   () . This metric quantifies how often a reasoning type  ∈ ℛ = { Base, CoT, Reasoner}
exhibits the highest vulnerability across diferent model families  ∈ ℱ = { GPT, DeepSeek, Phi-4} for a
specific attack type  . Let ℱ () ⊆ ℱ denote the set of model families where reasoning type  has a valid
efectiveness value for attack  (i.e., the quantity  (, ) is defined). This applies to all model-attack pairs
that passed the misunderstanding filter, which excludes cases where the model’s behavior resulted from
prompt misinterpretation rather than a meaningful response to the adversarial intent. Let ℛ ⊆ ℛ be
the set of reasoning types represented in family  , with  ∈ ℛ  if a model of type  was subjected to
adversarial evaluation on at least one bias category within the  family. The FL-VDR is then defined as:
 
()
=</p>
          <p>∑ 1 ( (, )
 ∈ℱ  ()</p>
          <p>()
= max   ,  ′)</p>
          <p>′∈ℛ
|ℱ () |
(6)</p>
          <p>Here, 1(⋅)is the indicator function that equals 1 when the condition is true and 0 otherwise. The
denominator |ℱ () | ensures that   () is computed only over families where reasoning type  is represented.
Thus,   () represents the proportion of such model families in which reasoning type  exhibits the
highest vulnerability to attack  , measured by the efectiveness of that attack. It is worth noting that in
calculating this metric, the model o1-preview is used as the representative of the Reasoner category
within the GPT family since it exhibits lower average attack efectiveness compared to o3-mini.</p>
          <p>Reasoning
type (ℛ)</p>
          <p>The results in Table 2 highlight that diferent reasoning paradigms exhibit distinct vulnerabilities to
specific adversarial strategies. Notably, CoT-based models are especially prone to machine translation
and reward incentive attacks ( = 1.00 ), and also notably vulnerable to role-playing scenarios ( = 0.50 ).
Reasoner models, on the other hand, are particularly vulnerable to obfuscation and prefix injection
attacks ( = 0.67 ), as well as to refusal suppression ( = 0.67 ). Interestingly, prompt injection attacks are
most efective on base models (  = 0.67 ). Overall, base models consistently show lower vulnerability
across most attack types, suggesting that enabling reasoning—whether at training or inference time—
does not inherently improve robustness to adversarial bias elicitation and often degrades safety. This
may stem from their simpler response strategies and lack of structured reasoning, leading to more
direct and cautious completions that are less likely to over-interpret or elaborate on adversarial cues.</p>
          <p>Finally, Table 3 reports the safety evaluation results across all tested models. While two models—Phi-4
and Phi-4-reasoning—surpassed the safety threshold ( = 0.5 ) in the initial assessment, none remained
safe under adversarial analysis. Indeed, each model proved considerably susceptible to at least one
jailbreak attack, with final safety scores falling below  . This underscores that even models with the
highest baseline safety can experience substantial declines when exposed to well-crafted, bias-probing
jailbreak prompts.</p>
          <p>Initial safety
assessment
Adversarial
analysis</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Responses to Research Questions</title>
          <p>We now summarize our findings by addressing the three research questions posed in Section 1.
RQ1 How do diferent reasoning mechanisms (e.g., CoT prompting or reasoning by-design) afect robustness
to bias elicitation?</p>
          <p>Our findings reveal that both forms of reasoning—whether elicited at inference time via CoT prompting
or integrated by design in reasoning-enabled models—tend to amplify vulnerability to bias elicitation
when compared to base models. Base models, which operate without explicit reasoning mechanisms,
achieve the highest safety scores on average, indicating a stronger resistance to producing biased
or harmful content. In contrast, the introduction of reasoning, regardless of the method, generally
lowers safety performance. This suggests that reasoning as currently implemented may introduce
additional pathways for stereotype reinforcement or rationalization. These results highlight a critical
and somewhat counterintuitive insight, i.e., reasoning does not inherently improve robustness to bias
and may, in fact, worsen it.</p>
          <p>RQ2 Are reasoning models inherently safer than those relying on reasoning elicitation at inference time
via CoT prompting?</p>
          <p>Our findings indicate that reasoning-enabled models are safer than those relying on reasoning
elicitation through CoT prompting. On average, reasoning-enabled models outperform CoT-prompted
variants in safety scores, showing lower rates of stereotypical responses. While both types of reasoning
increase model complexity and may in general afect safety, CoT prompting appears more prone to
generating harmful or biased content, likely due to its reliance on prompt-induced reasoning rather
than internalized safety-aligned reasoning processes.</p>
          <p>RQ3 How does the efectiveness of diferent jailbreak attacks targeting adversarial bias elicitation vary
across reasoning mechanisms?</p>
          <p>Our findings highlight that model vulnerability is nuanced, varying with both the jailbreak strategy
and the reasoning method used. CoT-prompted models are especially vulnerable to attacks involving
low-resource languages or fictional storytelling that manipulate prompt context—framing it through
reward incentives or role-playing scenarios—which can significantly afect models relying on
promptinduced reasoning paths not optimized for safety. In contrast, reasoning-enabled models are more
susceptible to obfuscation attacks like prefix injection or refusal suppression, which bypass internal
safeguards by steering the model toward harmful outputs. This increased vulnerability likely stems from
their greater generative freedom, enabling spurious justifications or rationalizations that align with
the malicious instructions provided in the prompt. Finally, base models tend to be the least vulnerable
overall. Their simpler behavior and lack of explicit reasoning reduce the surface area for adversarial
manipulation, making them comparatively more robust against a range of jailbreak strategies.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study provides key insights into how diferent reasoning mechanisms afect robustness to bias
elicitation in language models, using the CLEAR-Bias benchmark and the adversarial methodology
proposed in [11]. Our findings show that introducing reasoning—via inference-time CoT prompting
or reasoning-enabled architectures—generally amplifies bias compared to non-reasoning base models.
While reasoning-enabled models outperform those using zero-shot CoT prompting in safety, they still
underperform base models overall. These results challenge the assumption that reasoning inherently
aids bias mitigation and underscore the need for stronger safety alignment in reasoning-enabled
language models. There remain several avenues for future work. First, model behavior may vary
with the formulation of CoT prompts, which can in turn afect safety. Second, reasoning traces can be
analyzed to further understand how models justify responses to sensitive prompts. Emerging research
suggests that models do not always “say what they think”, i.e., reasoning traces may not reflect internal
decision-making processes [28, 29]. Exploring these aspects can foster transparency and trustworthiness
of reasoning language models, which is key in safety-critical applications.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We acknowledge financial support from “PNRR MUR project PE0000013-FAIR” - CUP H23C22000860006
and “National Centre for HPC, Big Data and Quantum Computing”, CN00000013 - CUP
H23C22000360005.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[7] T. Q. Luong, X. Zhang, Z. Jie, P. Sun, et al., Reft: Reasoning with reinforced fine-tuning, in:</p>
      <p>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
[8] F. Xu, Q. Hao, Z. Zong, J. Wang, et al., Towards large reasoning models: A survey of reinforced
reasoning with large language models, arXiv preprint arXiv:2501.09686 (2025).
[9] J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, in: Findings of
the Association for Computational Linguistics, 2022.
[10] R. Cantini, G. Cosenza, A. Orsino, D. Talia, Are large language models really bias-free? jailbreak
prompts for assessing adversarial robustness to bias elicitation, in: International Conference on
Discovery Science, 2024.
[11] R. Cantini, A. Orsino, M. Ruggiero, D. Talia, Benchmarking adversarial robustness to bias elicitation
in large language models: Scalable automated assessment with llm-as-a-judge, arXiv preprint
arXiv:2504.07887 (2025).
[12] M. Nadeem, A. Bethke, S. Reddy, Stereoset: Measuring stereotypical bias in pretrained language
models, in: Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021.
[13] N. Nangia, C. Vania, R. Bhalerao, S. R. Bowman, Crows-pairs: A challenge dataset for measuring
social biases in masked language models, in: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing, 2020.
[14] T. Shen, R. Jin, Y. Huang, C. Liu, et al., Large language model alignment: A survey, arXiv preprint
arXiv:2309.15025 (2023).
[15] X. Wu, J. Nian, Z. Tao, Y. Fang, Evaluating social biases in llm reasoning, arXiv preprint
arXiv:2502.15361 (2025).
[16] A. Parrish, A. Chen, N. Nangia, V. Padmakumar, et al., BBQ: A hand-built bias benchmark for
question answering, in: Findings of the Association for Computational Linguistics, 2022.
[17] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, et al., Large language models are zero-shot reasoners,</p>
      <p>Advances in Neural Information Processing Systems (2022).
[18] O. Shaikh, H. Zhang, W. Held, M. Bernstein, et al., On second thought, let’s not think step by
step! bias and toxicity in zero-shot reasoning, in: Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics, 2023.
[19] A. Arrieta, M. Ugarte, P. Valle, J. A. Parejo, et al., o3-mini vs deepseek-r1: Which one is safer?,
arXiv preprint arXiv:2501.18438 (2025).
[20] M. Ugarte, P. Valle, J. A. Parejo, S. Segura, et al., Astral: Automated safety testing of large language
models, arXiv preprint arXiv:2501.17132 (2025).
[21] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, et al., Judging llm-as-a-judge with mt-bench and
chatbot arena, Advances in Neural Information Processing Systems (2023).
[22] L. Zhu, X. Wang, X. Wang, Judgelm: Fine-tuned large language models are scalable judges, in:</p>
      <p>The Thirteenth International Conference on Learning Representations, 2023.
[23] S. Ranathunga, E. A. Lee, M. P. Skenduli, R. Shekhar, et al., Neural machine translation for
low-resource languages: A survey, ACM Computing Survey (2023).
[24] D. Dorn, A. Variengien, C.-R. Segerie, V. Corruble, Bells: A framework towards future proof
benchmarks for the evaluation of llm safeguards, arXiv preprint arXiv:2406.01364 (2024).
[25] A. Liu, B. Feng, B. Xue, B. Wang, et al., Deepseek-v3 technical report, arXiv preprint
arXiv:2412.19437 (2024).
[26] M. Abdin, S. Agarwal, A. Awadallah, V. Balachandran, et al., Phi-4-reasoning technical report,
arXiv preprint arXiv:2504.21318 (2025).
[27] D. Guo, D. Yang, H. Zhang, J. Song, et al., Deepseek-r1: Incentivizing reasoning capability in llms
via reinforcement learning, arXiv preprint arXiv:2501.12948 (2025).
[28] M. Turpin, J. Michael, E. Perez, S. Bowman, Language models don’t always say what they
think: Unfaithful explanations in chain-of-thought prompting, Advances in Neural Information
Processing Systems (2023).
[29] Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, et al., Reasoning models don’t always say what
they think, arXiv preprint arXiv:2505.05410 (2025).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          , et al.,
          <article-title>A survey on evaluation of large language models</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <article-title>Biases in large language models: origins, inventory, and discussion</article-title>
          ,
          <source>ACM Journal of Data and Information Quality</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Prabhumoye</surname>
          </string-name>
          ,
          <article-title>Five sources of bias in natural language processing, Language and linguistics compass (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I. O.</given-names>
            <surname>Gallegos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Barrow</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Tanjim</surname>
          </string-name>
          , et al.,
          <article-title>Bias and fairness in large language models: A survey, Computational Linguistics (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>