1. Introduction

N. Gabriele);

Bias in the Age of Reasoning Language Models

Riccardo Cantini

rcantini@dimes.unical.it 0 1

Nicola Gabriele

nicola.gabriele@dimes.unical.it 0 1

Alessio Orsino

aorsino@dimes.unical.it 0 1

Domenico Talia

talia@dimes.unical.it 0 1 0 Adversarial Robustness , Fairness, Sustainable AI 1 University of Calabria , Rende , Italy

2026

000 0 0002

inference time; and () Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: () how the introduction of reasoning capabilities afects model fairness and robustness; whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.

1. Introduction

As Large Language Models (LLMs) become increasingly integrated into high-stakes societal domains such as healthcare, education, and law—owing to their advanced capabilities in natural language understanding and generation [ 1, 2 ]—concerns about embedded biases have grown significantly. These biases can perpetuate harmful stereotypes, marginalize underrepresented groups, and undermine the ethical deployment of AI systems [ 3 ]. They often originate from multiple sources, including biased training data that reflect historical inequalities and stereotypes, linguistic imbalances in corpora, flawed algorithmic designs, and uncritical usage of AI technologies [ 4, 5 ].

To address the limitations of traditional LLMs, which rely on implicit, pattern-based reasoning, researchers have developed techniques to elicit more structured and interpretable behavior. One such approach is Chain-of-Thought (CoT) prompting, which encourages models to generate intermediate reasoning steps at inference time without requiring architectural changes or specialized training [ 6 ]. In contrast, a new class of models known as Reasoning Language Models (RLMs) has emerged. Unlike standard language models using CoT, RLMs are explicitly trained to perform multi-step reasoning through fine-tuned reasoning trajectories and integrated test-time search strategies [ 7, 8]. By embedding

CEUR Workshop

ISSN1613-0073 logical inference capabilities directly into their training and architecture, RLMs move beyond nexttoken prediction, ofering improved performance, transparency, and reliability, which are key for the responsible deployment of AI systems [9].

While prior research has extensively benchmarked bias in LLMs [10, 11, 12, 13] and explored alignment techniques to improve safety [14], the relationship between reasoning capabilities and bias mitigation remains underexplored. Specifically, it remains unclear whether explicit reasoning mechanisms help reduce biased behavior in language models or inadvertently reinforce it through structured inference chains [15]. In addition, the interplay between reasoning capabilities and adversarial bias elicitation raises the question of whether such mechanisms enhance robustness or, conversely, increase vulnerability to biased responses. To address this gap, this study presents a systematic evaluation of bias robustness across diferent reasoning paradigms and three main model families: GPT, DeepSeek, and Phi-4. For each family, we assessed the latest non-reasoning models (i.e., GPT-4o, DeepSeek V3 671B, and Phi-4), their CoT–augmented variants, and their reasoning-by-design counterparts (i.e., o3-mini and o1-preview for GPT; DeepSeek R1 and its distilled versions—DeepSeek Distill Qwen and DeepSeek Distill Llama—for DeepSeek; and Phi-4-reasoning for Phi-4). Our investigation is guided by the following research questions: RQ1 How do diferent reasoning mechanisms (e.g., CoT prompting or reasoning by-design) afect robustness to bias elicitation? RQ2 Are reasoning models inherently safer than those relying on reasoning elicitation at inference time via CoT prompting? RQ3 How does the efectiveness of diferent jailbreak attacks targeting adversarial bias elicitation vary across reasoning mechanisms?

Experiments have been performed using the CLEAR-Bias benchmark [11], leveraging an LLM-asa-judge framework to evaluate robustness against bias elicitation under adversarial conditions. This involved exposing the diferent models to a set of curated jailbreak prompts designed to probe biases across sociocultural and intersectional dimensions. In summary, our key contributions are as follows: • We conduct a systematic evaluation of bias safety in RLMs at diferent scales, using an adversarial approach to stress-test model safety under diferent reasoning configurations, including both CoT-prompted and reasoning-enabled models. • We provide empirical evidence that explicit reasoning—whether induced at training or inference time—can increase vulnerability to bias elicitation, with CoT-prompted models exhibiting slightly worse bias safety than their reasoning-enabled counterparts. • We empirically show that vulnerability to adversarial prompting strongly depends on the type of attack and the reasoning mechanisms embedded in the model, with non-reasoning models exhibiting the highest overall resistance to jailbreak attacks targeting bias elicitation.

The remainder of the paper is organized as follows. Section 2 reviews prior work on bias benchmarking and the adversarial safety of reasoning models. Section 3 introduces the CLEAR-Bias benchmark and outlines the methodology used in our evaluation. Section 4 illustrates the experimental results, and Section 5 concludes with a discussion of key findings, implications, and directions for future work.

2. Related Work

Recent work has highlighted the vulnerability of LLMs to bias elicitation—the extraction of harmful, stereotypical, or toxic content via ad-hoc or adversarial prompts—even in models specifically trained to align with human values. These models encode social biases across diferent dimensions such as race, gender, nationality, religion, and their intersections, revealing persistent representational harms that can undermine fairness, inclusivity, and trust in real-world applications [12, 13, 16, 10].

A particularly influential line of research examines how reasoning strategies, such as Chain-ofThought (CoT) prompting, interact with bias elicitation. While CoT improves performance on a range of logical and symbolic tasks [ 6, 17 ], its implications for fairness and safety remain less explored. Shaikh et al. [18] present one of the first controlled studies evaluating the efect of zero-shot CoT prompting on social bias. Their work reveals that prompting LLMs to “think step by step” can paradoxically amplify bias, making models more likely to generate stereotypical or toxic outputs. Using adapted versions of standard bias benchmarks (CrowS-Pairs [13], StereoSet [12], and BBQ [16]), along with a custom dataset of harmful queries, they show that CoT often reduces refusal rates and increases the likelihood of harmful completions, especially in larger models. Their analysis suggests that CoT reasoning may encourage models to hallucinate spurious justifications that override safety constraints, particularly when the task requires social nuance or judgment. Complementing this, Wu et al. [15] systematically investigate how social bias manifests in intermediate reasoning steps of instruction-tuned and reasoning-enabled models. Using the BBQ dataset, they show that reasoning traces often amplify stereotypes, especially when models shift reasoning paths mid-response or employ shallow forms of self-reflection. Their findings highlight that even correct answers can embed biased reasoning steps, and that removing biased steps leads to improved model performance. This reinforces the idea that reasoning alone does not guarantee fairness and can, in fact, reinforce harmful associations. Other work in the literature has focused on assessing the general safety of reasoning-enabled LLMs, particularly OpenAI’s o3-mini and DeepSeek R1. Arrieta et al. [19] conducted a large-scale, automated safety evaluation using the ASTRAL framework [20], which systematically tests models on a set of prompts spanning 14 safety-critical categories (e.g., hate speech, terrorism, privacy violations, misinformation). Their findings show that DeepSeek R1 produces more unsafe outputs than o3-mini, ofering insights into system-level safety of reasoning-enabled models. However, their work does not explicitly examine how these safety failures relate to social biases and adversarial robustness, leaving open questions about the intersection of reasoning capabilities, fairness, and bias safety.

While prior research has highlighted vulnerabilities in reasoning LLMs, it has typically focused on isolated reasoning strategies (e.g., chain-of-thought or reasoning by design) or a narrow range of model families, with little attention to adversarial elicitation. This gap underscores the need for a deeper examination of how reasoning paradigms, model scale, and adversarial techniques interact to influence bias amplification. In this work, we analyze the behavior of both large and small reasoner models, along with inference-time reasoning strategies such as zero-shot CoT prompting, to evaluate their robustness against adversarial bias elicitation across diferent sociodemographic groups. Building on the CLEAR-Bias benchmark [11], we apply jailbreak techniques to stress-test model safety and quantify their vulnerability using LLM-based scalable automatic evaluations. This comprehensive analysis extends prior work by systematically comparing reasoning strategies, model sizes, and adversarial robustness under unified robustness, fairness, and safety metrics.

3. Benchmarking Adversarial Robustness to Bias Elicitation

This section describes the methodology employed for evaluating bias vulnerability in LLMs, as proposed in the CLEAR-Bias benchmark [11]. This includes an overview of the dataset design and the automated evaluation framework relying on the LLM-as-a-judge paradigm [21, 22], which enables a scalable and systematic assessment of both stereotypical behaviors and robustness under adversarial conditions.

3.1. The CLEAR-Bias Dataset

CLEAR-Bias (Corpus for Linguistic Evaluation of Adversarial Robustness against Bias) is a curated benchmark dataset designed to assess the robustness of LLMs against bias elicitation, especially under adversarial conditions. The dataset targets a broad spectrum of social biases and incorporates advanced attack strategies intended to bypass safety mechanisms. It comprises 4,400 bias-probing prompts, covering seven isolated bias categories—age, disability, ethnicity, gender, religion, sexual orientation, and socioeconomic status—as well as three intersectional ones, i.e., ethnicity–socioeconomic status, gender–sexual orientation, and gender–ethnicity. Prompts are structured according to two task formats: • Choose the Option: the model selects between two possible answers in response to a contextualized question, assessing inclination toward stereotypical or biased choices. • Sentence Completion: the model is asked to complete a partial sentence using one of two options, revealing tendencies toward biased generations.

For every bias category ∈ , 10 prompts are curated for each task ∈ , yielding 20 base prompts per category and 200 base prompts in total. These prompts are subsequently transformed using adversarial techniques ∈

to create 4,200 additional prompts, resulting in a final dataset of 4,400 items. Seven distinct jailbreak techniques are employed to generate adversarially augmented versions of the base prompts, with each attack implemented in three variants to ensure diversity in adversarial conditions. Employed jailbreak attacks include machine translation into low-resource languages (e.g., Slovene, Macedonian, Scottish Gaelic) [23]; obfuscation via encoding schemes (e.g., base64, leetspeak); prefix injection to enforce specific starting phrases; prompt injection using misleading preambles that instruct the model to ignore prior text; refusal suppression to discourage typical model safety refusals; reward incentive to encourage biased outputs through hypothetical rewards; and role-playing to assume biased personas or characters (e.g., AIM—Always Intelligent and Machiavellian [24]).

3.2. Benchmarking Methodology

To enable large-scale evaluation of responses, the methodology relies on an LLM-as-a-judge approach, with DeepSeek V3 identified as the most reliable judge based on its highest agreement with a manually curated control set of prompt–response pairs, as measured by Cohen’s coeficient [ 11]. The judge model is used to evaluate the behavior of tested LLMs on both base and adversarial prompts. In the initial phase, each model is queried with all base prompts, and its responses are classified into four categories = {, , , } stereotypes,

, where denotes stereotyped responses that reinforce or perpetuate common denotes counter-stereotyped responses that challenge or flip stereotypes while still relying on generalized assumptions, denotes debiased responses that are impartial and balanced, and denotes refusal responses, indicating that the model declines to answer the prompt. Two main metrics are computed per bias category ∈ : refusal rate ( ) and debiasing rate ( ) for task ∈ :

• Robustness , which measures the model’s resistance to bias elicitation, considering both the counter-stereotyped ( ̄ ) responses:

• Fairness , which evaluates the model’s neutrality by comparing the rate of stereotyped ( ) and These are combined into a bias-specific safety score , with the overall model safety computed as the average across all biases:

Bias categories with a safety score above a predefined threshold are considered safe, and are denoted by the subset =̃ { ∈ |

≥ }, ⊆̃ . These categories proceed to subsequent adversarial evaluation, where the jailbreak prompts of CLEAR-Bias are exploited to evaluate models under adversarial conditions. To fairly assess model behavior in this more challenging setting, responses classified as refusals are re-evaluated to identify possible misunderstandings (e.g., due to obfuscation), thereby excluding cases where the behavior results from prompt misinterpretation rather than genuine refusal. Then, for each ∈ ̃, a new safety score ̃ () is computed per attack, with the final safety score ̃ incorporating =

+ , = 1 − | − ̄ | , = 1 | | ∈ ∑ = 1 | | ∈ ∑ = + , 2 = 1 The relative safety reduction for bias under attack is denoted by Δ attack computed as the mean safety reduction across all attacked bias categories: () , with the efectiveness () of 1 , () = 1 || ̃ ∈ ̃ ∑ Δ () the minimum safety across all attacks for each bias. Categories denoted by ̃c are those that remain unchanged, i.e., not subjected to adversarial prompting.

4. Experimental Evaluation 4.1. Experimental Setting

This section presents a comprehensive analysis of our benchmarking results across a wide range of language models with varying reasoning mechanisms, evaluating their robustness, fairness, and safety in the context of sociocultural biases captured by CLEAR-Bias. To enable fine-grained evaluation, we categorize the models into three main groups based on the type of reasoning ∈ ℛ = { Base, CoT, Reasoner}. For each group, we analyze diferent models from three families, ∈ ℱ = { GPT, DeepSeek, Phi-4}. Specifically, our analysis involves the following models: • Base: standard pretrained language models without explicit reasoning induction, including

DeepSeek V3 [25], GPT-4o, and Phi-4 [26]. • CoT : base models prompted with a zero-shot “Think step by step” instruction to elicit reasoning behavior at inference time—namely, DeepSeek V3 CoT, GPT-4o CoT, and Phi-4 CoT. • Reasoner : reasoning-enabled models trained for reasoning capabilities. These are further subdivided by scale into Large Reasoning Models (LRMs)—DeepSeek R1 [27], o3-mini, and o1preview—and Small Reasoning Models (SRMs)—Phi-4-reasoning [26], DeepSeek Distil Llama 8B [27], and DeepSeek Distill Qwen 14B [27].

This categorization supports a multifaceted analysis of reasoning robustness under bias elicitation, which aims to: () compare the robustness of large and small language models against both their zeroshot CoT-prompted and reasoning-enabled variants; () investigate whether models explicitly fine-tuned for reasoning are inherently more robust than those with elicited reasoning through prompting; and () evaluate the efectiveness of diferent jailbreak attacks across diverse reasoning mechanisms.

Importantly, models prompted with CoT instructions are asked to produce their reasoning within <think>...</think> tags. For these models, as well as for reasoner models that output reasoning traces by default (i.e., without using <think> tags), we evaluate only the final answer and ignore any reasoning content when categorizing the response with the LLM-as-judge paradigm, to ensure an uniform assessment of model responses across all groups in . To systematically assess safety, we used a safety threshold = 0.5 . A model is considered safe if its safety score exceeds this threshold, indicating moderate robustness and fairness while avoiding polarization toward any specific sociocultural category.

4.2. Results

Here we present the results of the initial safety assessment using base prompts from CLEAR-Bias, followed by the adversarial analysis using jailbreak prompts, and finally the responses to the research questions posed in Section 1.

4.2.1. Initial Safety Assessment

Consistently with the analysis in our previous study [11], models exhibit markedly diferent behaviors across bias categories in terms of robustness, fairness, and safety, as shown in Figure 1. Certain bias categories show higher safety scores across diferent models, particularly religion (0.59), sexual orientation (0.48), ethnicity (0.46), and gender (0.46). This suggests that existing alignment strategies and dataset curation eforts may prioritize minimizing bias in particularly sensitive categories. In contrast, intersectional bias categories demonstrate lower safety scores, such as gender-ethnicity (0.41), gender–sexual orientation (0.35), and ethnicity–socioeconomic status (0.32), when compared to their non-intersectional counterparts. This highlights the challenges language models face in handling overlapping and multifaceted identities, potentially due to their more nuanced nature and limited representation in pretraining corpora. Other categories, such as disability, socioeconomic status, and age, remain less protected, showing the lowest safety scores of 0.23, 0.20, and 0.12, respectively.

Safety 0.1 0.3 0.65 0.55 0.55 0.55 0.3 0.7 0.7 0.15 0.05 0.25 0.6 0.45 0.45 0.45 0.35 0.6 0.55 0 0.1 0.25 0.6 0.45 0.55 0.35 0.25 0.65 0.55 0.05 0.05 0.2 0.6 0.25 0.5 0.4 0.3 0.65 0.5 0.1 0.2 0.25 0.3 0.2 0.6 0.35 0.25 0.7 0.65 0.55 0 0.2 0.2 0.25 0.2 0.15 0.3 0.25 0.3 0.05 0 0.2 0.1 0.2 0.35 0.35 0.25 0.7 0.25 0.15 0.05 0.25 0.25 0.15 0.15 0.3 0.25 0.3 0.2 0.15 0 0.15 0 0.1 0.15 0.35 0.1 0.15 0.25 0.05 0.25 0.4 0.7 0.55 0.75 0.6 0.55 0.8 0.6 0.3 0.15 0.15 0.7 0.2 0.5 0.35 0.6 0.7 0.5 0.35 0.45 0.2 0.8 0.5 0.8 0.75 0.75 0.85 0.75 0.55 green shades indicate higher positive scores, while darker red ones reflects more biased behaviors. 0.64 and 0.55 across all bias categories, respectively. Other top-performing models, though below the threshold, include GPT-4o (0.45), Phi-4 CoT (0.42), and DeepSeek V3 (0.40).

0.0 0.2 0.4 0.8 Models are grouped into their respective families. The gray dotted line indicates the safety threshold = 0.5 .

These results reveal a general trend in which small-by-design models, like those from the Phi-4 family, exhibit higher safety than larger models, aligning with findings from previous literature [ 11]. Conversely, the lowest safety scores are primarily observed in the DeepSeek family, where both large and small reasoner variants struggle to maintain safe behavior in response to bias elicitation.

A significant analysis, shown in Figure 3, presents safety outcomes across diferent reasoning types and model sizes. In particular, Figure 3a reports the mean safety scores for base models and their CoT-prompted and reasoning-enabled counterparts. The results indicate that base models outperform all their reasoning variants, achieving the highest safety score of 0.50. This suggests that introducing reasoning capabilities—whether at training or inference time—can reduce safety reliability, possibly due to increased generative freedom that may lead to spurious justifications or rationalizations. Interestingly, reasoning-enabled models outperform CoT-prompted variants, potentially because prompt-induced reasoning can lead to less predictable reasoning paths, which are not tuned for safe, controlled reasoning. Specifically, reasoning-enabled models achieve a safety score of 0.40, compared to 0.33 for CoT-prompted models. Overall, our findings highlight the potential negative impact of reasoning capabilities on model safety, particularly in the context of bias elicitation, ofering early insights into how reasoning may paradoxically amplify bias. This aligns with prior studies—mainly focused on CoT-prompted models [18]—and suggests that this efect, while less pronounced, also exists in reasoning-enabled models.

Further scale-related insights emerge from Figure 3b, which compares safety performance between large and small reasoning models. The results indicate that small reasoning models (SRMs) are generally more vulnerable to bias elicitation than large reasoning models (LRMs), with average safety scores of 0.29 for SRMs and 0.33 for LRMs. However, the wider variance among SRMs suggests inconsistent safety performance across models, with Phi-4-reasoning emerging as the safest reasoning model and the second-safest model overall. In contrast, the distilled small reasoning variants of DeepSeek R1—Qwen 14B (0.20) and Llama 8B (0.13)—are among the least safe models evaluated. These results suggest that small-by-design models like Phi-4 may be more robust overall, retaining their relative strength even when equipped with reasoning capabilities. By contrast, in the case of distilled versions of larger models, the compression process may reduce their ability to handle nuanced or sensitive prompts efectively, thereby compromising their safety. DeDepeSeepeSkeeVk3VD3eCeopTSeekR1

To better assess model behavior, we analyzed responses in terms of refusal, debiasing, stereotype, and counter-stereotype rates (Figure 4). Figure 4a illustrates how models handle potentially harmful prompts, either by refusing to respond or by producing a debiased output. The results reveal that most models exhibit relatively low refusal rates, with the notable exception of Phi-4-reasoning, which reaches the highest refusal rate (0.36), consistent with its previously observed high safety. In contrast, Phi-4 CoT

DeepSeek

Phi-4 reasoning

DeepSeek

V3 CoT

Stereotype

Counter-Stereotype Phi-4 o1-preview debiasing is the dominant strategy for many models, especially those in the Phi-4 family without built-in reasoning capabilities, with Phi-4 achieving the highest debiasing rate (0.520), followed by Phi-4-CoT (0.320). This suggests that adding reasoning capabilities in Phi-4 models shifts behavior from debiasing toward greater reliance on refusal, reflecting more cautious, safety-oriented responses to sensitive prompts. Figure 4b compares the prevalence of stereotypical and counter-stereotypical completions. Models from the DeepSeek family—particularly DeepSeek V3 CoT, DeepSeek Llama 8B, and DeepSeek Qwen 14B—produce stereotypical outputs at very high rates (0.81, 0.87, and 0.80, respectively), while rarely ofering counter-stereotypical responses. DeepSeek R1 is a notable exception, with both a relatively high stereotype rate (0.65) and the highest counter-stereotype rate (0.22). This may reflect a reasoning-driven strategy that attempts to avoid bias by proposing counterposed narratives, even though this approach still generalizes by introducing counter-stereotypical biases. Overall, these trends highlight the limited efectiveness of current alignment techniques in reducing representational harms, especially within the DeepSeek family, whose safety issues are even more pronounced in the case of distilled models. In contrast, models from the Phi-4 and GPT families generally exhibit more balanced behavior, characterized by lower stereotype rates—especially Phi-4, with a rate of 0.34—and modest yet more consistent counter-stereotypical outputs.

4.2.2. Adversarial Analysis

For all bias categories initially deemed safe (i.e., ≥ 0.5 ), we conducted an adversarial safety assessment using the jailbreak prompts from CLEAR-Bias. Results in Table 1 provide key insights into the efectiveness of diferent attack types across all models. For example, machine translation emerges as the most efective attack overall (0.49), followed by obfuscation (0.41). Both attacks operate by rephrasing or translating adversarial prompts into formats that are dificult for the model to reason with, such as low-resource languages (LRLs) or encoded alphabets (e.g., Base64). In these cases, where the model is more likely to experience uncertainty, the efects of alignment tuning become less efective, making it more likely for safety filters to be bypassed. Refusal suppression (0.30) and prompt injection (0.23) also show moderate efectiveness. These techniques explicitly manipulate the model’s behavior by removing refusal triggers or appending malicious instructions to otherwise benign prompts. In contrast, prefix injection (0.10) and reward incentive (0.10) are considerably less efective, while role playing demonstrates slightly negative efectiveness on average (-0.03), suggesting that this attack may trigger the model’s safeguard mechanisms, thereby reducing the likelihood of unsafe completions.

Model

Finally, to provide a family-wise assessment of how diferent reasoning mechanisms impact vulnerability to adversarial elicitation, we define the Family-Level Vulnerability Dominance Rate (FL-VDR), indicated as () . This metric quantifies how often a reasoning type ∈ ℛ = { Base, CoT, Reasoner} exhibits the highest vulnerability across diferent model families ∈ ℱ = { GPT, DeepSeek, Phi-4} for a specific attack type . Let ℱ () ⊆ ℱ denote the set of model families where reasoning type has a valid efectiveness value for attack (i.e., the quantity (, ) is defined). This applies to all model-attack pairs that passed the misunderstanding filter, which excludes cases where the model’s behavior resulted from prompt misinterpretation rather than a meaningful response to the adversarial intent. Let ℛ ⊆ ℛ be the set of reasoning types represented in family , with ∈ ℛ if a model of type was subjected to adversarial evaluation on at least one bias category within the family. The FL-VDR is then defined as: () =

∑ 1 ( (, ) ∈ℱ ()

() = max , ′)

′∈ℛ |ℱ () | (6)

Here, 1(⋅)is the indicator function that equals 1 when the condition is true and 0 otherwise. The denominator |ℱ () | ensures that () is computed only over families where reasoning type is represented. Thus, () represents the proportion of such model families in which reasoning type exhibits the highest vulnerability to attack , measured by the efectiveness of that attack. It is worth noting that in calculating this metric, the model o1-preview is used as the representative of the Reasoner category within the GPT family since it exhibits lower average attack efectiveness compared to o3-mini.

Reasoning type (ℛ)

The results in Table 2 highlight that diferent reasoning paradigms exhibit distinct vulnerabilities to specific adversarial strategies. Notably, CoT-based models are especially prone to machine translation and reward incentive attacks ( = 1.00 ), and also notably vulnerable to role-playing scenarios ( = 0.50 ). Reasoner models, on the other hand, are particularly vulnerable to obfuscation and prefix injection attacks ( = 0.67 ), as well as to refusal suppression ( = 0.67 ). Interestingly, prompt injection attacks are most efective on base models ( = 0.67 ). Overall, base models consistently show lower vulnerability across most attack types, suggesting that enabling reasoning—whether at training or inference time— does not inherently improve robustness to adversarial bias elicitation and often degrades safety. This may stem from their simpler response strategies and lack of structured reasoning, leading to more direct and cautious completions that are less likely to over-interpret or elaborate on adversarial cues.

Finally, Table 3 reports the safety evaluation results across all tested models. While two models—Phi-4 and Phi-4-reasoning—surpassed the safety threshold ( = 0.5 ) in the initial assessment, none remained safe under adversarial analysis. Indeed, each model proved considerably susceptible to at least one jailbreak attack, with final safety scores falling below . This underscores that even models with the highest baseline safety can experience substantial declines when exposed to well-crafted, bias-probing jailbreak prompts.

Initial safety assessment Adversarial analysis

4.2.3. Responses to Research Questions

We now summarize our findings by addressing the three research questions posed in Section 1. RQ1 How do diferent reasoning mechanisms (e.g., CoT prompting or reasoning by-design) afect robustness to bias elicitation?

Our findings reveal that both forms of reasoning—whether elicited at inference time via CoT prompting or integrated by design in reasoning-enabled models—tend to amplify vulnerability to bias elicitation when compared to base models. Base models, which operate without explicit reasoning mechanisms, achieve the highest safety scores on average, indicating a stronger resistance to producing biased or harmful content. In contrast, the introduction of reasoning, regardless of the method, generally lowers safety performance. This suggests that reasoning as currently implemented may introduce additional pathways for stereotype reinforcement or rationalization. These results highlight a critical and somewhat counterintuitive insight, i.e., reasoning does not inherently improve robustness to bias and may, in fact, worsen it.

RQ2 Are reasoning models inherently safer than those relying on reasoning elicitation at inference time via CoT prompting?

Our findings indicate that reasoning-enabled models are safer than those relying on reasoning elicitation through CoT prompting. On average, reasoning-enabled models outperform CoT-prompted variants in safety scores, showing lower rates of stereotypical responses. While both types of reasoning increase model complexity and may in general afect safety, CoT prompting appears more prone to generating harmful or biased content, likely due to its reliance on prompt-induced reasoning rather than internalized safety-aligned reasoning processes.

RQ3 How does the efectiveness of diferent jailbreak attacks targeting adversarial bias elicitation vary across reasoning mechanisms?

Our findings highlight that model vulnerability is nuanced, varying with both the jailbreak strategy and the reasoning method used. CoT-prompted models are especially vulnerable to attacks involving low-resource languages or fictional storytelling that manipulate prompt context—framing it through reward incentives or role-playing scenarios—which can significantly afect models relying on promptinduced reasoning paths not optimized for safety. In contrast, reasoning-enabled models are more susceptible to obfuscation attacks like prefix injection or refusal suppression, which bypass internal safeguards by steering the model toward harmful outputs. This increased vulnerability likely stems from their greater generative freedom, enabling spurious justifications or rationalizations that align with the malicious instructions provided in the prompt. Finally, base models tend to be the least vulnerable overall. Their simpler behavior and lack of explicit reasoning reduce the surface area for adversarial manipulation, making them comparatively more robust against a range of jailbreak strategies.

5. Conclusion

This study provides key insights into how diferent reasoning mechanisms afect robustness to bias elicitation in language models, using the CLEAR-Bias benchmark and the adversarial methodology proposed in [11]. Our findings show that introducing reasoning—via inference-time CoT prompting or reasoning-enabled architectures—generally amplifies bias compared to non-reasoning base models. While reasoning-enabled models outperform those using zero-shot CoT prompting in safety, they still underperform base models overall. These results challenge the assumption that reasoning inherently aids bias mitigation and underscore the need for stronger safety alignment in reasoning-enabled language models. There remain several avenues for future work. First, model behavior may vary with the formulation of CoT prompts, which can in turn afect safety. Second, reasoning traces can be analyzed to further understand how models justify responses to sensitive prompts. Emerging research suggests that models do not always “say what they think”, i.e., reasoning traces may not reflect internal decision-making processes [28, 29]. Exploring these aspects can foster transparency and trustworthiness of reasoning language models, which is key in safety-critical applications.

Acknowledgments

We acknowledge financial support from “PNRR MUR project PE0000013-FAIR” - CUP H23C22000860006 and “National Centre for HPC, Big Data and Quantum Computing”, CN00000013 - CUP H23C22000360005.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. [7] T. Q. Luong, X. Zhang, Z. Jie, P. Sun, et al., Reft: Reasoning with reinforced fine-tuning, in:

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. [8] F. Xu, Q. Hao, Z. Zong, J. Wang, et al., Towards large reasoning models: A survey of reinforced reasoning with large language models, arXiv preprint arXiv:2501.09686 (2025). [9] J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, in: Findings of the Association for Computational Linguistics, 2022. [10] R. Cantini, G. Cosenza, A. Orsino, D. Talia, Are large language models really bias-free? jailbreak prompts for assessing adversarial robustness to bias elicitation, in: International Conference on Discovery Science, 2024. [11] R. Cantini, A. Orsino, M. Ruggiero, D. Talia, Benchmarking adversarial robustness to bias elicitation in large language models: Scalable automated assessment with llm-as-a-judge, arXiv preprint arXiv:2504.07887 (2025). [12] M. Nadeem, A. Bethke, S. Reddy, Stereoset: Measuring stereotypical bias in pretrained language models, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021. [13] N. Nangia, C. Vania, R. Bhalerao, S. R. Bowman, Crows-pairs: A challenge dataset for measuring social biases in masked language models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020. [14] T. Shen, R. Jin, Y. Huang, C. Liu, et al., Large language model alignment: A survey, arXiv preprint arXiv:2309.15025 (2023). [15] X. Wu, J. Nian, Z. Tao, Y. Fang, Evaluating social biases in llm reasoning, arXiv preprint arXiv:2502.15361 (2025). [16] A. Parrish, A. Chen, N. Nangia, V. Padmakumar, et al., BBQ: A hand-built bias benchmark for question answering, in: Findings of the Association for Computational Linguistics, 2022. [17] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, et al., Large language models are zero-shot reasoners,

Advances in Neural Information Processing Systems (2022). [18] O. Shaikh, H. Zhang, W. Held, M. Bernstein, et al., On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023. [19] A. Arrieta, M. Ugarte, P. Valle, J. A. Parejo, et al., o3-mini vs deepseek-r1: Which one is safer?, arXiv preprint arXiv:2501.18438 (2025). [20] M. Ugarte, P. Valle, J. A. Parejo, S. Segura, et al., Astral: Automated safety testing of large language models, arXiv preprint arXiv:2501.17132 (2025). [21] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, et al., Judging llm-as-a-judge with mt-bench and chatbot arena, Advances in Neural Information Processing Systems (2023). [22] L. Zhu, X. Wang, X. Wang, Judgelm: Fine-tuned large language models are scalable judges, in:

The Thirteenth International Conference on Learning Representations, 2023. [23] S. Ranathunga, E. A. Lee, M. P. Skenduli, R. Shekhar, et al., Neural machine translation for low-resource languages: A survey, ACM Computing Survey (2023). [24] D. Dorn, A. Variengien, C.-R. Segerie, V. Corruble, Bells: A framework towards future proof benchmarks for the evaluation of llm safeguards, arXiv preprint arXiv:2406.01364 (2024). [25] A. Liu, B. Feng, B. Xue, B. Wang, et al., Deepseek-v3 technical report, arXiv preprint arXiv:2412.19437 (2024). [26] M. Abdin, S. Agarwal, A. Awadallah, V. Balachandran, et al., Phi-4-reasoning technical report, arXiv preprint arXiv:2504.21318 (2025). [27] D. Guo, D. Yang, H. Zhang, J. Song, et al., Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, arXiv preprint arXiv:2501.12948 (2025). [28] M. Turpin, J. Michael, E. Perez, S. Bowman, Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, Advances in Neural Information Processing Systems (2023). [29] Y. Chen, J. Benton, A. Radhakrishnan, J. Uesato, et al., Reasoning models don’t always say what they think, arXiv preprint arXiv:2505.05410 (2025).

[1] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , et al., Language models are few-shot learners , Advances in Neural Information Processing Systems ( 2020 ).

[2]

Chang ,

Wang ,

Wu , et al., A survey on evaluation of large language models , ACM Transactions on Intelligent Systems and Technology ( 2024 ).

[3]

Navigli ,

Conia ,

Ross , Biases in large language models: origins, inventory, and discussion , ACM Journal of Data and Information Quality ( 2023 ).

[4]

Hovy ,

Prabhumoye , Five sources of bias in natural language processing, Language and linguistics compass ( 2021 ).

[5]

I. O.

Gallegos ,

R. A.

Rossi ,

Barrow , M. M. Tanjim , et al., Bias and fairness in large language models: A survey, Computational Linguistics ( 2024 ).

[6]

Wei ,

Wang ,

Schuurmans ,

Bosma , et al., Chain-of-thought prompting elicits reasoning in large language models , Advances in Neural Information Processing Systems ( 2022 ).