<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Cagliari, Italy
* Corresponding author.
† These authors contributed equally.
$ tlabruna@unipd.it (T. Labruna); simone.gallo@isti.cnr.it
(S. Gallo); dasan@math.unipd.it (G. D. S. Martino)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tiziano Labruna</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Gallo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Da San Martino</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNR - ISTI</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Positional bias in binary question answering occurs when a model systematically favors one choice over another based solely on the ordering of presented options. In this study, we quantify and analyze positional bias across five large language models (LLMs) under varying degrees of answer uncertainty. We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option and then created multiple versions with progressively less context and more out-of-context answers, yielding datasets that range from low to high uncertainty. Additionally, we evaluate two naturally higher-uncertainty benchmarks: (1) WebGPT question pairs with unequal human-assigned quality scores, and (2) Winning Arguments, where models predict the more persuasive argument in Reddit's r/ChangeMyView exchanges. Across each dataset, the order of the “correct” (or higher-quality/persuasive) option is systematically flipped (first placed in position 1, then in position 2) to compute both Preference Fairness (PF) and Position Consistency (PC). We observe that positional bias is nearly absent under low-uncertainty conditions, but grows exponentially when it becomes doubtful to decide which option is correct.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Positional bias</kwd>
        <kwd>question answering</kwd>
        <kwd>large language models</kwd>
        <kwd>answer ordering</kwd>
        <kwd>binary choice evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Additionally, we identified two datasets that involve creased scrutiny to their fairness, especially in contexts
subjective judgments or nuanced quality comparisons. involving binary or pairwise decisions. A prominent
conThe first is WebGPT [11], which provides human-rated cern is positional bias—a systematic preference for one
preferences between pairs of model-generated answers response over another based solely on its position in the
to the same question. The second is Winning Argu- prompt, irrespective of content. Our work builds on and
ments [
        <xref ref-type="bibr" rid="ref2">12</xref>
        ], featuring pairs of Reddit r/ChangeMyView diferentiates itself from a body of literature that has
exresponses to a single post, where only one reply earned amined this phenomenon under various evaluation and
a “delta” for being deemed more persuasive. reasoning paradigms.
      </p>
      <p>Across these datasets, we measure positional bias us- The study by Shi et al. [13] ofers the most
compreing two complementary metrics (Preference Fairness and hensive exploration of positional bias in LLM-based
pairPosition Consistency) that capture whether and how of- wise evaluation. They introduce three core metrics:
Poten a model’s decision changes when the order of the sitional Fairness (PF), Positional Consistency (PC), and
candidate answers is swapped. Through the experiments Repetitional Consistency (RC), to systematically assess
we uncover a clear pattern: positional bias is negligible how the order of candidate responses afects judgement
when uncertainty is low but grows exponentially as un- outcomes. Notably, they find that while most models
certainty increases. Moreover, we find that this efect is exhibit high repetitional consistency—i.e., deterministic
especially pronounced in tasks requiring subjective judg- outputs across repeated trials—positional fairness and
ment, where models frequently default to order-based consistency vary widely across tasks and models. Their
heuristics in the absence of unambiguous signals. ifndings demonstrate that positional bias becomes
espe</p>
      <p>The remainder of the paper is organized as follows: Sec- cially pronounced when comparing responses of
neartion 2 reviews related work on position bias and fairness equal quality, an observation that directly informs our
in NLP. Section 3 details our dataset construction, exper- own approach of varying answer uncertainty to modulate
imental protocols, and bias metrics. Section 4 presents the ambiguity of binary choices.
quantitative results and discusses the significance of the While Shi et al. focus primarily on models acting as
outcomes. Finally, Section 5 concludes and outlines fu- evaluators, Wang et al. [14] provide compelling evidence
ture directions. of position-sensitive scoring even in ostensibly
objective comparisons. They show that GPT-4 tends to favour
the first answer while GPT-3.5 leans toward the second,
2. Related Work irrespective of prompt instruction, as also highlighted
by similar studies [4]. Their proposed mitigation
strateThe growing adoption of Large Language Models (LLMs) gies—including Balanced Position Calibration (BPC) and
in both generation and evaluation tasks has brought in- Multiple Evidence Calibration (MEC)—highlight the
importance of structural prompt design in mitigating these between diagnostic evaluator studies and
answerbiases. Our study similarly adopts systematic answer re- generation tasks, showing that positional bias is neither
ordering, but unlike Wang et al., we extend the analysis an isolated nor a negligible phenomenon, but one that
to task formats beyond pairwise model evaluation, such is sensitive to context, task framing, and content quality.
as QA under uncertainty. This dual framing broadens the understanding of bias</p>
      <p>Other work, such as [15], shifts the lens toward multi- in LLMs and calls for future work on uncertainty-aware
option multiple choice settings. The authors distinguish prompt and dataset design.
between token bias—a preference for specific answer
IDs like “A” or “B”—and position bias—a preference for
answers based on ordinal position. Their central claim 3. Methodology
is that token bias, not positional bias, is the primary
cause of inconsistencies in MCQ tasks, and they propose In this section, we describe the construction of our
poPriDe, a debiasing method based on prior estimation. sitional bias benchmarks, the experimental protocol for
While they conclude that positional bias is secondary prompting and evaluation, the set of language models
and often overestimated, our findings suggest that under under investigation, and the metrics used to quantify
heightened uncertainty, position bias becomes marked, positional bias. Figure 1 shows a visual summary of the
particularly when correct answers are ambiguous and methodology.
out-of-context.</p>
      <p>The PORTIA framework proposed by another recent 3.1. Datasets
study [16] presents an architectural solution to reduce To systematically investigate positional bias under
varypositional dependency by restructuring the input through ing levels of uncertainty, we constructed a new
benchsegmental alignment. Although PORTIA is designed for mark suite, SQuAD-it-2, derived from the Italian
SQuADevaluator settings, its contribution lies in demonstrating it dataset [10] and spanning three uncertainty
condithat careful content interleaving can dampen reliance tions: Low, Medium, and High. In addition, we employed
on positional heuristics. While our methodology does two existing datasets—WebGPT and Winning
Argunot employ PORTIA-like restructuring, it shares a core ments—which capture human preference in more
subintuition: positional efects intensify when content cues jective decision-making contexts.
are weak or ill-formed, a condition we explicitly engineer Each dataset is structured around binary-choice
inthrough dataset manipulation. stances, represented either as quadruples (, , 1, 2)</p>
      <p>The CALM framework [17] ofers a general-purpose or triples (, 1, 2), where  is an optional context, 
protocol for quantifying a wide range of biases in is a question or prompt, and (1, 2) are candidate
anLLM-as-a-judge settings. Its automated perturbation swers. One answer is designated as the preferred choice,
method—swapping candidate positions to detect volatil- while the other serves as a distractor.
ity in outcomes—serves as a direct methodological
precedent for the Position Consistency metric. Moreover,
CALM’s observation that positional bias scales with the SQuAD-it-2 Low Uncertainty. This setting builds
number of response options aligns with our finding that upon the SQuAD-it dataset [10], a semi-automatic
Italbias intensifies when answer certainty decreases. ian translation of the original English SQuAD dataset</p>
      <p>In contrast to all aforementioned works, our study [18]. Each sample in SQuAD-it is structured as a triple
ofers a novel synthesis of two research trajectories: bi- (, , corr), where  is a question,  is a supporting
nary positional evaluation under uncertainty and large- context passage, and corr is the correct answer, which
scale QA-based benchmarking. By systematically con- is always explicitly contained in the context.
trolling for answer ambiguity across datasets derived However, for our study on positional bias in
binaryfrom SQuAD-it, WebGPT, and Reddit’s r/ChangeMyView choice settings, we needed pairs of answer candidates:
(Winning Arguments dataset), we demonstrate that po- one correct and one incorrect. To construct these, we
sitional bias is not merely an artefact of model prompt used Gemini-2 to generate a plausible but incorrect
anformatting or answer labelling conventions. Rather, it swer (plaus) for each sample in SQuAD-it. Specifically,
reflects a deeper tendency of LLMs to resolve ambiguity we prompted Gemini-2 with the context , the question
through positional priors—a phenomenon that expands , and the correct answer corr, instructing it to generate
the scope of prior observations made in evaluation-only an alternative answer that is plausible—meaning it could
contexts. Furthermore our work empirically substan- conceivably be a correct answer based on the question,
tiates the claim that positional bias is conditional—not but is in fact incorrect. The exact prompt used is included
ifxed—and emerges as a second-order inference strategy in Appendix A.
when primary cues are degraded. This resulted in a dataset where each instance takes the</p>
      <p>In sum, our contribution lies in bridging the gap form (, , corr, plaus). The presence of the context 
provides strong evidence in favor of the correct answer, where no rational basis for preference exists. Also this
minimizing ambiguity and uncertainty in the model’s version includes 7,609 samples from the test split.
decision. This version is intended to simulate the lowest Overall, the three SQuAD-it-2 variants form a
conlevel of uncertainty, where one answer is clearly sup- trolled uncertainty spectrum, from minimal ambiguity
ported by the context and the other, while plausible, is in the Low Uncertainty setting to total ambiguity in the
not. While we generated and publicly released SQuAD- High Uncertainty setting, enabling us to systematically
it-2 for both training and test splits, we consider only study how large language models respond to answer
orthe test set, which includes 7,609 samples, for the exper- dering under varying epistemic conditions.
iments of this paper.</p>
      <sec id="sec-1-1">
        <title>WebGPT. The WebGPT dataset [11] was introduced</title>
        <p>SQuAD-it-2 Medium Uncertainty. In this version, to support research in aligning long-form question
anwe reuse the same set of samples from the Low Uncer- swering systems with human preferences. It consists
tainty setting, including the same plausible incorrect an- of 19,578 comparisons between pairs of answers to the
swers generated by Gemini-2. However, to increase the same open-ended question, each annotated with human
level of uncertainty, we deliberately remove the context preference scores. These answers were originally
gener from each sample. This modification results in in- ated by a GPT-3 model fine-tuned via imitation learning
stances of the form (, corr, plaus), where the model is and further optimized using reinforcement learning from
asked to choose between two answers without access to human feedback (RLHF). Each comparison includes
metathe supporting information. data such as the browsing quotes used to compose the
an</p>
        <p>
          In the absence of context, the task becomes signifi- swers and the associated preference scores, which range
cantly more challenging. While the correct answer re- from − 1 to 1 and indicate which answer is preferred by
mains correct in an absolute sense, the model cannot rely annotators.
on evidence from the passage to make its choice. Some- For our work, we extracted a subset of this dataset
times, the question can still be answered using world focusing on clear preference signals. Specifically, we
knowledge or intuition; other times, it becomes virtually selected only those examples in which the two answers
impossible to determine which answer is correct based received diferent human scores ( (1) ̸= (2)), ensuring
solely on the question. As a result, this version intro- a clear distinction between a preferred answer and a less
duces a medium level of uncertainty, greater than in the preferred (distractor) one. This yielded to a total of 14,346
contextualized setting, but not entirely arbitrary, since samples for our experiments. From each of these selected
one answer is still grounded in the original question. The examples, we constructed input triples (, pref, dist),
dataset comprises 7,609 samples from the test split. where  is the original question, pref is the answer
with the higher human score, and dist is the lower-rated
SQuAD-it-2 High Uncertainty. This version repre- alternative. To standardize the task, we reformulated the
sents the maximum level of uncertainty, simulating a original human instruction, used during annotation to
scenario in which the model must choose between two guide raters in evaluating answer quality, as a prompt
equally ungrounded options. Here, we prompt Gemini-2 question asking the model to choose the better answer.
to generate two completely out-of-context (ooc) answers
for each question . The prompt (included in Appendix Winning Arguments. This dataset [
          <xref ref-type="bibr" rid="ref2">12</xref>
          ] is derived
A) provides the question, the context and the correct from the r/ChangeMyView subreddit, where users post
answer, instructing Gemini-2 to generate two answers their opinions and invite others to persuade them to
that are non-plausible, that is, they should not reasonably change their views. In this setting, the original poster
answer the question and should bear no clear relation to (OP) can award a “delta” (∆ ) to a reply that successfully
the topic. changed their mind. The dataset contains conversation
        </p>
        <p>The resulting instances are structured as threads enriched with metadata indicating which replies
(, (o1oc), (o2oc)), where both answers are distrac- received a delta, making it a valuable resource for
studytors. Since neither candidate is appropriate or grounded ing persuasion and argument quality.
in the question, there is no clear basis for choosing one To construct comparison pairs, the original dataset
over the other. In this setting, the model’s decision is creators used a controlled pairing strategy: each
deltaexpected to approximate random guessing, and the task awarded reply (i.e., persuasive) was matched with the
itself loses semantic validity. Nonetheless, we include most similar reply in the same thread that did not receive
this version to simulate conditions of extreme ambiguity a delta (i.e., less persuasive), based on Jaccard similarity.
and explore how models behave when confronted with This yields pairs of messages that are highly comparable
entirely unsupported, content-free binary choices. This in content but difer in perceived persuasiveness,
allowallows us to probe the outer limits of positional bias, ing fine-grained analysis of what makes one argument
more compelling than another. As with WebGPT, this field of foundation model research, including both
opendataset centers on subjective human preferences, making weight and proprietary providers.
the task inherently uncertain and nuanced.</p>
        <p>For our experiments, we used only the test set provided
with the dataset, consisting of 807 pairs. Each instance
was structured as a triple (,  _pref,  _dist), where 
is the original post,  _pref is the reply that received the
delta, and  _dist is the similar, non-awarded reply. To
reproduce a maximum uncertainty setting and prevent
models from relying on contextual cues from the original
post, we only include the two replies ( _pref,  _dist)
in each instance, excluding  entirely. This dataset adds a
valuable dimension to our evaluation by focusing on
realworld argumentative discourse and subjective judgments
of persuasive efectiveness.</p>
        <sec id="sec-1-1-1">
          <title>3.2. Experimental Protocol</title>
          <p>We adopt a two-pass prompting strategy to evaluate
positional bias across the five datasets introduced in
Section 3.1. The Low and Medium Uncertainty versions
of SQuAD-it-2 are derived from the same underlying
dataset; the diference lies in whether the context is
provided: it is included in the Low Uncertainty setting and
omitted in the Medium one. All other datasets are
evaluated without any context.</p>
          <p>Each instance consists of a prompt , a preferred
answer pref, and a distractor dist. For every evaluation
condition, we proceed as follows:
1. Pass 1 (Original Order). We construct Prompt1,
placing pref as Option 1 and dist as Option 2,
alongside the question and, where applicable, the
context. The prompt is submitted to the target
model, and its response is recorded as (1).
2. Pass 2 (Swapped Order). We construct Prompt2
by inverting the order of the two answers. The
instructional text and context (if any) are kept
identical. The model’s response is recorded as
(2).</p>
          <p>Prompt phrasing is tailored to the semantics of each
dataset and is reported in Appendix B. In all cases, the
prompts in Pass 1 and Pass 2 are structurally identical
except for the position of the two candidate answers. The
model’s raw selections (1) and (2) are logged without
transformation and later used in the analysis of positional
bias.</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>3.3. Models Evaluated</title>
          <p>We benchmark five state-of-the-art large language
models (LLMs), selected to cover a spectrum of architectures,
parameter scales, and deployment configurations. All
models are developed by leading organisations in the
• LLaMA-3.1–8B: An 8-billion-parameter
openweight model [19] following the LLaMA
architecture, fine-tuned for Italian, and released in late
2024. Its compact size makes it well-suited for
downstream use in resource-constrained
scenarios.
• Gemma-3–12B (quantized): A
12-billionparameter open-weight multilingual model, [20]
quantized to 4-bit precision (Q4_K_M) retrieved
via the Ollama model hub. This quantised variant
is employed for eficiency under computational
constraints.
• Gemini 1.5: A proprietary multilingual model
[21] from Google DeepMind, specifically tailored
for QA tasks.
• Gemini 2: The successor to Gemini 1.5,
featuring architectural improvements and retraining on
updated corpora.
• Phi-4–14B (quantized): A 14-billion-parameter
open-weight multilingual model, [22] quantized
to 4-bit precision (Q4_K_M) retrieved via the
Ollama model hub. Like Gemma-3, this model is
used in its quantized form to enable evaluation
under limited computational resources.</p>
          <p>Quantized models are adopted primarily due to
hardware and latency constraints. To ensure validity, we
conducted a preliminary test comparing the quantized and
full-precision variants of each model on a 100-instance
subset of the Winning Arguments dataset. The results
showed almost identical accuracy across both versions,
suggesting that quantization does not substantially
affect model preference or correctness in our evaluation
setting.</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>3.4. Bias Metrics</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>To quantify positional bias in model preferences, we</title>
        <p>adopt two significant metrics: Preference Fairness (PF),
introduced by Shi et al. [13], and Position Consistency (PC),
a widely adopted measure in the positional bias literature
and also discussed in their work.</p>
        <p>We do not consider Repetitional Consistency (RC) (also
introduced by Shi et al.), which measures model stability
across repeated identical queries, as we believe it is not
suficiently related to positional bias and not computable
under our two-pass evaluation protocol.
3.4.1. Preference Fairness (PF)
PF quantifies directional positional bias: the extent to
which a model favors one answer position (first or
second) independently of content. However, in our setting,
we focus on the magnitude of this bias, regardless of
whether it leans toward the first or second option. To
this end, we report the absolute value of the PF score, so
that it ranges from 0 (no bias) to 1 (maximal bias).</p>
      </sec>
      <sec id="sec-1-3">
        <title>Formally, we compute a raw PF score (PFraw) following Shi et al. [13]:</title>
        <p>where m−in and m+ax are the minimum and maximum
achievable values of PFraw, respectively, under the given
conditions. This normalization centers the scale around
zero and bounds it between − 1 and 1.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Finally, we report the absolute value of the resulting PF score: so that:</title>
        <p>|PF| ∈ [0, 1],
is content-based and consistent).
• |PF| = 0 indicates no positional bias (preference
• |PF| = 1 indicates maximum positional bias
(model always favors one position regardless of
content).
• Intermediate values reflect increasing degrees of
positional influence on preference.
3.4.2. Position Consistency (PC)</p>
      </sec>
      <sec id="sec-1-5">
        <title>PC assesses stability rather than directionality: it mea</title>
        <p>sures how often the model selects the same answer before
and after the answer order is swapped. Formally:
PC =

=1

1 ∑︁ I(︀ (1) = (2))︀ ,
model at pass , and I(· ) is the indicator function.
where ()</p>
        <p>∈ {A, B} is the option chosen by the
• pcn is the normalized count of times the model
across all model–dataset pairs.</p>
        <p>order.
• A value of PC = 1 indicates full positional
robustness: the model’s choice is unafected by option
• Lower values imply that the model’s preference
changes depending on which position the
answers are presented in.</p>
      </sec>
      <sec id="sec-1-6">
        <title>PF and PC capture orthogonal phenomena: PF indi</title>
        <p>cates directional preference bias, while PC reflects
robustness to positional perturbation. We report both metrics</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Results and Discussion</title>
      <p>Table 1 reports the performance of various models on
binary QA tasks across datasets with varying levels of
uncertainty. Each model is evaluated under two conditions:
when the correct answer is presented first and when it
is presented second. Additionally, we report the number
of invalid responses, i.e., outputs not conforming to the
expected binary format. Figure 2 provides a visualization
of the magnitude of positional bias, with bars showing
the values of PF (reported in absolute value) and PC for
every model and dataset evaluated. In this plot, higher PF
values indicate stronger positional bias, while lower PC
values correspond to reduced position consistency and
thus higher bias. While the figure ofers an immediate
overview of how bias varies across datasets and
models, the accuracy table provides more detailed insights
into model behavior, revealing specific patterns such as
systematic preference for a given position or consistent
shifts in performance depending on answer order.
SQuAD-it-2 Low Uncertainty</p>
      <sec id="sec-2-1">
        <title>Under low uncertainty,</title>
        <p>performance is high and relatively stable. All models
(except Llama) maintain accuracy above 90% across both
conditions. This indicates that when questions are clear
and straightforward, the models perform robustly and
are less sensitive to presentation order.</p>
        <p>SQuAD-it-2 Medium and High Uncertainty
As
uncertainty increases, performance drops and order efects
become more pronounced. In the medium uncertainty
setting, accuracy generally decreases across models, and
some models (e.g., Gemini-2 and Phi4-14B-Q) actually
perform slightly better when the wrong answer is
presented first. This may reflect a shift in reliance from
positional bias to internal reasoning mechanisms.</p>
      </sec>
      <sec id="sec-2-2">
        <title>In the high uncertainty setting, models diverge sharply.</title>
      </sec>
      <sec id="sec-2-3">
        <title>For example, Llama-3.1-8B shows a drastic drop in ac</title>
        <p>curacy when the wrong answer is presented first (from
0.648 to 0.108), indicating a strong sensitivity to order
under ambiguous conditions. In contrast,
Gemma-3-12B</p>
      </sec>
      <sec id="sec-2-4">
        <title>Q improves when the wrong answer is first (from 0.411</title>
        <p>to 0.590), suggesting a diferent processing dynamic.
Invalid responses spike in this setting, especially for Llama
and Gemini models, indicating a higher dificulties in
producing well-formed answers when the uncertainty is
higher.</p>
        <p>WebGPT and Winning Arguments Real-world
datasets present an additional layer of complexity. In the
WebGPT task, most models follow the trend observed in
synthetic settings: higher accuracy when the correct
answer comes first. However, Gemini-2 again deviates from General Trends and Considerations Across
this pattern, performing slightly better in the wrong-first datasets, several consistent patterns emerge,
highcondition. lighting how model behavior in binary QA tasks is</p>
        <p>In the Winning Arguments dataset, which features influenced by a complex interplay of input uncertainty,
highly opinionated and subjective content, the reversal is answer ordering, and model architecture. Most models
particularly pronounced: all models consistently perform perform better when the correct answer is presented
better when the correct answer is presented second. For first, particularly under low uncertainty conditions,
instance, Gemma-3-12B-Q improves dramatically from suggesting a tendency to favor the first option when
0.302 to 0.823 accuracy in the wrong-first setting. This questions are clear and unambiguous. However, in the
striking and systematic pattern suggests that models may Winning Argument dataset, which involves persuasive
be influenced not just by answer content but also by argumentation, all models systematically perform better
presentation dynamics, such as contrastive framing or when the correct answer is presented second. The
cumulative reasoning, where the second answer is
implicitly treated as a refinement or counterpoint to the
ifrst. It is also possible that models trained on internet
discussions and dialogues have internalized discourse
norms in which stronger or more convincing arguments
often follow weaker ones in order to rebut or build upon
them. This behavior warrants further investigation, as it
may reveal underlying heuristics the models rely on in
persuasive or opinionated domains.
(a) Preference Fairness
(b) Position Consistency
magnitude and consistency of this reversal suggest a Our findings reveal a clear trend: as input uncertainty
strong bias toward the second option in subjective or increases, so does positional bias. Under low uncertainty,
argumentative contexts, possibly influenced by discourse models exhibit high accuracy and almost identical
perforstructure or rhetorical patterns in the training data. mance whether the correct answer is presented first or</p>
        <p>
          As uncertainty increases, the impact of answer order- second, indicating minimal or no bias in these conditions.
ing becomes more marked across models and datasets. However, as uncertainty rises, due to the removal of
conWhile many models demonstrate robustness under low textual cues or the subjective nature of the task, models
uncertainty, with small performance diferences between begin to show strong and often inconsistent positional
correct-first and wrong-first conditions, their behavior preferences.
becomes significantly more unpredictable and order- We used two dedicated metrics to quantify these
efsensitive with higher uncertainty. This growing sen- fects: Preference Fairness (PF), which captures how much
sitivity is particularly evident in Figure 2: as the input a model favors one position over another, and Position
becomes more ambiguous or subjective, such as in the Consistency (PC), which reflects how stable model
deciSQuAD-it-2 High Uncertainty and Winning Arguments sions are across diferent answer orderings. Both metrics
settings, models increasingly deviate from uniform be- show clear deterioration as uncertainty increases,
conhavior and show strong biases. This trend suggests that ifrming that models rely more heavily on position-based
models may resort to positional heuristics or discourse- heuristics when semantic cues are weak.
level patterns under stress, rather than relying on se- A particularly striking result comes from the Winning
mantic fidelity alone. When varying the uncertainty Arguments dataset, where all models systematically
prelevel in SQuAD-it-2 (from Low to High Uncertainty) a fer the second option—even when it is incorrect. This
clear pattern emerges: the rate of invalid outputs consis- behavior suggests that models may be influenced not
tently increases, highlighting the dificulty models face in only by answer content but also by presentation
dynammaintaining output consistency and adhering to format ics, such as contrastive framing or cumulative reasoning,
constraints as the task becomes less structured. possibly reflecting discourse norms internalized during
training, where stronger arguments often follow weaker
ones to refine or counter them.
5. Conclusion These results expose a fundamental limitation in
current LLMs and highlight the need for robust evaluation
In this work, we conducted a systematic investigation and debiasing strategies, especially in high-stakes or
subof positional bias in large language models using binary- jective scenarios. Our release of SQuAD-it-2 provides
choice prompting. We evaluated five diferent LLMs a valuable tool for continued research, ofering a
scalacross both controlled tasks and real-world datasets, and able and controlled benchmark for assessing positional
introduced a novel benchmark, SQuAD-it-2, to study artifacts, particularly in multilingual contexts.
this phenomenon in Italian, an underrepresented lan- Future work should explore the mechanisms behind
guage in current LLM evaluation eforts. SQuAD-it-2 position-based preferences more deeply, with special
atincludes binary QA tasks at three uncertainty levels, en- tention to how models process discourse structure,
conabling fine-grained analysis of how answer ordering in- trastive reasoning, and pragmatic cues. Better
underteracts with ambiguity. standing these behaviors will be crucial for developing
more interpretable, trustworthy, and bias-resilient mod- [6] Z. Wang, H. Zhang, X. Li, K.-H. Huang, C. Han, S. Ji,
els. Additionally, it would be valuable to introduce a S. M. Kakade, H. Peng, H. Ji, Eliminating position
third option (e.g., “neither response is valid”) in future bias of language models: A mechanistic approach,
evaluations, as we observed that models often implicitly arXiv preprint arXiv:2407.01100 (2024).
reject both candidates when neither is convincing. Inves- [7] R. Dominguez-Olmedo, M. Hardt, C.
Mendlertigating how model behavior changes with the inclusion Dünner, Questioning the survey responses of large
of such an option could ofer further insight into their language models, Advances in Neural Information
decision-making strategies under uncertainty. Processing Systems 37 (2024) 45850–45878.
[8] L. Zhu, X. Wang, X. Wang, Judgelm: Fine-tuned
large language models are scalable judges, arXiv
Acknowledgements preprint arXiv:2310.17631 (2023).
[9] R. Li, Y. Gao, Anchored answers: Unravelling
poThis work has been partially funded by the European sitional bias in gpt-2’s multiple-choice questions,
Union through the National Recovery and Resilience Plan arXiv preprint arXiv:2405.03205 (2024).
(NRRP), Mission 4, Component 2, Investment 1.3 – Call [10] D. Croce, A. Zelenanska, R. Basili, Neural learning
for Tender No. 341 of March 15, 2022, of the Italian Min- for question answering in italian, in: C. Ghidini,
istry of University and Research – NextGenerationEU; B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA
Code PE00000014, Concession Decree No. 1556 of Octo- 2018 – Advances in Artificial Intelligence, Springer
ber 11, 2022, CUP D43C22003050001, Project “SEcurity International Publishing, Cham, 2018, pp. 389–402.
and RIghts in the CyberSpace (SERICS) – Spoke 2 Mis- [11] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang,
information and Fakes – DEcision supporT systEm foR C. Kim, C. Hesse, S. Jain, V. Kosaraju, W.
SauncybeR intelligENCE (Deterrence) ders, et al., Webgpt: Browser-assisted
questionanswering with human feedback, arXiv preprint
References arXiv:2112.09332 (2021).
[
          <xref ref-type="bibr" rid="ref2">12</xref>
          ] C. Tan, V. Niculae, C. Danescu-Niculescu-Mizil,
[1] E. Kamalloo, N. Dziri, C. Clarke, D. Rafiei, Eval- L. Lee, Winning arguments: Interaction
dynamuating open-domain question answering in the ics and persuasion strategies in good-faith online
era of large language models, in: A. Rogers, discussions, in: Proceedings of WWW, 2016.
J. Boyd-Graber, N. Okazaki (Eds.), Proceedings [13] L. Shi, W. Ma, S. Vosoughi, Judging the
of the 61st Annual Meeting of the Association judges: A systematic investigation of position
for Computational Linguistics (Volume 1: Long bias in pairwise comparative assessments by llms,
Papers), Association for Computational Linguis- CoRR abs/2406.07791 (2024). URL: https://doi.org/
tics, Toronto, Canada, 2023, pp. 5591–5606. URL: 10.48550/arXiv.2406.07791. doi:10.48550/ARXIV.
https://aclanthology.org/2023.acl-long.307/. doi:10. 2406.07791. arXiv:2406.07791.
18653/v1/2023.acl-long.307. [14] P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao,
[2] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McK- L. Kong, Q. Liu, T. Liu, Z. Sui, Large language
modeown, T. B. Hashimoto, Benchmarking large lan- els are not fair evaluators, in: L. Ku, A. Martins,
guage models for news summarization, Transac- V. Srikumar (Eds.), Proceedings of the 62nd
Antions of the Association for Computational Linguis- nual Meeting of the Association for Computational
tics 12 (2024) 39–57. Linguistics (Volume 1: Long Papers), ACL 2024,
[
          <xref ref-type="bibr" rid="ref7">3</xref>
          ] T. Labruna, S. Brenna, A. Zaninello, B. Magnini, Un- Bangkok, Thailand, August 11-16, 2024, Association
raveling chatgpt: A critical analysis of ai-generated for Computational Linguistics, 2024, pp. 9440–9450.
goal-oriented dialogues and annotations, in: Inter- URL: https://doi.org/10.18653/v1/2024.acl-long.511.
national Conference of the Italian Association for doi:10.18653/V1/2024.ACL-LONG.511.
        </p>
        <p>Artificial Intelligence, Springer, 2023, pp. 151–171. [15] C. Zheng, H. Zhou, F. Meng, J. Zhou, M. Huang,
[4] S. Casola, T. Labruna, A. Lavelli, B. Magnini, et al., Large language models are not robust multiple
Testing chatgpt for stability and reasoning: A case choice selectors, in: The Twelfth International
study using italian medical specialty tests, in: Pro- Conference on Learning Representations, ICLR
ceedings of the 9th Italian Conference on Compu- 2024, Vienna, Austria, May 7-11, 2024,
OpenRetational Linguistics (CLiC-it 2023), 2023. view.net, 2024. URL: https://openreview.net/forum?
[5] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, id=shr9PXz7T0.</p>
        <p>Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al., Judging [16] Z. Li, C. Wang, P. Ma, D. Wu, S. Wang, C. Gao,
llm-as-a-judge with mt-bench and chatbot arena, Y. Liu, Split and merge: Aligning position
biAdvances in Neural Information Processing Sys- ases in llm-based evaluators, in: Y. Al-Onaizan,
tems 36 (2023) 46595–46623. M. Bansal, Y. Chen (Eds.), Proceedings of the 2024</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A. Prompts for SQuAD-it-2 Dataset</title>
    </sec>
    <sec id="sec-4">
      <title>Generation</title>
      <sec id="sec-4-1">
        <title>A.1. Prompt for Low and Medium</title>
      </sec>
      <sec id="sec-4-2">
        <title>Uncertainty Settings</title>
        <p>For the SQuAD-it-2 Low and Medium Uncertainty
variants, we used a single prompt to generate a plausible but
incorrect answer. The input to the model includes the
original context passage, the question, and the correct
answer. The model is explicitly instructed to generate an
answer that could reasonably be interpreted as correct
(i.e., plausible), while being in fact incorrect. The exact
prompt is shown below:
Contesto: &lt;CONTEXT&gt;
Domanda: &lt;QUESTION&gt;
Risposta corretta: &lt;CORRECT_ANSWER&gt;
Step 2: Second Out-of-Context Answer. The model
is then prompted again with the same context, question,
and correct answer, along with the previously generated
out-of-context wrong answer. This time, it is asked to
produce a second, distinct out-of-context answer. The
corresponding prompt is:
Contesto: &lt;CONTEXT&gt;
Domanda: &lt;QUESTION&gt;
Risposta corretta: &lt;CORRECT_ANSWER&gt;
Risposta errata: &lt;WRONG_ANSWER_1&gt;
Fornisci una risposta completamente fuori
contesto e sbagliata alla domanda sopra.
Assicurati che non sia basata sul
contesto fornito e che sia diversa dalla
risposta errata gia’ presente.</p>
        <p>Restituisci solo la risposta, senza
spiegazioni o altro.</p>
        <p>B, depending on which answer
is better. Do not explain your
choice. Output only A or B.</p>
        <sec id="sec-4-2-1">
          <title>Final Construction. Once both out-of-context an</title>
          <p>swers are generated, we discard the original context and
the correct answer, retaining only the question and the
two OOC distractors. The final dataset entries are
structured as:</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>B. Prompt Templates</title>
      <sec id="sec-5-1">
        <title>We report here the prompt templates used in the experiments described in Section 3.2. Each dataset required a prompt adapted to its semantic framing and language.</title>
      </sec>
      <sec id="sec-5-2">
        <title>SQuAD-it-2. These prompts are in Italian. When context is present (Low Uncertainty), it is introduced with “Contesto:”. The rest of the prompt follows this structure:</title>
        <p>Domanda: [Q]
A) [Risposta 1]
B) [Risposta 2]</p>
      </sec>
      <sec id="sec-5-3">
        <title>The final instruction depends on the uncertainty level:</title>
        <p>• Low Uncertainty (with context):</p>
        <p>Scegli la risposta</p>
        <p>Restituisci solo A o B.
• Medium/High Uncertainty (no context):</p>
        <p>Scegli la risposta che reputi più
corretta. Se credi che nessuna sia
corretta, scegli comunque quella che
reputi più plausibile. Restituisci
solo A o B.
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>guage Processing</source>
          ,
          <source>EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Miami</article-title>
          , FL, USA, Fornisci una risposta plausibile ma sbagliata
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>November</surname>
          </string-name>
          12-
          <issue>16</issue>
          ,
          <year>2024</year>
          ,
          <article-title>Association for Compu- alla domanda sopra, basandoti sul</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>tational Linguistics</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>11084</fpage>
          -
          <lpage>11108</lpage>
          . URL: contesto. Restituisci solo la risposta,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>emnlp-main.621. senza spiegazioni o altro</article-title>
          . [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Mo- This prompt ensures that the incorrect answer remains</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>ICLR</source>
          <year>2025</year>
          , Singapore,
          <source>April 24-28</source>
          ,
          <year>2025</year>
          , OpenRe
          <article-title>- A.2. Prompting Strategy for High</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          view.net,
          <year>2025</year>
          . URL: https://openreview.net/forum? Uncertainty Setting
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>id=3GTtZFiajM. For the High Uncertainty setting, we followed a two-step</article-title>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lopyrev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Squad: prompting process to construct a pair of out-of-context</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of (OOC) incorrect answers. The correct answer is used</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>text</surname>
          </string-name>
          ,
          <source>arXiv preprint arXiv:1606.05250</source>
          (
          <year>2016</year>
          ).
          <article-title>during generation but removed from the final</article-title>
          dataset to [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ka- increase ambiguity.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>ten</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <article-title>The llama 3 herd of models, Step 1: First Out-of-Context Answer</article-title>
          . The model
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
          <article-title>receives the context, the question, and the correct an</article-title>
          [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <article-title>Vieil- swer. It is asked to generate an incorrect answer that is</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Rivière</surname>
          </string-name>
          , et al.,
          <article-title>Gemma 3 technical report, arXiv it is not plausible or grounded</article-title>
          .
          <source>The prompt used is:</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>preprint arXiv:2503.19786</source>
          (
          <year>2025</year>
          ). [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Georgiev</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. I. Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Burnell</surname>
          </string-name>
          , L. Bai, Contesto: &lt;CONTEXT&gt;
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          et al.,
          <source>Gemini</source>
          <volume>1</volume>
          .
          <article-title>5: Unlocking multimodal under- Risposta corretta: &lt;CORRECT_ANSWER&gt;</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>preprint arXiv:2403.05530</source>
          (
          <year>2024</year>
          ).
          <article-title>Fornisci una risposta completamente fuori [22]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Abouelenin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ashfaq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Atkinson</surname>
          </string-name>
          , cAosnstiecsutroatei cshbeaglnioantasiaallbaasdaotmaansdual sopra.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al., Phi-4
          <article-title>-mini risposta, senza spiegazioni o altro</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>preprint arXiv:2503.01743</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>