1. Introduction

Cagliari, Italy * Corresponding author. † These authors contributed equally. $ tlabruna@unipd.it (T. Labruna); simone.gallo@isti.cnr.it (S. Gallo); dasan@math.unipd.it (G. D. S. Martino)

Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences

Tiziano Labruna

Simone Gallo

Giovanni Da San Martino

1 0 CNR - ISTI , Pisa , Italy 1 Department of Mathematics, University of Padova , Italy

2025

000 0 0001

Positional bias in binary question answering occurs when a model systematically favors one choice over another based solely on the ordering of presented options. In this study, we quantify and analyze positional bias across five large language models (LLMs) under varying degrees of answer uncertainty. We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option and then created multiple versions with progressively less context and more out-of-context answers, yielding datasets that range from low to high uncertainty. Additionally, we evaluate two naturally higher-uncertainty benchmarks: (1) WebGPT question pairs with unequal human-assigned quality scores, and (2) Winning Arguments, where models predict the more persuasive argument in Reddit's r/ChangeMyView exchanges. Across each dataset, the order of the “correct” (or higher-quality/persuasive) option is systematically flipped (first placed in position 1, then in position 2) to compute both Preference Fairness (PF) and Position Consistency (PC). We observe that positional bias is nearly absent under low-uncertainty conditions, but grows exponentially when it becomes doubtful to decide which option is correct.

eol>Positional bias question answering large language models answer ordering binary choice evaluation

1. Introduction

Additionally, we identified two datasets that involve creased scrutiny to their fairness, especially in contexts subjective judgments or nuanced quality comparisons. involving binary or pairwise decisions. A prominent conThe first is WebGPT [11], which provides human-rated cern is positional bias—a systematic preference for one preferences between pairs of model-generated answers response over another based solely on its position in the to the same question. The second is Winning Argu- prompt, irrespective of content. Our work builds on and ments [ 12 ], featuring pairs of Reddit r/ChangeMyView diferentiates itself from a body of literature that has exresponses to a single post, where only one reply earned amined this phenomenon under various evaluation and a “delta” for being deemed more persuasive. reasoning paradigms.

Across these datasets, we measure positional bias us- The study by Shi et al. [13] ofers the most compreing two complementary metrics (Preference Fairness and hensive exploration of positional bias in LLM-based pairPosition Consistency) that capture whether and how of- wise evaluation. They introduce three core metrics: Poten a model’s decision changes when the order of the sitional Fairness (PF), Positional Consistency (PC), and candidate answers is swapped. Through the experiments Repetitional Consistency (RC), to systematically assess we uncover a clear pattern: positional bias is negligible how the order of candidate responses afects judgement when uncertainty is low but grows exponentially as un- outcomes. Notably, they find that while most models certainty increases. Moreover, we find that this efect is exhibit high repetitional consistency—i.e., deterministic especially pronounced in tasks requiring subjective judg- outputs across repeated trials—positional fairness and ment, where models frequently default to order-based consistency vary widely across tasks and models. Their heuristics in the absence of unambiguous signals. ifndings demonstrate that positional bias becomes espe

The remainder of the paper is organized as follows: Sec- cially pronounced when comparing responses of neartion 2 reviews related work on position bias and fairness equal quality, an observation that directly informs our in NLP. Section 3 details our dataset construction, exper- own approach of varying answer uncertainty to modulate imental protocols, and bias metrics. Section 4 presents the ambiguity of binary choices. quantitative results and discusses the significance of the While Shi et al. focus primarily on models acting as outcomes. Finally, Section 5 concludes and outlines fu- evaluators, Wang et al. [14] provide compelling evidence ture directions. of position-sensitive scoring even in ostensibly objective comparisons. They show that GPT-4 tends to favour the first answer while GPT-3.5 leans toward the second, 2. Related Work irrespective of prompt instruction, as also highlighted by similar studies [4]. Their proposed mitigation strateThe growing adoption of Large Language Models (LLMs) gies—including Balanced Position Calibration (BPC) and in both generation and evaluation tasks has brought in- Multiple Evidence Calibration (MEC)—highlight the importance of structural prompt design in mitigating these between diagnostic evaluator studies and answerbiases. Our study similarly adopts systematic answer re- generation tasks, showing that positional bias is neither ordering, but unlike Wang et al., we extend the analysis an isolated nor a negligible phenomenon, but one that to task formats beyond pairwise model evaluation, such is sensitive to context, task framing, and content quality. as QA under uncertainty. This dual framing broadens the understanding of bias

Other work, such as [15], shifts the lens toward multi- in LLMs and calls for future work on uncertainty-aware option multiple choice settings. The authors distinguish prompt and dataset design. between token bias—a preference for specific answer IDs like “A” or “B”—and position bias—a preference for answers based on ordinal position. Their central claim 3. Methodology is that token bias, not positional bias, is the primary cause of inconsistencies in MCQ tasks, and they propose In this section, we describe the construction of our poPriDe, a debiasing method based on prior estimation. sitional bias benchmarks, the experimental protocol for While they conclude that positional bias is secondary prompting and evaluation, the set of language models and often overestimated, our findings suggest that under under investigation, and the metrics used to quantify heightened uncertainty, position bias becomes marked, positional bias. Figure 1 shows a visual summary of the particularly when correct answers are ambiguous and methodology. out-of-context.

The PORTIA framework proposed by another recent 3.1. Datasets study [16] presents an architectural solution to reduce To systematically investigate positional bias under varypositional dependency by restructuring the input through ing levels of uncertainty, we constructed a new benchsegmental alignment. Although PORTIA is designed for mark suite, SQuAD-it-2, derived from the Italian SQuADevaluator settings, its contribution lies in demonstrating it dataset [10] and spanning three uncertainty condithat careful content interleaving can dampen reliance tions: Low, Medium, and High. In addition, we employed on positional heuristics. While our methodology does two existing datasets—WebGPT and Winning Argunot employ PORTIA-like restructuring, it shares a core ments—which capture human preference in more subintuition: positional efects intensify when content cues jective decision-making contexts. are weak or ill-formed, a condition we explicitly engineer Each dataset is structured around binary-choice inthrough dataset manipulation. stances, represented either as quadruples (, , 1, 2)

The CALM framework [17] ofers a general-purpose or triples (, 1, 2), where is an optional context, protocol for quantifying a wide range of biases in is a question or prompt, and (1, 2) are candidate anLLM-as-a-judge settings. Its automated perturbation swers. One answer is designated as the preferred choice, method—swapping candidate positions to detect volatil- while the other serves as a distractor. ity in outcomes—serves as a direct methodological precedent for the Position Consistency metric. Moreover, CALM’s observation that positional bias scales with the SQuAD-it-2 Low Uncertainty. This setting builds number of response options aligns with our finding that upon the SQuAD-it dataset [10], a semi-automatic Italbias intensifies when answer certainty decreases. ian translation of the original English SQuAD dataset

In contrast to all aforementioned works, our study [18]. Each sample in SQuAD-it is structured as a triple ofers a novel synthesis of two research trajectories: bi- (, , corr), where is a question, is a supporting nary positional evaluation under uncertainty and large- context passage, and corr is the correct answer, which scale QA-based benchmarking. By systematically con- is always explicitly contained in the context. trolling for answer ambiguity across datasets derived However, for our study on positional bias in binaryfrom SQuAD-it, WebGPT, and Reddit’s r/ChangeMyView choice settings, we needed pairs of answer candidates: (Winning Arguments dataset), we demonstrate that po- one correct and one incorrect. To construct these, we sitional bias is not merely an artefact of model prompt used Gemini-2 to generate a plausible but incorrect anformatting or answer labelling conventions. Rather, it swer (plaus) for each sample in SQuAD-it. Specifically, reflects a deeper tendency of LLMs to resolve ambiguity we prompted Gemini-2 with the context , the question through positional priors—a phenomenon that expands , and the correct answer corr, instructing it to generate the scope of prior observations made in evaluation-only an alternative answer that is plausible—meaning it could contexts. Furthermore our work empirically substan- conceivably be a correct answer based on the question, tiates the claim that positional bias is conditional—not but is in fact incorrect. The exact prompt used is included ifxed—and emerges as a second-order inference strategy in Appendix A. when primary cues are degraded. This resulted in a dataset where each instance takes the

In sum, our contribution lies in bridging the gap form (, , corr, plaus). The presence of the context provides strong evidence in favor of the correct answer, where no rational basis for preference exists. Also this minimizing ambiguity and uncertainty in the model’s version includes 7,609 samples from the test split. decision. This version is intended to simulate the lowest Overall, the three SQuAD-it-2 variants form a conlevel of uncertainty, where one answer is clearly sup- trolled uncertainty spectrum, from minimal ambiguity ported by the context and the other, while plausible, is in the Low Uncertainty setting to total ambiguity in the not. While we generated and publicly released SQuAD- High Uncertainty setting, enabling us to systematically it-2 for both training and test splits, we consider only study how large language models respond to answer orthe test set, which includes 7,609 samples, for the exper- dering under varying epistemic conditions. iments of this paper.

WebGPT. The WebGPT dataset [11] was introduced

SQuAD-it-2 Medium Uncertainty. In this version, to support research in aligning long-form question anwe reuse the same set of samples from the Low Uncer- swering systems with human preferences. It consists tainty setting, including the same plausible incorrect an- of 19,578 comparisons between pairs of answers to the swers generated by Gemini-2. However, to increase the same open-ended question, each annotated with human level of uncertainty, we deliberately remove the context preference scores. These answers were originally gener from each sample. This modification results in in- ated by a GPT-3 model fine-tuned via imitation learning stances of the form (, corr, plaus), where the model is and further optimized using reinforcement learning from asked to choose between two answers without access to human feedback (RLHF). Each comparison includes metathe supporting information. data such as the browsing quotes used to compose the an

In the absence of context, the task becomes signifi- swers and the associated preference scores, which range cantly more challenging. While the correct answer re- from − 1 to 1 and indicate which answer is preferred by mains correct in an absolute sense, the model cannot rely annotators. on evidence from the passage to make its choice. Some- For our work, we extracted a subset of this dataset times, the question can still be answered using world focusing on clear preference signals. Specifically, we knowledge or intuition; other times, it becomes virtually selected only those examples in which the two answers impossible to determine which answer is correct based received diferent human scores ( (1) ̸= (2)), ensuring solely on the question. As a result, this version intro- a clear distinction between a preferred answer and a less duces a medium level of uncertainty, greater than in the preferred (distractor) one. This yielded to a total of 14,346 contextualized setting, but not entirely arbitrary, since samples for our experiments. From each of these selected one answer is still grounded in the original question. The examples, we constructed input triples (, pref, dist), dataset comprises 7,609 samples from the test split. where is the original question, pref is the answer with the higher human score, and dist is the lower-rated SQuAD-it-2 High Uncertainty. This version repre- alternative. To standardize the task, we reformulated the sents the maximum level of uncertainty, simulating a original human instruction, used during annotation to scenario in which the model must choose between two guide raters in evaluating answer quality, as a prompt equally ungrounded options. Here, we prompt Gemini-2 question asking the model to choose the better answer. to generate two completely out-of-context (ooc) answers for each question . The prompt (included in Appendix Winning Arguments. This dataset [ 12 ] is derived A) provides the question, the context and the correct from the r/ChangeMyView subreddit, where users post answer, instructing Gemini-2 to generate two answers their opinions and invite others to persuade them to that are non-plausible, that is, they should not reasonably change their views. In this setting, the original poster answer the question and should bear no clear relation to (OP) can award a “delta” (∆ ) to a reply that successfully the topic. changed their mind. The dataset contains conversation

The resulting instances are structured as threads enriched with metadata indicating which replies (, (o1oc), (o2oc)), where both answers are distrac- received a delta, making it a valuable resource for studytors. Since neither candidate is appropriate or grounded ing persuasion and argument quality. in the question, there is no clear basis for choosing one To construct comparison pairs, the original dataset over the other. In this setting, the model’s decision is creators used a controlled pairing strategy: each deltaexpected to approximate random guessing, and the task awarded reply (i.e., persuasive) was matched with the itself loses semantic validity. Nonetheless, we include most similar reply in the same thread that did not receive this version to simulate conditions of extreme ambiguity a delta (i.e., less persuasive), based on Jaccard similarity. and explore how models behave when confronted with This yields pairs of messages that are highly comparable entirely unsupported, content-free binary choices. This in content but difer in perceived persuasiveness, allowallows us to probe the outer limits of positional bias, ing fine-grained analysis of what makes one argument more compelling than another. As with WebGPT, this field of foundation model research, including both opendataset centers on subjective human preferences, making weight and proprietary providers. the task inherently uncertain and nuanced.

For our experiments, we used only the test set provided with the dataset, consisting of 807 pairs. Each instance was structured as a triple (, _pref, _dist), where is the original post, _pref is the reply that received the delta, and _dist is the similar, non-awarded reply. To reproduce a maximum uncertainty setting and prevent models from relying on contextual cues from the original post, we only include the two replies ( _pref, _dist) in each instance, excluding entirely. This dataset adds a valuable dimension to our evaluation by focusing on realworld argumentative discourse and subjective judgments of persuasive efectiveness.

3.2. Experimental Protocol

We adopt a two-pass prompting strategy to evaluate positional bias across the five datasets introduced in Section 3.1. The Low and Medium Uncertainty versions of SQuAD-it-2 are derived from the same underlying dataset; the diference lies in whether the context is provided: it is included in the Low Uncertainty setting and omitted in the Medium one. All other datasets are evaluated without any context.

Each instance consists of a prompt , a preferred answer pref, and a distractor dist. For every evaluation condition, we proceed as follows: 1. Pass 1 (Original Order). We construct Prompt1, placing pref as Option 1 and dist as Option 2, alongside the question and, where applicable, the context. The prompt is submitted to the target model, and its response is recorded as (1). 2. Pass 2 (Swapped Order). We construct Prompt2 by inverting the order of the two answers. The instructional text and context (if any) are kept identical. The model’s response is recorded as (2).

Prompt phrasing is tailored to the semantics of each dataset and is reported in Appendix B. In all cases, the prompts in Pass 1 and Pass 2 are structurally identical except for the position of the two candidate answers. The model’s raw selections (1) and (2) are logged without transformation and later used in the analysis of positional bias.

3.3. Models Evaluated

We benchmark five state-of-the-art large language models (LLMs), selected to cover a spectrum of architectures, parameter scales, and deployment configurations. All models are developed by leading organisations in the • LLaMA-3.1–8B: An 8-billion-parameter openweight model [19] following the LLaMA architecture, fine-tuned for Italian, and released in late 2024. Its compact size makes it well-suited for downstream use in resource-constrained scenarios. • Gemma-3–12B (quantized): A 12-billionparameter open-weight multilingual model, [20] quantized to 4-bit precision (Q4_K_M) retrieved via the Ollama model hub. This quantised variant is employed for eficiency under computational constraints. • Gemini 1.5: A proprietary multilingual model [21] from Google DeepMind, specifically tailored for QA tasks. • Gemini 2: The successor to Gemini 1.5, featuring architectural improvements and retraining on updated corpora. • Phi-4–14B (quantized): A 14-billion-parameter open-weight multilingual model, [22] quantized to 4-bit precision (Q4_K_M) retrieved via the Ollama model hub. Like Gemma-3, this model is used in its quantized form to enable evaluation under limited computational resources.

Quantized models are adopted primarily due to hardware and latency constraints. To ensure validity, we conducted a preliminary test comparing the quantized and full-precision variants of each model on a 100-instance subset of the Winning Arguments dataset. The results showed almost identical accuracy across both versions, suggesting that quantization does not substantially affect model preference or correctness in our evaluation setting.

3.4. Bias Metrics To quantify positional bias in model preferences, we

adopt two significant metrics: Preference Fairness (PF), introduced by Shi et al. [13], and Position Consistency (PC), a widely adopted measure in the positional bias literature and also discussed in their work.

We do not consider Repetitional Consistency (RC) (also introduced by Shi et al.), which measures model stability across repeated identical queries, as we believe it is not suficiently related to positional bias and not computable under our two-pass evaluation protocol. 3.4.1. Preference Fairness (PF) PF quantifies directional positional bias: the extent to which a model favors one answer position (first or second) independently of content. However, in our setting, we focus on the magnitude of this bias, regardless of whether it leans toward the first or second option. To this end, we report the absolute value of the PF score, so that it ranges from 0 (no bias) to 1 (maximal bias).

Formally, we compute a raw PF score (PFraw) following Shi et al. [13]:

where m−in and m+ax are the minimum and maximum achievable values of PFraw, respectively, under the given conditions. This normalization centers the scale around zero and bounds it between − 1 and 1.

Finally, we report the absolute value of the resulting PF score: so that:

|PF| ∈ [0, 1], is content-based and consistent). • |PF| = 0 indicates no positional bias (preference • |PF| = 1 indicates maximum positional bias (model always favors one position regardless of content). • Intermediate values reflect increasing degrees of positional influence on preference. 3.4.2. Position Consistency (PC)

PC assesses stability rather than directionality: it mea

sures how often the model selects the same answer before and after the answer order is swapped. Formally: PC = =1 1 ∑︁ I(︀ (1) = (2))︀ , model at pass , and I(· ) is the indicator function. where ()

∈ {A, B} is the option chosen by the • pcn is the normalized count of times the model across all model–dataset pairs.

order. • A value of PC = 1 indicates full positional robustness: the model’s choice is unafected by option • Lower values imply that the model’s preference changes depending on which position the answers are presented in.

PF and PC capture orthogonal phenomena: PF indi

cates directional preference bias, while PC reflects robustness to positional perturbation. We report both metrics

4. Results and Discussion

Table 1 reports the performance of various models on binary QA tasks across datasets with varying levels of uncertainty. Each model is evaluated under two conditions: when the correct answer is presented first and when it is presented second. Additionally, we report the number of invalid responses, i.e., outputs not conforming to the expected binary format. Figure 2 provides a visualization of the magnitude of positional bias, with bars showing the values of PF (reported in absolute value) and PC for every model and dataset evaluated. In this plot, higher PF values indicate stronger positional bias, while lower PC values correspond to reduced position consistency and thus higher bias. While the figure ofers an immediate overview of how bias varies across datasets and models, the accuracy table provides more detailed insights into model behavior, revealing specific patterns such as systematic preference for a given position or consistent shifts in performance depending on answer order. SQuAD-it-2 Low Uncertainty

Under low uncertainty,

performance is high and relatively stable. All models (except Llama) maintain accuracy above 90% across both conditions. This indicates that when questions are clear and straightforward, the models perform robustly and are less sensitive to presentation order.

SQuAD-it-2 Medium and High Uncertainty As uncertainty increases, performance drops and order efects become more pronounced. In the medium uncertainty setting, accuracy generally decreases across models, and some models (e.g., Gemini-2 and Phi4-14B-Q) actually perform slightly better when the wrong answer is presented first. This may reflect a shift in reliance from positional bias to internal reasoning mechanisms.

In the high uncertainty setting, models diverge sharply. For example, Llama-3.1-8B shows a drastic drop in ac

curacy when the wrong answer is presented first (from 0.648 to 0.108), indicating a strong sensitivity to order under ambiguous conditions. In contrast, Gemma-3-12B

Q improves when the wrong answer is first (from 0.411

to 0.590), suggesting a diferent processing dynamic. Invalid responses spike in this setting, especially for Llama and Gemini models, indicating a higher dificulties in producing well-formed answers when the uncertainty is higher.

WebGPT and Winning Arguments Real-world datasets present an additional layer of complexity. In the WebGPT task, most models follow the trend observed in synthetic settings: higher accuracy when the correct answer comes first. However, Gemini-2 again deviates from General Trends and Considerations Across this pattern, performing slightly better in the wrong-first datasets, several consistent patterns emerge, highcondition. lighting how model behavior in binary QA tasks is

In the Winning Arguments dataset, which features influenced by a complex interplay of input uncertainty, highly opinionated and subjective content, the reversal is answer ordering, and model architecture. Most models particularly pronounced: all models consistently perform perform better when the correct answer is presented better when the correct answer is presented second. For first, particularly under low uncertainty conditions, instance, Gemma-3-12B-Q improves dramatically from suggesting a tendency to favor the first option when 0.302 to 0.823 accuracy in the wrong-first setting. This questions are clear and unambiguous. However, in the striking and systematic pattern suggests that models may Winning Argument dataset, which involves persuasive be influenced not just by answer content but also by argumentation, all models systematically perform better presentation dynamics, such as contrastive framing or when the correct answer is presented second. The cumulative reasoning, where the second answer is implicitly treated as a refinement or counterpoint to the ifrst. It is also possible that models trained on internet discussions and dialogues have internalized discourse norms in which stronger or more convincing arguments often follow weaker ones in order to rebut or build upon them. This behavior warrants further investigation, as it may reveal underlying heuristics the models rely on in persuasive or opinionated domains. (a) Preference Fairness (b) Position Consistency magnitude and consistency of this reversal suggest a Our findings reveal a clear trend: as input uncertainty strong bias toward the second option in subjective or increases, so does positional bias. Under low uncertainty, argumentative contexts, possibly influenced by discourse models exhibit high accuracy and almost identical perforstructure or rhetorical patterns in the training data. mance whether the correct answer is presented first or

As uncertainty increases, the impact of answer order- second, indicating minimal or no bias in these conditions. ing becomes more marked across models and datasets. However, as uncertainty rises, due to the removal of conWhile many models demonstrate robustness under low textual cues or the subjective nature of the task, models uncertainty, with small performance diferences between begin to show strong and often inconsistent positional correct-first and wrong-first conditions, their behavior preferences. becomes significantly more unpredictable and order- We used two dedicated metrics to quantify these efsensitive with higher uncertainty. This growing sen- fects: Preference Fairness (PF), which captures how much sitivity is particularly evident in Figure 2: as the input a model favors one position over another, and Position becomes more ambiguous or subjective, such as in the Consistency (PC), which reflects how stable model deciSQuAD-it-2 High Uncertainty and Winning Arguments sions are across diferent answer orderings. Both metrics settings, models increasingly deviate from uniform be- show clear deterioration as uncertainty increases, conhavior and show strong biases. This trend suggests that ifrming that models rely more heavily on position-based models may resort to positional heuristics or discourse- heuristics when semantic cues are weak. level patterns under stress, rather than relying on se- A particularly striking result comes from the Winning mantic fidelity alone. When varying the uncertainty Arguments dataset, where all models systematically prelevel in SQuAD-it-2 (from Low to High Uncertainty) a fer the second option—even when it is incorrect. This clear pattern emerges: the rate of invalid outputs consis- behavior suggests that models may be influenced not tently increases, highlighting the dificulty models face in only by answer content but also by presentation dynammaintaining output consistency and adhering to format ics, such as contrastive framing or cumulative reasoning, constraints as the task becomes less structured. possibly reflecting discourse norms internalized during training, where stronger arguments often follow weaker ones to refine or counter them. 5. Conclusion These results expose a fundamental limitation in current LLMs and highlight the need for robust evaluation In this work, we conducted a systematic investigation and debiasing strategies, especially in high-stakes or subof positional bias in large language models using binary- jective scenarios. Our release of SQuAD-it-2 provides choice prompting. We evaluated five diferent LLMs a valuable tool for continued research, ofering a scalacross both controlled tasks and real-world datasets, and able and controlled benchmark for assessing positional introduced a novel benchmark, SQuAD-it-2, to study artifacts, particularly in multilingual contexts. this phenomenon in Italian, an underrepresented lan- Future work should explore the mechanisms behind guage in current LLM evaluation eforts. SQuAD-it-2 position-based preferences more deeply, with special atincludes binary QA tasks at three uncertainty levels, en- tention to how models process discourse structure, conabling fine-grained analysis of how answer ordering in- trastive reasoning, and pragmatic cues. Better underteracts with ambiguity. standing these behaviors will be crucial for developing more interpretable, trustworthy, and bias-resilient mod- [6] Z. Wang, H. Zhang, X. Li, K.-H. Huang, C. Han, S. Ji, els. Additionally, it would be valuable to introduce a S. M. Kakade, H. Peng, H. Ji, Eliminating position third option (e.g., “neither response is valid”) in future bias of language models: A mechanistic approach, evaluations, as we observed that models often implicitly arXiv preprint arXiv:2407.01100 (2024). reject both candidates when neither is convincing. Inves- [7] R. Dominguez-Olmedo, M. Hardt, C. Mendlertigating how model behavior changes with the inclusion Dünner, Questioning the survey responses of large of such an option could ofer further insight into their language models, Advances in Neural Information decision-making strategies under uncertainty. Processing Systems 37 (2024) 45850–45878. [8] L. Zhu, X. Wang, X. Wang, Judgelm: Fine-tuned large language models are scalable judges, arXiv Acknowledgements preprint arXiv:2310.17631 (2023). [9] R. Li, Y. Gao, Anchored answers: Unravelling poThis work has been partially funded by the European sitional bias in gpt-2’s multiple-choice questions, Union through the National Recovery and Resilience Plan arXiv preprint arXiv:2405.03205 (2024). (NRRP), Mission 4, Component 2, Investment 1.3 – Call [10] D. Croce, A. Zelenanska, R. Basili, Neural learning for Tender No. 341 of March 15, 2022, of the Italian Min- for question answering in italian, in: C. Ghidini, istry of University and Research – NextGenerationEU; B. Magnini, A. Passerini, P. Traverso (Eds.), AI*IA Code PE00000014, Concession Decree No. 1556 of Octo- 2018 – Advances in Artificial Intelligence, Springer ber 11, 2022, CUP D43C22003050001, Project “SEcurity International Publishing, Cham, 2018, pp. 389–402. and RIghts in the CyberSpace (SERICS) – Spoke 2 Mis- [11] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, information and Fakes – DEcision supporT systEm foR C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. SauncybeR intelligENCE (Deterrence) ders, et al., Webgpt: Browser-assisted questionanswering with human feedback, arXiv preprint References arXiv:2112.09332 (2021). [ 12 ] C. Tan, V. Niculae, C. Danescu-Niculescu-Mizil, [1] E. Kamalloo, N. Dziri, C. Clarke, D. Rafiei, Eval- L. Lee, Winning arguments: Interaction dynamuating open-domain question answering in the ics and persuasion strategies in good-faith online era of large language models, in: A. Rogers, discussions, in: Proceedings of WWW, 2016. J. Boyd-Graber, N. Okazaki (Eds.), Proceedings [13] L. Shi, W. Ma, S. Vosoughi, Judging the of the 61st Annual Meeting of the Association judges: A systematic investigation of position for Computational Linguistics (Volume 1: Long bias in pairwise comparative assessments by llms, Papers), Association for Computational Linguis- CoRR abs/2406.07791 (2024). URL: https://doi.org/ tics, Toronto, Canada, 2023, pp. 5591–5606. URL: 10.48550/arXiv.2406.07791. doi:10.48550/ARXIV. https://aclanthology.org/2023.acl-long.307/. doi:10. 2406.07791. arXiv:2406.07791. 18653/v1/2023.acl-long.307. [14] P. Wang, L. Li, L. Chen, Z. Cai, D. Zhu, B. Lin, Y. Cao, [2] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McK- L. Kong, Q. Liu, T. Liu, Z. Sui, Large language modeown, T. B. Hashimoto, Benchmarking large lan- els are not fair evaluators, in: L. Ku, A. Martins, guage models for news summarization, Transac- V. Srikumar (Eds.), Proceedings of the 62nd Antions of the Association for Computational Linguis- nual Meeting of the Association for Computational tics 12 (2024) 39–57. Linguistics (Volume 1: Long Papers), ACL 2024, [ 3 ] T. Labruna, S. Brenna, A. Zaninello, B. Magnini, Un- Bangkok, Thailand, August 11-16, 2024, Association raveling chatgpt: A critical analysis of ai-generated for Computational Linguistics, 2024, pp. 9440–9450. goal-oriented dialogues and annotations, in: Inter- URL: https://doi.org/10.18653/v1/2024.acl-long.511. national Conference of the Italian Association for doi:10.18653/V1/2024.ACL-LONG.511.

Artificial Intelligence, Springer, 2023, pp. 151–171. [15] C. Zheng, H. Zhou, F. Meng, J. Zhou, M. Huang, [4] S. Casola, T. Labruna, A. Lavelli, B. Magnini, et al., Large language models are not robust multiple Testing chatgpt for stability and reasoning: A case choice selectors, in: The Twelfth International study using italian medical specialty tests, in: Pro- Conference on Learning Representations, ICLR ceedings of the 9th Italian Conference on Compu- 2024, Vienna, Austria, May 7-11, 2024, OpenRetational Linguistics (CLiC-it 2023), 2023. view.net, 2024. URL: https://openreview.net/forum? [5] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, id=shr9PXz7T0.

Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al., Judging [16] Z. Li, C. Wang, P. Ma, D. Wu, S. Wang, C. Gao, llm-as-a-judge with mt-bench and chatbot arena, Y. Liu, Split and merge: Aligning position biAdvances in Neural Information Processing Sys- ases in llm-based evaluators, in: Y. Al-Onaizan, tems 36 (2023) 46595–46623. M. Bansal, Y. Chen (Eds.), Proceedings of the 2024

A. Prompts for SQuAD-it-2 Dataset Generation A.1. Prompt for Low and Medium Uncertainty Settings

For the SQuAD-it-2 Low and Medium Uncertainty variants, we used a single prompt to generate a plausible but incorrect answer. The input to the model includes the original context passage, the question, and the correct answer. The model is explicitly instructed to generate an answer that could reasonably be interpreted as correct (i.e., plausible), while being in fact incorrect. The exact prompt is shown below: Contesto: <CONTEXT> Domanda: <QUESTION> Risposta corretta: <CORRECT_ANSWER> Step 2: Second Out-of-Context Answer. The model is then prompted again with the same context, question, and correct answer, along with the previously generated out-of-context wrong answer. This time, it is asked to produce a second, distinct out-of-context answer. The corresponding prompt is: Contesto: <CONTEXT> Domanda: <QUESTION> Risposta corretta: <CORRECT_ANSWER> Risposta errata: <WRONG_ANSWER_1> Fornisci una risposta completamente fuori contesto e sbagliata alla domanda sopra. Assicurati che non sia basata sul contesto fornito e che sia diversa dalla risposta errata gia’ presente.

Restituisci solo la risposta, senza spiegazioni o altro.

B, depending on which answer is better. Do not explain your choice. Output only A or B.

Final Construction. Once both out-of-context an

swers are generated, we discard the original context and the correct answer, retaining only the question and the two OOC distractors. The final dataset entries are structured as:

B. Prompt Templates We report here the prompt templates used in the experiments described in Section 3.2. Each dataset required a prompt adapted to its semantic framing and language. SQuAD-it-2. These prompts are in Italian. When context is present (Low Uncertainty), it is introduced with “Contesto:”. The rest of the prompt follows this structure:

Domanda: [Q] A) [Risposta 1] B) [Risposta 2]

The final instruction depends on the uncertainty level:

• Low Uncertainty (with context):

Scegli la risposta

Restituisci solo A o B. • Medium/High Uncertainty (no context):

Scegli la risposta che reputi più corretta. Se credi che nessuna sia corretta, scegli comunque quella che reputi più plausibile. Restituisci solo A o B. During the preparation of this work, the author(s) did not use any generative AI tools or services.

guage Processing , EMNLP 2024 , Miami , FL, USA, Fornisci una risposta plausibile ma sbagliata

November 12- 16 , 2024 , Association for Compu- alla domanda sopra, basandoti sul

tational Linguistics , 2024 , pp. 11084 - 11108 . URL: contesto. Restituisci solo la risposta,

https://aclanthology.org/ 2024 . emnlp-main.621. senza spiegazioni o altro . [17]

Ye ,

Wang ,

Huang ,

Chen ,

Zhang , N.

Mo- This prompt ensures that the incorrect answer remains

ICLR 2025 , Singapore, April 24-28 , 2025 , OpenRe - A.2. Prompting Strategy for High

view.net, 2025 . URL: https://openreview.net/forum? Uncertainty Setting

id=3GTtZFiajM. For the High Uncertainty setting, we followed a two-step [18]

Rajpurkar ,

Zhang ,

Lopyrev ,

Liang , Squad: prompting process to construct a pair of out-of-context

100 ,000+ questions for machine comprehension of (OOC) incorrect answers. The correct answer is used

text , arXiv preprint arXiv:1606.05250 ( 2016 ). during generation but removed from the final dataset to [19]

Grattafiori ,

Dubey ,

Jauhri ,

Pandey , A . Ka- increase ambiguity.

ten , A.

Vaughan , et al., The llama 3 herd of models, Step 1: First Out-of-Context Answer . The model

arXiv preprint arXiv:2407.21783 ( 2024 ). receives the context, the question, and the correct an [20]

Team ,

Kamath ,

Ferret ,

Pathak , N.

Vieil- swer. It is asked to generate an incorrect answer that is

Rivière , et al., Gemma 3 technical report, arXiv it is not plausible or grounded . The prompt used is:

preprint arXiv:2503.19786 ( 2025 ). [21]

Team ,

Georgiev , V. I. Lei ,

Burnell , L. Bai, Contesto: <CONTEXT>

et al., Gemini 1 . 5: Unlocking multimodal under- Risposta corretta: <CORRECT_ANSWER>

preprint arXiv:2403.05530 ( 2024 ). Fornisci una risposta completamente fuori [22]

Abouelenin ,

Ashfaq ,

Atkinson , cAosnstiecsutroatei cshbeaglnioantasiaallbaasdaotmaansdual sopra.

Cai ,

Chaudhary ,

Chen , et al., Phi-4 -mini risposta, senza spiegazioni o altro .

preprint arXiv:2503.01743 ( 2025 ).