1. Introduction

URL: https://arxiv.org/pdf/

Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA

Naveen Lamba

Sanju Tiwari

Manas Gaur

1 0 Center for Artificial Intelligence in Medicine, Imaging and Forensics, Sharda University , Greater Noida , India 1 University of Maryland , Baltimore County, Baltimore, MD , USA

2408

00118

Hallucination in Large Language Models(LLMs) is a well studied problem. However, the properties that make LLM intrinsically vulnerable to hallucinations have not been identified and studied. This research identifies and characterizes the key properties, allowing us to pinpoint vulnerabilities within the model's internal mechanisms. To solidify on these properties, we utilized two established datasets, HaluEval and TruthfulQA and convert their existing format of question answering into various other formats to narrow down these properties as the reason for the hallucinations. Our findings reveal that hallucination percentages across symbolic properties are notably high for Gemma-2-2B, averaging 79.0% across tasks and datasets. With increased model scale, hallucination drops to 73.6% for Gemma-2-9B and 63.9% for Gemma-2-27B, reflecting a 15 percentage point reduction overall.Although the hallucination rate decreases as the model size increases, a substantial amount of hallucination caused by symbolic properties still persists. This is especially evident for modifiers (ranging from 84.76% to 94.98%) and named entities (ranging from 83.87% to 93.96%) across all Gemma models and both datasets. These findings indicate that symbolic elements continue to confuse the models, pointing to a fundamental weakness in how these LLMs process such inputs-regardless of their scale.

eol>Hallucination Large Language Models Attention Symbolic Triggers Symbolic Properties

1. Introduction

Large language models (LLMs) have made significant advancements in various natural language understanding and generation tasks, including open-domain question answering [ 1 ], text summarization [ 2 ], reasoning [ 3 ], and dialogue [ 4 ]. Despite their success, the reliability of LLMs remains a major issue due to hallucination1, which involves confidently generating content that is factually inaccurate or nonsensical text [ 5, 6 ].

While significant research has been conducted on identifying and reducing hallucinations in LLMs [ 7, 8 ], much of this work has been primarily driven by the development of novel hallucination benchmarks and their corresponding detection and mitigation approaches [ 9 ]. However, the investigation into the fundamental, intrinsic causes of hallucination phenomena in LLMs remains significantly underexplored. Understanding these root causes is particularly crucial because they often stem from limitations in symbolic knowledge representation and reasoning—areas where the NLP community has extensive expertise [ 10, 11 ]. These limitations manifest through specific and elemental symbolic triggers that consistently provoke hallucinations: named entities, negation handling, exception cases, and others can cause LLMs to generate incorrect information, irrespective of the dataset format or domain. Figure 1 illustrates two such examples where all Gemma models hallucinate in the presence of symbolic triggers like modifiers, named entity, number, negation, and exception. By focusing on these intrinsic mechanisms, researchers can develop more robust, data-agnostic methodologies that not only localize the sources of hallucination within LLMs but also provide systematic, long-term solutions rather than superficial fixes. This deeper understanding would enable the creation of more reliable language models that can better distinguish between accurate and inaccurate generated content, ultimately leading to more trustworthy AI systems [ 12 ].

This paper addresses this gap by identifying and describing symbolic and interpretable knowledge properties that reliably trigger hallucination across natural language understanding task types and model scales. This paper makes several key contributions which are outlined below: Key Contributions: • Identification of symbolic hallucination triggers: Systematically identified and characterized five symbolic knowledge properties that reliably trigger hallucination: modifiers (adjectives, adverbs, verbs), named entities, numbers, negation, and exceptions, and provided a property-focused evaluation for understanding intrinsic vulnerabilities in LLMs. • Prompt engineering-driven data transformation for generalization of symbolic triggers: Developed a systematic evaluation approach that tests hallucination consistency across three critical dimensions: model scale (Gemma-2-2B, 9B, 27B), task formats (question-answering (QA), multiple choice questions (MCQ), Odd-One-Out (OOO)), and symbolic property types by converting existing datasets to isolate specific triggers, which demonstrates that symbolic vulnerabilities are fundamental architectural issues rather than artifacts of specific experimental conditions. • Internal activation analysis using symbolic triggers: Conducted attention pattern analysis and activation-level traces to examine how symbolic properties afect internal model representations and processing, providing evidence that hallucinations stem from deeper representational instabilities rather than surface-level generation errors.

These contributions led us to the following findings: • Symbolic triggers elicit hallucination across model sizes: Hallucination rates remain substantially high across all model sizes: 79.0% (Gemma-2-2B), 73.6% (Gemma-2-9B), and 63.9% (Gemma-2-27B), with only a modest 15 percentage point reduction despite significant model scaling, indicating that these are structural rather than capacity-related issues that challenge the assumption that larger models automatically become more reliable. • Primary symbolic triggers: Modifiers show hallucination rates ranging from 84.76% to 94.98% across all models while named entities exhibit similarly high rates (83.87% to 93.96%), consistently emerging as the most problematic symbolic properties and revealing specific linguistic elements that pose the most significant risk for factual accuracy in LLM outputs. • Task Format Dependency: QA format produces the highest hallucination rates compared to MCQ and Odd-One-Out formats, with lower symbolic attention values correlating with higher hallucination frequency, particularly evident in MCQ tasks, demonstrating that task structure significantly influences model reliability and suggesting that constrained generation formats may ofer some protection against symbolic confusion. • Non-monotonic input length efects of symbolic triggers: Symbolic triggers behave diferently across input lengths: modifiers and named entities cause the most hallucinations in short-tomedium contexts (10-30 tokens) but become more reliable with longer context, while numbers follow an unpredictable up-and-down pattern, and negation and exceptions consistently cause fewer problems overall, demonstrating that context length afects each symbolic property uniquely.

2. Related Work

Recent advances in large language models (LLMs) have intensified focus on understanding and mitigating hallucination—confident outputs that are factually incorrect or logically incoherent. While early research primarily concentrated on output-level detection and dataset-based evaluation of hallucination phenomena[ 13, 14 ], the deeper representational vulnerabilities of LLMs remain underexplored. Our study contributes by examining how hallucination manifests across model sizes, under diferent task formats, and in response to symbolic properties embedded in inputs.

While a growing body of work evaluates LLMs for factual reliability, few studies assess how hallucination trends evolve with model scale. Notably, works like Yao et al. [ 15 ] frame hallucinations as emergent adversarial phenomena—linked to overconfident generalizations—but do not analyze whether such tendencies vary with parameter count. Similarly, most hallucination benchmarks focus on a single model instance rather than conducting comparative analysis across multiple versions of the same model family. Our work addresses this gap by systematically evaluating hallucination behavior across Gemma-2-2B, 9B, and 27B, revealing that while hallucination rates reduce with scale, symbolic triggers remain persistent.

Benchmark datasets such as TruthfulQA[ 16 ] and HaluEval[ 17 ] have been instrumental in evaluating LLM hallucinations. These benchmarks typically use open-ended QA to elicit model generations under minimal constraints, which often reveal factual inconsistencies. However, prior work does not systematically vary task formats to study how structural diferences—like constrained generation in multiple-choice or odd-one-out tasks—modulate hallucination tendencies. Our study introduces task format as a key dimension, converting QA data into MCQ and OOO formats to probe whether and how task structure interacts with hallucination triggers.

Numerous research eforts have investigated linguistic and stylistic elements that afect hallucination. For instance, Rawte et al. [ 18 ] demonstrate that the likelihood of hallucinated outputs is influenced by readability, formality, and concreteness. Some have concentrated on particular symbolic structures. Negation has specifically been recognized as a continual vulnerability for LLMs, as shown by Varshney et al. [ 19 ] and Asher and Bhar [ 20 ], who illustrate that models often generate false information even when negation indicators are straightforward in syntax and clear in logic. These results indicate that symbolic reasoning continues to be dificult, even if many assessments are limited to individual signals. Our work broadens this scope by evaluating five symbolic properties—modifiers, named entities, numbers, negation, and exceptions—as systematic triggers of hallucination. We also extend analysis beyond surface-level generations, examining how symbolic inputs induce representational instability across transformer layers.

In contrast to prior research, which often isolates one axis of hallucination (model, task, or linguistic feature), our work ofers a three-dimensional assessment across model scale, task format, and symbolic input structure. We analyze symbolic hallucination in three Gemma models (2B, 9B, 27B), three reformatted task environments (QA, MCQ, OOO), and five symbolic property types, providing both quantitative trends and internal activation-level insights [ 21 ]. This integrative approach reveals that hallucinations are not just artifacts of generation, but reflect deeper weaknesses in how LLMs process structurally complex or logically nuanced inputs.

3. Methodology

Our approach involved taking existing datasets, converting them into diferent question formats, and then testing how three diferent sizes of Gemma models (2B, 9B, and 27B) responded to questions containing specific symbolic triggers like modifiers, numbers, and named entities. This study evaluates diferent versions of Gemma models, open-source checkpoints released by Google DeepMind [ 22, 23 ]. For consistency across all experiments and to minimize sampling parameter variability, we utilize each model’s default temperature value, as provided by the model, which is typically a low or zero value for deterministic generation. This enables us to see the inherent behavior of each model in its recommended decoding setup without injecting sampling-originating randomness.

This research explores the inherent symbolic knowledge characteristics that induce hallucinations in LLMs, i.e., between varying instances of the Gemma model family (2B, 9B, and 27B). The approach follows a systematic, property-focused evaluation pipeline consisting of dataset setup, input transformation, model selection, controlled prompt creation, and hallucination analysis. We base our methodology on the assumption that some input symbolic structures — like modifiers, named entities, negations, numbers, and exceptions — increase the likelihood of LLMs to hallucinate. To empirically test this, we reformatted typical datasets into task-specific ones and inspected the derived outputs.

3.1. Dataset Preparation and Task Conversion

We used two established hallucination evaluation datasets, HaluEval and TruthfulQA, which contain factual question-answer pairs. To determine whether symbolic triggers cause hallucinations regardless of task structure or if certain formats ofer protection against symbolic confusion, we systematically converted these datasets into three distinct formats that provide diferent levels of generative constraints and cognitive demands: (i) QA format preserves open-ended generation that may expose maximum hallucination tendencies since models can freely fabricate plausible-sounding but incorrect responses when encountering symbolic triggers, (ii) MCQ format provides constrained multiple-choice selection that tests whether limiting response options can mitigate symbolic trigger efects by preventing freeform generation, and (iii) OOO format tests semantic classification abilities under symbolic influence to determine if symbolic triggers disrupt fundamental reasoning processes beyond just factual recall. What we prepared: We systematically transformed 100 samples from each dataset (verified to contain one or more target symbolic properties) into all three task formats, creating a comprehensive evaluation framework of 600 total test instances. Each transformation maintained the core symbolic elements while adapting the response structure to isolate whether symbolic confusion persists across diferent cognitive demands and constraint levels. How transformation was achieved: We designed standardized prompts for each format: QA Prompt: "Answer the following question in one short, factual sentence." MCQ Prompt: "Consider the following multiple-choice question. Pick the correct answer and explain your reasoning." Odd One Out Prompt: "Identify the item that does not belong in the list. Explain your reasoning."

For example: "Answer the following question in one short, factual sentence." "Consider the following multiple-choice question. Pick the correct answer and explain your reasoning." • QA Prompt: • MCQ Prompt: • Odd One Out Prompt:

"Identify the item that does not belong in the list. Explain your reasoning."

Prompt design was carefully managed so that hallucinations, when they occur, can be attributed to model reasoning and symbolic processing rather than prompt ambiguity, with all transformed prompts annotated for symbolic property analysis.

Two standard hallucination benchmarking datasets: HaluEval[ 17 ] and TruthfulQA[ 16 ] have been used for this study. Both datasets include factual QA pairs. To investigate hallucination behavior with varying knowledge forms, we design three task formulations: (i) QA: Preserves the question-answer pair format of the original dataset. (ii) MCQ: Converts every QA pair to a single correct and two distractor options multiple-choice format. (iii) Odd-One-Out: Presents conceptually related options with the exception of one, seeking identification of the semantic outlier. All 600 samples of the two datasets are checked to have one or more of the five symbolic knowledge properties.

3.2. Property Identification and Categorization

To investigate symbolic triggers of hallucination, all input prompts were annotated for the presence of ifve key symbolic properties. These properties were chosen based on their structural role in language. Each property was identified using linguistic markers and then manually verified to ensure semantic relevance. Here, we define each category, provide examples, and summarize their contribution to hallucination: 1. Modifiers(adjectives, adverbs, and verbs): These elements introduce subjective or descriptive information, often adding interpretive flexibility. Example: “Which is the most rapidly growing city in Europe?”. Modifiers such as “rapidly” or “most” invite vague or ambiguous completions, increasing the risk of confident but unverifiable assertions. LLMs may hallucinate plausiblesounding answers even when the modifier-driven nuance is not grounded in training data. 2. Named Entities(persons, organizations, locations): Identified using Named Entity Recognition (NER) techniques, these refer to proper nouns that often require external knowledge grounding. Example: “Who is the founder of the fictional company TechNova?” . Due to their reliance on memorized or incomplete knowledge, LLMs often fabricate facts or assign incorrect associations when dealing with named entities—especially rare or fictional ones. 3. Numbers(quantitative expressions): These include cardinal numbers, ranges, dates, and measurements. Example: “How many satellites does Mars currently have?”. LLMs are prone to imprecision or outright numerical hallucination, either due to outdated training data or due to overgeneralizing learned patterns. Such prompts demand factual accuracy, making errors more noticeable. 4. Negation(not, never, none, cannot): Detected via syntactic and semantic analysis, negation alters the logical polarity of a sentence. Example: “Which of these is not a fruit?”. LLMs frequently mishandle negation by overlooking or misinterpreting the negative cue, resulting in logically inverted or irrelevant answers. 5. Exceptions(edge cases, conditional rules): These refer to inputs that challenge the model to recognize rare or counterexamples. Example: “Which metal is liquid at room temperature?”. Exceptions require deeper contextual reasoning. Since LLMs tend to generalize, they often miss these special cases, favoring the more common rule rather than the exception.

By categorizing prompts along these symbolic dimensions, we aim to isolate specific triggers that systematically increase hallucination likelihood across tasks and model scales. This property-level lens provides a more interpretable understanding of why and when LLMs go wrong.

3.3. Hallucination Evaluation Strategy

We employ a three-tier hallucination analysis approach, progressing from overall hallucination rates to detailed, layer-wise causes, ultimately attributing them to five symbolic triggers.

Symbolic trigger-based computation of hallucination percentage: To quantify hallucination induced by symbolic properties, we annotated each input for the presence of one or more symbolic triggers (modifiers, named entities, numbers, negation, exceptions) and computed the proportion of hallucinated outputs within each trigger category. A prediction was marked as a hallucination if it was factually incorrect. This computation was carried out per symbolic property, allowing us to isolate their individual contribution to hallucination rates. The final hallucination percentage per property was then calculated as the number of hallucinated instances containing that property divided by the total instances containing it.

Symbolic trigger-driven attention analysis of Gemma models: We analyze attention scores to symbolic tokens at specific transformer layers selected based on prior research patterns. Following Wu et al. [24]’s approach, which emphasizes mid-to-deeper layers where semantic integration peaks, we examine Layers 10 and 20 for Gemma-2-2B, Layers 20 and 31 for Gemma-2-9B, and Layers 23 and 40 for Gemma-2-27B. This allows consistent comparison of symbolic attention allocation across model sizes. Input token length and hallucination percentage analysis: We investigate the relationship between hallucination rates and input question length by organizing data into token length bins and analyzing how symbolic property efects vary across diferent context sizes. This reveals whether symbolic triggers have consistent efects regardless of the surrounding context or if their impact changes with input complexity.

4. Results and Analysis

This section presents our empirical analysis of hallucination behavior in the Gemma model family under symbolic property influence. The investigation is organized along three axes: (i) consistency across model sizes, (ii) variation across task types, and (iii) internal activation responses. The evaluation spans all five symbolic property types, with hallucination annotated as confident yet factually incorrect responses.

4.1. Consistency Across Model Variants

In the QA format, modifiers, named entities, and numbers consistently emerge as the most hallucinationprone symbolic properties across all three Gemma model sizes. As shown in Table 1, hallucination percentages for modifiers in the HaluEval dataset remain notably high, decreasing only slightly from 84.76% in Gemma-2-2B to 77.24% in Gemma-2-27B. Named entities follow a similar trend, with a marginal drop from 83.87% to 76.43%, while numbers stay persistently high at around 83.16%–76.32% across model scales.

This pattern is also observed in the TruthfulQA dataset, where modifiers reach up to 94.98% in Gemma2-9B and numbers peak at 98.00%, reflecting the models’ continued struggle with these symbolic cues. On the other hand, while negation and exceptions appear less frequently in HaluEval (e.g., 70.00% and 80.00% in Gemma-2-2B), their hallucination rates remain above 90% in TruthfulQA across all model sizes.

These results indicate that scaling up model size ofers only modest reductions in hallucination rates for symbolic properties, and that the same set of symbolic triggers continues to challenge LLMs, revealing a persistent internal vulnerability.

4.2. Generalization Across Task Formats

To understand how hallucination behavior generalizes across prompt formats, we analyzed symbolic token attention across QA, MCQ, and Odd-One-Out (OOO) tasks using the Gemma model family (2B, 9B, 27B). Table 2 presents average attention scores to symbolic tokens across task formats and model sizes, measured at specific mid-to-deeper layers.

Following prior layer selection patterns used in Wu et al. [24], which emphasized mid and postmid transformer layers ( Layers 10 and 20 for Gemma-2-2B and Layers 20 and 31 for Gemma-2-9B), we chose Layers 23 and 40 for Gemma-2-27B. These lie in the middle-to-late segments of the model, where semantic integration and abstract token interactions typically peak. This alignment allows for a consistent and meaningful comparison of symbolic attention across model sizes.

The results indicate that task format substantially afects both hallucination frequency and attention allocation, despite using prompts with similar symbolic triggers. Across all model sizes, MCQ prompts result in consistently higher hallucination frequency than QA, particularly at the 2B scale. This correlates with lower symbolic attention values for MCQ compared to QA—suggesting reduced grounding or interpretive focus. For instance, in the 27B model, attention to modifiers in QA is 0.0078 (Layer 23), dropping to 0.0063 in MCQ, and further varying in OOO (0.0085). This indicates task-specific shifts in symbolic emphasis, even within the same model.

Conversely, while OOO prompts show relatively lower symbolic attention, they elicit stronger semantic hallucination efects, particularly in smaller models (as seen in prior hallucination rate and efect metrics). Notably, in the 2B model, symbolic attention for named entities drops sharply in MCQ (0.0177 → 0.0051 from Layer 10 to 20), whereas QA retains higher symbolic focus (0.0147 → 0.0082). The same trend, though attenuated, persists in 27B, showing a consistent symbolic property ranking: modifiers and named entities receive the highest attention, followed by numbers, negation, and exceptions.

4.3. Activation-Level Traces of Symbolic Instability

To further probe the internal behavior of LLMs in the presence of symbolic linguistic properties, we analyzed the relationship between hallucination and input question length. Table 3 illustrate the average hallucination percentages across symbolic properties (modifiers, named entities, numbers, negation, and exceptions) as a function of token length, for both HaluEval and TruthfulQA datasets. We observe that hallucination induced by symbolic properties like modifiers and named entities remains consistently high across varying input lengths. For instance, modifiers peaked at nearly 97% hallucination in 0–29 query token length bracket, while named entities followed a similar trend with a peak around 78%, which is actually the normal length of the query used by a layman. Notably, hallucination rates tend to decline for longer queries (40+ tokens), potentially due to enhanced contextual grounding, although the trend is not uniform across all properties. Instances where hallucination percentages drop to 0% are due to the absence of the corresponding symbolic property in that token-length bracket. However, as evident from the table, even minimal presence of a property often corresponds with noticeable hallucination, underscoring a persistent underlying efect.

These observations suggest that certain symbolic properties evoke unstable internal activations, especially in shorter to mid-length prompts. The model’s inability to generalize robustly across symbolic structures, regardless of input size, reveals activation-level fragility tied to linguistic form, rather than token count alone. This provides evidence that hallucinations are not solely a product of length context, and 30–39 due to limited occurrences. For TruthfulQA, hallucination % is 0 in the 50+ length bin, as all questions were shorter than 50 tokens.

Query Token

Length 2B

9B HaluEval

TruthfulQA

HaluEval

TruthfulQA

HaluEval 27B

TruthfulQA but of deeper symbolic entanglement.

Our findings strongly indicate that symbolic linguistic properties, particularly modifiers, named entities, and numbers, act as consistent triggers for hallucination across all Gemma model sizes. While scaling from Gemma-2-2B to 27B reduces hallucination rates modestly (by 15 percentage points), symbolic hallucinations persist even in the largest models. This persistence highlights that such hallucinations are not solely a function of model capacity but stem from how these models internally encode and generalize over symbolic constructs. Additionally, our activation-level analysis reveals that hallucination rates vary with input length, peaking for mid-range lengths (10–30 tokens). This suggests that context size interacts nonlinearly with symbolic processing, which may indicate local representational instability rather than mere underfitting. Across all models and tasks, QA emerges as the most hallucination-prone format, reinforcing that generative responses under minimal constraints (unlike MCQ or OOO) expose deeper symbolic weaknesses in LLMs.

5. Conclusion and Future Directions

This study presents a focused investigation into the symbolic triggers of hallucination in Gemma language models. Across tasks and datasets, we consistently observe that hallucinations are most frequently associated with symbolic linguistic properties—especially modifiers, named entities, and numbers. While scaling the model from Gemma-2-2B to 27B results in a modest reduction in hallucination rates, these symbolic vulnerabilities persist regardless of model size, revealing a deeper representational fragility. Our activation-level analyses further suggest that hallucination is not merely a product of input length or task format, but is tightly coupled with how LLMs internalize and generalize over symbolic structures. The persistence of high hallucination rates, particularly in QA tasks, indicates that symbolic confusion remains a core limitation of current LLM architectures. However, now we have symbolic knowledge that can help us locate hallucination within the layers of open-source LLMs.

The future work will focus on two key technical directions: Mechanistic interpretability analysis will employ activation patching and causal intervention techniques to precisely localize which transformer layers and attention heads are responsible for symbolic confusion, enabling targeted architectural improvements. Cross-model generalizability studies will systematically validate these symbolic vulnerabilities across diferent model families (LLaMA, Mistral, GPT) to determine whether these represent universal architectural limitations or model-specific weaknesses. We also aim to extend this analysis to multilingual and multimodal LLMs to evaluate the generality of symbolic hallucinations across modalities and languages. Finally, exploring prompt-based interventions may ofer practical mitigation strategies by reducing symbolic ambiguity at inference time.

Acknowledgments

The authors gratefully acknowledge the use of the NVIDIA H100 DGX system provided by the Centre for Artificial Intelligence in Medicine, Imaging and Forensics (CAIMIF), Sharda University, which made the large-scale model experiments and evaluations possible.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT for Grammar and spelling check. After using this, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

A. Online Resources

The source code and data related to this work are available at: • GitHub

[1]

Kamalloo ,

Dziri ,

C. L.

Clarke ,

Rafiei , Evaluating open-domain question answering in the era of large language models , Proceedings of the Annual Meeting of the Association for Computational Linguistics 1 ( 2023 ) 5591 - 5606 . URL: https://arxiv.org/pdf/2305.06984. doi: 10 .18653/v1/ 2023 . acl-long . 307 .

[2]

D. V.

Veen ,

C. V.

Uden ,

Blankemeier , J.-B. Delbrouck , A.

Aali , C.

Bluethgen , A.

Pareek , M.

Polacin , E. P.

Reis , A.

Seehofnerová , N.

Rohatgi , P.

Hosamani , W. Collins, N.

Ahuja , C. P.

Langlotz , J.

Hom , S.

Gatidis , J.

Pauly , A. S.

Chaudhari , Clinical text summarization: Adapting large language models can outperform human experts , Research Square ( 2023 ) rs .3.rs- 3483777 . URL: https: //pmc.ncbi.nlm.nih.gov/articles/PMC10635391/. doi: 10 .21203/RS.3.RS- 3483777 /V1.

[3]

Yugeswardeenoo ,

Zhu , S.

O'Brien, Question-analysis prompting improves llm performance in reasoning tasks (

2024 ). URL: https://arxiv.org/pdf/2407.03624.

[4]

Guan ,

Xiong ,

Wang ,

Bian ,

Zhu , J. guang Lou, Evaluating llm-based agents for multi-turn conversations: A survey, Proceedings of Make sure to enter the correct conference title from your rights confirmation emai (Conference acronym 'XX) 1 ( 2025 ). URL: https://arxiv.org/pdf/2503.22458. doi:XXXXXXX.XXXXXXX.

[5]

Maynez ,

Narayan ,

Bohnet ,

McDonald , On faithfulness and factuality in abstractive summarization, Proceedings of the Annual Meeting of the Association for Computational Linguistics ( 2020 ) 1906 - 1919 . URL: https://arxiv.org/pdf/ 2005 .00661. doi: 10 .18653/v1/ 2020 .acl-main. 173 .

[6]

Govil ,

Jain ,

Bonagiri ,

Chadha ,

Kumaraguru ,

Gaur ,

Dey , Cobias: Assessing the contextual reliability of bias benchmarks for language models , in: Proceedings of the 17th ACM Web Science Conference 2025 , 2025 , pp. 460 - 471 .

[7]

Ji ,

Lee ,

Frieske ,

Yu ,

Su ,

Xu ,

Ishii ,

Bang ,

Chen ,

Dai ,

H. S.

Chan ,

Madotto ,

Fung , Survey of hallucination in natural language generation , ACM Computing Surveys 55 ( 2023 ) 1 - 38 . URL: http://arxiv.org/abs/2202.03629. doi: 10 .1145/3571730, arXiv: 2202 .03629 [cs].

[8]

Huang ,

Feng ,

Qin , T. Liu,

Yu , W. Ma,

Zhong ,

Feng ,

Wang ,

Chen ,

Peng , A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , ACM Transactions on Information Systems 1 ( 2024 ). doi: 10 .1145/3703155.

[9]

Sun ,

Yin ,

Guo ,

Wu ,

Qiu ,

Zhao , Benchmarking hallucination in large language models based on unanswerable math word problem ( 2024 ). URL: https://arxiv.org/pdf/2403.03558.

[10]

Acharya ,

Velasquez ,

H. H.

Song , A survey on symbolic knowledge distillation of large language models , IEEE Transactions on Artificial Intelligence 5 ( 2024 ) 5928 - 5948 . URL: http: //arxiv.org/abs/2408.10210http://dx.doi.org/10.1109/TAI. 2024 . 3428519 . doi: 10 .1109/TAI. 2024 . 3428519 .

[11]

Weston ,

Bordes ,

Chopra ,

A. M.

Rush ,

B. V.

Merriënboer ,

Joulin , T. Mikolov, Towards ai-complete question answering: A set of prerequisite toy tasks , 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings ( 2015 ). URL: https: //arxiv.org/pdf/1502.05698.

[12]

Zhang ,

Li ,

Cui ,

Cai , L. Liu,

Fu ,

Huang ,

Zhao ,

Zhang ,

Chen ,

Wang ,

A. T.

Luu ,

Bi ,

Shi ,

Shi , Siren's song in the ai ocean: A survey on hallucination in large language models , 2023 . URL: http://arxiv.org/abs/2309.01219. doi: 10 .48550/arXiv.2309.01219, arXiv: 2309 .01219 [cs].

[13]

Zhao ,

S. B.

Cohen ,

Webber , Reducing quantity hallucinations in abstractive summarization, Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 ( 2020 ) 2237 - 2249 . URL: https://arxiv.org/pdf/ 2009 .13312. doi: 10 .18653/v1/ 2020 .findings-emnlp. 203 .

[14]

Durmus ,

He ,

Diab , Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization, Proceedings of the Annual Meeting of the Association for Computational Linguistics ( 2020 ) 5055 - 5070 . URL: http://arxiv.org/abs/ 2005 .03754http://dx.doi. org/10.18653/v1/ 2020 .acl-main. 454 . doi: 10 .18653/v1/ 2020 .acl-main. 454 .

[15] J.-Y. Yao , K.-P.

Ning , Z.-H.

Liu , M.-N.

Ning , Y.-Y. Liu, L.

Yuan , Llm lies: Hallucinations are not bugs, but features as adversarial examples ( 2023 ). URL: https://arxiv.org/pdf/2310.01469.

[16]

Lin ,

Hilton ,

Evans , Truthfulqa: Measuring how models mimic human falsehoods , Proceedings of the Annual Meeting of the Association for Computational Linguistics 1 ( 2021 ) 3214 - 3252 . URL: https://arxiv.org/pdf/2109.07958. doi: 10 .18653/v1/ 2022 . acl-long . 229 .

[17]

Li , X. Cheng, W. X. Zhao , J.-Y.

Nie , J.-R.

Wen , Halueval: A large-scale hallucination evaluation benchmark for large language models , 2023 . URL: http://arxiv.org/abs/2305.11747. doi: 10 .48550/ arXiv.2305.11747, arXiv: 2305 .11747 [cs].

[18]

Rawte ,

Priya ,

S. M.

Towhidul , I. Tonmoy,

S. M. M.

Zaman ,

Sheth , A. Das , Exploring the relationship between llm hallucinations and prompt linguistic nuances: Readability, formality, and concreteness ( 2023 ). URL: https://arxiv.org/pdf/2309.11064.

[19]

Varshney ,

Raj ,

Mishra ,

Chatterjee ,

Sarkar ,

Saeidi ,

Baral , Investigating and addressing hallucinations of llms in tasks involving negation ( 2024 ). URL: https://arxiv.org/pdf/ 2406.05494.

[20]

Asher ,

Bhar , Strong hallucinations from negation and how to fix them ( 2024 ). URL: https: //arxiv.org/pdf/2402.10543.

[21]

Joshi ,

Saha ,

Shukla ,

Vema ,

Jhamtani ,

Gaur ,

Modi , Towards robust evaluation of unlearning in llms via data transformations , in: Findings of the Association for Computational Linguistics: EMNLP 2024 , 2024 , pp. 12100 - 12119 .

[22]

Team ,

Riviere ,

Pathak ,

P. G.

Sessa ,

Hardin ,

Bhupatiraju ,

Hussenot ,

Mesnard ,

Shahriari ,

Ramé ,

Ferret , P. Liu,

Tafti ,

Friesen ,

Casbon ,

Ramos ,

Kumar ,

C. L.

Lan ,

Jerome ,

Tsitsulin ,

Vieillard ,

Stanczyk ,

Girgin ,

Momchev ,

Hofman ,

Thakoor , J.-B. Grill , B.

Neyshabur , O.

Bachem , A.

Walton , A.

Severyn , A.

Parrish , A.

Ahmad , A.

Hutchison , A.

Abdagic , A.

Carl , A.

Shen , A.

Brock , A.

Coenen , A.

Laforge , A.

Paterson , B. Bas-