1. Introduction

Knowledge-Grounded Detection of Factual Hallucinations in Large Language Models

Cristian Ceccarelli

Alessandro Raganato

Marco Viviani

0 0 University of Milano-Bicocca (DISCo - IKR3 Lab) , Edificio U14 (ABACUS), Viale Sarca, 336 - 20126 Milan , Italy

2025

Large Language Models (LLMs) have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet they remain prone to generating factually incorrect content, known as hallucinations. In this context, this work focuses on factuality hallucinations, ofering a comprehensive review of existing detection methods and an empirical evaluation of their efectiveness. In particular, we investigate the role of external knowledge integration by testing hallucination detection approaches that leverage evidence retrieved from a real-world Web search engine. Our experimental analysis compares this knowledge-enhanced strategy with alternative approaches, including uncertainty-based and black-box methods, across multiple benchmark datasets. The results indicate that, while external knowledge generally improves factuality detection, the quality and precision of the retrieval process critically afect performance. Our findings underscore the importance of grounding LLM outputs in verifiable external sources and point to future directions for improving retrieval-augmented hallucination detection systems.

eol>Natural Language Processing (NLP) Large Language Models (LLMs) Hallucinations Retrieval-Augmented Generation (RAG)

1. Introduction

systems, limit their practical applicability, and contribute to the spread of misinformation [ 3 ], especially in critIn recent years, the rapid advancements in technology ical areas such as journalism, medicine, and scientific and the growing availability of data have fostered the research, where factual accuracy is paramount. As such, emergence of Large Language Models (LLMs). These mod- hallucinations represent a major challenge in the deployels, based on the Transformer architecture, exploit atten- ment of LLMs. Addressing this issue requires a deeper tion mechanisms to analyze relationships between tex- understanding of its underlying causes and the develtual elements and efectively capture contextual meaning opment of robust detection and mitigation strategies to [ 1 ]. This capability allows LLMs to excel in natural lan- ensure the reliability and safety of these technologies in guage generation and a wide range of Natural Language real-world applications [ 4 ].

Processing (NLP) tasks, including text summarization, In this context, we investigate how incorporating exmachine translation, and conversational AI. Due to their ternal knowledge can improve the efectiveness of halimpressive ability to understand, interpret, and generate lucination detection in LLMs. Specifically, we explore human-like language, LLMs have become indispensable the integration of Retrieval-Augmented Generation (RAG) tools in fields such as education, research, and healthcare. frameworks [ 5 ] into existing detection pipelines, with

However, despite their capabilities and the significant the aim of enhancing their ability to identify hallucitechnological advancements they represent, LLMs still nated content by accessing verifiable information. Thereface some challenges. A particularly critical issue is their fore, in this work, we develop an automated knowledge tendency to generate the so-called hallucinations, which retrieval system that leverages the Google Search API are outputs that are plausible but incorrect, under difer- to collect relevant external evidence, which is then inent perspectives [ 2 ]. The prevalence of such hallucinated tegrated through RAG into two distinct hallucination outputs is particularly concerning given the increasing detection methods: () a few-shot prompting approach, integration of LLMs into sensitive domains. The gen- where an LLM is explicitly instructed to assess the factueration of incorrect content can undermine trust in AI ality of a given statement, and () SelfCheckGPT [ 6 ], a state-of-the-art hallucination detection method based on CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- response sampling, which evaluates whether a generated tics, September 24 — 26, 2025, Cagliari, Italy output contains hallucinated content. Finally, the impact *$Cocrrriceescpcoan9d9i.nccg@augtmhoairl..com (C. Ceccarelli); of knowledge integration on the efectiveness of hallucialessandro.raganato@unimib.it (A. Raganato); nation detection approaches is assessed by conducting marco.viviani@unimib.it (M. Viviani) a comparative evaluation. Specifically, the performance http://www.ir.disco.unimib.it/people/marco-viviani/ (M. Viviani) of each approach is measured both with and without the 0000-0002-7018-7515 (A. Raganato); 0000-0002-2274-9050 incorporation of external knowledge, using established (M. Viviani) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License benchmark datasets for hallucination detection.

Attribution 4.0 International (CC BY 4.0).

2. Background and Related Work These approaches can be broadly classified into the fol

lowing categories: Within the context of LLMs, the term “hallucination” refers to the generation of content that is either nonsen- • Uncertainty estimation-based: Studies suggest that outsical or unfaithful to the source content. In the literature, puts produced with high model uncertainty are more hallucinations are typically categorized into two main prone to hallucinations [ 7 ]. Accordingly, these methtypes: factuality hallucinations and faithfulness hallucina- ods estimate the LLM’s uncertainty by analyzing its tions [ 2 ]. The remainder of the section therefore provides internal states to infer the likelihood of hallucinated background on the two distinct concepts, before consid- content. A key advantage of these techniques is their ering the literature that directly addresses the problem. independence from external knowledge; however, they require access to the model’s internal representations, 2.1. Factuality Hallucinations which may not be feasible in all settings, especially with proprietary models; This category of hallucination encompasses all content that contradicts established real-world knowledge. It con- • Knowledge retrieval-based: These approaches leverage stitutes the primary focus of this study, as it is directly external knowledge sources—such as online encycloassociated with the presence and potential dissemination pedias or structured databases—to verify the factuality of misinformation. Factuality hallucinations can be fur- of LLM-generated content. While generally reliable ther classified based on the verifiability of the generated and adaptable across domains, these methods often content against reliable sources, depending on whether incur high computational costs due to the retrieval and they are characterized by: processing of external information; • Factual inconsistency, which refers to cases in which the output contradicts verifiable information from reliable sources, thereby generating incorrect content; • Factual fabrication, which occurs when the generated output cannot be verified against any reliable source, indicating the generation of unverifiable or entirely invented content.

• Zero-resource and black-box: These techniques detect hallucinations by analyzing output consistency and model behavior across multiple generations, without relying on external knowledge or internal model access.

Although these methods are broadly applicable to any LLM, they may be less efective in scenarios involving queries with multiple plausible answers or ambiguous interpretations. 2.2. Faithfulness Hallucinations Belonging to the first category, the work described in [ 8 ] argues that when an LLM generates hallucinated conFaithfulness hallucinations arise when the generated con- tent, it implicitly encodes a degree of uncertainty within tent is inconsistent with the input or contextual informa- its internal representations. Based on this assumption, tion provided by the user. This category can be further the authors introduce SAPLMA, a method that aims to subdivided into three types, depending on whether they determine the factuality of a generated statement by anare characterized by: alyzing the internal states of the model to estimate its uncertainty. Since it is not yet fully understood which • Instruction inconsistency, which occurs when the out- internal layers best capture information relevant to facput deviates from the explicit instructions given by the tuality, the authors investigate multiple variants of the user; approach by extracting hidden states from diferent layers • Context inconsistency, where the generated content is of the model, such as intermediate or final layers. These misaligned with the contextual information supplied representations are then passed to a shallow neural clasby the user; sifier, which outputs the probability that the statement is true or false. Despite the good results, the optimal layer • Logical inconsistency, which is typically observed in from which to extract internal states remains unclear reasoning tasks and is characterized by contradictions and appears to be dependent on the specific LLM emor errors in the reasoning steps of the model. ployed. Furthermore, the evaluation was conducted on isolated statements classified as true or false, rather than 2.3. Related Work on complete model responses generated in relation to specific user inputs, thereby limiting the assessment of the In recent years, numerous studies have investigated the method’s efectiveness in realistic interaction scenarios. issue of hallucinations in LLMs, proposing a variety of The approach presented in [ 9 ], which belongs to the detection approaches based on diferent methodologi- second category of approaches, introduces FActScore, a cal strategies to identify and mitigate this phenomenon. method based on comparison with a reliable external knowledge source. The procedure begins by decompos- lished benchmark datasets for hallucination detection, ing the content generated by the LLM into atomic facts, each encompassing a variety of domains. This ensures a defined as concise and discrete statements. These atomic broader evaluation scope and demonstrates the robustfacts are then manually verified by human annotators, ness of the method across diverse contexts. who assess their factuality using English Wikipedia as the reference source. Each atomic fact is labeled as supported or unsupported depending on whether it is supported by 3. Methodology the knowledge base. The overall factuality score of the This section details the methodologies employed for the content is computed as the proportion of atomic facts that development of the automatic knowledge retrieval sysare supported by reliable knowledge. While this method tem, alongside the strategies utilized for integrating the ofers a structured and interpretable evaluation of fac- retrieved knowledge into both: () the few-shot prompttual accuracy, it presents notable limitations. Specifically, ing approach, and () the SelfCheckGPT framework. it has been validated exclusively in biographical texts, domains characterized by objective and easily verifiable information. 3.1. Knowledge Retrieval System

Finally, belonging to the third category of methods, The knowledge retrieval system is built entirely upon in [ 6 ] the authors propose SelfCheckGPT, a hallucina- a customized Google Search engine, accessed via the tion detection method that leverages stochastic sampling Google Search API. In particular, the retrieval process is of multiple responses generated by an LLM from the organized into the following steps: same input prompt. The underlying assumption of this approach is that, when an LLM possesses reliable knowl- • A query is submitted to the search engine; edge about a given topic, its responses will exhibit a high degree of consistency; conversely, a lack of knowl- • The search engine communicates with the Web edge will lead to greater variability among responses. through the API and returns a list of query-relevant To evaluate the consistency of these sampled outputs, URLs; the authors introduce five distinct variants of SelfCheck- • The content of the first URL is parsed to extract the GPT: SelfCheckGPT with BERTScore, which performs main body text from the HTML; semantic similarity comparisons between responses; SelfCheckGPT with Question Answering (QA), which gener- • The retrieved textual content is then encoded using ates questions from the original answer and uses the an embedding model, and its vector representation sampled responses to answer them; SelfCheckGPT with is stored in a vector database, allowing for eficient Natural Language Inference (NLI), which applies an NLI retrieval and integration with the LLM. model to determine whether responses entail or contradict one another; SelfCheckGPT with -grams, which Figure 1 illustrates the pipeline for the knowledge reestimates token-level probabilities; and SelfCheckGPT trieval process. with LLM prompt, which relies on prompting an LLM to judge the consistency of the sampled outputs. However, the evaluation of this approach was conducted on a limited dataset comprising 238 Wikipedia-style articles synthetically generated by an LLM, with factuality assessed manually at the sentence level. While this setting provides initial insights, the scope of the study remains narrow and could be extended to include more diverse and conceptually complex content.

In light of the primary limitations identified in the literature for existing hallucination detection approaches, this study proposes a fully automated methodology that completely eliminates the need for human involvement Figure 1: Pipeline of the knowledge retrieval process. in the knowledge retrieval process. Manual retrieval is often labor-intensive and time-consuming; by contrast, the proposed approach leverages an automated pipeline for sourcing and integrating external knowledge, 3.2. Few-Shot Prompting with Knowledge thereby significantly reducing both time and operational Few-shot prompting is a technique in which an LLM costs. Furthermore, the efectiveness of the method is is presented with a limited number of task-specific exvalidated through experiments conducted on three estab- amples to guide its behavior and enhance its ability to perform a given task. However, the model’s responses in this setting are based solely on the knowledge acquired during the pre-training phase. To enhance its performance and expand its informational basis, the framework integrates external knowledge retrieved through the automated retrieval system. This additional context is provided to the model during inference, enabling more accurate and informed task execution. Specifically, the process is structured into the following steps: • The user’s query is encoded using the embedding

model; • The resulting embedding is used to retrieve relevant

information from the vectorized knowledge base; • The retrieved knowledge is incorporated into the prompt, together with a set of examples and the question–answer pair to be assessed; • The LLM evaluates the factuality of the answer by leveraging both its internal knowledge and the external information, classifying the response as either factual (true) or hallucinated (false).

4. Experimental Evaluation This section presents the experimental setup employed

3.3. SelfCheckGPT with Knowledge to conduct the experiments, describes the datasets and the metric used for performance evaluation, and provides The knowledge was also integrated into the SelfCheck- an analysis of the results obtained. GPT framework to improve the quality of the sampled responses. The underlying assumption is that providing the LLM with relevant external information will lead to 4.1. Experimental Setup the generation of more accurate and reliable responses. All experiments were carried out on the Google Colab As a result, when these samples are compared with the platform,1 utilizing a Tesla T4 GPU. The LLM employed target response using one of the SelfCheckGPT variants, for the few-shot prompting approach, response samit becomes easier to assess whether the target response pling, and the LLM-prompt variant of SelfCheckGPT was is hallucinated. The process is structured according to the following steps: 1https://colab.research.google.com/ Llama-3.2-3B-Instruct, accessed using the Transformers For the implementation of SelfCheckGPT, the variants library of Hugging Face.2 For both approaches, the model employed for evaluation purposes are BERTScore, NLI, selected for generating semantic embeddings and as a and LLM prompt (see Section 2.3). In accordance with retriever was jina-embeddings-v3.3 The retrieved knowl- the original SelfCheckGPT configuration, 5 responses edge was segmented into chunks of 256 characters with per query were sampled using a temperature setting of an overlap of 25 characters to preserve semantic coher- 1.0 and a maximum output length of 128 tokens. Figence across segments. The retriever was configured to ure 5 illustrates the prompt provided to the LLM for the return the top 5 most relevant documents according to generation of these sampled responses. similarity to the input query.

The few-shot prompting approach was evaluated by Prompt for Generating Sampled Responses with providing the model with 1, 5, and 10 examples. To gener- Knowledge ate the response, the LLM was set to a temperature value equal to 0.001. Figure 4 presents the prompt structure provided to the LLM to classify a given text as either factual or hallucinated.

Based on your knowledge and on the context provided, answer the following question giving as much detail as you can. Question: [QUESTION] Context: [KNOWLEDGE] Answer:

4.2. Datasets and Evaluation Metric For the experimental evaluation, three benchmark datasets for hallucination detection were selected. Each dataset includes a user query, the corresponding LLMgenerated response, and a binary label indicating whether the response is factually accurate. The datasets employed are FactAlign [ 10 ], FactBench [ 11 ], and FELM [ 12 ], all of which are described in detail in the following.

Prompt for Few-Shot Prompting with Knowledge I want you to act as a response judge. Given a user query, a knowledge, and a response

by an LLM, your objective is to determine if the response is an hallucination or not.

In the context of NLP, an "hallucination" refers to a phenomenon where the LLM generates text that is incorrect, nonsensical, or not real. Based on your knowledge, on the knowledge provided, and on the definition of hallucination provided, analyze the user query and the response of the LLM, and answer the following question: is the response factual or not?

BE CAREFUL: sometimes the knowledge may be

empty or not useful, in which case you have to respond based only on your knowledge.

Answer True if you consider the response factual, False otherwise. You don’t have to provide any explanation. ### EXAMPLE 1 User query: [USER QUERY] Knowledge: [KNOWLEDGE] LLM response: [LLM RESPONSE] Answer: [ANSWER]

... ### EXAMPLE N

User query: [USER QUERY] Knowledge: [KNOWLEDGE] LLM response: [LLM RESPONSE] Answer: [ANSWER]

### LLM TURN

User query: [USER QUERY] Knowledge: [KNOWLEDGE] LLM response: [LLM RESPONSE] Answer:

2https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct 3https://huggingface.co/jinaai/jina-embeddings-v3 to facilitate more efective knowledge retrieval through generated by an LLM, and a binary factuality label. For the Google Search API and to simplify both the factu- evaluation purposes, only the entries corresponding to ality classification task performed by the LLM and the user queries in the form of questions were retained. Due generation of sampled responses within the SelfCheck- to computational constraints, a subset of 100 observations GPT framework. Following this filtering step, a random was selected. To mitigate the efects of class imbalance, sample of 100 questions was selected. This limitation an equal number of true and false instances (50 each) was imposed by constraints on computational resources were randomly sampled. A fixed random seed was apand time, which required a balance between the number plied to ensure reproducibility and consistency across all of examples and processing eficiency. Furthermore, to experimental configurations. ensure comparability and consistency across the methods and each variant, a fixed random seed was used to guarantee the reproducibility of the 100 instances across all experiments.

FELM. FELM is a multi-domain benchmark dataset

designed for the evaluation of hallucination detection in LLMs, encompassing five distinct domains, each posing specific challenges for the models under analysis [ 12 ].

The domains are defined as follows: FactBench. This dataset was specifically developed to evaluate FactCheck-GPT, a multi-step framework designed for the detection and correction of factual errors in responses generated by LLMs [ 11 ]. FactBench was constructed by integrating three distinct benchmark datasets aimed at hallucination detection: • Knowledge-based FacTool: Created to assess the performance of the FacTool framework, which evaluates the factual consistency of LLM-generated responses through external knowledge retrieval [ 13 ].

This dataset was constructed by selecting 50 prompts from FactPrompts and fact-checking datasets such as TruthfulQA [14]. For each prompt, responses were generated using ChatGPT and subsequently annotated by human evaluators with binary labels indicating factual correctness; • FELM-WK : Subset of the FELM dataset that will be

detailed in the next paragraph; • HaluEval: This benchmark dataset for hallucination detection was constructed by initially considering 52 000 prompts, followed by a filtering procedure aimed at selecting those most likely to elicit hallucinated responses from a LLM. Specifically, each prompt was submitted to ChatGPT three times, and the average semantic similarity among the generated responses was calculated. The 5 000 prompts with the lowest semantic similarity scores were retained to ensure the dataset included only the most challenging queries.

The selected prompts were then resubmitted to ChatGPT to obtain a second set of responses, which were manually annotated as either true or false based on their factual accuracy [15].

FactBench was made publicly available by the authors on GitHub and comprises a total of 4 835 examples, of which 3 838 are labeled as true and 995 as false.5 Each instance includes a user query, the corresponding response 5https://github.com/yuxiaw/Factcheck-GPT/blob/main/Factbench. jsonl • World knowledge: Includes questions related to general

cultural and factual knowledge; • Science and technology: Comprises statements related to scientific facts or citations across disciplines such as physics and biology; • Reasoning: Contains prompts that require multi-step

logical reasoning to produce a correct response; • Recommendation and writing: Involves open questions requiring the model to provide suggestions or generate creative or structured written content; • Math: Encompasses problems that necessitate both logical reasoning and mathematical skills to arrive at correct answers.

FELM was constructed by aggregating prompts from diverse sources, which were then submitted to ChatGPT operating in a zero-shot configuration. The resulting responses were segmented into sentences, each of which was subsequently evaluated by a team of experts. The factual accuracy of each sentence was assessed based on comparison with reliable sources, and sentences were annotated as either true or false accordingly. A response was labeled as true only if all its sentences were assessed as accurate; otherwise, it was classified as false. The FELM dataset was obtained from Hugging Face and comprises a total of 847 instances.6 Each instance includes a user prompt, the corresponding response generated by the LLM, and a factuality label. Of these examples, 566 are labeled as factual, while 281 are labeled as non-factual.

For evaluation, only the World knowledge and Science and technology domains were considered, as the remaining presented substantial limitations for the knowledge retrieval approach (e.g., mathematical prompts such as “What is the value of the expression 1! + 2! + 3! + ... + 6https://huggingface.co/datasets/hkust-nlp/felm 10!” ). As in the previous datasets, only prompts formu- across all three benchmark datasets. With regard to fewlated as questions were retained. To mitigate class im- shot prompting, the ten-shot configuration achieves the balance and accommodate computational constraints, a best performance, followed by the five-shot and one-shot balanced subset of 100 samples—comprising 50 factual variants, respectively. This trend is consistent with the and 50 non-factual instances—was randomly selected. A hypothesis that providing a greater number of examples ifxed random seed was applied to ensure consistency enables the LLM to better internalize the task structure, across experiments. thereby improving generalization and overall accuracy. In this regard, the strategy for selecting examples in Evaluation metric. Since all the datasets employed the few-shot prompting approach could be improved. In in the evaluation are balanced, Accuracy was adopted as the current evaluation, examples were randomly sampled the primary performance metric. It is defined as follows: from the datasets, which may result in class imbalance among the examples shown to the LLM, potentially afAccuracy = + fecting performance. Ensuring a balanced representation + + + of classes in the selected examples would therefore be where TP denotes factual responses correctly classified crucial for enhancing the robustness of the analysis in as factual, TN represents hallucinated responses correctly the few-shot prompting setting. identified as hallucinations, FP corresponds to halluci- Regarding the impact of knowledge integration, on nated responses incorrectly classified as factual, and FN the FactAlign dataset, the only method that underperrefers to factual responses mistakenly classified as hallu- forms when incorporating external knowledge is fewcinations. shot prompting with five examples; all other tested methods either match or surpass the performance of their 4.3. Results and Discussion counterparts without knowledge. A similar trend is observed on FactBench, where all approaches that leverage To evaluate the impact of knowledge integration, the retrieved knowledge perform at least as well as, and often performance of both SelfCheckGPT and the few-shot better than, those without knowledge integration. Finally, prompting approach was evaluated in two configurations: in the FELM dataset, incorporating external knowledge with and without the inclusion of external knowledge. generally leads to performance improvements across A summary of the comparative results is presented in methods, with the sole exception of SelfCheckGPT using Table 1. The notation W/O and W denotes whether the the LLM Prompt, where performance declines by one perevaluated variant operates without or with integrated centage point after knowledge integration. Overall, these knowledge, respectively. For each variant and dataset, analyses suggest that integrating external knowledge the version (with or without knowledge) that achieves generally enhances the performance of the evaluated apthe highest performance is underlined; if both versions proaches across all datasets, with only a few exceptions perform equally, no underlining is applied. where a slight decrease in performance was observed. These performance declines may be attributed to limiModels Variant FactAlign FactBench FELM tations in the knowledge retrieval process. Specifically,

W/O W W/O W W/O W only the first retrieved URL is considered—typically the

BERTScore 59.0 61.0 61.0 60.0 56.0 59.0 most popular, but not necessarily the most informative. SelfCheckGPT LLMNPLrIompt 6672..00 6657..00 6547..00 6693..00 6697..00 6781..00 Additionally, the retrieval system occasionally fails to PFreowm-pShtiontg OFTienvnee---ssshhhooottt 555705...000 555549...000 655293...000 666425...000 655269...000 656392...000 aahcnincteid-sebsrorttehmleeveaacchnqatuncioissinmttioesnnotrodfCuvAeaPtlouTaWCbHleebAerxpetsretorrtniecactlitoikonnnso,s,wsuwlcehhdigcahes. Table 1 Nevertheless, on average, approaches augmented with exComparison between methods with and without integrated ternal knowledge outperform their non-augmented counknowledge, to evaluate its impact on their performance. terparts. This suggests that further improvements in the retrieval process could improve the overall efectiveness of these methods and lead to even greater performance gains.

As shown in Table 1, the SelfCheckGPT framework consistently outperforms the few-shot prompting approach across all evaluated conditions. This result aligns with expectations, given that SelfCheckGPT is specifi- 5. Conclusions and Perspectives cally designed for hallucination detection, whereas fewshot prompting is a more general-purpose methodol- In this study, we introduced a fully automated knowledge ogy. Among the SelfCheckGPT variants, the NLI-based retrieval framework that leverages a custom search enmethod demonstrates the highest overall efectiveness gine interfacing with the Web via the Google Search API and eficiency, surpassing the LLM prompting variant to extract relevant external information. The retrieved knowledge was subsequently integrated into two distinct methodologies: (i) few-shot prompting, which consists of providing a set of examples to guide task execution, and (ii) SelfCheckGPT, a hallucination detection framework that generates and compares multiple responses from an LLM to identify factual inconsistencies. The enhanced versions of both approaches, incorporating retrieved knowledge, were evaluated on three benchmark datasets for hallucination detection—FactAlign, FactBench, and FELM—spanning a diverse range of domains. The experimental results indicate that SelfCheckGPT consistently outperforms the few-shot prompting approach, demonstrating strong performance across all three benchmark datasets. Among its variants, the NLI configuration emerges as the most efective and computationally eficient. Moreover, the integration of external knowledge generally enhances the performance of the evaluated approaches compared to their counterparts without such integration. Nonetheless, the observed improvements could be further amplified by refining the knowledge retrieval process in future work. Specifically, challenges such as CAPTCHA mechanisms or site access restrictions that limit automated retrieval should be addressed. Additionally, the quality of the queries submitted to the search engine could be improved by leveraging LLMs to generate more precise and contextually rich queries, thereby yielding more informative results. Moreover, expanding the number of retrieved Web sources may lead to more comprehensive and accurate knowledge; for instance, retrieving the top five results could increase the relevance and diversity of the retrieved information. Finally, future researches may also focus on further refining the knowledge integration process by leveraging more advanced and sophisticated RAG techniques [ 5 ]. Enhancing integration within frameworks such as SelfCheckGPT, which has already demonstrated promising results in hallucination detection, holds significant potential. These advancements could support the development of a reliable, scalable, and eficient multi-domain hallucination detection system.

Acknowledgments This work was partly funded by: the European Union – Next Generation EU, Mission 4, Component 2, CUP: D53D23008480001 (20225WTRFN – KURAMi:

Knowledge-based, explainable User empowerment in Releasing private data and Assessing Misinformation in online environments);7 ATEQC – Progetti di Ricerca di Ateneo – Quota Competitiva (University Research Projects – Competitive Funding Scheme) PriQuaDeS: Next-generation Privacy- and Quality-preserving Decentralized Social Web

7https://kurami.disco.unimib.it/ Applications; the MUR under the grant “Dipartimenti di

Eccellenza 2023-2027” of the Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca, Italy. We further acknowledge ISCRA for awarding this project access to the LEONARDO supercomputer [16], owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy). Declaration on Generative AI

[1]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is All You Need, Advances in neural information processing systems 30 ( 2017 ) 1 - 11 .

[2]

Huang ,

Yu , W. Ma,

Zhong ,

Feng ,

Wang ,

Chen ,

Peng ,

Feng ,

Qin , et al., A Survey on Hallucination in Large Language Models: Principles , Taxonomy, Challenges, and Open Questions, ACM Transactions on Information Systems 43 ( 2025 ) 1 - 55 .

[3]

Kruschwitz ,

Petrocchi ,

Viviani , ROMCIR 2025 : Overview of the 5th Workshop on Reducing Online Misinformation Through Credible Information Retrieval , in: European Conference on Information Retrieval , Springer, 2025 , pp. 339 - 344 .

[4]

Saxena ,

Sathe ,

Sandosh , Mitigating Hallucinations in Large Language Models: A Comprehensive Survey on Detection and Reduction Strategies , in: International Conference on Sustainable Computing and Intelligent Systems , Springer, 2025 , pp. 39 - 52 .

[5]

Gao ,

Xiong ,

Gao ,

Jia ,

Pan ,

Bi ,

Dai ,

Sun ,

Wang ,

Wang , Retrieval-Augmented Generation for Large Language Models: A Survey , 2024 . URL: https://arxiv.org/abs/2312.10997. arXiv: 2312 . 10997 .

[6]

Manakul ,

Liusie , M. Gales, SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 9004 - 9017 .

[7]

Xiao ,

W. Y.

Wang , On Hallucination and Predictive Uncertainty in Conditional Language Generation , in: P. Merlo,

Tiedemann , R. Tsarfaty (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics , Online, 2021 , pp. 2734 - 2744 .

[8]

Azaria , T. Mitchell, The Internal State of an LLM Knows When It's Lying , in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Association for Computational Linguistics , Singapore, 2023 , pp. Detection in Generative AI - A Tool Augmented 967 - 976 . Framework for Multi-Task and Multi-Domain Sce-

[9]

Min ,

Krishna ,

Lyu ,

Lewis , W.-t. Yih, narios, 2023 . URL: https://arxiv.org/abs/2307.13528. P. Koh,

Iyyer ,

Zettlemoyer , H. Hajishirzi, arXiv: 2307 .13528. FActScore: Fine-grained Atomic Evaluation of Fac- [14]

Lin ,

Hilton , O. Evans, TruthfulQA: Measurtual Precision in Long Form Text Generation, in: ing How Models Mimic Human Falsehoods , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of S. Muresan,

Nakov , A . Villavicencio (Eds.), Prothe 2023 Conference on Empirical Methods in Nat- ceedings of the 60th Annual Meeting of the Asural Language Processing, Association for Com- sociation for Computational Linguistics (Volume putational Linguistics , Singapore, 2023 , pp. 12076 - 1 : Long Papers), Association for Computational 12100 . Linguistics , Dublin, Ireland, 2022 , pp. 3214 - 3252 .

[10] C.-W. Huang , Y.-N. Chen , FactAlign: Long-form URL: https://aclanthology .org/ 2022 . acl-long . 229 /. Factuality Alignment of Large Language Mod- doi:10.18653/v1/ 2022 . acl-long.229. els , in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen [15] J.

Li , X.

Cheng , X.

Zhao , J.-Y.

Nie , J.-R.

Wen , HaluE(Eds.), Findings of the Association for Computa- val: A Large-Scale Hallucination Evaluation Benchtional Linguistics: EMNLP 2024, Association for mark for Large Language Models , in: H. Bouamor , Computational Linguistics, Miami, Florida, USA, J. Pino , K. Bali (Eds.), Proceedings of the 2023 Con2024 , pp. 16363 - 16375 . URL: https://aclanthology. ference on Empirical Methods in Natural Language org/ 2024 .findings-emnlp. 955 /. doi: 10 .18653/v1/ Processing, Association for Computational Linguis2024.findings-emnlp. 955 . tics, Singapore, 2023 , pp. 6449 - 6464 . URL: https:

[11]

Wang ,

R. Gangi

Reddy ,

Z. M.

Mujahid , A . Arora, //aclanthology.org/ 2023 .emnlp-main. 397 /. doi:10.

Rubashevskii ,

Geng ,

O. Mohammed

Afzal , 18653 /v1/ 2023 .emnlp-main.397. L. Pan , N.

Borenstein , A.

Pillai , I. Augenstein , [16]

Turisini , G. Amati,

Cestari , CINECA SuperI. Gurevych,

Nakov , Factcheck-Bench: Fine- Computing Centre , SuperComputing Applications Grained Evaluation Benchmark for Automatic Fact- and Innovation Department , LEONARDO: A Pancheckers, in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen

European Pre-Exascale Supercomputer for HPC and (Eds.), Findings of the Association for Computa- AI applications , Journal of Large-Scale Research tional Linguistics: EMNLP 2024 , Association for Facilities 9 ( 2024 ). Computational Linguistics, Miami, Florida, USA, 2024 , pp. 14199 - 14230 . URL: https://aclanthology. org/ 2024 .findings-emnlp. 830 /. doi: 10 .18653/v1/ 2024 .findings-emnlp. 830 .

[12]

Chen ,

Zhao ,

Zhang , I.-C. Chern,

Gao ,

Liu ,

Online Resources J. He , FELM: Benchmarking Factuality Evaluation of Large Language Models, in: Proceedings of the The datasets used for the experimental evaluations are 37th International Conference on Neural Informa- publicly available, as referenced in the works cited tion Processing Systems , NIPS '23, Curran

Asso

- throughout the paper. For the sake of reproducibility, ciates Inc ., Red

Hook

, NY , USA, 2023 . the code developed in this study is also made publicly

[13] I.-C. Chern , S.

Chern , S.

Chen , W.

Yuan , K. Feng, accessible at the following address: https://github.com/ C. Zhou, J.

He , G. Neubig, P. Liu, FacTool: Factuality cristianceccarelli/rag-hallu.