<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge-Grounded Detection of Factual Hallucinations in Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cristian Ceccarelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Raganato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Viviani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milano-Bicocca (DISCo - IKR3 Lab)</institution>
          ,
          <addr-line>Edificio U14 (ABACUS), Viale Sarca, 336 - 20126 Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs) have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet they remain prone to generating factually incorrect content, known as hallucinations. In this context, this work focuses on factuality hallucinations, ofering a comprehensive review of existing detection methods and an empirical evaluation of their efectiveness. In particular, we investigate the role of external knowledge integration by testing hallucination detection approaches that leverage evidence retrieved from a real-world Web search engine. Our experimental analysis compares this knowledge-enhanced strategy with alternative approaches, including uncertainty-based and black-box methods, across multiple benchmark datasets. The results indicate that, while external knowledge generally improves factuality detection, the quality and precision of the retrieval process critically afect performance. Our findings underscore the importance of grounding LLM outputs in verifiable external sources and point to future directions for improving retrieval-augmented hallucination detection systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing (NLP)</kwd>
        <kwd>Large Language Models (LLMs)</kwd>
        <kwd>Hallucinations</kwd>
        <kwd>Retrieval-Augmented Generation (RAG)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        systems, limit their practical applicability, and contribute
to the spread of misinformation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], especially in
critIn recent years, the rapid advancements in technology ical areas such as journalism, medicine, and scientific
and the growing availability of data have fostered the research, where factual accuracy is paramount. As such,
emergence of Large Language Models (LLMs). These mod- hallucinations represent a major challenge in the
deployels, based on the Transformer architecture, exploit atten- ment of LLMs. Addressing this issue requires a deeper
tion mechanisms to analyze relationships between tex- understanding of its underlying causes and the
develtual elements and efectively capture contextual meaning opment of robust detection and mitigation strategies to
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This capability allows LLMs to excel in natural lan- ensure the reliability and safety of these technologies in
guage generation and a wide range of Natural Language real-world applications [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Processing (NLP) tasks, including text summarization, In this context, we investigate how incorporating
exmachine translation, and conversational AI. Due to their ternal knowledge can improve the efectiveness of
halimpressive ability to understand, interpret, and generate lucination detection in LLMs. Specifically, we explore
human-like language, LLMs have become indispensable the integration of Retrieval-Augmented Generation (RAG)
tools in fields such as education, research, and healthcare. frameworks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] into existing detection pipelines, with
      </p>
      <p>
        However, despite their capabilities and the significant the aim of enhancing their ability to identify
hallucitechnological advancements they represent, LLMs still nated content by accessing verifiable information.
Thereface some challenges. A particularly critical issue is their fore, in this work, we develop an automated knowledge
tendency to generate the so-called hallucinations, which retrieval system that leverages the Google Search API
are outputs that are plausible but incorrect, under difer- to collect relevant external evidence, which is then
inent perspectives [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The prevalence of such hallucinated tegrated through RAG into two distinct hallucination
outputs is particularly concerning given the increasing detection methods: () a few-shot prompting approach,
integration of LLMs into sensitive domains. The gen- where an LLM is explicitly instructed to assess the
factueration of incorrect content can undermine trust in AI ality of a given statement, and () SelfCheckGPT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a
state-of-the-art hallucination detection method based on
CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- response sampling, which evaluates whether a generated
tics, September 24 — 26, 2025, Cagliari, Italy output contains hallucinated content. Finally, the impact
*$Cocrrriceescpcoan9d9i.nccg@augtmhoairl..com (C. Ceccarelli); of knowledge integration on the efectiveness of
hallucialessandro.raganato@unimib.it (A. Raganato); nation detection approaches is assessed by conducting
marco.viviani@unimib.it (M. Viviani) a comparative evaluation. Specifically, the performance
 http://www.ir.disco.unimib.it/people/marco-viviani/ (M. Viviani) of each approach is measured both with and without the
0000-0002-7018-7515 (A. Raganato); 0000-0002-2274-9050 incorporation of external knowledge, using established
(M. Viviani)
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License benchmark datasets for hallucination detection.
      </p>
      <p>Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <sec id="sec-2-1">
        <title>These approaches can be broadly classified into the fol</title>
        <p>
          lowing categories:
Within the context of LLMs, the term “hallucination”
refers to the generation of content that is either nonsen- • Uncertainty estimation-based: Studies suggest that
outsical or unfaithful to the source content. In the literature, puts produced with high model uncertainty are more
hallucinations are typically categorized into two main prone to hallucinations [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Accordingly, these
methtypes: factuality hallucinations and faithfulness hallucina- ods estimate the LLM’s uncertainty by analyzing its
tions [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The remainder of the section therefore provides internal states to infer the likelihood of hallucinated
background on the two distinct concepts, before consid- content. A key advantage of these techniques is their
ering the literature that directly addresses the problem. independence from external knowledge; however, they
require access to the model’s internal representations,
2.1. Factuality Hallucinations which may not be feasible in all settings, especially
with proprietary models;
This category of hallucination encompasses all content
that contradicts established real-world knowledge. It con- • Knowledge retrieval-based: These approaches leverage
stitutes the primary focus of this study, as it is directly external knowledge sources—such as online
encycloassociated with the presence and potential dissemination pedias or structured databases—to verify the factuality
of misinformation. Factuality hallucinations can be fur- of LLM-generated content. While generally reliable
ther classified based on the verifiability of the generated and adaptable across domains, these methods often
content against reliable sources, depending on whether incur high computational costs due to the retrieval and
they are characterized by: processing of external information;
• Factual inconsistency, which refers to cases in which
the output contradicts verifiable information from
reliable sources, thereby generating incorrect content;
• Factual fabrication, which occurs when the generated
output cannot be verified against any reliable source,
indicating the generation of unverifiable or entirely
invented content.
        </p>
        <p>• Zero-resource and black-box: These techniques detect
hallucinations by analyzing output consistency and
model behavior across multiple generations, without
relying on external knowledge or internal model access.</p>
        <p>
          Although these methods are broadly applicable to any
LLM, they may be less efective in scenarios involving
queries with multiple plausible answers or ambiguous
interpretations.
2.2. Faithfulness Hallucinations Belonging to the first category, the work described in
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] argues that when an LLM generates hallucinated
conFaithfulness hallucinations arise when the generated con- tent, it implicitly encodes a degree of uncertainty within
tent is inconsistent with the input or contextual informa- its internal representations. Based on this assumption,
tion provided by the user. This category can be further the authors introduce SAPLMA, a method that aims to
subdivided into three types, depending on whether they determine the factuality of a generated statement by
anare characterized by: alyzing the internal states of the model to estimate its
uncertainty. Since it is not yet fully understood which
• Instruction inconsistency, which occurs when the out- internal layers best capture information relevant to
facput deviates from the explicit instructions given by the tuality, the authors investigate multiple variants of the
user; approach by extracting hidden states from diferent layers
• Context inconsistency, where the generated content is of the model, such as intermediate or final layers. These
misaligned with the contextual information supplied representations are then passed to a shallow neural
clasby the user; sifier, which outputs the probability that the statement is
true or false. Despite the good results, the optimal layer
• Logical inconsistency, which is typically observed in from which to extract internal states remains unclear
reasoning tasks and is characterized by contradictions and appears to be dependent on the specific LLM
emor errors in the reasoning steps of the model. ployed. Furthermore, the evaluation was conducted on
isolated statements classified as true or false, rather than
2.3. Related Work on complete model responses generated in relation to
specific user inputs, thereby limiting the assessment of the
In recent years, numerous studies have investigated the method’s efectiveness in realistic interaction scenarios.
issue of hallucinations in LLMs, proposing a variety of The approach presented in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which belongs to the
detection approaches based on diferent methodologi- second category of approaches, introduces FActScore, a
cal strategies to identify and mitigate this phenomenon. method based on comparison with a reliable external
knowledge source. The procedure begins by decompos- lished benchmark datasets for hallucination detection,
ing the content generated by the LLM into atomic facts, each encompassing a variety of domains. This ensures a
defined as concise and discrete statements. These atomic broader evaluation scope and demonstrates the
robustfacts are then manually verified by human annotators, ness of the method across diverse contexts.
who assess their factuality using English Wikipedia as the
reference source. Each atomic fact is labeled as supported
or unsupported depending on whether it is supported by 3. Methodology
the knowledge base. The overall factuality score of the This section details the methodologies employed for the
content is computed as the proportion of atomic facts that development of the automatic knowledge retrieval
sysare supported by reliable knowledge. While this method tem, alongside the strategies utilized for integrating the
ofers a structured and interpretable evaluation of fac- retrieved knowledge into both: () the few-shot
prompttual accuracy, it presents notable limitations. Specifically, ing approach, and () the SelfCheckGPT framework.
it has been validated exclusively in biographical texts,
domains characterized by objective and easily verifiable
information. 3.1. Knowledge Retrieval System
        </p>
        <p>
          Finally, belonging to the third category of methods, The knowledge retrieval system is built entirely upon
in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] the authors propose SelfCheckGPT, a hallucina- a customized Google Search engine, accessed via the
tion detection method that leverages stochastic sampling Google Search API. In particular, the retrieval process is
of multiple responses generated by an LLM from the organized into the following steps:
same input prompt. The underlying assumption of this
approach is that, when an LLM possesses reliable knowl- • A query is submitted to the search engine;
edge about a given topic, its responses will exhibit a
high degree of consistency; conversely, a lack of knowl- • The search engine communicates with the Web
edge will lead to greater variability among responses. through the API and returns a list of query-relevant
To evaluate the consistency of these sampled outputs, URLs;
the authors introduce five distinct variants of SelfCheck- • The content of the first URL is parsed to extract the
GPT: SelfCheckGPT with BERTScore, which performs main body text from the HTML;
semantic similarity comparisons between responses;
SelfCheckGPT with Question Answering (QA), which gener- • The retrieved textual content is then encoded using
ates questions from the original answer and uses the an embedding model, and its vector representation
sampled responses to answer them; SelfCheckGPT with is stored in a vector database, allowing for eficient
Natural Language Inference (NLI), which applies an NLI retrieval and integration with the LLM.
model to determine whether responses entail or
contradict one another; SelfCheckGPT with -grams, which Figure 1 illustrates the pipeline for the knowledge
reestimates token-level probabilities; and SelfCheckGPT trieval process.
with LLM prompt, which relies on prompting an LLM
to judge the consistency of the sampled outputs.
However, the evaluation of this approach was conducted on a
limited dataset comprising 238 Wikipedia-style articles
synthetically generated by an LLM, with factuality
assessed manually at the sentence level. While this setting
provides initial insights, the scope of the study remains
narrow and could be extended to include more diverse
and conceptually complex content.
        </p>
        <p>In light of the primary limitations identified in the
literature for existing hallucination detection approaches,
this study proposes a fully automated methodology that
completely eliminates the need for human involvement Figure 1: Pipeline of the knowledge retrieval process.
in the knowledge retrieval process. Manual retrieval
is often labor-intensive and time-consuming; by
contrast, the proposed approach leverages an automated
pipeline for sourcing and integrating external knowledge, 3.2. Few-Shot Prompting with Knowledge
thereby significantly reducing both time and operational Few-shot prompting is a technique in which an LLM
costs. Furthermore, the efectiveness of the method is is presented with a limited number of task-specific
exvalidated through experiments conducted on three estab- amples to guide its behavior and enhance its ability to
perform a given task. However, the model’s responses in
this setting are based solely on the knowledge acquired
during the pre-training phase. To enhance its
performance and expand its informational basis, the
framework integrates external knowledge retrieved through
the automated retrieval system. This additional context
is provided to the model during inference, enabling more
accurate and informed task execution. Specifically, the
process is structured into the following steps:
• The user’s query is encoded using the embedding</p>
        <p>model;
• The resulting embedding is used to retrieve relevant</p>
        <p>information from the vectorized knowledge base;
• The retrieved knowledge is incorporated into the
prompt, together with a set of examples and the
question–answer pair to be assessed;
• The LLM evaluates the factuality of the answer by
leveraging both its internal knowledge and the external
information, classifying the response as either factual
(true) or hallucinated (false).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental Evaluation</title>
      <sec id="sec-3-1">
        <title>This section presents the experimental setup employed</title>
        <p>3.3. SelfCheckGPT with Knowledge to conduct the experiments, describes the datasets and
the metric used for performance evaluation, and provides
The knowledge was also integrated into the SelfCheck- an analysis of the results obtained.
GPT framework to improve the quality of the sampled
responses. The underlying assumption is that providing
the LLM with relevant external information will lead to 4.1. Experimental Setup
the generation of more accurate and reliable responses. All experiments were carried out on the Google Colab
As a result, when these samples are compared with the platform,1 utilizing a Tesla T4 GPU. The LLM employed
target response using one of the SelfCheckGPT variants, for the few-shot prompting approach, response
samit becomes easier to assess whether the target response pling, and the LLM-prompt variant of SelfCheckGPT was
is hallucinated. The process is structured according to
the following steps: 1https://colab.research.google.com/
Llama-3.2-3B-Instruct, accessed using the Transformers For the implementation of SelfCheckGPT, the variants
library of Hugging Face.2 For both approaches, the model employed for evaluation purposes are BERTScore, NLI,
selected for generating semantic embeddings and as a and LLM prompt (see Section 2.3). In accordance with
retriever was jina-embeddings-v3.3 The retrieved knowl- the original SelfCheckGPT configuration, 5 responses
edge was segmented into chunks of 256 characters with per query were sampled using a temperature setting of
an overlap of 25 characters to preserve semantic coher- 1.0 and a maximum output length of 128 tokens.
Figence across segments. The retriever was configured to ure 5 illustrates the prompt provided to the LLM for the
return the top 5 most relevant documents according to generation of these sampled responses.
similarity to the input query.</p>
        <p>The few-shot prompting approach was evaluated by Prompt for Generating Sampled Responses with
providing the model with 1, 5, and 10 examples. To gener- Knowledge
ate the response, the LLM was set to a temperature value
equal to 0.001. Figure 4 presents the prompt structure
provided to the LLM to classify a given text as either
factual or hallucinated.</p>
        <sec id="sec-3-1-1">
          <title>Based on your knowledge and on the context provided, answer the following question giving as much detail as you can.</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Question: [QUESTION]</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Context: [KNOWLEDGE]</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Answer:</title>
          <p>
            4.2. Datasets and Evaluation Metric
For the experimental evaluation, three benchmark
datasets for hallucination detection were selected. Each
dataset includes a user query, the corresponding
LLMgenerated response, and a binary label indicating whether
the response is factually accurate. The datasets employed
are FactAlign [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], FactBench [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], and FELM [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], all of
which are described in detail in the following.
          </p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Prompt for Few-Shot Prompting with Knowledge</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>I want you to act as a response judge.</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>Given a user query, a knowledge, and a response</title>
          <p>by an LLM, your objective is to determine if the
response is an hallucination or not.</p>
          <p>In the context of NLP, an "hallucination" refers to
a phenomenon where the LLM generates text that
is incorrect, nonsensical, or not real. Based on your
knowledge, on the knowledge provided, and on the
definition of hallucination provided, analyze the user
query and the response of the LLM, and answer the
following question: is the response factual or not?</p>
        </sec>
        <sec id="sec-3-1-8">
          <title>BE CAREFUL: sometimes the knowledge may be</title>
          <p>empty or not useful, in which case you have to
respond based only on your knowledge.</p>
        </sec>
        <sec id="sec-3-1-9">
          <title>Answer True if you consider the response factual,</title>
        </sec>
        <sec id="sec-3-1-10">
          <title>False otherwise.</title>
        </sec>
        <sec id="sec-3-1-11">
          <title>You don’t have to provide any explanation. ### EXAMPLE 1</title>
        </sec>
        <sec id="sec-3-1-12">
          <title>User query: [USER QUERY]</title>
        </sec>
        <sec id="sec-3-1-13">
          <title>Knowledge: [KNOWLEDGE]</title>
        </sec>
        <sec id="sec-3-1-14">
          <title>LLM response: [LLM RESPONSE]</title>
        </sec>
        <sec id="sec-3-1-15">
          <title>Answer: [ANSWER]</title>
          <p>...
### EXAMPLE N</p>
        </sec>
        <sec id="sec-3-1-16">
          <title>User query: [USER QUERY]</title>
        </sec>
        <sec id="sec-3-1-17">
          <title>Knowledge: [KNOWLEDGE]</title>
        </sec>
        <sec id="sec-3-1-18">
          <title>LLM response: [LLM RESPONSE]</title>
        </sec>
        <sec id="sec-3-1-19">
          <title>Answer: [ANSWER]</title>
          <p>### LLM TURN</p>
        </sec>
        <sec id="sec-3-1-20">
          <title>User query: [USER QUERY]</title>
        </sec>
        <sec id="sec-3-1-21">
          <title>Knowledge: [KNOWLEDGE]</title>
        </sec>
        <sec id="sec-3-1-22">
          <title>LLM response: [LLM RESPONSE]</title>
        </sec>
        <sec id="sec-3-1-23">
          <title>Answer:</title>
          <p>2https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
3https://huggingface.co/jinaai/jina-embeddings-v3
to facilitate more efective knowledge retrieval through generated by an LLM, and a binary factuality label. For
the Google Search API and to simplify both the factu- evaluation purposes, only the entries corresponding to
ality classification task performed by the LLM and the user queries in the form of questions were retained. Due
generation of sampled responses within the SelfCheck- to computational constraints, a subset of 100 observations
GPT framework. Following this filtering step, a random was selected. To mitigate the efects of class imbalance,
sample of 100 questions was selected. This limitation an equal number of true and false instances (50 each)
was imposed by constraints on computational resources were randomly sampled. A fixed random seed was
apand time, which required a balance between the number plied to ensure reproducibility and consistency across all
of examples and processing eficiency. Furthermore, to experimental configurations.
ensure comparability and consistency across the
methods and each variant, a fixed random seed was used to
guarantee the reproducibility of the 100 instances across
all experiments.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>FELM. FELM is a multi-domain benchmark dataset</title>
        <p>
          designed for the evaluation of hallucination detection in
LLMs, encompassing five distinct domains, each posing
specific challenges for the models under analysis [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          The domains are defined as follows:
FactBench. This dataset was specifically developed
to evaluate FactCheck-GPT, a multi-step framework
designed for the detection and correction of factual errors
in responses generated by LLMs [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. FactBench was
constructed by integrating three distinct benchmark datasets
aimed at hallucination detection:
• Knowledge-based FacTool: Created to assess the
performance of the FacTool framework, which
evaluates the factual consistency of LLM-generated
responses through external knowledge retrieval [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>This dataset was constructed by selecting 50 prompts
from FactPrompts and fact-checking datasets such as
TruthfulQA [14]. For each prompt, responses were
generated using ChatGPT and subsequently annotated by
human evaluators with binary labels indicating factual
correctness;
• FELM-WK : Subset of the FELM dataset that will be</p>
        <p>detailed in the next paragraph;
• HaluEval: This benchmark dataset for hallucination
detection was constructed by initially considering 52 000
prompts, followed by a filtering procedure aimed at
selecting those most likely to elicit hallucinated
responses from a LLM. Specifically, each prompt was
submitted to ChatGPT three times, and the average
semantic similarity among the generated responses
was calculated. The 5 000 prompts with the lowest
semantic similarity scores were retained to ensure the
dataset included only the most challenging queries.</p>
        <p>The selected prompts were then resubmitted to
ChatGPT to obtain a second set of responses, which were
manually annotated as either true or false based on
their factual accuracy [15].</p>
        <p>FactBench was made publicly available by the authors
on GitHub and comprises a total of 4 835 examples, of
which 3 838 are labeled as true and 995 as false.5 Each
instance includes a user query, the corresponding response
5https://github.com/yuxiaw/Factcheck-GPT/blob/main/Factbench.
jsonl
• World knowledge: Includes questions related to general</p>
        <p>cultural and factual knowledge;
• Science and technology: Comprises statements related
to scientific facts or citations across disciplines such
as physics and biology;
• Reasoning: Contains prompts that require multi-step</p>
        <p>logical reasoning to produce a correct response;
• Recommendation and writing: Involves open questions
requiring the model to provide suggestions or generate
creative or structured written content;
• Math: Encompasses problems that necessitate both
logical reasoning and mathematical skills to arrive at
correct answers.</p>
        <p>FELM was constructed by aggregating prompts from
diverse sources, which were then submitted to ChatGPT
operating in a zero-shot configuration. The resulting
responses were segmented into sentences, each of which
was subsequently evaluated by a team of experts. The
factual accuracy of each sentence was assessed based on
comparison with reliable sources, and sentences were
annotated as either true or false accordingly. A response
was labeled as true only if all its sentences were assessed
as accurate; otherwise, it was classified as false. The
FELM dataset was obtained from Hugging Face and
comprises a total of 847 instances.6 Each instance includes a
user prompt, the corresponding response generated by
the LLM, and a factuality label. Of these examples, 566 are
labeled as factual, while 281 are labeled as non-factual.</p>
        <p>For evaluation, only the World knowledge and Science
and technology domains were considered, as the
remaining presented substantial limitations for the knowledge
retrieval approach (e.g., mathematical prompts such as
“What is the value of the expression 1! + 2! + 3! + ... +
6https://huggingface.co/datasets/hkust-nlp/felm
10!” ). As in the previous datasets, only prompts formu- across all three benchmark datasets. With regard to
fewlated as questions were retained. To mitigate class im- shot prompting, the ten-shot configuration achieves the
balance and accommodate computational constraints, a best performance, followed by the five-shot and one-shot
balanced subset of 100 samples—comprising 50 factual variants, respectively. This trend is consistent with the
and 50 non-factual instances—was randomly selected. A hypothesis that providing a greater number of examples
ifxed random seed was applied to ensure consistency enables the LLM to better internalize the task structure,
across experiments. thereby improving generalization and overall accuracy.
In this regard, the strategy for selecting examples in
Evaluation metric. Since all the datasets employed the few-shot prompting approach could be improved. In
in the evaluation are balanced, Accuracy was adopted as the current evaluation, examples were randomly sampled
the primary performance metric. It is defined as follows: from the datasets, which may result in class imbalance
among the examples shown to the LLM, potentially
afAccuracy =   +   fecting performance. Ensuring a balanced representation
  +   +   +   of classes in the selected examples would therefore be
where TP denotes factual responses correctly classified crucial for enhancing the robustness of the analysis in
as factual, TN represents hallucinated responses correctly the few-shot prompting setting.
identified as hallucinations, FP corresponds to halluci- Regarding the impact of knowledge integration, on
nated responses incorrectly classified as factual, and FN the FactAlign dataset, the only method that
underperrefers to factual responses mistakenly classified as hallu- forms when incorporating external knowledge is
fewcinations. shot prompting with five examples; all other tested
methods either match or surpass the performance of their
4.3. Results and Discussion counterparts without knowledge. A similar trend is
observed on FactBench, where all approaches that leverage
To evaluate the impact of knowledge integration, the retrieved knowledge perform at least as well as, and often
performance of both SelfCheckGPT and the few-shot better than, those without knowledge integration. Finally,
prompting approach was evaluated in two configurations: in the FELM dataset, incorporating external knowledge
with and without the inclusion of external knowledge. generally leads to performance improvements across
A summary of the comparative results is presented in methods, with the sole exception of SelfCheckGPT using
Table 1. The notation W/O and W denotes whether the the LLM Prompt, where performance declines by one
perevaluated variant operates without or with integrated centage point after knowledge integration. Overall, these
knowledge, respectively. For each variant and dataset, analyses suggest that integrating external knowledge
the version (with or without knowledge) that achieves generally enhances the performance of the evaluated
apthe highest performance is underlined; if both versions proaches across all datasets, with only a few exceptions
perform equally, no underlining is applied. where a slight decrease in performance was observed.
These performance declines may be attributed to
limiModels Variant FactAlign FactBench FELM tations in the knowledge retrieval process. Specifically,</p>
        <p>W/O W W/O W W/O W only the first retrieved URL is considered—typically the</p>
        <p>BERTScore 59.0 61.0 61.0 60.0 56.0 59.0 most popular, but not necessarily the most informative.
SelfCheckGPT LLMNPLrIompt 6672..00 6657..00 6547..00 6693..00 6697..00 6781..00 Additionally, the retrieval system occasionally fails to
PFreowm-pShtiontg OFTienvnee---ssshhhooottt 555705...000 555549...000 655293...000 666425...000 655269...000 656392...000 aahcnincteid-sebsrorttehmleeveaacchnqatuncioissinmttioesnnotrodfCuvAeaPtlouTaWCbHleebAerxpetsretorrtniecactlitoikonnnso,s,wsuwlcehhdigcahes.
Table 1 Nevertheless, on average, approaches augmented with
exComparison between methods with and without integrated ternal knowledge outperform their non-augmented
counknowledge, to evaluate its impact on their performance. terparts. This suggests that further improvements in the
retrieval process could improve the overall efectiveness
of these methods and lead to even greater performance
gains.</p>
        <p>
          As shown in Table 1, the SelfCheckGPT framework
consistently outperforms the few-shot prompting
approach across all evaluated conditions. This result aligns
with expectations, given that SelfCheckGPT is specifi- 5. Conclusions and Perspectives
cally designed for hallucination detection, whereas
fewshot prompting is a more general-purpose methodol- In this study, we introduced a fully automated knowledge
ogy. Among the SelfCheckGPT variants, the NLI-based retrieval framework that leverages a custom search
enmethod demonstrates the highest overall efectiveness gine interfacing with the Web via the Google Search API
and eficiency, surpassing the LLM prompting variant
to extract relevant external information. The retrieved
knowledge was subsequently integrated into two distinct
methodologies: (i) few-shot prompting, which consists of
providing a set of examples to guide task execution, and
(ii) SelfCheckGPT, a hallucination detection framework
that generates and compares multiple responses from an
LLM to identify factual inconsistencies. The enhanced
versions of both approaches, incorporating retrieved
knowledge, were evaluated on three benchmark datasets
for hallucination detection—FactAlign, FactBench, and
FELM—spanning a diverse range of domains. The
experimental results indicate that SelfCheckGPT
consistently outperforms the few-shot prompting approach,
demonstrating strong performance across all three
benchmark datasets. Among its variants, the NLI configuration
emerges as the most efective and computationally
eficient. Moreover, the integration of external knowledge
generally enhances the performance of the evaluated
approaches compared to their counterparts without such
integration. Nonetheless, the observed improvements
could be further amplified by refining the knowledge
retrieval process in future work. Specifically, challenges
such as CAPTCHA mechanisms or site access restrictions
that limit automated retrieval should be addressed.
Additionally, the quality of the queries submitted to the search
engine could be improved by leveraging LLMs to
generate more precise and contextually rich queries, thereby
yielding more informative results. Moreover, expanding
the number of retrieved Web sources may lead to more
comprehensive and accurate knowledge; for instance,
retrieving the top five results could increase the relevance
and diversity of the retrieved information. Finally,
future researches may also focus on further refining the
knowledge integration process by leveraging more
advanced and sophisticated RAG techniques [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Enhancing
integration within frameworks such as SelfCheckGPT,
which has already demonstrated promising results in
hallucination detection, holds significant potential. These
advancements could support the development of a
reliable, scalable, and eficient multi-domain hallucination
detection system.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>This work was partly funded by: the European Union – Next Generation EU, Mission 4, Component 2, CUP: D53D23008480001 (20225WTRFN – KURAMi:</title>
        <p>Knowledge-based, explainable User empowerment in
Releasing private data and Assessing Misinformation in online
environments);7 ATEQC – Progetti di Ricerca di Ateneo –
Quota Competitiva (University Research Projects –
Competitive Funding Scheme) PriQuaDeS: Next-generation
Privacy- and Quality-preserving Decentralized Social Web</p>
      </sec>
      <sec id="sec-4-2">
        <title>7https://kurami.disco.unimib.it/</title>
      </sec>
      <sec id="sec-4-3">
        <title>Applications; the MUR under the grant “Dipartimenti di</title>
        <p>Eccellenza 2023-2027” of the Department of
Informatics, Systems and Communication (DISCo), University of
Milano-Bicocca, Italy. We further acknowledge ISCRA
for awarding this project access to the LEONARDO
supercomputer [16], owned by the EuroHPC Joint Undertaking,
hosted by CINECA (Italy).
Declaration on Generative AI</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is All You Need,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , et al.,
          <source>A Survey on Hallucination in Large Language Models: Principles</source>
          , Taxonomy, Challenges, and Open Questions,
          <source>ACM Transactions on Information Systems</source>
          <volume>43</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Kruschwitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petrocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Viviani</surname>
          </string-name>
          ,
          <string-name>
            <surname>ROMCIR</surname>
          </string-name>
          <year>2025</year>
          :
          <article-title>Overview of the 5th Workshop on Reducing Online Misinformation Through Credible Information Retrieval</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>339</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sathe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sandosh</surname>
          </string-name>
          ,
          <article-title>Mitigating Hallucinations in Large Language Models: A Comprehensive Survey on Detection and Reduction Strategies</article-title>
          ,
          <source>in: International Conference on Sustainable Computing and Intelligent Systems</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Retrieval-Augmented Generation for Large Language Models: A Survey</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2312.10997. arXiv:
          <volume>2312</volume>
          .
          <fpage>10997</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liusie</surname>
          </string-name>
          , M. Gales,
          <article-title>SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>9004</fpage>
          -
          <lpage>9017</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>On Hallucination and Predictive Uncertainty in Conditional Language Generation</article-title>
          , in: P. Merlo,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          , R. Tsarfaty (Eds.),
          <source>Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2734</fpage>
          -
          <lpage>2744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Azaria</surname>
          </string-name>
          , T. Mitchell,
          <article-title>The Internal State of an LLM Knows When It's Lying</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <source>Detection in Generative AI - A Tool Augmented</source>
          <volume>967</volume>
          -
          <fpage>976</fpage>
          .
          <article-title>Framework for Multi-Task and</article-title>
          <string-name>
            <surname>Multi-Domain</surname>
          </string-name>
          Sce-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih, narios,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2307.13528. P. Koh,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , H. Hajishirzi, arXiv:
          <fpage>2307</fpage>
          .13528. FActScore:
          <article-title>Fine-grained Atomic Evaluation of Fac-</article-title>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Evans,</surname>
          </string-name>
          <article-title>TruthfulQA: Measurtual Precision in Long Form Text Generation, in: ing How Models Mimic Human Falsehoods</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.), Proceedings of S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <article-title>Prothe 2023 Conference on Empirical Methods in Nat- ceedings of the 60th Annual Meeting of the Asural Language Processing, Association for Com- sociation for Computational Linguistics (Volume putational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>12076</fpage>
          -
          <lpage>1</lpage>
          : Long Papers),
          <source>Association for Computational</source>
          <volume>12100</volume>
          .
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>3214</fpage>
          -
          <lpage>3252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>C.-W. Huang</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>FactAlign: Long-form URL: https://aclanthology</article-title>
          .org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>229</volume>
          /. Factuality Alignment of Large Language Mod- doi:10.18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long.229. els</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            [15]
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Cheng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Wen</surname>
          </string-name>
          , HaluE(Eds.),
          <article-title>Findings of the Association for Computa- val: A Large-Scale Hallucination Evaluation Benchtional Linguistics: EMNLP 2024, Association for mark for Large Language Models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            , Computational Linguistics, Miami, Florida, USA,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Con2024</source>
          , pp.
          <fpage>16363</fpage>
          -
          <lpage>16375</lpage>
          . URL: https://aclanthology. ference on Empirical Methods in Natural Language org/
          <year>2024</year>
          .findings-emnlp.
          <volume>955</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/ Processing, Association for Computational Linguis2024.findings-emnlp.
          <volume>955</volume>
          . tics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>6449</fpage>
          -
          <lpage>6464</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Gangi</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. M.</given-names>
            <surname>Mujahid</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Arora, //aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>397</volume>
          /. doi:10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Rubashevskii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Mohammed</given-names>
            <surname>Afzal</surname>
          </string-name>
          ,
          <volume>18653</volume>
          /v1/
          <year>2023</year>
          .emnlp-main.397. L.
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Borenstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Pillai</surname>
            ,
            <given-names>I. Augenstein</given-names>
          </string-name>
          , [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Turisini</surname>
          </string-name>
          , G. Amati,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cestari</surname>
          </string-name>
          , CINECA SuperI. Gurevych,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>Factcheck-Bench: Fine- Computing Centre</surname>
          </string-name>
          ,
          <article-title>SuperComputing Applications Grained Evaluation Benchmark for Automatic Fact-</article-title>
          and
          <string-name>
            <surname>Innovation Department</surname>
          </string-name>
          , LEONARDO: A Pancheckers, in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          <article-title>European Pre-Exascale Supercomputer for HPC and</article-title>
          (Eds.),
          <article-title>Findings of the Association for Computa- AI applications</article-title>
          ,
          <source>Journal of Large-Scale Research tional Linguistics: EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Association for Facilities 9 (</article-title>
          <year>2024</year>
          ). Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>14199</fpage>
          -
          <lpage>14230</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .findings-emnlp.
          <volume>830</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-emnlp.
          <volume>830</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I.-C. Chern,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Online Resources J. He</surname>
          </string-name>
          ,
          <article-title>FELM: Benchmarking Factuality Evaluation of Large Language Models, in: Proceedings of the The datasets used for the experimental evaluations are 37th International Conference on Neural Informa- publicly available, as referenced in the works cited tion Processing Systems</article-title>
          , NIPS '23,
          <string-name>
            <surname>Curran</surname>
            <given-names>Asso</given-names>
          </string-name>
          <article-title>- throughout the paper. For the sake of reproducibility, ciates Inc</article-title>
          .,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2023</year>
          .
          <article-title>the code developed in this study is also made publicly</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>I.-C. Chern</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Chern</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Yuan</surname>
            , K. Feng, accessible at the following address: https://github.com/ C. Zhou,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , G. Neubig, P. Liu, FacTool: Factuality cristianceccarelli/rag-hallu.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>