<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection in RAG Systems⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ChanndethSok</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DavidLuz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>YacineHaddam</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ENSAE Paris, Institut Polytechnique de Paris</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Forvia Paris Tech Center, GIT, Immeuble Lumière, 40 avenue des Terroirs de France</institution>
          ,
          <addr-line>75012 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Identity-Aware AI workshop at 28th European Conference on Artificial Intelligence</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Large Language Models</institution>
          ,
          <addr-line>Retrieval-Augmented Generation, Hallucination Detection, Metamorphic Testing, Trust-</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited byhallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore preseMntetaRAG, a metamorphic testing framework for hallucination detection in RetrievalA- ugmentedGeneration (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and enterprise deployments. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. MetaRAG further localizes unsupported claims at the span level, enabling transparent visualization of potentially hallucinated segments and supporting configurable safeguards in sensitive use cases. Experiments on a proprietary enterprise dataset demonstrate the efectiveness MofetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG's span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
• Intrinsic hallucination: fabricated or contradictory information relative to the model’s internal
knowledge.
• Extrinsic hallucination: generated information that conflicts with, misrepresents, or disregards
externally provided context or retrieved documents.</p>
      <p>Retrieval-Augmented Generation (RAG8]) a[ims to mitigate hallucinations by grounding model
outputs in retrieved, up-to-date documents, as illustrated in1.FBigyuirnejecting retrieved text from
reliable external sources and proprietary documents, into the prompt, RAG improves factuality and
domain relevance. While efective against intrinsic hallucinations, RAG remains susceptexibtrlientsioc
hallucinations, especially when retrieved evidence is ignored, misinterpreted, or insu9fic].ient [</p>
      <p>
        Detecting hallucinations is particularly challenging in real-world settings, where RAG-based
chatbots must respond to queries about unseen, proprietary, or confidential content where gold-standard
references are typically unavailab1l0e].[ Many existing hallucination detection methods rely on
goldstandard reference answe5r,s11[], annotated datase1t2s][, or access to model internals such as hidden
states or token log-probabiliti1e3s, 1[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, in enterprise settings, such internals are often
inaccessible: many state-of-the-art LLMs (e.g., GPT-4, Claude) are proprietary and only accessible via
APIs that expose the final output text but not intermediate computations, limiting the feasibility of
these methods in practic1e0][.
      </p>
      <p>To address these challenges, we introdMuceetaRAG: a metamorphic testing framework for detecting
hallucinations in RAG-based conversational ageMnettsa.RAG is a zero-resource, black-box setting
that decomposes answers into atomic factoids, applies controlled mutations (e.g., synonym and antonym
substitutions), and verifies each mutated factoid against the retrieved context. Synonyms are expected
to beentailed, while antonyms are expected tocboentradicted. Hallucinations are flagged when outputs
violate these well-defined metamorphic relations (MRs). Unlike prior approaches, MetaRAG does not
require ground-truth labels, annotated corpora, or access to model internals, making it suitable for
deployment in proprietary settings.</p>
      <p>We evaluate MetaRAG on a proprietary corpus, thus unseen during model training. Our results show
that MetaRAG reliably detects hallucinations, providing actionable insights for enhancing chatbot
reliability and trustworthiness. These results establish MetaRAG as a practical tool for reliable deployment,
and its span-level detection opens the door to identity-aware safeguards.</p>
      <p>Our contributions include:
• We introducMeetaRAG, a reference-free, black-box setting, metamorphic testing framework for
hallucination detection in RAG systems. It decomposes answers into factoids, applies linguistic
transformations (synonym and antonym), and verifies them against retrieved context to produce
a hallucination score.
• We implement a prototype and evaluate MetaRAG on a proprietary dataset, demonstrating its
efectiveness in detecting hallucinations that occur when segments of generated responses diverge
from the retrieved context.
• We analyze the performance–latency/cost trade‑ofs of MetaRAG and provide a consistency
analysis to guide future research and practical deployment.
• We outline identity-aware safeguards (topic-aware thresholds, forced citation, escalation) that
can consume MetaRAG’s scores; these safeguards are a deployment design and are not part of
our empirical evaluation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Definitions of Hallucination</title>
        <p>
          The termhallucination has been used with varying scope across natural language generation tasks. Some
studies emphasizefactuality, describing hallucinations as outputs that contradict established facts, i.e.,
inconsistencies with world knowledge or external ground 1t5r,u1t6h]. [Others highlighftaithfulness,
where hallucinations occur when generated responses deviate from the user instruction or a reference
text, often producing plausible but ungrounded statements particularly in source-conditioned tasks
such as summarization or question answer1in7g]. [Beyond these two dimensions, researchers also note
cases of incoherent or nonsensical text that cannot be clearly attributed to factuality or faithfulne
criteria6[
          <xref ref-type="bibr" rid="ref5">, 5</xref>
          ].
        </p>
        <p>Alternative terms have also been introduCceodn.fabulation draws on psychology to describe
lfuent but fabricated content arising from model pr1i8o]r,sw[hile fabrication is preferred by some to
avoid anthropomorphic connotatio19n,s2[0]. More recently, Chakraborty et2a1l]. [propose a flexible
definition tailored to deployment settings, defining a hallucinatioangeanserated output that conflicts
with constraints or deviates from desired behavior in actual deployment, while remaining syntactically
plausible under the circumstance.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Hallucination Detection in LLMs</title>
        <p>Building on these definitions, hallucinations have been recognized as a major challenge in text generation.
Early work in machine translation and abstractive summarization described them as outputs that are
not grounded in the input sour5c,e11[, 22], motivating the development of evaluation metrics and
detection methods for faithfulness and factual consistency across natural language generation tasks.</p>
        <p>More recentreference-free (unsupervised or zero-reference) methods aim to detect hallucinations
without gold-standard labels by analyzing the model’s own outputs. A prominent
meStehlfoCdhiesckGPT [23], a zero-resource, black-box approach that queries the LLM multiple times with the same
prompt and measures semantic consistency across responses. The intuition is that hallucinated content
often leads to instability under stochastic re-generation; true facts remain stable, while fabricated
ones diverge. Manakul et al. show that SelfCheckGPT achieves strong performance in sentence-level
hallucination detection compared to gray-box methods, and emphasize that it requires no external
database or access to model intern2a3l]s. [However, SelfCheckGPT may struggle when deterministic
decoding or high model confidence leads to repeating the same incorrect output.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Metamorphic Testing</title>
        <p>Metamorphic Testing (MT)24[] was originally proposed in software engineering to addressotrahcele
problem in which the correct output is unknown. MT relimesetoanmorphic relations (MRs):
transformations of the input with predictable efects on outputs, enabling error detection without access to
ground truth25[]. In machine learning, MT has been applied to validate models in computer vision
[26] (e.g., rotating an image should not change its predicted class) an2d7N]LP [</p>
        <p>
          In hallucination detection for LLMMest,aQA [
          <xref ref-type="bibr" rid="ref12">28</xref>
          ] leverages MRs by generating paraphrased or
antonym-based question variants and verifying whether answers satisfy expected semantic or logical
constraints. Relying purely on prompt mutations and consistency checks, MetaQA achieves higher
precision and recall than SelfCheckGPT on open-domain QA.
        </p>
        <p>
          Researchers have also adapted MT for more complex conversational and reasoning sMetOtRin-gs.
TAR [
          <xref ref-type="bibr" rid="ref13">29</xref>
          ] applies dialogue-level perturbations and knowledge-graph-based inference to multi-turn
systems, detecting up to four times more unique bugs than single-turDnroMwTz.ee [
          <xref ref-type="bibr" rid="ref14">30</xref>
          ] uses logic
programming to construct temporal and logical rules from Wikipedia, generating fact-conflicting test
cases and revealing rates of 24.7% to 59.8% across six LLMs in nine dom2a8i]n.s [
        </p>
        <p>These works highlight the promise of MT for hallucination detection, but they primarily target
open-book QA or multi-turn dialogue, often over short, single-sentence outputs. Prior studies have not
addressed hallucination detectiornetrinieval-augmented generation (RAG) scenarios over proprietary
corpora, a setting in which ground-truth references are unavailable and model internals are inaccessible.
MetaRAG builds on MT by decomposing answers into factoids and designing MRs tailored to factual
consistency against retrieved evidence in a zero-resource, black-box setting.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. MetaRAG: Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Overview</title>
        <p>
          Building on the metamorphic testing (MT) methodology to detect hallucinations in LLMs introduced
by MetaQA [
          <xref ref-type="bibr" rid="ref12">28</xref>
          ], MetaRAG advances this approach to detect hallucinations in retrieval-augmented
generation (RAG) settings by introducing a context-based verification stage. A metamorphic testing
layer operates on top of the standard RAG pipeline to automatically detect hallucinated responses.
Figure2 outlines the workflow.
        </p>
        <p>Given a user query , the system retrieves the t omp-ost relevant chunks from a database, forming
the contex t= { 1,  2, … ,   }. The LLM generates an initial ans weursing(, ) as input. MetaRAG
then decompose s into factoids, applies controlled metamorphic transformations to produce variants
(synonym and antonym), verifies each variant agains,tand aggregates the results into a hallucination
score (Algorithm1).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Step 1: Factoid decomposition</title>
        <p>Given an answe r , we first decompose it into a set ofafctoids, defined as atomic, independently
verifiable facts, denoted bℱy = { 1, … ,   }. Each factoid corresponds to a single factual statement
that cannot be further divided without losing meaning, such as a subject-predicate-object triple or a
scoped numerical or temporal claim. Representing an an swaert the factoid level enables fine-grained
verification in subsequent steps, allowing localized hallucinations to be marked inside longer answers.</p>
        <p>We obtainℱ using an LLM-based extractor with a fixed prompt that enforces one proposition per line,
prohibits paraphrasing or inference bey o,nadnd co-reference resolution. The full prompt template is
provided in the supplementary material.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Step 2: Mutation Generation</title>
        <p>
          Each factoid (hereafterfa,ct) from Step 1, MetaRAG applies metamorphic mutations to generate
perturbed variants of the original claim. This step is grounded in the principle of metamorphic testing,
where controlled semantic transformations are used to probe model consistency and expose
hallucinations 2[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>Formally, for each fact o id∈ { 1, … ,   }, we construct variants using two relations:</p>
        <p>• Synonym Mutation: This relation substitutes key terms iwnith appropriate synonyms,
syn
yielding paraphrased factoi d, s that preserve the original semantic meaning. These assess the
model’s ability to recognize reworded yet factually equivalent statements.
• Antonym Mutation: This relation replaces key terms inwith antonyms or negations,
producing factoid s,ant that are semantically opposed to the original. These serve as adversarial tests to
ensure the model does not support clearly contradictory information.</p>
        <p>Let denote the number of mutations generateedacbhyrelation. The mutation set f oirs therefore
ℱ = {  ,s1yn, … ,  ,syn,  ,a1nt, … ,  ,ant }.</p>
        <p>By construction, ifis correct and supported by the retrieved co n,ttehxetn ,s⋅yn should beentailed
by  , whereas ,a⋅nt should becontradicted by  .</p>
        <p>Mutations are generated by prompting an LLM with templates that explicitly instruct synonymous
or contradictory outputs while preserving atomicity and relevance; the exact prompt templates appear
in the supplementary material.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Step 3: Factoid Verification</title>
        <p>Each mutated fact o i,sdyn and ,ant is then verified by LLMs conditioning on the cont ex(ttreated as
ground truth). The LLM returns one of three deciYsieosn(es:ntailed b y ), No (contradicted by), or
Not sure (insuficient evidence). We then assign a penalty sco r∈e{0, 0.5, 1} based on the decision
and the mutation type:</p>
        <p>This penalty assignment quantifies semantic (in)consistency at the variant level: correct entailment
for synonyms and correct contradiction for antonyms yield zero penalty, while the opposite yields
maximal penalty. In Step 4, we aggregate these penalties over all variants o f teoacchompute a
fact-level hallucination score.</p>
        <sec id="sec-3-4-1">
          <title>2: Output: Hallucination sco r(e, , )</title>
          <p>3: Factoid Extraction:</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>4: ℱ ← FactoidsDecomposition()</title>
          <p>5: for each factoi d in ℱ do</p>
          <p>Mutation Generation:
Algorithm 1 MetaRAG Hallucination Detection
1: Input: Generated answ er, query , contex t , number of mutatio ns, threshol d
, factoid score{s }</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Step 4: Score Calculation</title>
        <p>To quantify hallucination risk, we calculate a hallucinatio n
 ,scforee,ach facto id. This yields a
granular diagnostic that pinpoints which claims are potentially unreliable. The score for each factoid
is defined as the average penalty across t2hemetamorphic transformations (synonym and antonym)
of  :

=1
2
  = 1 (∑  ,
syn + ∑  ,ant) ,

=1
(1)
syn
where ,
and ,</p>
        <p>ant are the penalties assigned in Step 3 to -tthhesynonym and antonym variants
factoid, thus no hallucination, w hi=le1 indicates a highly probable hallucination.</p>
        <p>Response Hallucination score: Instead of a simple average, the hallucination score for the entire
response is defined as the maximum score found among all the individual factoids. This metric
ensures that a single, severe hallucination in any part of the response will result in a high overall score,
accurately reflecting the unreliability of the entire answer.</p>
        <p>(, , ) = max   , (2)</p>
        <p>
          1≤≤
where is the number of decomposed factoids. A response can be flagged as containing hallucination
if (, , ) exceeds a predefined confidence threshol d∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] (e.g. 0.5).
        </p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Identity-Aware Safeguards for Deployment</title>
        <p>While MetaRAG is a general-purpose hallucination detector, its factoid-level scores can be directly
integrated intidoentity-aware deployment policies. Importantly, no protected attributes are inferred or
stored; instead, only tthoepic of the query or retrieved context (e.g., pregnancy, refugee rights, labor
eligibility) is used as a deployment signaScl.ope. The safeguards described here represent a deployment
design that consumes MetaRAG’s scores; they are not part of the empirical evaluation reported in
Section4.</p>
        <p>
          Each factoid receives a sc ore∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], where  = 0 indicates full consistency with the retrieved
context and = 1 indicates strong evidence of hallucination. The overall respons (e,s, c)ore thus
represents the risk level of the most unreliable claim: higher values correspond to higher hallucination
risk.
        </p>
        <p>These scores could enable deployment-time safeguards through the following hooks:
1. Topic detection. A lightweight topic classifier or rule-based tagger can assign coarse domain
labels (e.g., healthcare, migration, labor) to the query or retrieved context.
2. Topic-aware thresholds. A response is flagged if (, , ) ≥  . Thresholds can be adapted by
domain, e.g., general= 0.5 for generic queries, and a stricitdeenrtity= 0.3 for sensitive domains.
3. Span highlighting and forced citation. For flagged responses, MetaRAG highlights
unsupported spans and enforces inline citations to retrieved evidence, to improve transparency and
calibrate user trust.
4. Escalation. If hallucinations persist above threshold in identity-sensitive domains, the system
may abstain, regenerate with a stricter prompt, or escalate to human review.
5. Auditing. Logs of flagged spans, hallucination scores, and topic labels can be maintained for
post-hoc fairness, compliance, and safety audits.</p>
        <p>In this way, higher hallucination scores are systematically translated into stronger protective actions,
with more conservative safeguards applied whenever queries touch on identity-sensitive contexts.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>We conducted experiments to evaluMateetaRAG on its ability tdoetect hallucinations in
retrievalaugmented generation (RAG). The evaluation simulates a realistic enterprise deployment setting, in
which a chatbot serves responses generated from internal documentation. Our focus is on the detection
stage, that is, identifying when an answer contains unsupported (hallucination) or fabricated information.
Prevention and mitigation are important but they are outside the scope of this work.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The evaluation dataset is a proprietary collect2io3ninotfernal enterprise documents, including
policy manuals, procedural guidelines, and analytical reports, none of which were seen during LLM
training. Each document was segmented into chunks of a few hundred tokens, and retrieval used cosine
similarity overtext-embedding-3-large, with the top=- 3 chunks appended to each query.</p>
        <p>We then collected a set of user queries and corresponding chatbot answers. Each response was
labeled by human annotators as either hallucinated or not, using the retrieved context as the reference.
The final evaluation set contain67s responses, of which36 are labeled asnot hallucinated and31 as
hallucinated.</p>
        <p>To preserve confidentiality, we do not release the full annotated dataset. However, the complete
annotation guidelines are included in the supplementary material.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation Protocol</title>
        <p>MetaRAG producefinse-grained, factoid-level hallucination scores , whereas the available ground
truth labels are at trheseponse level. To align with these existing labels, we evaluate MetaRAG as
a binary classifier by thresholding the hallucination s (co,, r)e at = 0.5 . We report standard
classification metrics: Precision, Recall, F1 score and accuracy. Latency is also recorded to assess
feasibility for real-time deployment.
4.2.1. Case Studies in Identity-Sensitive Domains
Beyond quantitative evaluation, we also provide qualitative illustrations of MetaRAG in identity-sensitive
scenarios. To illustrate how MetaRAG’s span-level scores can enable identity-aware safeguards without
inferring protected attributes, we present two stylized examples. These are not part of the quantitative
evaluation in Sectio4n,but highlight potential deployment scenarios.</p>
        <p>Healthcare (pregnancy). A user asks: “Can pregnant women take ibuprofen for back pain?” The
model answers: “Yes, ibuprofen issafe throughout pregnancy.” However, the retrieved context
speciifes that ibuprofen is contraindicated in the third trimester. MetaRAG flags t“hsaefsepathnroughout
pregnancy” with a high factoid scor e=( 0.92), yielding a response-level sco re= 0.92 . Under the
policy hooks described in Sectio3.n6, the topic tapgregnancy triggers a stricter thresh oidledn(tity= 0.3,
lower than the general case), span highlighting, a forced citation requirement, and possible escalation
to human review.</p>
        <p>Migration (refugee rights). A user asks: “Do LGBTQ+ refugeeasutomatically receive protection
in country X?” The model claims that such protectionasuatormeatic, but the retrieved legal text
provides no evidence of this. MetaRAG flags the unsupported cl“aaiumtomatically receive protection”
with a moderate scor e =( 0.5), yielding a response-level sco re= 0.5 . Although this score would sit
at the decision boundary under a general thresgheonledra(l= 0.5), the stricter identity-aware threshold
( identity= 0.3) ensures it is flagged for this case. Under the policy hooks, the topaicsytluamg/refugee
enforces citation and may escalate the response to a human reviewer. In a chatbot deployment, the
system would abstain from returning the unsupported answer and instead notify the user that expert
verification is required.</p>
        <p>These qualitative vignettes complement our quantitative evaluation by showing how MetaRAG’s
lfagged spans can be turned into concrete safeguards in identity-sensitive deployments.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Ablation Study</title>
      <p>To understand the contribution of individual design choices, we perform a set of ablation experiments
using the private dataset.</p>
      <sec id="sec-5-1">
        <title>5.1. Ablation Study Design</title>
        <p>We evaluate 26 configurations of MetaRAG, each defined by a combination of:
• Number of variants per relatio∈n{2, 5}
• Factoid-decomposition modeglp:t-4.1 or gpt-4.1-mini from OpenAI
• Temperature for mutation generat i∈o n{0:.0, 0.7}
• Mutation–generation modgepl:t-4.1 or gpt-4.1-mini
• Verifier model: gpt-4.1-mini, gpt-4.1, or themulti ensemble (gpt-4.1-nano, gpt-4.1-mini, gpt-4.1,</p>
        <p>Claude Sonnet 4)
Since the evaluation task is binary classification, we rePproerctision, Recall, F1 score, andAccuracy,
along withlatency (lower is better).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>To provide a comprehensive view of performance trade-ofs, we reportTotph-e4 configurations
separately for each of three primary metrics: F1 score, Precision, and Recall2()T.aTbhle configuration
notation follows the format:</p>
        <p>Decomposition Model / Generation Model / Verifier /  / Temperature.</p>
        <p>For example, mini/41/multi/2/0 indicates that the factoid decomposition model is “mini”, the
variant generation model is “41”, the verifier is “multi”, there=a2re variants per relation, and the
temperature is 0.0.</p>
        <p>Several configurations appear in more than one top-4 list, reflecting balanced performance across
metrics. For instance, ID m5i(ni/41/multi/2/0) ranks first in both F1 score and Recall, while maintaining
competitive Precision.</p>
        <p>The most promising configurations are further examined in Sec5t.3iotno verify stability under
multiple seeds.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Consistency Check</title>
        <p>To verify the robustness of our results, each top configuration (selected based on F1 score, Precision,
Recall, and token usage) is rerun under identical conditions using five diferent random seeds. This
procedure serves three purposes:
• To ensure that high performance is not attributable to random initialization or favorable seeds.
• To quantify variability across runs with the same configuration by reporting the standard deviation
for each metric.
• To assess stability using the coeficient of variation (CV) defined as the ratio of the standard
deviation to the meaCnV( = / ), where lower values indicate greater consistency.</p>
        <p>Across all metrics, the top configurations demonstrate strong reproducibility, with the majority
exhibiting a CV below 2%. In particular, configurations 18 and 16 achieve both high F1 scores and low
variability, indicating that they are not only accurate but also stable across repeated trials.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Pareto Front Analysis</title>
        <p>Following the consistency check (Sect5i.o3n), we restrict the Pareto front analysis to the four most stable
top-performing configurations selected by F1 score. We analyze the trade-of between hallucination
detection performance and eficiency using Pareto frontiers. A configuratPioanreitso-optimal if
no other configuration achieves strictly higher F1 while being no worse in cost metrics; similarly, for
precision–recall trade-of.</p>
        <p>Figure4 presents the Pareto fronts for our primary detection metric (F1 score) with respect to (i)
average token usage, (ii) average total execution time (second), and (iii) the precision–recall trade-of.
The Pareto front highlights configurations that ofer the best possible balance between accuracy and
eficiency, enabling deployment choices aligned with cost or latency constraints.</p>
        <p>Several top-ranked configurations (IDs 5, 18, 19, 16) lie on the Pareto front across these views,
indicating that they ofer competitive accuracy without excessive cost. The corresponding Pareto
analyses for precision and recall metrics are provided in the Supplementary Material.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <sec id="sec-6-1">
        <title>6.1. Practical Implications</title>
        <p>Integrating hallucination detection into enterprise RAG systems ofers several advantages:
• Risk Mitigation: Early detection of unsupported answers mitigates the spread of misinformation
in both customer-facing and internal applications.
• Regulatory Compliance: Many industries, such as healthcare and finance, require verifiable
information; automated detection supports regulatory compliance.
• Operational Eficiency : Detecting hallucinations simultaneously with content delivery reduces
the need for costly downstream human verification.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Ethical Considerations</title>
        <p>Beyond technical performance, hallucination detection intersects directly with questions of fairness,
accountability, and identity harms. Hallucinations in chatbot systems pose risks that extend beyond
factual inaccuracies: they can reinforce harmful stereotypes, undermine user trust, and misrepresent
marginalized communities in identity-sensitive contexts.</p>
        <p>
          • Reinforced stereotypes: Language models are known to reproduce and amplify societal biases,
as demonstrated by benchmarks such as Stereo3S1e]ta[nd WinoBias3[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. In identity-sensitive
deployments, hallucinated outputs risk reinforcing these biases in subtle but harmful ways.
• Trust erosion: Chatbots are only adopted at scale in high-stakes domains if users trust their
outputs. Surveys on hallucination consistently highlight that exposure to unsupported or fabricated
content undermines user trust in LLM syst6e,m7s].[
• Identity harms: Misrepresentations in generated responses may distort personal narratives or
marginalize underrepresented groups, aligning with broader critiques that technical systems can
reproduce social inequities if identity considerations are over3l3o,o3k4e]d. [
        </p>
        <p>By detecting hallucinations in a black-box, reference-free mManentaeRr,AG can support safer
deployment of RAG-based systems, particularly in settings where fairness, identity, and user well-being
are at stake.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Limitations and Future Work</title>
        <p>While MetaRAG demonstrates strong hallucination detection performance, several limitations remain:
• Dataset Scope: The study relies on a private, domain-specific dataset. This may limit external
validity.Future work should focus on curating or constructing public benchmarks designed to avoid
overlap with LLM pretraining corpora, enabling more robust generalization.
• Annotation Granularity: We lack factoid-level ground truth, which reduces our ability to assess
ifne-grained reasoning accuracPy.roviding such annotations in future datasets would support deeper
consistency evaluations.
• Policy Hooks Not Evaluated: The identity-aware deployment hooks introduced in Se3c.t6ion
are presented only as a design concept. In our implementation, we used a fixed threshold of
 = 0.5 across all queriesF.uture research should implement and measure the efectiveness of
topic-aware thresholds, forced citation, and escalation strategies in real-world chatbot deployments.
• Topic as Proxy (Design Limitation): In Section3.6, we suggest topic tags (e.g., pregnancy,
asylum, labor) as privacy-preserving signals for stricter safeguards, rather than inferring protected
attributes. This was not implemented in our experiments. As a design idea, it may also miss cases
where risk is identity-conditioned but the query appears geFnueturriec.work should explore how
to operationalize such topic-aware safeguards and investigate richer, privacy-preserving signals that
better capture identity-sensitive risks.
• Model Dependency: Current findings hinge on specific LLMs (GPT-4.1 variants). As models
evolve, the behavior of MetaRAG may shFift.uture eforts should validate MetaRAG across
opensource and emerging models to reinforce its robustness.
• Eficiency and Cost: The verification steps add computational overhead, possibly impacting
deployment in latency sensitive environmenIntvse.stigating lighter-weight verification strategies
and adaptive scheduling techniques could help mitigate this trade-of .
• Context Modality: Our current formulation assumes that the retrieved coinstteexxttual,
enabling direct comparison through language-based verification. However, RAG pipelines
increasingly operate over multimodal contexts such as tables, structured knowledge bases, or images.
Future work should extend MetaRAG to handle non-textual evidence, requiring modality-specific
verification strategies (e.g., table grounding, multimodal alignment) .</p>
        <p>Together, these limitations highlight both immediate boundaries and promising future directions for
enhancing MetaRAG’s reliability, fairness, and eficiency.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Hallucinations in RAG-based conversational agents remain a significant barrier to trustworthy
deployment in real-world applications. We introduMcedtaRAG, a metamorphic testing framework for
hallucination detection in retrieval-augmented generation (RAG) that operates without requiring ground
truth references or access to model internals. Our experiments show that MetaRAG achieves strong
detection performance on a challenging proprietary dataset. Beyond general reliability, MetaRAG’s
factoid-level localization further supports identity-aware deployment by surfacing unsupported claims
in sensitive domains (e.g., healthcare, migration, labor). Looking ahead, we see MetaRAG as a step
toward safer and fairer conversational AI, where hallucinations are not only detected but also connected
to safeguards that protect users in identity-sensitive contexts. This connection to identity-aware AI
ensures that hallucination detection does not treat all users as homogeneous but provides safeguards
that reduce disproportionate risks for identity-specific groups.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgment</title>
      <p>This work was carried out during Channdeth’s internship at Forvia. The authors thank the Forvia team
in Bercy, Paris, for their guidance and support.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly to improve the clarity
and grammar of certain sentences and to rephrase text for better readability. After using this tool, the
authors reviewed, edited, and verified all content to ensure accuracy and originality, and they take full
responsibility for the publication’s content.
[12] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKeown, T. B. Hashimoto, Benchmarking large
language models for news summarization, Transactions of the Association for Computational
Linguistics 12 (2024) 39–57. URLh:ttps://aclanthology.org/2024.tacl-.1d.3o/i:10.1162/tacl_a_
00632.
[13] H. Rashkin, V. Nikolaev, M. Lamm, L. Aroyo, M. Collins, D. Das, S. Petrov, G. S. Tomar, I. Turc,
D. Reitter, Measuring attribution in natural language generation models, Computational Linguistics
49 (2023) 777–840. URL: https://aclanthology.org/2023.cl-4..d2o/i:10.1162/coli_a_00486.
[14] N. Dziri, E. Kamalloo, S. Milton, O. Zaiane, M. Yu, E. M. Ponti, S. Reddy, FaithDial: A faithful
benchmark for information-seeking dialogue, Transactions of the Association for Computational
Linguistics 10 (2022) 1473–1490. URLh:ttps://aclanthology.org/2022.tacl-1..d84o/i:10.1162/tacl_
a_00529.
[15] e. a. Wang, Factuality of large language models: A survey, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen
(Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 19519–19529. URL:
https://aclanthology.org/2024.emnlp-main.10.8d8/oi:10.18653/v1/2024.emnlp-main.1088.
[16] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin,
T. Liu, A survey on hallucination in large language models: Principles, taxonomy, challenges,
and open questions, ACM Transactions on Information Systems 43 (2025) 1–55.hUtRtLp:
//dx.doi.org/10.1145/3703155.doi:10.1145/3703155.
[17] V. Rawte, A. Sheth, A. Das, A survey of hallucination in large foundation models, 2023. URL:
https://arxiv.org/abs/2309.059.2a2rXiv:2309.05922.
[18] S. et al., Confabulation: The surprising value of large language model hallucinations, in:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 14274–14284.</p>
      <p>URL: https://aclanthology.org/2024.acl-long.7.7d0o/i:10.18653/v1/2024.acl-long.770.
[19] R. Azamfirei, S. R. Kudchadkar, J. Fackler, Large language models and the perils of their
hallucinations, Critical Care 27 (2023).
[20] N. Maleki, B. Padmanabhan, K. Dutta, Ai hallucinations: A misnomer worth clarifying, 2024. URL:
https://arxiv.org/abs/2401.067.9a6rXiv:2401.06796.
[21] N. Chakraborty, M. Ornik, K. Driggs-Campbell, Hallucination detection in foundation models for
decision-making: A flexible definition and review of the state of the art, ACM Comput. Surv. 57
(2025). URL: https://doi.org/10.1145/37168 4.6doi:10.1145/3716846.
[22] P. Koehn, R. Knowles, Six challenges for neural machine translation, in: T. Luong, A. Birch,
G. Neubig, A. Finch (Eds.), Proceedings of the First Workshop on Neural Machine Translation,
Association for Computational Linguistics, Vancouver, 2017, pp. 28–39.hUtRtLps:://aclanthology.
org/W17-3204/. doi:10.18653/v1/W17-3204.
[23] P. Manakul, A. Liusie, M. J. F. Gales, Selfcheckgpt: Zero-resource black-box hallucination
detection for generative large language models, 2023. hUtRtLp:s://arxiv.org/abs/2303.088.96
arXiv:2303.08896.
[24] T. Y. Chen, S. C. Cheung, S. M. Yiu, Metamorphic testing: A new approach for generating next test
cases, 2020. URL:https://arxiv.org/abs/2002.125.4a3rXiv:2002.12543.
[25] S. Segura, G. Fraser, A. B. Sanchez, A. Ruiz-Cortés, A survey on metamorphic testing, IEEE</p>
      <p>Transactions on Software Engineering 42 (2016) 805–824. d1o0i.: 1109/TSE.2016.2532875.
[26] A. Dwarakanath, M. Ahuja, S. Sikand, R. M. Rao, R. P. J. C. Bose, N. Dubash, S. Podder, Identifying
implementation bugs in machine learning based image classifiers using metamorphic testing,
in: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and
Analysis, ISSTA ’18, ACM, 2018, p. 118–128. URL: http://dx.doi.org/10.1145/3213846.321385.8
doi:10.1145/3213846.3213858.
[27] M. T. Ribeiro, T. Wu, C. Guestrin, S. Singh, Beyond accuracy: Behavioral testing of NLP models
with CheckList, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, Association for Computational
Linguistics, Online, 2020, pp. 4902–4912. URLh:ttps://aclanthology.org/2020.acl-main. 4.4d2o/i:10.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Prompt Templates</title>
      <sec id="sec-10-1">
        <title>A.1. Factoid Decomposition Prompt</title>
        <p>We provide the exact template used to extract atomic factoids from model responses.
export const factExtractionPrompt = (inputText: string) =&gt; `
You are a fact extraction assistant.</p>
        <p>Your task is to extract all specific factual propositions from the given text.</p>
        <p>Instructions:
1. Extract every distinct factual statement present in the input, even if the statement is incorrect, ambiguous,
or nonsensical.
2. Each extracted proposition must be a complete, standalone sentence.
3. Each sentence must express only one atomic fact. (An atomic fact cannot be split into simpler factual
statements.)
4. If a sentence contains multiple facts, split them into multiple atomic fact sentences.
5. Do not paraphrase, rewrite, summarize, interpret, infer, or judge any part of the input. Only extract and
restate what is explicitly written.
6. Do not omit or correct any statements, regardless of their factual accuracy.
7. Output your answer as a JSON array of strings, with each element being one atomic factual sentence.
Example:
Input:
Marie Curie discovered polonium and radium, and Albert Einstein developed the theory of relativity in 1905.
Output:
[
"Marie Curie discovered polonium.",
"Marie Curie discovered radium.",
"Albert Einstein developed the theory of relativity in 1905."
Now, extract atomic facts from this text:
Input:
${inputText}
`;</p>
      </sec>
      <sec id="sec-10-2">
        <title>A.2. Mutation Generation Prompts</title>
        <p>We provide both synonym and antonym generation templates
export const antonymPrompt = (
count: Integer,
question: string,
factoid: string
) =&gt; `
You will be given a question and a factual answer (factoid).</p>
        <p>Your task is to generate ${count} *negations* (contradictory statements) of the factoid, based on the context of
the question.</p>
        <p>Instructions:
- Each negation must directly contradict the factoid, focusing on what the question asks.
- Do not add new information not present in the factoid or question.
- Do not use double negations or wording that preserves the original meaning.
- Each negation must be a meaningful, grammatically correct sentence.
- Do not introduce unrelated facts.
- Ensure that each negation is relevant to the ’questions context.
- **Just return the sentences, one per line, without numbers or bullets, and nothing else.**
Example:
Question: Where was Einstein born?
Factoid: Einstein was born in Germany.</p>
        <p>Good Antonym: Einstein was not born in Germany.</p>
        <p>Bad Antonym: Einstein visited Germany. (not a contradiction)
Bad Antonym: Einstein was born in Austria. (adds new information)
Bad Antonym: Einstein was not not born in Germany. (double negation)
Bad Antonym: Was not born in Germany. (missing subject)
Question: ${question}
Factoid: ${factoid}
`;
export const synonymPrompt = (
count: Integer,
question: string,
factoid: string
) =&gt; `
You will be given a question and a factual answer (factoid).</p>
        <p>Your task is to generate ${count} *synonyms* (paraphrased statements with the same meaning) of the factoid, based
on the context of the question.</p>
        <p>Instructions:
- Each output must be a single atomic factual claim (cannot be split into smaller facts).
- Use only information explicitly present in the question or factoid. Do not invent, infer, or add external
knowledge.
- A correct synonym is a statement that means exactly the same thing as the factoid, even if the wording is
different.
- Do not output partial phrases, keywords, or combine/split facts.
- Each synonym must be a complete, grammatically correct sentence.
- Just return the sentences, one per line, without numbers, bullets, or any other output.</p>
        <p>Example:
Question: Where was Einstein born?
Factoid: Einstein was born in Germany.</p>
        <p>Good Synonym: Germany is the country where Einstein was born.</p>
        <p>Bad Synonym: Einstein visited Germany. (not equivalent)
Bad Synonym: Einstein was born. (incomplete)</p>
      </sec>
      <sec id="sec-10-3">
        <title>A.3. Factoid Verification Prompt</title>
        <p>The verification step compares mutated factoids against retrieved context using
entailment/contradiction/neutral checks.
export const verifyPrompt= (
statement: string,
context: string
) =&gt; `
You will be given a statement and passages that represent the ground truth.</p>
        <p>Determine if the statement is supported by the passage, either explicitly or through clear implication.</p>
        <p>Answer with one of the following **only**:
- YES: if the statement is clearly and completely supported by the passages.
- NO: if the statement is contradicted or directly refuted by the passages.
- NOT SURE: if the passage does not contain enough information to confirm or deny the statement.</p>
        <p>Respond with YES, NO, or NOT SURE. Then, in one short sentence, explain the reason for your answer.</p>
        <p>Examples:
Passages (Ground Truth): "Alice was born in Paris and moved to New York at the age of five."
Statement: "Alice spent her early childhood in France."
Answer: YES. The passage states Alice was born in Paris, which is in France.</p>
        <p>Passages (Ground Truth): "Bob has never visited Japan but plans to travel there next summer."
Statement: "Bob visited Japan last year."
Answer: NO. The passage says Bob has never visited Japan
Passages (Ground Truth): "Carol enjoys outdoor activities like hiking and cycling."
Statement: "Carol loves swimming."
Answer: NOT SURE. There is no information in the passages about Carol and swimming.</p>
        <p>All verification LLMs were run with temperatu=re0.0 to ensure determinism.</p>
        <p>Now, perform the task:
Passages (Ground Truth): ${context}
Statement: ${statement}
Answer:`;</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>B. Dataset and Annotation</title>
      <sec id="sec-11-1">
        <title>B.1. Dataset</title>
        <p>Our evaluation dataset consists of 23 internal enterprise documents, unseen during LLM training. Each
document was segmented into chunks of approximately a few hundred tokens, and we retrieved up to
 = 3 chunks per query. Retrieval used cosine similarity over OpetnexAtI-embedding-3-large.</p>
        <p>Figure5 further illustrates the token length distributions of generated answers and retrieved contexts.
Generated answers are typically shorter (me≈di8a3ntokens), while retrieved contexts are longer
(median≈ 572 tokens), reflecting the compression and grounding challenges faced by the RAG system.</p>
      </sec>
      <sec id="sec-11-2">
        <title>B.2. Human Annotation Protocol</title>
        <p>The annotation dataset was constructed in three steps. First, we collected responses produced by the
RAG system on enterprise documents. Second, we used an LLM-based verifier to provide an initial
label (faithful orhallucinated) for each response based on its retrieved context. Finally, human
annotators reviewed the RAG responses together with their retrieved evidence and assigned gold labels.
Annotators were instructed to:
• Mark each response afsaithful orhallucinated.
• Consider a response hallucinated if any atomic factoid was not supported by retrieved evidence.
• Resolve ambiguous cases by majority vote.</p>
        <p>To ensure class balance across conditions, a subset of responses was lightly edited (e.g., by introducing
or removing unsupported factual details) so that hallucinated and non-hallucinated examples were more
evenly represented. These edits were applied before annotation, and annotators labeled both original
and modified responses using the same guidelines. Figur6eillustrates the final label distribution in our
dataset, confirming that hallucination and not hallucination cases are reasonably balanced.</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>C. Extended Results</title>
      <sec id="sec-12-1">
        <title>C.1. Result</title>
        <p>Table 4 reports the full ablation results across prompt settings, mutation counts, and verifier models.
Complete MetaRAG ablation results across 26 configurations. Configuration format: Decomposition
Model / Mutation Generation Model / Verifier Model / Number of Mutatio ns) (/ Temperature.Total
(avg) denotes the average execution time, andCost (avg) denotes the average token usage per run.</p>
      </sec>
      <sec id="sec-12-2">
        <title>C.2. Consistency Study</title>
        <p>variability relative to the mean.
To assess the robustness of MetaRAG to random initialization, we report mean and standard deviation
of the main evaluation metrics over 5 diferent random seeds for the top configurations. We also include
the coeficient of variation (CV) for Precision and Recall, which provides a normalized measure of
Run-to-run consistency for top precision configurations (me±anstandard deviation over 5 seeds) and coeficient
of variation (CV) for Precision.</p>
        <p>ID
Run-to-run consistency for top recall configurations (me±anstandard deviation over 5 seeds) and coeficient of
variation (CV) for Recall.</p>
        <p>These results (Tabl5e and Table6) demonstrate that MetaRAG maintains stable performance across
random seeds, particularly for high-precision configurations (e.g., IDs 24) and high-recall configurations
(a) Precision Pareto front (vs. cost/latency)
(b) Recall Pareto front (vs. cost/latency)
(e.g., IDs 1, 4 and 5). This stability supports the reliability of the Pareto front analysis presented in the
following section.</p>
      </sec>
      <sec id="sec-12-3">
        <title>C.3. Pareto Front Analysis</title>
        <p>We further analyze robustness and metric-specific trade-ofs in this Supplementary Material. A
configuration is Pareto-optimal if no other configuration achieves strictly higher performance while being
no worse in the cost metrics. Figu7raesand7b present the corresponding Pareto fronts for Precision
and Recall. These analyses confirm that the same top-ranked configurations (IDs 5, 18, 19, and 16)
consistently ofer strong performance–eficiency trade-ofs across multiple evaluation criteria.</p>
        <p>In our setting, the positive class corresponds to hallucinations, while the negative class corresponds
to faithful responses (no hallucinations). Hhenigche,Precision means that flagged hallucinations are
rarely false positives, which is criticasalfienty-critical andtrustworthy applications. Conversely,high
Recall ensures that most hallucinations are detected, though at the cost of occasionally misclassifying
faithful responses. Such recall-oriented configurations may be advantageous in exploratory or diagnostic
scenarios. In practice, high-precision operating points (e.g., IDs 18 and 19) reduce false alarms in
safetycritical pipelines, while high-recall points (e.g., IDs 1 and 4) maximize coverage in exploratory settings.
This mirrors standard alert-system trade-ofs and clarifies hMoewtaRAG can be tuned for diferent
deployment objectives. Selections based on F1 score represent a balanced compromise suitable for
general-purpose use cases.</p>
      </sec>
      <sec id="sec-12-4">
        <title>D. Implementation Notes</title>
        <p>All MetaRAG experiments were implemented in TypeScript with asynchronous API calls to the LLMs,
allowing multiple requests to be processed concurrently. This parallelization reduced the average
end-to-end execution time per run, without afecting accuracy metrics. The reported runtime and cost
results in Table4 are therefore representative of a practical and scalable setup.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>O.</surname>
          </string-name>
          et al, Gpt-4
          <source>technical report</source>
          ,
          <year>2024</year>
          . UhRtLt:ps://arxiv.org/abs/2303.087.7a4rXiv:
          <fpage>2303</fpage>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>A. G.</surname>
          </string-name>
          et al,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URLh: ttps://arxiv.org/abs/2407.217.83 arXiv:
          <fpage>2407</fpage>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Adeli</surname>
          </string-name>
          , et al.,
          <article-title>On the opportunities and risks of foundation models</article-title>
          ,
          <source>arXiv preprint arXiv:2108.07258</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tu</surname>
          </string-name>
          , et al.,
          <article-title>Large language models encode clinical knowledge</article-title>
          ,
          <source>Nature</source>
          <volume>620</volume>
          (
          <year>2023</year>
          )
          <fpage>472</fpage>
          -
          <lpage>480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maynez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bohnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <article-title>On faithfulness and factuality in abstractive summarization</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>1906</fpage>
          -
          <lpage>1919</lpage>
          . URLh:ttps://aclanthology.org/
          <year>2020</year>
          .acl-
          <source>main. 1.7d3o/i:10</source>
          . 18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>173</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <article-title>h</article-title>
          . tUtRpLs: //doi.org/10.1145/357173 0. doi:
          <volume>10</volume>
          .1145/3571730.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Trustworthy and responsible large language models: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2402.00176</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Rag can still hallucinate: Faithfulness evaluation for retrievalaugmented generation</article-title>
          ,
          <source>arXiv preprint arXiv:2304.09848</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , et al.,
          <article-title>Holistic evaluation of language models (</article-title>
          <year>2023</year>
          ).hUtRtLp:s: //arxiv.org/abs/2211.0911.0arXiv:
          <fpage>2211</fpage>
          .
          <fpage>09110</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kryscinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>Evaluating the factual consistency of abstractive text summarization</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>9332</fpage>
          -
          <lpage>9346</lpage>
          . URLh:ttps://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>7</volume>
          .50/ doi:10.18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>750</volume>
          . 18653/v1/
          <year>2020</year>
          .acl- main.442.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A. A.</given-names>
            <surname>Mamun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Uddin,
          <article-title>Hallucination detection in large language models with metamorphic relations, 2025</article-title>
          . URhLt:tps://arxiv.org/abs/2502.158.4a4rXiv:
          <fpage>2502</fpage>
          .
          <fpage>15844</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>G.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aleti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Neelofar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tantithamthavorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qi</surname>
          </string-name>
          , T. Y. Chen,
          <article-title>Mortar: Multi-turn metamorphic testing for llm-based dialogue systems</article-title>
          ,
          <year>2025</year>
          . URhLt:tps://arxiv.org/abs/2412.155.57 arXiv:
          <fpage>2412</fpage>
          .
          <fpage>15557</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>N.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Drowzee:
          <article-title>Metamorphic testing for fact-conflicting hallucination detection in large language models</article-title>
          ,
          <year>2024</year>
          . hUtRtLp:s://arxiv.org/abs/2405.006.48 arXiv:
          <fpage>2405</fpage>
          .
          <fpage>00648</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nadeem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bethke</surname>
          </string-name>
          , S. Reddy,
          <article-title>StereoSet: Measuring stereotypical bias in pretrained language models</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long. 4</article-title>
          .1d6o/i:10.18653/v1/
          <year>2021</year>
          .acl- long.416.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yatskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Gender bias in coreference resolution: Evaluation and debiasing methods</article-title>
          , in: M.
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Stent (Eds.),
          <source>Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>2</volume>
          (
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>20</lpage>
          . URhLt: tps://aclanthology.org/N18-20.03/ doi:10.18653/v1/
          <fpage>N18</fpage>
          - 2003.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Morgenstern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vecchione</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          , H. D. III,
          <string-name>
            <surname>K. Crawford</surname>
          </string-name>
          , Datasheets for datasets,
          <source>Commun. ACM</source>
          <volume>64</volume>
          (
          <year>2021</year>
          )
          <fpage>86</fpage>
          -
          <lpage>92</lpage>
          . URhL:ttps://doi.org/10.1145/34587 2.3doi:
          <fpage>10</fpage>
          .1145/ 3458723.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [34]
          <string-name>
            <surname>A. D. Selbst</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Friedler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Venkatasubramanian</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Vertesi</surname>
          </string-name>
          ,
          <article-title>Fairness and abstraction in sociotechnical systems</article-title>
          ,
          <source>in: Proceedings of the Conference on Fairness, Accountability, and Transparency</source>
          , FAT* '
          <volume>19</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>59</fpage>
          -
          <lpage>68</lpage>
          . URL: https://doi.org/10.1145/3287560.32875 9.8doi:
          <fpage>10</fpage>
          .1145/3287560.3287598.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>