<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hannah Sansford</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicholas Richardson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hermina Petric Maretic</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juba Nait Saada</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Knowledge Graphs, Hallucination Detection, Hallucination Correction</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon Science</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bristol</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Methods to evaluate Large Language Model (LLM) responses and detect inconsistencies, also known as hallucinations, with respect to the provided knowledge, are becoming increasingly important for LLM applications. Current metrics fall short in their ability to provide explainable decisions, systematically check all pieces of information in the response, and are often too computationally expensive to be used in practice. We present GraphEval: a hallucination evaluation framework based on representing information in Knowledge Graph (KG) structures. Our method identifies the specific triples in the KG that are prone to hallucinations and hence provides more insight into where in the response a hallucination has occurred, if at all, than previous methods. Furthermore, using our approach in conjunction with state-of-the-art natural language inference (NLI) models leads to an improvement in balanced accuracy on various hallucination benchmarks, compared to using the raw NLI models. Lastly, we explore the use of GraphEval for hallucination correction by leveraging the structure of the KG, a method we name GraphCorrect, and demonstrate that the majority of hallucinations can indeed be rectified.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>before hallucinations were at the forefront of the
problem. Methods have evolved a great deal from traditional
As the size and power of LLMs have drastically increased N-gram based metrics, such as BLEU [2] and ROUGE
over recent years, so has the number of potential appli- [3], to much more intricate LLM-based evaluation
metcations. Arguably, one of the biggest blockers to imple- rics with user-defined evaluation criteria, such as G-Eval
menting these models in practice is their tendency to [4]. More recently, techniques to mitigate the prevalence
hallucinate - returning seemingly plausible, but untrue, of hallucinations in generated outputs leveraging
Reresponses. Here, we focus on the problem of detecting trieval Augmented Generation (RAG) [5] and reasoning
hallucinations with respect to the provided context that on knowledge graphs (KGs) [6, 7] have been proposed.
the LLM should use as its source of knowledge; detecting The former suggested the concatenation of relevant
conhallucinations that have deviated from the LLM’s original textual data into the prompt to ground the LLM response,
training data is out of the scope of this work. In appli- while the latter enforced a more robust reasoning process
cations where certainty in a response is critical, such as through providing grounding information in KG
strucmedical diagnosis, the existence of hallucinations that tures [8]. As successful as these approaches have been,
arise from a given context is especially limiting. There- they do not fully circumvent the need to evaluate LLM
fore, it is of utmost importance to develop successful outputs.
methods to detect these hallucinations and, when it is
of interest to address or correct them, provide clarity on
which aspect of the response is likely a hallucination.</p>
      <p>The importance of this issue is reflected in the amount
of research being published on the topic - see Ji et al. [1]
for a recent survey of this area.</p>
      <p>Inspired by current research harnessing KGs to provide
grounded LLM responses, we propose GraphEval - a
hallucination detection framework based on the
representation of information in KG structures. To the best of
our knowledge, we are the first to apply KGs to an
LLMbased hallucination evaluation framework, and in doing
Performing evaluation on natural language is a challeng- so we provide a higher level of insight into where in
ing task that researchers have been interested in long the output a hallucination has occurred than any
previous metrics. Additionally, we demonstrate how using
our method in conjunction with current state-of-the-art
hallucination detection methods improves their
classiifcation accuracy on various benchmarks. Finally, we
consider the problem of hallucination correction and we
introduce GraphCorrect, showcasing how GraphEval can
efectively be extended to rectify a significant proportion Therefore, researchers have begun to develop new
methof hallucinations present in LLM outputs. ods that are more acutely tuned to detecting
inconsistencies between an LLM output and its grounding context.</p>
      <p>Maynez et al. [9] identified the crossover between the
2. Problem statement textual entailment score in NLI tasks and consistency
prediction. This was a breakthrough at the time, producing
In this work we focus on the closed-domain hallucina- higher correlation with faithfulness than any previous
tion detection problem: the situation where we have a metrics, and paved the way for further research that
captextual output from an LLM which is generated using italised on NLI data and models [13, 14, 15].
some grounding context included in the prompt. In this
case, the goal is for the LLM to use the provided context
as its only source of knowledge. The open-domain
problem, which is with respect to all factual knowledge in
the world, is not explored here but is briefly discussed in
Section 8.</p>
      <p>Very recently, attention has turned to leveraging LLMs
themselves to evaluate the consistency of LLM outputs.</p>
      <p>SelfCheckGPT [16] and ChatProtect [17] approach the
problem by considering the self-consistency within
sampled outputs. Since they require the generation of a large
number of responses from the LLM, many consider these
methods prohibitively computationally expensive.</p>
      <p>We consider hallucination detection to be a binary
classification problem, with 0 corresponding to the LLM output
being factually consistent given the provided context,
and 1 corresponding to the output containing at least
one inconsistency. We can assess hallucination
evaluation methods using a benchmarking dataset containing
ground-truth labels (usually human-annotated) to
determine whether a given context-output pair contains
factual inconsistencies. Throughout the paper we use
the terms factual, consistent, grounded and faithful
interchangeably to mean containing no hallucinations with
respect to the context.</p>
      <p>Other LLM-based hallucination evaluation methods, such
as G-Eval [4] and GPTScore [18], employ a diferent LLM
for evaluation than the one used to generate the LLM
response that needs to be evaluated. G-Eval allows
userdefined evaluation criteria and uses automated
chainof-thought prompting and form-filling to assign scores.</p>
      <p>GPTScore treats the task as conditional generation,
leveraging models like GPT-3 to assign higher probabilities to
high-quality outputs by prepending evaluation
instructions to the LLM prompt. Unlike NLI models trained on
binary classification data, these methods produce scores
Finally, we explore the problem of hallucination correc- that are harder to interpret as probabilities and often
tion, wherein we do not use any directly labeled dataset. require additional steps for inconsistency classification.
Instead, we utilize hallucination detection frameworks to
ifrst identify hallucinations to correct, and subsequently
repurposing them to evaluate the corrected outputs. It is
important to note that our exploration of hallucination
correction only serves as an extension to our evaluation
framework and is not the primary focus of this study.</p>
      <p>Recent hallucination detection methods, such as
FactScore [19] and SAFE [20], utilize large language
models to break down the response into atomic or
individual facts for evaluation. These approaches have
enabled precise identification of where hallucinations
occur within the LLM response. Each fact is automatically
verified against a comprehensive knowledge source like
Wikipedia or scientific literature in the case of FactScore,
or through the use of a search engine in the case of SAFE.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Related work</title>
      <p>Historically, N-gram based metrics such as BLEU [2]
and ROUGE [3] have been the most widely used metrics
for natural language evaluation. However, these
metrics have been shown to perform poorly at the task of
factual inconsistency detection [9, 10]. In more recent
years, embedding-based metrics such as BERTScore [11]
have been favoured over N-gram based metrics. These
methods measure the similarity between two pieces of
text by comparing the contextualised embedding from a
transformer model, such as BERT [12].</p>
      <p>Both N-gram and embedding-based metrics base their
scores on how similar the text to be evaluated is to some
reference text. This similarity objective often fails to
capture the intricacies of the hallucination detection problem.
FactGraph [21] is the only factuality evaluation method
we are aware of that utilises graph-like structures. The
method is focused solely on the detection of
inconsistencies in the summarization problem, decomposing both
the summary and the supporting documents into what
they call structured meaning representations (MRs). These
MRs describe the core semantic concepts and relations,
which the authors claim to be more suitable for factuality
evaluation than the raw text.</p>
    </sec>
    <sec id="sec-3">
      <title>4. GraphEval: Our evaluation method</title>
      <p>a visualisation of this process in Figure 1 using a real
example from one of the benchmarks described in Section
7.1.</p>
      <p>GraphEval is based around the idea of representing
information in a structured manner through KGs, and aims to
address the lack of explainability of previous
hallucination detection approaches, i.e. which concrete pieces of
information in particular are inconsistent.</p>
      <p>Regarding stage 1, we provide a short review of
LLMbased KG construction methods in Section 5, along with
results from our implementation. For stage 2, we leverage
existing techniques and employ an out-of-the-box NLI
model for this task. A benefit of this approach is that it
Formally, a KG is a collection of triples  = gives us the opportunity to make a direct comparison
{(1, , 2) ⊆ ℰ × ℛ × ℰ } , where ℰ and ℛ denote between the performance of the raw NLI model and the
the set of entities and relationships, respectively. In the model supplemented with our KG approach. In essence,
GraphEval setting, both entities and relationships are our method is a pre-processing step, the output of which
simply pieces of text. We do not make use of common can be fed into any hallucination detection method; we
extensions to this simple setting, such as entity and rela- choose NLI models as they are computationally cheap
tionship types, or attached properties. compared to LLM-based models, yet still achieve
state-ofthe-art results. By feeding each triple into an NLI model,
Our GraphEval metric consists of a two-stage procedure: along with the grounding context, we obtain a probability
of containing a hallucination for each triple. Finally, we
classify the example as inconsistent if at least one triple
produces a probability greater than 0.5.</p>
      <sec id="sec-3-1">
        <title>Stage 1 - Construct a KG from the LLM output</title>
        <p>to be evaluated.</p>
        <p>Stage 2 - Iterate through each of the triples in
the KG, identifying whether they are factually
consistent given the provided context.</p>
        <p>The output is considered factually inconsistent if any of
the triples in stage 2 are identified as not grounded in the
context. The inconsistent triple(s) may also be returned
to provide explainability by highlighting where in the
output the hallucination(s) has occurred. We provide
Similar approaches to ours have been proposed in
recent literature. SummaC [14] also uses NLI-based models
to detect inconsistencies in LLM-generated summaries.
However, it distinguishes itself by segmenting both the
context and the summary into their respective sentences,
and then by passing each context-summary pair into the
NLI model. This approach presents challenges in
main</p>
        <sec id="sec-3-1-1">
          <title>Benchmark</title>
          <p>SummEval
QAGS-C
QAGS-X</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>No. of Examples Label Ratio Avg Output len. Avg Context len.</title>
          <p>1,600
235
239
33.2%
48.1%
48.5%
63
49
18
359
383
318
taining entity references across sentences; for instance,
"John Doe" may only be referred to as "he" in another
sentence. Similarly, FactScore [19] faces the same
limitation. Our method circumvents this issue by organising
entity relationships with a KG.</p>
          <p>While FactGraph [21] also makes use of graph structures
in their consistency evaluation process, the method
differs from GraphEval in a few major respects. Firstly,
their approach can only be applied to the summarisation
problem; whereas GraphEval can easily be applied to
various domains such as Summarisation, Question
Answering, Common Sense Reasoning and many others.
Secondly, FactGraph does not employ LLMs anywhere in
their framework, missing out on recent advances in the
ifeld. Finally, their approach aims to decompose both the
LLM output and the provided context into the underlying
core semantic concepts and relations, before comparing
each of the graph structures. GraphEval, on the other
hand, only represents the LLM output as a KG and aims
to preserve as much of the information contained in the
raw text as possible.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Construction of KGs using LLMs</title>
      <p>Constructing KGs from unstructured textual data
involves identifying the set of entities within the text and
the relationships between them, resulting in a structured
representation of the information contained within the
text. The process can be divided into three main stages:
1. Entity detection - the process of identifying and</p>
      <p>extracting entities from text.
2. Coreference resolution - the process of finding
of all expressions (also called mentions) in the text
that refer to the same entity.
3. Relation extraction - the process of identifying
semantic relationships between entities.</p>
      <p>To summarise the advantages of GraphEval over previous
methods:
Previously, researchers addressed each stage individually,
but with the increasing power of LLMs, there’s been a
shift towards end-to-end systems. Kumar et al. [22]
suggest employing two LLM components: one for named
entity recognition and another one for both relation
classification and direction. Similarly, Grapher [ 23] utilizes a
pre-trained LLM for entity extraction and relation
prediction. However, these methods require users to provide
possible relations. More recent methods like PiVE [24]
• We present a systematic way of checking all and AutoKG [25] use LLM prompting strategies for KG
pieces of information contained in the LLM out- construction without additional user input.
put. The aforementioned methods do not make use of some of
• Our method only requires one call to an LLM, in the emergent abilities of LLMs, such as in-context
learnthe KG construction phase, and does not require ing and the chain-of thought prompting strategy. We
the (usually) large context documents to be input, decide to leverage these emergent abilities, and take a
as in all previous LLM-based metrics. This makes simple prompt engineering approach to our KG
construcGraphEval less computationally expensive than tion step. The techniques used can be summarised as the
other LLM-based methods. following:
• Our method returns the specific triples that are
not grounded in the context, providing
explainability for the decision and identifying which
section of the output should not be trusted. We
leverage this feature for hallucination correction and
propose a new method called GraphCorrect,
described in Section 6.
• Chain-of-thought (CoT) prompting strategy.
Providing intermediate reasoning steps in the prompt
to enable LLMs to solve more complex tasks.
• In-context learning. A method of prompt
engineering where one provides several task
demonstrations within the prompt, circumventing the
need for fine-tuning.</p>
      <p>The final prompt used in our experiments can be found
in the Appendix. We highlight to the reader that our KG
construction method is not the main contribution of our
work, which is rather the application of KG construction
to the hallucination detection problem. The major benefit
of our KG construction approach is its ease of implemen- 7.1. Benchmarks
tation with any LLM. Furthermore, it is less
computationally intensive than methods like PiVE, which performs
multiple iterations of improvements to the generated KG.</p>
      <p>Of course, users may conduct the KG construction stage
of GraphEval using their method of choice; the
experiments in this paper exhibit the capability of a simple
prompting strategy.</p>
    </sec>
    <sec id="sec-5">
      <title>7. Experiments</title>
    </sec>
    <sec id="sec-6">
      <title>6. GraphCorrect: Correction of hallucinations with GraphEval</title>
      <p>While the primary focus of this work lies in hallucination
detection, GraphEval’s breakdown of LLM outputs into
triples easily allows for its extension to correct hallucina- SummEval The SummEval dataset consists of human
tions within the given context. To achieve this, we first evaluations on 16 summarization model outputs from
identify all triples within the KG that are likely to con- 100 articles from the CNN/DailyMail dataset [28]. Each
tain hallucinations (i.e. those with a probability greater summary is labelled on a Likert scale from 1-5 on 4
catthan 0.5, if any). We then employ the following two-step egories: consistency, coherence, fluency and relevance.
procedure on each identified triple: We follow the TRUE benchmark [13] in taking the
consistency scores and mapping a score of 5 to being fully
consistent, and anything lower to being inconsistent.</p>
      <p>We conducted two sets of experiments: one focusing on
hallucination detection to highlight GraphEval’s
performance and another on hallucination correction to
showcase the advantages of GraphCorrect. For both scenarios,
we utilized the SummEval [26], QAGS-C and QAGS-X
[27] benchmarks - currently the most prevalent
benchmarks in relevant academic literature. All three are
concerned with detecting hallucinations in LLM-generated
summaries and are human-annotated for factual
consistency with respect to the grounding context. Table
1 contains some statistics pertaining to each of these
datasets.</p>
      <p>QAGS The QAGS-C and QAGS-X datasets are built
from the CNN/DailyMail and the XSum [29] datasets,
respectively. The human annotators examined the
summaries one sentence at a time, and determined the factual
consistency of each sentence comparing it to the original
article. Three annotators assessed each sentence and the
majority decision was recorded. Again, we follow the
TRUE benchmark in considering a summary to be
factually consistent if and only if all sentences are considered
consistent.</p>
      <sec id="sec-6-1">
        <title>Step 1 - Input the given triple along with the</title>
        <p>context into an LLM to correct for the potential
hallucinations within the triple. This results in a
newly generated corrected triple.</p>
        <p>Step 2 - Input the identified triple, its corrected
counterpart and the initial LLM output.
Selectively replace the information from the original
(hallucination-containing) triple with the
information from the new triple in the initial LLM
output.</p>
        <p>1https://huggingface.co
2https://huggingface.co/vectara/hallucination_evaluation_model
We name this LLM hallucination correction method as
GraphCorrect. The final prompts used in our experiments
for both step 1 and step 2 can be found in the Appendix B
and C respectively. This systematic approach to
hallucination correction ofers several benefits. First, it tackles 7.2. NLI models in GraphEval
each identified hallucination separately, increasing the
chances of all perceived hallucinations being corrected. As mentioned in Section 4, we employ NLI models to
Furthermore, it ofers the advantage of exclusively alter- perform the second stage of GraphEval - checking the
ing the segments of the original text that are suspected consistency of each individual triple with respect to the
to contain a hallucination, leaving other elements un- context. We conduct experiments using the three most
touched and ensuring overall high similarity with the popular NLI-based hallucination detection models
availoriginal text. Finally, breaking down the entire process able on HuggingFace 1.
into intermediate steps ensures that the original context
and the initial LLM output never undergo simultaneous
processing within an LLM. This guarantees safeguards
against both the addition of extra information and the
loss of information in the LLM output.</p>
        <p>HHEM Based on the DeBERTaV3 model [30] and
initially trained on NLI data, the hallucination evaluation
model created by Vectara 2 is further fine-tuned on
datasets annotated for consistency. The datasets used
for fine tuning were: FEVER [ 31], Vitamin C [32] and
PAWS [33]. This model is considerably smaller than the
following two models, requiring only 738 MB of memory,
and thus has a significantly shorter run-time.</p>
        <p>TRUE The TRUE model is based on a T5-XXL model
[34] and is trained similarly to the model described in
the TRUE paper [13]. Instead of the ANLI dataset used
in that paper, this model is trained on the same datasets
as HHEM, plus the following: SNLI [35], MNLI [36] and
Scitail [37]. This model requires 45.5 GB of memory.</p>
        <p>HHEM
HHEM + GraphEval
TRUE
TRUE + GraphEval
TrueTeacher
TrueTeacher + GraphEval</p>
        <p>In all experiments conducted in this study necessitating
the utilization of an LLM, we use Claude 2 3, an LLM
from Anthropic, through the Amazon Bedrock API 4. We
use the default settings for the LLM: temperature = 1,
top_p = 1, top_k = 250. We also refer the reader to the
Appendix for the prompts used in this work.</p>
        <p>TrueTeacher Gekhman et al. [15] leverage the ability
of LLMs to evaluate hallucinations by generating syn- We hypothesise that the negligible diference between
thetic data through annotating model-generated sum- the base NLI model and the model supplemented with
maries. They then use this synthetic data to further GraphEval for the QAGS-X dataset is due to the average
ifne-tune the model from [ 13], leading to state-of-the- length of the generated text (only 18 words, compared
art performance on the TRUE benchmark. This model is with 49 and 63 for QAGS-C and SummEval respectively,
the same size as the TRUE model. see 1). This highlights an important aspect of where the
most value can be found in our method. When the LLM
output is very short, there are less likely to be multiple
7.3. Experimental settings facts that need to be checked for consistency (which
can easily be done without the use of a KG) and the
intricacies of the short sentence might even be lost in
the KG construction phase. On the other hand, when the
LLM output is very long, current methods struggle to
test each individual fact against the context, and this is
when GraphEval thrives.
7.4. Results</p>
        <sec id="sec-6-1-1">
          <title>7.4.1. Hallucination detection with GraphEval</title>
          <p>It should be noted that even when the results for
GraphEval are comparable to the baseline methods, the benefit
of using GraphEval is the identification of the specific
triple(s) that are inconsistent with the provided context.
3https://www.anthropic.com/news/claude-2
4https://aws.amazon.com/bedrock/claude/
We present our results of hallucination detection for the
three NLI models, and their GraphEval counterparts, in 7.4.2. Hallucination correction with GraphCorrect
Table 2. We report the balanced accuracy as our evalu- Identifying the particular triple(s) likely to harbor a
halluation metric, which corrects for the class imbalance in cination enables straightforward correction using
Graphthe SummEval benchmark. In the case of using the NLI Correct, as described in Section 6. For each of the
evalumodel directly, we classify the example as containing a ation frameworks proposed here (HHEM + GraphEval,
hallucination if the NLI model returns a probability of TRUE + GraphEval, and TrueTeacher + GrapEval), we
more than 0.5. When combining the NLI model with compared GraphCorrect to a basic prompting strategy
GraphEval, we classify the example as containing a hallu- for hallucination correction, serving as a baseline. The
cination if at least one triple fed to the NLI model returns prompt used in this baseline approach, referred to as the
a probability of more than 0.5. We see that adding the Direct Prompt henceforth, is provided in Appendix D.
GraphEval pre-processing step to each of the NLI
models almost always improves the balanced accuracy score, For each framework, we initially identify hallucinations,
sometimes by a considerable amount, such as the results correct only the LLM outputs suspected of containing
halfor the SummEval and QAGS-C benchmarks in Table lucinations using either GraphCorrect or Direct Prompt,
2. On average (weighting by the number of samples in and then reapply the evaluation framework to detect
haleach dataset), adding the GraphEval pre-processing step lucinations in the corrected LLM outputs. Note that this
improves the balanced accuracy by 6.2 (SE=1.3). procedure only allows us to measure what we presume to
be corrected hallucinations, given the potential for errors
in the evaluation frameworks utilized here. We report the</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>8. Discussion</title>
      <p>We observe that, in the knowledge graph construction
phase of our procedure, it is possible that some
information loss may occur. However, as shown by the results
in Section 7.4, our method rarely leads to a reduction in
balanced accuracy. Furthermore, when it is comparable
to the baseline methods, we have the added explainability
of identifying the specific triples where the hallucination
has occurred.
percentage of believed corrected hallucinations in Table
4. A score of 0% suggests no corrected hallucinations
according to the given framework, while a score of 100%
indicates correction of all hallucinations as per the given
framework. GraphCorrect outperforms the prompting
strategy proposed here by significantly correcting for
more hallucinations on all tasks apart from two related
to the QAGS-X dataset. As on the hallucination detection
task, we hypothesise these results are correlated with the
average length of the text, with GraphCorrect bringing
most value in longer texts with a more complex structure
to unravel and correct.</p>
      <p>Additionally, as previously stated, GraphCorrect ofers
the advantage of only modifying the segments of text
in the LLM outputs susceptible to hallucinations, while
leaving other sections unaltered, thereby maintaining
high overall similarity with the original text. This charac- We expect that in the near future, more research will be
teristic is illustrated in Table 3 by assessing the ROUGE-1, conducted on the construction of KGs from unstructured
ROUGE-2, and ROUGE-L metrics between the original text, which will provide improvements to the first stage of
our procedure and ultimately the evaluation performance.</p>
      <p>Even as LLMs alone become more powerful, this will
Detection &amp; Evaluation Dataset continue to contribute to improvements in GraphEval’s
performance.</p>
      <p>Our work focuses on detection of hallucinations in
closeddomain tasks, where we are interested only in
consistency with respect to the provided context. The
GraphEval framework could be extended to open-domain
hallucination detection by employing agents, as in AutoKG [25],
to first retrieve relevant external sources as the grounding
information to check against.</p>
    </sec>
    <sec id="sec-8">
      <title>9. Conclusion</title>
      <p>We introduce GraphEval, a simple and efective
preprocessing step for improving the explainability and
performance of LLM hallucination detection metrics. Our
method leverages LLM’s ability to extract information
from unstructured text and construct knowledge graphs,
whose triples can be fed into out-of-the-box hallucination
detection methods.</p>
      <p>Finally, we examined the issue of hallucination correction
and showed that GraphCorrect can efectively address the
majority of hallucinations found in LLM outputs while
maintaining extremely high similarity with the original
texts.
We demonstrate that GraphEval in conjunction with
state-of-the-art NLI models leads to an average
improvement in balanced accuracy of 6.2 (SE=1.3) on three
popular hallucination benchmarks. Furthermore, our method
indicates which triples, in the KG representation of the
LLM output, are inconsistent. To the best of our
knowledge, this is the first application of KGs to an LLM-based
hallucination evaluation framework and we believe the
success of GraphEval will only grow as KG construction
methods also improve.
Language Technologies, Volume 1 (Long and Short Power and Communication Technologies (GUCON),
Papers), Association for Computational Linguistics, 2020, pp. 310–315. doi:10.1109/GUCON48875.
Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: 2020.9231227.
https://aclanthology.org/N19-1423. doi:10.18653/ [23] I. Melnyk, P. Dognin, P. Das, Grapher: Multi-stage
v1/N19-1423. knowledge graph construction using pretrained
lan[13] O. Honovich, R. Aharoni, J. Herzig, H. Taitelbaum, guage models, in: NeurIPS 2021 Workshop on
D. Kukliansy, V. Cohen, T. Scialom, I. Szpektor, Deep Generative Models and Downstream
AppliA. Hassidim, Y. Matias, TRUE: Re-evaluating cations, 2021. URL: https://openreview.net/forum?
factual consistency evaluation, in: S. Feng, id=N2CFXG8-pRd.</p>
      <p>H. Wan, C. Yuan, H. Yu (Eds.), Proceedings of [24] J. Han, N. Collier, W. Buntine, E. Shareghi, Pive:
the Second DialDoc Workshop on Document- Prompting with iterative verification improving
grounded Dialogue and Conversational Question graph-based generative capability of llms, arXiv
Answering, Association for Computational Lin- preprint arXiv:2305.12392 (2023).
guistics, Dublin, Ireland, 2022, pp. 161–175. URL: [25] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao,
https://aclanthology.org/2022.dialdoc-1.19. doi:10. S. Deng, H. Chen, N. Zhang, Llms for knowledge
18653/v1/2022.dialdoc-1.19. graph construction and reasoning: Recent
capa[14] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst, bilities and future opportunities, arXiv preprint
SummaC: Re-visiting NLI-based models for incon- arXiv:2305.13168 (2023).
sistency detection in summarization, Transactions [26] A. R. Fabbri, W. Kryscinski, B. McCann, R. Socher,
of the Association for Computational Linguistics D. R. Radev, Summeval: Re-evaluating
sum10 (2022) 163–177. URL: https://aclanthology.org/ marization evaluation, Transactions of the
2022.tacl-1.10. doi:10.1162/tacl_a_00453. Association for Computational Linguistics 9
[15] Z. Gekhman, J. Herzig, R. Aharoni, C. Elkind, (2020) 391–409. URL: https://api.semanticscholar.</p>
      <p>I. Szpektor, Trueteacher: Learning factual consis- org/CorpusID:220768873.
tency evaluation with large language models, 2023. [27] A. Wang, K. Cho, M. Lewis, Asking and
anarXiv:2305.11171. swering questions to evaluate the factual
consis[16] P. Manakul, A. Liusie, M. J. Gales, Selfcheckgpt: tency of summaries, in: D. Jurafsky, J. Chai,
Zero-resource black-box hallucination detection for N. Schluter, J. Tetreault (Eds.), Proceedings of the
generative large language models, arXiv preprint 58th Annual Meeting of the Association for
ComarXiv:2303.08896 (2023). putational Linguistics, Association for
Computa[17] N. Mündler, J. He, S. Jenko, M. Vechev, Self- tional Linguistics, Online, 2020, pp. 5008–5020. URL:
contradictory hallucinations of large language mod- https://aclanthology.org/2020.acl-main.450. doi:10.
els: Evaluation, detection and mitigation, in: The 18653/v1/2020.acl-main.450.
Twelfth International Conference on Learning Rep- [28] K. M. Hermann, T. Kocisky, E. Grefenstette, L.
Esperesentations, 2024. URL: https://openreview.net/ holt, W. Kay, M. Suleyman, P. Blunsom, Teaching
forum?id=EmQSOi1X2f. machines to read and comprehend, Advances in
[18] J. Fu, S.-K. Ng, Z. Jiang, P. Liu, Gptscore: Evaluate neural information processing systems 28 (2015).</p>
      <p>as you desire, 2023. arXiv:2302.04166. [29] S. Narayan, S. B. Cohen, M. Lapata, Don’t give
[19] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, me the details, just the summary! topic-aware
P. W. Koh, M. Iyyer, L. Zettlemoyer, H. Hajishirzi, convolutional neural networks for extreme
sumFactscore: Fine-grained atomic evaluation of fac- marization, in: E. Rilof, D. Chiang, J.
Hockentual precision in long form text generation, arXiv maier, J. Tsujii (Eds.), Proceedings of the 2018
Conpreprint arXiv:2305.14251 (2023). ference on Empirical Methods in Natural Language
[20] J. Wei, C. Yang, X. Song, Y. Lu, N. Hu, D. Tran, Processing, Association for Computational
LinguisD. Peng, R. Liu, D. Huang, C. Du, et al., Long-form tics, Brussels, Belgium, 2018, pp. 1797–1807. URL:
factuality in large language models, arXiv preprint https://aclanthology.org/D18-1206. doi:10.18653/
arXiv:2403.18802 (2024). v1/D18-1206.
[21] L. F. R. Ribeiro, M. Liu, I. Gurevych, M. Dreyer, [30] P. He, X. Liu, J. Gao, W. Chen, Deberta:
DecodingM. Bansal, Factgraph: Evaluating factuality in sum- enhanced bert with disentangled attention, in:
Inmarization with semantic graph representations, ternational Conference on Learning
Representa2022. arXiv:2204.06508. tions, 2021. URL: https://openreview.net/forum?id=
[22] A. Kumar, A. Pandey, R. Gadia, M. Mishra, Build- XPZIaotutsD.</p>
      <p>ing knowledge graph using pre-trained language [31] J. Thorne, A. Vlachos, O. Cocarascu,
model for learning entity-aware relationships, in: C. Christodoulopoulos, A. Mittal, The FEVER2.0
2020 IEEE International Conference on Computing, shared task, in: Proceedings of the Second
Workshop on Fact Extraction and VERification s i m p l e , t h e y a r e a k i n t o
(FEVER), 2018. W i k i p e d i a nodes .
[32] T. Schuster, A. Fisch, R. Barzilay, Get your vita- S t e p 2 − C o r e f e r e n c e r e s o l u t i o n : F i n d
min C! robust fact verification with contrastive a l l e x p r e s s i o n s i n t h e t e x t t h a t
evidence, in: Proceedings of the 2021 Confer- r e f e r t o t h e same e n t i t y . Make
ence of the North American Chapter of the As- s uInr e pea nrttiict uiel as r a dreo nn oo tt idnucplluidc ea t e d .
sociation for Computational Linguistics: Human e n t i t i e s t h a t a r e more s p e c i f i c
Language Technologies, Association for Compu- v e r s i o n s t h e m s e l v e s , e . g . " a
tational Linguistics, Online, 2021, pp. 624–643. d e t a i l e d view o f j u p i t e r ’ s
URL: https://aclanthology.org/2021.naacl-main.52. a t m o s p h e r e " and " j u p i t e r ’ s
doi:10.18653/v1/2021.naacl-main.52. a t m o s p h e r e " , o n l y i n c l u d e t h e
[33] Y. Zhang, J. Baldridge, L. He, PAWS: Paraphrase most s p e c i f i c v e r s i o n o f t h e
Adversaries from Word Scrambling, in: Proc. of e n t i t y .</p>
      <p>NAACL, 2019. S t e p 3 − R e l a t i o n e x t r a c t i o n :
[34] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, I d e n t i f y s e m a n t i c r e l a t i o n s h i p s
M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the between t h e e n t i t i e s you have
limits of transfer learning with a unified text-to- i d e n t i f i e d .
text transformer, Journal of Machine Learning Re- Format : R e t u r n t h e knowledge graph a s
search 21 (2020) 1–67. URL: http://jmlr.org/papers/ a l i s t o f t r i p l e s , i . e . [ " e n t i t y
v21/20-074.html. 1 " , " r e l a t i o n 1 − 2 " , " e n t i t y 2 " ] ,
[35] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, i n Python code .</p>
      <p>A large annotated corpus for learning natural lan- " " " ,
guage inference, in: L. Màrquez, C. Callison-Burch, ) ,
J. Su (Eds.), Proceedings of the 2015 Conference ( " human " ,
on Empirical Methods in Natural Language Pro- " Use t h e g i v e n f o r m a t t o e x t r a c t
cessing, Association for Computational Linguistics, i n f o r m a t i o n from t h e
Lisbon, Portugal, 2015, pp. 632–642. URL: https: ifno plluotw} i&lt;n/gi n pi nu pt u&gt;t. : &lt;S ikni pp u tt h&gt;e{
//aclanthology.org/D15-1075. doi:10.18653/v1/ p r e a m b l e and o u t p u t t h e
D15-1075. r e s u l t a s a l i s t w i t h i n &lt;
[36] A. Williams, N. Nangia, S. Bowman, A broad- python &gt; &lt;/ python &gt; t a g s . " ,
coverage challenge corpus for sentence understand- ) ,
ing through inference, in: M. Walker, H. Ji, A. Stent ( " human " ,
(Eds.), Proceedings of the 2018 Conference of the " " " I m p o r t a n t T i p s :
North American Chapter of the Association for 1 . Make s u r e a l l i n f o r m a t i o n
Computational Linguistics: Human Language Tech- i s i n c l u d e d i n t h e
nologies, Volume 1 (Long Papers), Association for knowledge graph .
Computational Linguistics, New Orleans, Louisiana, 2 . Each t r i p l e must o n l y
2018, pp. 1112–1122. URL: https://aclanthology.org/ Ncoonnet a ionf tthher e es t rsitnr gins g s !
N18-1101. doi:10.18653/v1/N18-1101. s h o u l d be empty .
[37] T. Khot, A. Sabharwal, P. Clark, SciTail: A textual 3 . Do n o t s p l i t up r e l a t e d
entailment dataset from science question answer- i n f o r m a t i o n i n t o s e p a r a t e
ing, in: AAAI, 2018. t r i p l e s b e c a u s e t h i s
c o u l d change t h e meaning .
4 . Make s u r e a l l b r a c k e t s and
A. KG Construction Prompt q u o t a t i o n marks a r e
matched .
5 . B e f o r e a d d i n g a t r i p l e t o
t h e knowledge graph ,
c h e c k t h e c o n c a t e n a t e d
t r i p l e makes s e n s e a s a
s e n t e n c e . I f not , d i s c a r d
i t .
( " s y s t e m " ,
" " "
You a r e an e x p e r t a t e x t r a c t i n g</p>
      <p>i n f o r m a t i o n i n
s t r u c t u r e d f o r m a t s t o b u i l d a</p>
      <p>knowledge graph .</p>
      <p>S t e p 1 − E n t i t y d e t e c t i o n : I d e n t i f y
a l l e n t i t i e s i n t h e raw t e x t .</p>
      <p>Make s u r e n o t t o m i s s any o u t .</p>
      <p>E n t i t i e s s h o u l d be b a s i c and
" " " ,
) ,
( " human " ,</p>
      <p>" " " Here a r e some example i n p u t
and o u t p u t p a i r s .
## Example 1 .</p>
      <p>I n p u t :
" The Walt Disney Company ,
commonly known a s Disney , i s
an American m u l t i n a t i o n a l
mass media and e n t e r t a i n m e n t
c o n g l o m e r a t e t h a t i s
h e a d q u a r t e r e d a t t h e Walt
Disney S t u d i o s complex i n</p>
      <p>Burbank , C a l i f o r n i a . "
Output :
&lt;python &gt;
[ [ " The Walt Disney Company " , "
h e a d q u a r t e r e d a t " , " Walt
Disney S t u d i o s complex i n</p>
      <p>Burbank , C a l i f o r n i a " ] ,
[ " The Walt Disney Company " , "
commonly known a s " , " Disney
" ] ,
[ " The Walt Disney Company " , "
i n s t a n c e o f " , " American
m u l t i n a t i o n a l mass media and
e n t e r t a i n m e n t c o n g l o m e r a t e " ] ]
&lt;/ python &gt;
## Example 2 .</p>
      <p>I n p u t :
" Amanda J a c k s o n was born i n</p>
      <p>S p r i n g f i e l d , Ohio , USA on
June 1 , 1 9 8 5 . She was a
b a s k e t b a l l p l a y e r f o r t h e U . S
. women ’ s team . "
Output :
&lt;python &gt;
[ [ " Amanda J a c k s o n " , " born i n " , "</p>
      <p>S p r i n g f i e l d , Ohio , USA " ] ,
[ " Amanda J a c k s o n " , " born on " , "</p>
      <p>June 1 , 1 9 8 5 " ] ,
[ " Amanda J a c k s o n " , " o c c u p a t i o n " ,</p>
      <p>" b a s k e t b a l l p l a y e r " ] ,
[ " Amanda J a c k s o n " , " p l a y e d f o r " ,
"U . S . women ’ s b a s k e t b a l l team
" ] ] &lt;/ python &gt;
## Example 3 .</p>
      <p>I n p u t :
" Music e x e c u t i v e D a r i u s Van Arman
was born i n P e n n s y l v a n i a . He
a t t e n d e d Gonzaga C o l l e g e
High S c h o o l and i s a human
b e i n g . "
Output :
&lt;python &gt;
[ [ " D a r i u s Van Arman " , "
o c c u p a t i o n " , " Music e x e c u t i v e
" ] ,
[ " D a r i u s Van Arman " , " born i n " , "</p>
      <p>P e n n s y l v a n i a " ] ,
[ " D a r i u s Van Arman " , " a t t e n d e d " ,
" Gonzaga C o l l e g e High S c h o o l
" ] , [ " D a r i u s Van Arman " , "
i n s t a n c e o f " , " human b e i n g " ] ]
&lt;/ python &gt;
## Example 4 .</p>
      <p>I n p u t : " I t a l y had 3 . 6 x t i m e s more</p>
      <p>c a s e s o f c o r o n a v i r u s than</p>
      <p>China . "
Output :
&lt;python &gt;
[ [ " I t a l y " , " had 3 . 6 x t i m e s more
c a s e s o f c o r o n a v i r u s than " , "</p>
      <p>China " ] ]
&lt;/ python &gt;
" " " ,
) ,</p>
    </sec>
    <sec id="sec-9">
      <title>B. Hallucination correction (step 1)</title>
      <p>" " "
You a r e an e x p e r t a t e x t r a c t i n g
i n f o r m a t i o n i n s t r u c t u r e d f o r m a t s
from t e x t .</p>
      <p>The f o l l o w i n g t r i p l e c o n t a i n s
f a c t u a l l y i n c o r r e c t i n f o r m a t i o n .
C o r r e c t i t b a s e d on t h e p r o v i d e d
c o n t e x t ,
I m p o r t a n t T i p s :
1 . A t r i p l e i s d e f i n e d a s [ "
e n t i t y 1 " , " r e l a t i o n 1 − 2 " , "
e n t i t y 2 " ] .
2 . A t r i p l e must only c o n t a i n
t h r e e s t r i n g s ! None o f t h e
s t r i n g s s h o u l d be empty .
3 . The c o n c a t e n a t e d t r i p l e must
make s e n s e a s a s e n t e n c e .
4 . Only r e t u r n t h e c o r r e c t e d
t r i p l e , n o t h i n g e l s e .
&lt; t r i p l e &gt; { t r i p l e } &lt; / t r i p l e &gt;
&lt; c o n t e x t &gt; { c o n t e x t } &lt; / c o n t e x t &gt;
Remember , i t i s i m p o r t a n t t h a t you
only r e t u r n t h e c o r r e c t e d t r i p l e .
" " "</p>
    </sec>
    <sec id="sec-10">
      <title>C. Hallucination correction (step 2)</title>
      <p>" " "
I n t h e f o l l o w i n g c o n t e x t , r e p l a c e t h e
i n f o r m a t i o n o f t h e o l d t r i p l e
with t h e i n f o r m a t i o n o f t h e new
one .</p>
      <p>Do not make any o t h e r m o d i f i c a t i o n t o
t h e c o n t e x t .</p>
      <p>Only r e t u r n t h e new c o n t e x t .
&lt; c o n t e x t &gt; { summary } &lt; / c o n t e x t &gt;
&lt; o l d _ t r i p l e &gt; { o l d _ t r i p l e } &lt; / o l d _ t r i p l e &gt;
&lt; n e w _ t r i p l e &gt; { n e w _ t r i p l e } &lt; / n e w _ t r i p l e &gt;
" " "</p>
    </sec>
    <sec id="sec-11">
      <title>D. Hallucination correction without a KG</title>
      <p>" " "
The f o l l o w i n g summary c o n t a i n s
f a c t u a l l y i n c o r r e c t i n f o r m a t i o n .
C o r r e c t i t b a s e d on t h e c o n t e x t , but
don ’ t change o t h e r p a r t s o f t h e
summary .</p>
      <p>Only r e t u r n t h e c o r r e c t e d summary ,
n o t h i n g e l s e .
&lt;summary &gt; { summary } &lt; / summary&gt;
&lt; c o n t e x t &gt; { c o n t e x t } &lt; / c o n t e x t &gt;
Remember , do minimal changes t o t h e
o r i g i n a l summary , don ’ t make i t
l o n g e r and keep a s much o f i t a s
you can e x a c t l y t h e same .</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>