=Paper=
{{Paper
|id=Vol-3894/paper5
|storemode=property
|title=GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework
|pdfUrl=https://ceur-ws.org/Vol-3894/paper5.pdf
|volume=Vol-3894
|authors=Hannah Sansford,Nicholas Richardson,Hermina Petric Maretic,Juba Nait Saada
|dblpUrl=https://dblp.org/rec/conf/kil/SansfordRMS24
}}
==GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework==
GraphEval: A Knowledge-Graph Based LLM Hallucination
Evaluation Framework
Hannah Sansford1,* , Nicholas Richardson2 , Hermina Petric Maretic2 and Juba Nait Saada2
1
University of Bristol, UK
2
Amazon Science
Abstract
Methods to evaluate Large Language Model (LLM) responses and detect inconsistencies, also known as hallucinations, with
respect to the provided knowledge, are becoming increasingly important for LLM applications. Current metrics fall short in
their ability to provide explainable decisions, systematically check all pieces of information in the response, and are often too
computationally expensive to be used in practice. We present GraphEval: a hallucination evaluation framework based on
representing information in Knowledge Graph (KG) structures. Our method identifies the specific triples in the KG that are
prone to hallucinations and hence provides more insight into where in the response a hallucination has occurred, if at all,
than previous methods. Furthermore, using our approach in conjunction with state-of-the-art natural language inference
(NLI) models leads to an improvement in balanced accuracy on various hallucination benchmarks, compared to using the raw
NLI models. Lastly, we explore the use of GraphEval for hallucination correction by leveraging the structure of the KG, a
method we name GraphCorrect, and demonstrate that the majority of hallucinations can indeed be rectified.
Keywords
Large Language Models, Knowledge Graphs, Hallucination Detection, Hallucination Correction
1. Introduction before hallucinations were at the forefront of the prob-
lem. Methods have evolved a great deal from traditional
As the size and power of LLMs have drastically increased N-gram based metrics, such as BLEU [2] and ROUGE
over recent years, so has the number of potential appli- [3], to much more intricate LLM-based evaluation met-
cations. Arguably, one of the biggest blockers to imple- rics with user-defined evaluation criteria, such as G-Eval
menting these models in practice is their tendency to [4]. More recently, techniques to mitigate the prevalence
hallucinate - returning seemingly plausible, but untrue, of hallucinations in generated outputs leveraging Re-
responses. Here, we focus on the problem of detecting trieval Augmented Generation (RAG) [5] and reasoning
hallucinations with respect to the provided context that on knowledge graphs (KGs) [6, 7] have been proposed.
the LLM should use as its source of knowledge; detecting The former suggested the concatenation of relevant con-
hallucinations that have deviated from the LLM’s original textual data into the prompt to ground the LLM response,
training data is out of the scope of this work. In appli- while the latter enforced a more robust reasoning process
cations where certainty in a response is critical, such as through providing grounding information in KG struc-
medical diagnosis, the existence of hallucinations that tures [8]. As successful as these approaches have been,
arise from a given context is especially limiting. There- they do not fully circumvent the need to evaluate LLM
fore, it is of utmost importance to develop successful outputs.
methods to detect these hallucinations and, when it is
of interest to address or correct them, provide clarity on Inspired by current research harnessing KGs to provide
which aspect of the response is likely a hallucination. grounded LLM responses, we propose GraphEval - a hal-
The importance of this issue is reflected in the amount lucination detection framework based on the represen-
of research being published on the topic - see Ji et al. [1] tation of information in KG structures. To the best of
for a recent survey of this area. our knowledge, we are the first to apply KGs to an LLM-
based hallucination evaluation framework, and in doing
Performing evaluation on natural language is a challeng- so we provide a higher level of insight into where in
ing task that researchers have been interested in long the output a hallucination has occurred than any previ-
ous metrics. Additionally, we demonstrate how using
KiL’24: Workshop on Knowledge-infused Learning co-located with our method in conjunction with current state-of-the-art
30th ACM KDD Conference, August 26, 2024, Barcelona, Spain hallucination detection methods improves their classi-
*
Work done during an internship with Amazon.
fication accuracy on various benchmarks. Finally, we
$ hannah.sansford@bristol.ac.uk (H. Sansford); nchls@amazon.es
(N. Richardson); maretich@amazon.co.uk (H. Petric Maretic); consider the problem of hallucination correction and we
jubans@amazon.co.uk (J. Nait Saada) introduce GraphCorrect, showcasing how GraphEval can
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
effectively be extended to rectify a significant proportion Therefore, researchers have begun to develop new meth-
of hallucinations present in LLM outputs. ods that are more acutely tuned to detecting inconsisten-
cies between an LLM output and its grounding context.
Maynez et al. [9] identified the crossover between the
2. Problem statement textual entailment score in NLI tasks and consistency pre-
diction. This was a breakthrough at the time, producing
In this work we focus on the closed-domain hallucina- higher correlation with faithfulness than any previous
tion detection problem: the situation where we have a metrics, and paved the way for further research that cap-
textual output from an LLM which is generated using italised on NLI data and models [13, 14, 15].
some grounding context included in the prompt. In this
Very recently, attention has turned to leveraging LLMs
case, the goal is for the LLM to use the provided context
themselves to evaluate the consistency of LLM outputs.
as its only source of knowledge. The open-domain prob-
SelfCheckGPT [16] and ChatProtect [17] approach the
lem, which is with respect to all factual knowledge in
problem by considering the self-consistency within sam-
the world, is not explored here but is briefly discussed in
pled outputs. Since they require the generation of a large
Section 8.
number of responses from the LLM, many consider these
We consider hallucination detection to be a binary classifi- methods prohibitively computationally expensive.
cation problem, with 0 corresponding to the LLM output
Other LLM-based hallucination evaluation methods, such
being factually consistent given the provided context,
as G-Eval [4] and GPTScore [18], employ a different LLM
and 1 corresponding to the output containing at least
for evaluation than the one used to generate the LLM
one inconsistency. We can assess hallucination evalua-
response that needs to be evaluated. G-Eval allows user-
tion methods using a benchmarking dataset containing
defined evaluation criteria and uses automated chain-
ground-truth labels (usually human-annotated) to de-
of-thought prompting and form-filling to assign scores.
termine whether a given context-output pair contains
GPTScore treats the task as conditional generation, lever-
factual inconsistencies. Throughout the paper we use
aging models like GPT-3 to assign higher probabilities to
the terms factual, consistent, grounded and faithful in-
high-quality outputs by prepending evaluation instruc-
terchangeably to mean containing no hallucinations with
tions to the LLM prompt. Unlike NLI models trained on
respect to the context.
binary classification data, these methods produce scores
Finally, we explore the problem of hallucination correc- that are harder to interpret as probabilities and often
tion, wherein we do not use any directly labeled dataset. require additional steps for inconsistency classification.
Instead, we utilize hallucination detection frameworks to
Recent hallucination detection methods, such as
first identify hallucinations to correct, and subsequently
FactScore [19] and SAFE [20], utilize large language
repurposing them to evaluate the corrected outputs. It is
models to break down the response into atomic or in-
important to note that our exploration of hallucination
dividual facts for evaluation. These approaches have
correction only serves as an extension to our evaluation
enabled precise identification of where hallucinations oc-
framework and is not the primary focus of this study.
cur within the LLM response. Each fact is automatically
verified against a comprehensive knowledge source like
Wikipedia or scientific literature in the case of FactScore,
3. Related work or through the use of a search engine in the case of SAFE.
Historically, N-gram based metrics such as BLEU [2] FactGraph [21] is the only factuality evaluation method
and ROUGE [3] have been the most widely used metrics we are aware of that utilises graph-like structures. The
for natural language evaluation. However, these met- method is focused solely on the detection of inconsisten-
rics have been shown to perform poorly at the task of cies in the summarization problem, decomposing both
factual inconsistency detection [9, 10]. In more recent the summary and the supporting documents into what
years, embedding-based metrics such as BERTScore [11] they call structured meaning representations (MRs). These
have been favoured over N-gram based metrics. These MRs describe the core semantic concepts and relations,
methods measure the similarity between two pieces of which the authors claim to be more suitable for factuality
text by comparing the contextualised embedding from a evaluation than the raw text.
transformer model, such as BERT [12].
Both N-gram and embedding-based metrics base their
scores on how similar the text to be evaluated is to some
reference text. This similarity objective often fails to cap-
ture the intricacies of the hallucination detection problem.
Figure 1: A visualisation of the GraphEval approach. First, the LLM output is fed into the KG construction prompt to produce
the KG depicted on the right. Next, each individual triple in the KG is fed into an out-of-the-box hallucination detection
method, such as an NLI model, and compared to the provided context for inconsistencies. Finally, any triples that are flagged
as inconsistent are returned to the user, along with the overall hallucination decision.
4. GraphEval: Our evaluation a visualisation of this process in Figure 1 using a real
example from one of the benchmarks described in Section
method 7.1.
GraphEval is based around the idea of representing infor- Regarding stage 1, we provide a short review of LLM-
mation in a structured manner through KGs, and aims to based KG construction methods in Section 5, along with
address the lack of explainability of previous hallucina- results from our implementation. For stage 2, we leverage
tion detection approaches, i.e. which concrete pieces of existing techniques and employ an out-of-the-box NLI
information in particular are inconsistent. model for this task. A benefit of this approach is that it
Formally, a KG is a collection of triples 𝒦𝒢 = gives us the opportunity to make a direct comparison
{(𝑒1 , 𝑟, 𝑒2 ) ⊆ ℰ × ℛ × ℰ}, where ℰ and ℛ denote between the performance of the raw NLI model and the
the set of entities and relationships, respectively. In the model supplemented with our KG approach. In essence,
GraphEval setting, both entities and relationships are our method is a pre-processing step, the output of which
simply pieces of text. We do not make use of common can be fed into any hallucination detection method; we
extensions to this simple setting, such as entity and rela- choose NLI models as they are computationally cheap
tionship types, or attached properties. compared to LLM-based models, yet still achieve state-of-
the-art results. By feeding each triple into an NLI model,
Our GraphEval metric consists of a two-stage procedure: along with the grounding context, we obtain a probability
Stage 1 - Construct a KG from the LLM output of containing a hallucination for each triple. Finally, we
to be evaluated. classify the example as inconsistent if at least one triple
Stage 2 - Iterate through each of the triples in produces a probability greater than 0.5.
the KG, identifying whether they are factually Similar approaches to ours have been proposed in re-
consistent given the provided context. cent literature. SummaC [14] also uses NLI-based models
The output is considered factually inconsistent if any of to detect inconsistencies in LLM-generated summaries.
the triples in stage 2 are identified as not grounded in the However, it distinguishes itself by segmenting both the
context. The inconsistent triple(s) may also be returned context and the summary into their respective sentences,
to provide explainability by highlighting where in the and then by passing each context-summary pair into the
output the hallucination(s) has occurred. We provide NLI model. This approach presents challenges in main-
Benchmark No. of Examples Label Ratio Avg Output len. Avg Context len.
SummEval 1,600 33.2% 63 359
QAGS-C 235 48.1% 49 383
QAGS-X 239 48.5% 18 318
Table 1
Statistics relating to the evaluation benchmarks used. The label ratio is the ratio of factually consistent examples to inconsistent
examples. The average output and context length are the average number of words in each.
taining entity references across sentences; for instance, 5. Construction of KGs using LLMs
"John Doe" may only be referred to as "he" in another
sentence. Similarly, FactScore [19] faces the same limita- Constructing KGs from unstructured textual data in-
tion. Our method circumvents this issue by organising volves identifying the set of entities within the text and
entity relationships with a KG. the relationships between them, resulting in a structured
representation of the information contained within the
While FactGraph [21] also makes use of graph structures text. The process can be divided into three main stages:
in their consistency evaluation process, the method dif-
fers from GraphEval in a few major respects. Firstly, 1. Entity detection - the process of identifying and
their approach can only be applied to the summarisation extracting entities from text.
problem; whereas GraphEval can easily be applied to 2. Coreference resolution - the process of finding
various domains such as Summarisation, Question An- of all expressions (also called mentions) in the text
swering, Common Sense Reasoning and many others. that refer to the same entity.
Secondly, FactGraph does not employ LLMs anywhere in 3. Relation extraction - the process of identifying
their framework, missing out on recent advances in the semantic relationships between entities.
field. Finally, their approach aims to decompose both the
LLM output and the provided context into the underlying Previously, researchers addressed each stage individually,
core semantic concepts and relations, before comparing but with the increasing power of LLMs, there’s been a
each of the graph structures. GraphEval, on the other shift towards end-to-end systems. Kumar et al. [22] sug-
hand, only represents the LLM output as a KG and aims gest employing two LLM components: one for named
to preserve as much of the information contained in the entity recognition and another one for both relation clas-
raw text as possible. sification and direction. Similarly, Grapher [23] utilizes a
pre-trained LLM for entity extraction and relation predic-
To summarise the advantages of GraphEval over previous tion. However, these methods require users to provide
methods: possible relations. More recent methods like PiVE [24]
• We present a systematic way of checking all and AutoKG [25] use LLM prompting strategies for KG
pieces of information contained in the LLM out- construction without additional user input.
put. The aforementioned methods do not make use of some of
• Our method only requires one call to an LLM, in the emergent abilities of LLMs, such as in-context learn-
the KG construction phase, and does not require ing and the chain-of thought prompting strategy. We
the (usually) large context documents to be input, decide to leverage these emergent abilities, and take a
as in all previous LLM-based metrics. This makes simple prompt engineering approach to our KG construc-
GraphEval less computationally expensive than tion step. The techniques used can be summarised as the
other LLM-based methods. following:
• Our method returns the specific triples that are
• Chain-of-thought (CoT) prompting strategy. Pro-
not grounded in the context, providing explain-
viding intermediate reasoning steps in the prompt
ability for the decision and identifying which sec-
to enable LLMs to solve more complex tasks.
tion of the output should not be trusted. We lever-
age this feature for hallucination correction and • In-context learning. A method of prompt engi-
propose a new method called GraphCorrect, de- neering where one provides several task demon-
scribed in Section 6. strations within the prompt, circumventing the
need for fine-tuning.
The final prompt used in our experiments can be found
in the Appendix. We highlight to the reader that our KG
construction method is not the main contribution of our
work, which is rather the application of KG construction 7. Experiments
to the hallucination detection problem. The major benefit
of our KG construction approach is its ease of implemen- 7.1. Benchmarks
tation with any LLM. Furthermore, it is less computation-
We conducted two sets of experiments: one focusing on
ally intensive than methods like PiVE, which performs
hallucination detection to highlight GraphEval’s perfor-
multiple iterations of improvements to the generated KG.
mance and another on hallucination correction to show-
Of course, users may conduct the KG construction stage
case the advantages of GraphCorrect. For both scenarios,
of GraphEval using their method of choice; the exper-
we utilized the SummEval [26], QAGS-C and QAGS-X
iments in this paper exhibit the capability of a simple
[27] benchmarks - currently the most prevalent bench-
prompting strategy.
marks in relevant academic literature. All three are con-
cerned with detecting hallucinations in LLM-generated
summaries and are human-annotated for factual con-
6. GraphCorrect: Correction of sistency with respect to the grounding context. Table
hallucinations with GraphEval 1 contains some statistics pertaining to each of these
datasets.
While the primary focus of this work lies in hallucination
detection, GraphEval’s breakdown of LLM outputs into
triples easily allows for its extension to correct hallucina- SummEval The SummEval dataset consists of human
tions within the given context. To achieve this, we first evaluations on 16 summarization model outputs from
identify all triples within the KG that are likely to con- 100 articles from the CNN/DailyMail dataset [28]. Each
tain hallucinations (i.e. those with a probability greater summary is labelled on a Likert scale from 1-5 on 4 cat-
than 0.5, if any). We then employ the following two-step egories: consistency, coherence, fluency and relevance.
procedure on each identified triple: We follow the TRUE benchmark [13] in taking the con-
sistency scores and mapping a score of 5 to being fully
Step 1 - Input the given triple along with the consistent, and anything lower to being inconsistent.
context into an LLM to correct for the potential
hallucinations within the triple. This results in a
newly generated corrected triple. QAGS The QAGS-C and QAGS-X datasets are built
Step 2 - Input the identified triple, its corrected from the CNN/DailyMail and the XSum [29] datasets,
counterpart and the initial LLM output. Selec- respectively. The human annotators examined the sum-
tively replace the information from the original maries one sentence at a time, and determined the factual
(hallucination-containing) triple with the infor- consistency of each sentence comparing it to the original
mation from the new triple in the initial LLM article. Three annotators assessed each sentence and the
output. majority decision was recorded. Again, we follow the
TRUE benchmark in considering a summary to be factu-
We name this LLM hallucination correction method as ally consistent if and only if all sentences are considered
GraphCorrect. The final prompts used in our experiments consistent.
for both step 1 and step 2 can be found in the Appendix B
and C respectively. This systematic approach to halluci-
nation correction offers several benefits. First, it tackles 7.2. NLI models in GraphEval
each identified hallucination separately, increasing the
chances of all perceived hallucinations being corrected. As mentioned in Section 4, we employ NLI models to
Furthermore, it offers the advantage of exclusively alter- perform the second stage of GraphEval - checking the
ing the segments of the original text that are suspected consistency of each individual triple with respect to the
to contain a hallucination, leaving other elements un- context. We conduct experiments using the three most
touched and ensuring overall high similarity with the popular NLI-based hallucination detection models avail-
original text. Finally, breaking down the entire process able on HuggingFace 1 .
into intermediate steps ensures that the original context
and the initial LLM output never undergo simultaneous HHEM Based on the DeBERTaV3 model [30] and ini-
processing within an LLM. This guarantees safeguards tially trained on NLI data, the hallucination evaluation
against both the addition of extra information and the model created by Vectara 2 is further fine-tuned on
loss of information in the LLM output. datasets annotated for consistency. The datasets used
1
https://huggingface.co
2
https://huggingface.co/vectara/hallucination_evaluation_model
for fine tuning were: FEVER [31], Vitamin C [32] and SummEval QAGS-C QAGS-X
PAWS [33]. This model is considerably smaller than the HHEM 66.0 63.5 75.5
following two models, requiring only 738 MB of memory, HHEM + GraphEval 71.5 72.2 75.2
and thus has a significantly shorter run-time. TRUE 61.3 61.8 72.6
TRUE + GraphEval 72.4 71.7 73.9
TrueTeacher 74.9 75.6 79.0
TRUE The TRUE model is based on a T5-XXL model TrueTeacher + GraphEval 79.2 78.1 79.6
[34] and is trained similarly to the model described in
the TRUE paper [13]. Instead of the ANLI dataset used Table 2
in that paper, this model is trained on the same datasets Balanced accuracy scores for hallucination detection of NLI
as HHEM, plus the following: SNLI [35], MNLI [36] and models (HHEM, TRUE, TrueTeacher) and their GraphEval
Scitail [37]. This model requires 45.5 GB of memory. counterparts on the SummEval, QAGS-C and QAGS-X bench-
marks.
TrueTeacher Gekhman et al. [15] leverage the ability
of LLMs to evaluate hallucinations by generating syn- We hypothesise that the negligible difference between
thetic data through annotating model-generated sum- the base NLI model and the model supplemented with
maries. They then use this synthetic data to further GraphEval for the QAGS-X dataset is due to the average
fine-tune the model from [13], leading to state-of-the- length of the generated text (only 18 words, compared
art performance on the TRUE benchmark. This model is with 49 and 63 for QAGS-C and SummEval respectively,
the same size as the TRUE model. see 1). This highlights an important aspect of where the
most value can be found in our method. When the LLM
output is very short, there are less likely to be multiple
7.3. Experimental settings facts that need to be checked for consistency (which
In all experiments conducted in this study necessitating can easily be done without the use of a KG) and the
the utilization of an LLM, we use Claude 2 3 , an LLM intricacies of the short sentence might even be lost in
from Anthropic, through the Amazon Bedrock API 4 . We the KG construction phase. On the other hand, when the
use the default settings for the LLM: temperature = 1, LLM output is very long, current methods struggle to
top_p = 1, top_k = 250. We also refer the reader to the test each individual fact against the context, and this is
Appendix for the prompts used in this work. when GraphEval thrives.
It should be noted that even when the results for GraphE-
val are comparable to the baseline methods, the benefit
7.4. Results of using GraphEval is the identification of the specific
7.4.1. Hallucination detection with GraphEval triple(s) that are inconsistent with the provided context.
We present our results of hallucination detection for the
three NLI models, and their GraphEval counterparts, in 7.4.2. Hallucination correction with GraphCorrect
Table 2. We report the balanced accuracy as our evalu-
Identifying the particular triple(s) likely to harbor a hallu-
ation metric, which corrects for the class imbalance in
cination enables straightforward correction using Graph-
the SummEval benchmark. In the case of using the NLI
Correct, as described in Section 6. For each of the evalu-
model directly, we classify the example as containing a
ation frameworks proposed here (HHEM + GraphEval,
hallucination if the NLI model returns a probability of
TRUE + GraphEval, and TrueTeacher + GrapEval), we
more than 0.5. When combining the NLI model with
compared GraphCorrect to a basic prompting strategy
GraphEval, we classify the example as containing a hallu-
for hallucination correction, serving as a baseline. The
cination if at least one triple fed to the NLI model returns
prompt used in this baseline approach, referred to as the
a probability of more than 0.5. We see that adding the
Direct Prompt henceforth, is provided in Appendix D.
GraphEval pre-processing step to each of the NLI mod-
els almost always improves the balanced accuracy score, For each framework, we initially identify hallucinations,
sometimes by a considerable amount, such as the results correct only the LLM outputs suspected of containing hal-
for the SummEval and QAGS-C benchmarks in Table lucinations using either GraphCorrect or Direct Prompt,
2. On average (weighting by the number of samples in and then reapply the evaluation framework to detect hal-
each dataset), adding the GraphEval pre-processing step lucinations in the corrected LLM outputs. Note that this
improves the balanced accuracy by 6.2 (SE=1.3). procedure only allows us to measure what we presume to
be corrected hallucinations, given the potential for errors
3
https://www.anthropic.com/news/claude-2 in the evaluation frameworks utilized here. We report the
4
https://aws.amazon.com/bedrock/claude/
ROUGE-1 ROUGE-2 ROUGE-L
Detection Dataset
Direct Prompt GraphCorrect Direct Prompt GraphCorrect Direct Prompt GraphCorrect
SummEval 0.827 0.915 0.772 0.879 0.796 0.910
HHEM + GraphEval QAGS-C 0.800 0.893 0.735 0.841 0.769 0.885
QAGS-X 0.649 0.821 0.495 0.734 0.606 0.815
SummEval 0.781 0.880 0.707 0.833 0.746 0.871
TRUE + GraphEval QAGS-C 0.840 0.894 0.780 0.848 0.808 0.886
QAGS-X 0.651 0.805 0.505 0.706 0.613 0.795
SummEval 0.781 0.884 0.703 0.839 0.737 0.876
TrueTeacher + GraphEval QAGS-C 0.809 0.889 0.743 0.837 0.781 0.881
QAGS-X 0.643 0.797 0.486 0.694 0.598 0.784
Table 3
Average ROUGE-1, ROUGE-2 and ROUGE-L scores measuring similarity between original and corrected summaries using
Direct Prompt and GraphCorrect across different datasets and hallucination detection frameworks.
percentage of believed corrected hallucinations in Table summaries and the corrected versions for both GraphCor-
4. A score of 0% suggests no corrected hallucinations rect and Direct Prompt across all experimental scenarios
according to the given framework, while a score of 100% examined in this study. GraphCorrect systematically gen-
indicates correction of all hallucinations as per the given erates texts that are closer in similarity to the original
framework. GraphCorrect outperforms the prompting LLM outputs compared to its counterpart.
strategy proposed here by significantly correcting for
more hallucinations on all tasks apart from two related
to the QAGS-X dataset. As on the hallucination detection 8. Discussion
task, we hypothesise these results are correlated with the
average length of the text, with GraphCorrect bringing Our work focuses on detection of hallucinations in closed-
most value in longer texts with a more complex structure domain tasks, where we are interested only in consis-
to unravel and correct. tency with respect to the provided context. The GraphE-
val framework could be extended to open-domain halluci-
Additionally, as previously stated, GraphCorrect offers
nation detection by employing agents, as in AutoKG [25],
the advantage of only modifying the segments of text
to first retrieve relevant external sources as the grounding
in the LLM outputs susceptible to hallucinations, while
information to check against.
leaving other sections unaltered, thereby maintaining
high overall similarity with the original text. This charac- We expect that in the near future, more research will be
teristic is illustrated in Table 3 by assessing the ROUGE-1, conducted on the construction of KGs from unstructured
ROUGE-2, and ROUGE-L metrics between the original text, which will provide improvements to the first stage of
our procedure and ultimately the evaluation performance.
Even as LLMs alone become more powerful, this will
Method for Correction
Detection & Evaluation Dataset
Direct Prompt GraphCorrect
continue to contribute to improvements in GraphEval’s
SummEval 48.6 55.1
performance.
HHEM + GraphEval QAGS-C 38.5 58.7 We observe that, in the knowledge graph construction
QAGS-X 63.2 69.5
phase of our procedure, it is possible that some informa-
SummEval 49.6 59.5
tion loss may occur. However, as shown by the results
TRUE + GraphEval QAGS-C 42.7 53.7
QAGS-X 70.8 66.7
in Section 7.4, our method rarely leads to a reduction in
SummEval 53.1 59.8
balanced accuracy. Furthermore, when it is comparable
TrueTeacher + GraphEval QAGS-C 47.1 59.6 to the baseline methods, we have the added explainability
QAGS-X 71.1 69.3 of identifying the specific triples where the hallucination
has occurred.
Table 4
Percentage of believed corrected hallucinations using a di- We believe our hallucination correction framework
rect prompting strategy and GraphCorrect on the SummEval, (GraphCorrect) shows promise and an interesting av-
QAGS-C and QAGS-X benchmarks. The hallucinations were enue for future work. However, the effectiveness of the
first detected by HHEM + GraphEval, TRUE + GraphEval and approach described in this work should be assessed man-
TrueTeacher + GraphEval respectively, and then corrections ually, rather than relying on the convoluted use of hallu-
were evaluated by the same metric. cination evaluation frameworks (which only yield mea-
surements of believed corrected hallucinations). on Empirical Methods in Natural Language Pro-
cessing, Association for Computational Linguis-
tics, Singapore, 2023, pp. 2511–2522. URL: https:
9. Conclusion //aclanthology.org/2023.emnlp-main.153. doi:10.
18653/v1/2023.emnlp-main.153.
We introduce GraphEval, a simple and effective pre- [5] P. Lewis, E. Perez, A. Piktus, F. Petroni,
processing step for improving the explainability and per- V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.
formance of LLM hallucination detection metrics. Our Yih, T. Rocktäschel, et al., Retrieval-augmented
method leverages LLM’s ability to extract information generation for knowledge-intensive nlp tasks,
from unstructured text and construct knowledge graphs, Advances in Neural Information Processing
whose triples can be fed into out-of-the-box hallucination Systems 33 (2020) 9459–9474.
detection methods. [6] L. Luo, Y.-F. Li, G. Haffari, S. Pan, Reasoning on
graphs: Faithful and interpretable large language
We demonstrate that GraphEval in conjunction with
model reasoning, arXiv preprint arXiv:2310.01061
state-of-the-art NLI models leads to an average improve-
(2023).
ment in balanced accuracy of 6.2 (SE=1.3) on three popu-
[7] L. Yang, H. Chen, Z. Li, X. Ding, X. Wu, Give us
lar hallucination benchmarks. Furthermore, our method
the facts: Enhancing large language models with
indicates which triples, in the KG representation of the
knowledge graphs for fact-aware language model-
LLM output, are inconsistent. To the best of our knowl-
ing, IEEE Transactions on Knowledge and Data
edge, this is the first application of KGs to an LLM-based
Engineering (2024).
hallucination evaluation framework and we believe the
[8] G. Agrawal, T. Kumarage, Z. Alghamdi, H. Liu, Can
success of GraphEval will only grow as KG construction
knowledge graphs reduce hallucinations in llms? :
methods also improve.
A survey, 2024. arXiv:2311.07914.
Finally, we examined the issue of hallucination correction [9] J. Maynez, S. Narayan, B. Bohnet, R. McDonald,
and showed that GraphCorrect can effectively address the On faithfulness and factuality in abstractive sum-
majority of hallucinations found in LLM outputs while marization, in: D. Jurafsky, J. Chai, N. Schluter,
maintaining extremely high similarity with the original J. Tetreault (Eds.), Proceedings of the 58th An-
texts. nual Meeting of the Association for Computa-
tional Linguistics, Association for Computational
Linguistics, Online, 2020, pp. 1906–1919. URL:
References https://aclanthology.org/2020.acl-main.173. doi:10.
18653/v1/2020.acl-main.173.
[1] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, [10] O. Honovich, L. Choshen, R. Aharoni, E. Nee-
Y. J. Bang, A. Madotto, P. Fung, Survey of hal- man, I. Szpektor, O. Abend, 𝑞 2 : Evaluating
lucination in natural language generation, ACM factual consistency in knowledge-grounded dia-
Comput. Surv. 55 (2023). URL: https://doi.org/10. logues via question generation and question an-
1145/3571730. doi:10.1145/3571730. swering, in: M.-F. Moens, X. Huang, L. Spe-
[2] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: cia, S. W.-t. Yih (Eds.), Proceedings of the 2021
a method for automatic evaluation of machine Conference on Empirical Methods in Natural
translation, in: P. Isabelle, E. Charniak, D. Lin Language Processing, Association for Computa-
(Eds.), Proceedings of the 40th Annual Meeting of tional Linguistics, Online and Punta Cana, Do-
the Association for Computational Linguistics, As- minican Republic, 2021, pp. 7856–7870. URL: https:
sociation for Computational Linguistics, Philadel- //aclanthology.org/2021.emnlp-main.619. doi:10.
phia, Pennsylvania, USA, 2002, pp. 311–318. URL: 18653/v1/2021.emnlp-main.619.
https://aclanthology.org/P02-1040. doi:10.3115/ [11] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger,
1073083.1073135. Y. Artzi, Bertscore: Evaluating text generation with
[3] C.-Y. Lin, ROUGE: A package for automatic eval- bert, in: International Conference on Learning
uation of summaries, in: Text Summarization Representations, 2020. URL: https://openreview.net/
Branches Out, Association for Computational Lin- forum?id=SkeHuCVFDr.
guistics, Barcelona, Spain, 2004, pp. 74–81. URL: [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
https://aclanthology.org/W04-1013. Pre-training of deep bidirectional transformers for
[4] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu, language understanding, in: J. Burstein, C. Do-
G-eval: NLG evaluation using gpt-4 with bet- ran, T. Solorio (Eds.), Proceedings of the 2019 Con-
ter human alignment, in: H. Bouamor, J. Pino, ference of the North American Chapter of the As-
K. Bali (Eds.), Proceedings of the 2023 Conference sociation for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Power and Communication Technologies (GUCON),
Papers), Association for Computational Linguistics, 2020, pp. 310–315. doi:10.1109/GUCON48875.
Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: 2020.9231227.
https://aclanthology.org/N19-1423. doi:10.18653/ [23] I. Melnyk, P. Dognin, P. Das, Grapher: Multi-stage
v1/N19-1423. knowledge graph construction using pretrained lan-
[13] O. Honovich, R. Aharoni, J. Herzig, H. Taitelbaum, guage models, in: NeurIPS 2021 Workshop on
D. Kukliansy, V. Cohen, T. Scialom, I. Szpektor, Deep Generative Models and Downstream Appli-
A. Hassidim, Y. Matias, TRUE: Re-evaluating cations, 2021. URL: https://openreview.net/forum?
factual consistency evaluation, in: S. Feng, id=N2CFXG8-pRd.
H. Wan, C. Yuan, H. Yu (Eds.), Proceedings of [24] J. Han, N. Collier, W. Buntine, E. Shareghi, Pive:
the Second DialDoc Workshop on Document- Prompting with iterative verification improving
grounded Dialogue and Conversational Question graph-based generative capability of llms, arXiv
Answering, Association for Computational Lin- preprint arXiv:2305.12392 (2023).
guistics, Dublin, Ireland, 2022, pp. 161–175. URL: [25] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao,
https://aclanthology.org/2022.dialdoc-1.19. doi:10. S. Deng, H. Chen, N. Zhang, Llms for knowledge
18653/v1/2022.dialdoc-1.19. graph construction and reasoning: Recent capa-
[14] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst, bilities and future opportunities, arXiv preprint
SummaC: Re-visiting NLI-based models for incon- arXiv:2305.13168 (2023).
sistency detection in summarization, Transactions [26] A. R. Fabbri, W. Kryscinski, B. McCann, R. Socher,
of the Association for Computational Linguistics D. R. Radev, Summeval: Re-evaluating sum-
10 (2022) 163–177. URL: https://aclanthology.org/ marization evaluation, Transactions of the
2022.tacl-1.10. doi:10.1162/tacl_a_00453. Association for Computational Linguistics 9
[15] Z. Gekhman, J. Herzig, R. Aharoni, C. Elkind, (2020) 391–409. URL: https://api.semanticscholar.
I. Szpektor, Trueteacher: Learning factual consis- org/CorpusID:220768873.
tency evaluation with large language models, 2023. [27] A. Wang, K. Cho, M. Lewis, Asking and an-
arXiv:2305.11171. swering questions to evaluate the factual consis-
[16] P. Manakul, A. Liusie, M. J. Gales, Selfcheckgpt: tency of summaries, in: D. Jurafsky, J. Chai,
Zero-resource black-box hallucination detection for N. Schluter, J. Tetreault (Eds.), Proceedings of the
generative large language models, arXiv preprint 58th Annual Meeting of the Association for Com-
arXiv:2303.08896 (2023). putational Linguistics, Association for Computa-
[17] N. Mündler, J. He, S. Jenko, M. Vechev, Self- tional Linguistics, Online, 2020, pp. 5008–5020. URL:
contradictory hallucinations of large language mod- https://aclanthology.org/2020.acl-main.450. doi:10.
els: Evaluation, detection and mitigation, in: The 18653/v1/2020.acl-main.450.
Twelfth International Conference on Learning Rep- [28] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espe-
resentations, 2024. URL: https://openreview.net/ holt, W. Kay, M. Suleyman, P. Blunsom, Teaching
forum?id=EmQSOi1X2f. machines to read and comprehend, Advances in
[18] J. Fu, S.-K. Ng, Z. Jiang, P. Liu, Gptscore: Evaluate neural information processing systems 28 (2015).
as you desire, 2023. arXiv:2302.04166. [29] S. Narayan, S. B. Cohen, M. Lapata, Don’t give
[19] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, me the details, just the summary! topic-aware
P. W. Koh, M. Iyyer, L. Zettlemoyer, H. Hajishirzi, convolutional neural networks for extreme sum-
Factscore: Fine-grained atomic evaluation of fac- marization, in: E. Riloff, D. Chiang, J. Hocken-
tual precision in long form text generation, arXiv maier, J. Tsujii (Eds.), Proceedings of the 2018 Con-
preprint arXiv:2305.14251 (2023). ference on Empirical Methods in Natural Language
[20] J. Wei, C. Yang, X. Song, Y. Lu, N. Hu, D. Tran, Processing, Association for Computational Linguis-
D. Peng, R. Liu, D. Huang, C. Du, et al., Long-form tics, Brussels, Belgium, 2018, pp. 1797–1807. URL:
factuality in large language models, arXiv preprint https://aclanthology.org/D18-1206. doi:10.18653/
arXiv:2403.18802 (2024). v1/D18-1206.
[21] L. F. R. Ribeiro, M. Liu, I. Gurevych, M. Dreyer, [30] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-
M. Bansal, Factgraph: Evaluating factuality in sum- enhanced bert with disentangled attention, in: In-
marization with semantic graph representations, ternational Conference on Learning Representa-
2022. arXiv:2204.06508. tions, 2021. URL: https://openreview.net/forum?id=
[22] A. Kumar, A. Pandey, R. Gadia, M. Mishra, Build- XPZIaotutsD.
ing knowledge graph using pre-trained language [31] J. Thorne, A. Vlachos, O. Cocarascu,
model for learning entity-aware relationships, in: C. Christodoulopoulos, A. Mittal, The FEVER2.0
2020 IEEE International Conference on Computing, shared task, in: Proceedings of the Second
Workshop on Fact Extraction and VERification simple , they are akin to
(FEVER), 2018. Wikipedia nodes .
[32] T. Schuster, A. Fisch, R. Barzilay, Get your vita- Step 2 − Coreference r e s o l u t i o n : Find
min C! robust fact verification with contrastive a l l expressions in the t e x t that
evidence, in: Proceedings of the 2021 Confer- r e f e r t o t h e same e n t i t y . Make
sure e n t i t i e s are not d u p l i c a t e d .
ence of the North American Chapter of the As-
I n p a r t i c u l a r do n o t i n c l u d e
sociation for Computational Linguistics: Human e n t i t i e s t h a t a r e more s p e c i f i c
Language Technologies, Association for Compu- versions themselves , e . g . " a
tational Linguistics, Online, 2021, pp. 624–643. d e t a i l e d view o f j u p i t e r ’ s
URL: https://aclanthology.org/2021.naacl-main.52. a t m o s p h e r e " and " j u p i t e r ’ s
doi:10.18653/v1/2021.naacl-main.52. atmosphere " , only i n c l u d e the
[33] Y. Zhang, J. Baldridge, L. He, PAWS: Paraphrase most s p e c i f i c v e r s i o n o f t h e
Adversaries from Word Scrambling, in: Proc. of entity .
NAACL, 2019. Step 3 − Relation e x t r a c t i o n :
[34] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, I d e n t i f y semantic r e l a t i o n s h i p s
between t h e e n t i t i e s you have
M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
identified .
limits of transfer learning with a unified text-to-
text transformer, Journal of Machine Learning Re- Format : R e t u r n t h e knowledge g r a p h a s
search 21 (2020) 1–67. URL: http://jmlr.org/papers/ a l i s t of t r i p l e s , i . e . [ " e n t i t y
v21/20-074.html. 1" , " r e l a t i o n 1 −2" , " e n t i t y 2 " ] ,
[35] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, i n Python c o d e .
A large annotated corpus for learning natural lan- """ ,
guage inference, in: L. Màrquez, C. Callison-Burch, ) ,
J. Su (Eds.), Proceedings of the 2015 Conference ( " human " ,
on Empirical Methods in Natural Language Pro- " Use t h e g i v e n f o r m a t t o e x t r a c t
cessing, Association for Computational Linguistics, i n f o r m a t i o n from t h e
following input : {
Lisbon, Portugal, 2015, pp. 632–642. URL: https:
input } input >. Skip the
//aclanthology.org/D15-1075. doi:10.18653/v1/ p r e a m b l e and o u t p u t t h e
D15-1075. r e s u l t as a l i s t within <
[36] A. Williams, N. Nangia, S. Bowman, A broad- python > python > t a g s . " ,
coverage challenge corpus for sentence understand- ) ,
ing through inference, in: M. Walker, H. Ji, A. Stent ( " human " ,
(Eds.), Proceedings of the 2018 Conference of the " " " Important Tips :
North American Chapter of the Association for 1 . Make s u r e a l l i n f o r m a t i o n
Computational Linguistics: Human Language Tech- i s included in the
nologies, Volume 1 (Long Papers), Association for knowledge g r a p h .
2 . Each t r i p l e must o n l y
Computational Linguistics, New Orleans, Louisiana,
contain three strings !
2018, pp. 1112–1122. URL: https://aclanthology.org/ None o f t h e s t r i n g s
N18-1101. doi:10.18653/v1/N18-1101. s h o u l d be empty .
[37] T. Khot, A. Sabharwal, P. Clark, SciTail: A textual 3 . Do n o t s p l i t up r e l a t e d
entailment dataset from science question answer- information into separate
ing, in: AAAI, 2018. t r i p l e s because t h i s
c o u l d change t h e meaning .
4 . Make s u r e a l l b r a c k e t s and
A. KG Construction Prompt q u o t a t i o n marks a r e
matched .
( " system " , 5 . Before adding a t r i p l e to
""" t h e knowledge graph ,
You a r e an e x p e r t a t e x t r a c t i n g check the c onca tena ted
information in t r i p l e makes s e n s e a s a
structured formats to build a s e n t e n c e . I f not , d i s c a r d
knowledge g r a p h . it .
Step 1 − Entity detection : I d e n t i f y """ ,
a l l e n t i t i e s i n t h e raw t e x t . ) ,
Make s u r e n o t t o m i s s any o u t . ( " human " ,
E n t i t i e s s h o u l d be b a s i c and " " " Here a r e some example i n p u t
and o u t p u t p a i r s . [ " D a r i u s Van Arman " , " a t t e n d e d " ,
" Gonzaga C o l l e g e High S c h o o l
## Example 1 . " ] , [ " D a r i u s Van Arman " , "
Input : i n s t a n c e o f " , " human b e i n g " ] ]
" The Walt D i s n e y Company , python >
commonly known a s Disney , i s
an American m u l t i n a t i o n a l ## Example 4 .
mass media and e n t e r t a i n m e n t I n p u t : " I t a l y had 3 . 6 x t i m e s more
conglomerate that i s c a s e s of c o r o n a v i r u s than
h e a d q u a r t e r e d a t t h e Walt China . "
D i s n e y S t u d i o s complex i n Output :
Burbank , C a l i f o r n i a . " < python >
Output : [ [ " I t a l y " , " had 3 . 6 x t i m e s more
c a s e s of c o r o n a v i r u s than " , "
[ [ " The Walt D i s n e y Company " , " China " ] ]
h e a d q u a r t e r e d a t " , " Walt python >
D i s n e y S t u d i o s complex i n """ ,
Burbank , C a l i f o r n i a " ] , ) ,
[ " The Walt D i s n e y Company " , "
commonly known a s " , " D i s n e y
"] ,
[ " The Walt D i s n e y Company " , " B. Hallucination correction (step 1)
i n s t a n c e o f " , " American
m u l t i n a t i o n a l mass media and """
entertainment conglomerate " ] ] You a r e an e x p e r t a t e x t r a c t i n g
python > information in structured formats
from t e x t .
## Example 2 . The f o l l o w i n g t r i p l e c o n t a i n s
Input : f a ct u a l ly incorrect information .
" Amanda J a c k s o n was born i n C o r r e c t i t b a s e d on t h e p r o v i d e d
S p r i n g f i e l d , Ohio , USA on context ,
J u n e 1 , 1 9 8 5 . She was a Important Tips :
b a s k e t b a l l player for the U. S 1 . A t r i p l e i s defined as [ "
. women ’ s team . " e n t i t y 1" , " r e l a t i o n 1 −2" , "
Output : entity 2"].
2 . A t r i p l e must o n l y c o n t a i n
[ [ " Amanda J a c k s o n " , " born i n " , " t h r e e s t r i n g s ! None o f t h e
S p r i n g f i e l d , Ohio , USA " ] , s t r i n g s s h o u l d be empty .
[ " Amanda J a c k s o n " , " born on " , " 3 . The c o n c a t e n a t e d t r i p l e must
June 1 , 1 9 8 5 " ] , make s e n s e a s a s e n t e n c e .
[ " Amanda J a c k s o n " , " o c c u p a t i o n " , 4 . Only r e t u r n t h e c o r r e c t e d
" basketball player "] , t r i p l e , nothing e l s e .
[ " Amanda J a c k s o n " , " p l a y e d f o r " ,
" U . S . women ’ s b a s k e t b a l l team < t r i p l e >{ t r i p l e } t r i p l e >
" ] ] python > { context } context >
## Example 3 . Remember , i t i s i m p o r t a n t t h a t you
Input : only return the c o r r e c t e d t r i p l e .
" Music e x e c u t i v e D a r i u s Van Arman """
was born i n P e n n s y l v a n i a . He
a t t e n d e d Gonzaga C o l l e g e
High S c h o o l and i s a human
being . " C. Hallucination correction (step
Output :
2)
[ [ " D a r i u s Van Arman " , " """
o c c u p a t i o n " , " Music e x e c u t i v e In the following context , r e p l a c e the
"] , information of the old t r i p l e
[ " D a r i u s Van Arman " , " born i n " , " w i t h t h e i n f o r m a t i o n o f t h e new
Pennsylvania " ] , one .
Do n o t make any o t h e r m o d i f i c a t i o n t o
the context .
Only r e t u r n t h e new c o n t e x t .
< c o n t e x t > { summary } < / c o n t e x t >
< o l d _ t r i p l e >{ o l d _ t r i p l e } o l d _ t r i p l e >
{ new_triple } new_triple >
"""
D. Hallucination correction
without a KG
"""
The f o l l o w i n g summary c o n t a i n s
f a ct u a l ly incorrect information .
C o r r e c t i t b a s e d on t h e c o n t e x t , b u t
don ’ t change o t h e r p a r t s o f t h e
summary .
Only r e t u r n t h e c o r r e c t e d summary ,
nothing e l s e .
{ summary } < / summary >
{ context } context >
Remember , do m i n i m a l c h a n g e s t o t h e
o r i g i n a l summary , don ’ t make i t
l o n g e r and keep a s much o f i t a s
you can e x a c t l y t h e same .
"""