GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework

GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework HannahSansford hannah.sansford@bristol.ac.uk University of Bristol

NicholasRichardson Amazon Science HerminaPetricMaretic maretich@amazon.co.uk Amazon Science JubaNaitSaada jubans@amazon.co.uk Amazon Science GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework 1613-0073 0493B68747775A0C6079A4C691695B7B GROBID - A machine learning software for extracting information from scholarly documents Large Language Models Knowledge Graphs Hallucination Detection Hallucination Correction

Methods to evaluate Large Language Model (LLM) responses and detect inconsistencies, also known as hallucinations, with respect to the provided knowledge, are becoming increasingly important for LLM applications. Current metrics fall short in their ability to provide explainable decisions, systematically check all pieces of information in the response, and are often too computationally expensive to be used in practice. We present GraphEval: a hallucination evaluation framework based on representing information in Knowledge Graph (KG) structures. Our method identifies the specific triples in the KG that are prone to hallucinations and hence provides more insight into where in the response a hallucination has occurred, if at all, than previous methods. Furthermore, using our approach in conjunction with state-of-the-art natural language inference (NLI) models leads to an improvement in balanced accuracy on various hallucination benchmarks, compared to using the raw NLI models. Lastly, we explore the use of GraphEval for hallucination correction by leveraging the structure of the KG, a method we name GraphCorrect, and demonstrate that the majority of hallucinations can indeed be rectified.

Introduction

As the size and power of LLMs have drastically increased over recent years, so has the number of potential applications. Arguably, one of the biggest blockers to implementing these models in practice is their tendency to hallucinate -returning seemingly plausible, but untrue, responses. Here, we focus on the problem of detecting hallucinations with respect to the provided context that the LLM should use as its source of knowledge; detecting hallucinations that have deviated from the LLM's original training data is out of the scope of this work. In applications where certainty in a response is critical, such as medical diagnosis, the existence of hallucinations that arise from a given context is especially limiting. Therefore, it is of utmost importance to develop successful methods to detect these hallucinations and, when it is of interest to address or correct them, provide clarity on which aspect of the response is likely a hallucination. The importance of this issue is reflected in the amount of research being published on the topic -see Ji et al. [1] for a recent survey of this area. before hallucinations were at the forefront of the problem. Methods have evolved a great deal from traditional N-gram based metrics, such as BLEU [2] and ROUGE [3], to much more intricate LLM-based evaluation metrics with user-defined evaluation criteria, such as G-Eval [4]. More recently, techniques to mitigate the prevalence of hallucinations in generated outputs leveraging Retrieval Augmented Generation (RAG) [5] and reasoning on knowledge graphs (KGs) [6,7] have been proposed. The former suggested the concatenation of relevant contextual data into the prompt to ground the LLM response, while the latter enforced a more robust reasoning process through providing grounding information in KG structures [8]. As successful as these approaches have been, they do not fully circumvent the need to evaluate LLM outputs.

Inspired by current research harnessing KGs to provide grounded LLM responses, we propose GraphEval -a hallucination detection framework based on the representation of information in KG structures. To the best of our knowledge, we are the first to apply KGs to an LLMbased hallucination evaluation framework, and in doing so we provide a higher level of insight into where in the output a hallucination has occurred than any previous metrics. Additionally, we demonstrate how using our method in conjunction with current state-of-the-art hallucination detection methods improves their classification accuracy on various benchmarks. Finally, we consider the problem of hallucination correction and we introduce GraphCorrect, showcasing how GraphEval can effectively be extended to rectify a significant proportion of hallucinations present in LLM outputs.

Problem statement

In this work we focus on the closed-domain hallucination detection problem: the situation where we have a textual output from an LLM which is generated using some grounding context included in the prompt. In this case, the goal is for the LLM to use the provided context as its only source of knowledge. The open-domain problem, which is with respect to all factual knowledge in the world, is not explored here but is briefly discussed in Section 8.

We consider hallucination detection to be a binary classification problem, with 0 corresponding to the LLM output being factually consistent given the provided context, and 1 corresponding to the output containing at least one inconsistency. We can assess hallucination evaluation methods using a benchmarking dataset containing ground-truth labels (usually human-annotated) to determine whether a given context-output pair contains factual inconsistencies. Throughout the paper we use the terms factual, consistent, grounded and faithful interchangeably to mean containing no hallucinations with respect to the context. Finally, we explore the problem of hallucination correction, wherein we do not use any directly labeled dataset. Instead, we utilize hallucination detection frameworks to first identify hallucinations to correct, and subsequently repurposing them to evaluate the corrected outputs. It is important to note that our exploration of hallucination correction only serves as an extension to our evaluation framework and is not the primary focus of this study.

Related work

Historically, N-gram based metrics such as BLEU [2] and ROUGE [3] have been the most widely used metrics for natural language evaluation. However, these metrics have been shown to perform poorly at the task of factual inconsistency detection [9,10]. In more recent years, embedding-based metrics such as BERTScore [11] have been favoured over N-gram based metrics. These methods measure the similarity between two pieces of text by comparing the contextualised embedding from a transformer model, such as BERT [12].

Both N-gram and embedding-based metrics base their scores on how similar the text to be evaluated is to some reference text. This similarity objective often fails to capture the intricacies of the hallucination detection problem.

Therefore, researchers have begun to develop new methods that are more acutely tuned to detecting inconsistencies between an LLM output and its grounding context. Maynez et al. [9] identified the crossover between the textual entailment score in NLI tasks and consistency prediction. This was a breakthrough at the time, producing higher correlation with faithfulness than any previous metrics, and paved the way for further research that capitalised on NLI data and models [13,14,15].

Very recently, attention has turned to leveraging LLMs themselves to evaluate the consistency of LLM outputs. SelfCheckGPT [16] and ChatProtect [17] approach the problem by considering the self-consistency within sampled outputs. Since they require the generation of a large number of responses from the LLM, many consider these methods prohibitively computationally expensive.

Other LLM-based hallucination evaluation methods, such as G-Eval [4] and GPTScore [18], employ a different LLM for evaluation than the one used to generate the LLM response that needs to be evaluated. G-Eval allows userdefined evaluation criteria and uses automated chainof-thought prompting and form-filling to assign scores. GPTScore treats the task as conditional generation, leveraging models like GPT-3 to assign higher probabilities to high-quality outputs by prepending evaluation instructions to the LLM prompt. Unlike NLI models trained on binary classification data, these methods produce scores that are harder to interpret as probabilities and often require additional steps for inconsistency classification.

Recent hallucination detection methods, such as FactScore [19] and SAFE [20], utilize large language models to break down the response into atomic or individual facts for evaluation. These approaches have enabled precise identification of where hallucinations occur within the LLM response. Each fact is automatically verified against a comprehensive knowledge source like Wikipedia or scientific literature in the case of FactScore, or through the use of a search engine in the case of SAFE.

FactGraph [21] is the only factuality evaluation method we are aware of that utilises graph-like structures. The method is focused solely on the detection of inconsistencies in the summarization problem, decomposing both the summary and the supporting documents into what they call structured meaning representations (MRs). These MRs describe the core semantic concepts and relations, which the authors claim to be more suitable for factuality evaluation than the raw text.

GraphEval: Our evaluation method

GraphEval is based around the idea of representing information in a structured manner through KGs, and aims to address the lack of explainability of previous hallucination detection approaches, i.e. which concrete pieces of information in particular are inconsistent.

Formally, a KG is a collection of triples 𝒦𝒢 = {(𝑒1, 𝑟, 𝑒2) ⊆ ℰ × ℛ × ℰ}, where ℰ and ℛ denote the set of entities and relationships, respectively. In the GraphEval setting, both entities and relationships are simply pieces of text. We do not make use of common extensions to this simple setting, such as entity and relationship types, or attached properties.

Our GraphEval metric consists of a two-stage procedure:

Stage 1 -Construct a KG from the LLM output to be evaluated. Stage 2 -Iterate through each of the triples in the KG, identifying whether they are factually consistent given the provided context.

The output is considered factually inconsistent if any of the triples in stage 2 are identified as not grounded in the context. The inconsistent triple(s) may also be returned to provide explainability by highlighting where in the output the hallucination(s) has occurred. We provide a visualisation of this process in Figure 1 using a real example from one of the benchmarks described in Section 7.1.

Regarding stage 1, we provide a short review of LLMbased KG construction methods in Section 5, along with results from our implementation. For stage 2, we leverage existing techniques and employ an out-of-the-box NLI model for this task. A benefit of this approach is that it gives us the opportunity to make a direct comparison between the performance of the raw NLI model and the model supplemented with our KG approach. In essence, our method is a pre-processing step, the output of which can be fed into any hallucination detection method; we choose NLI models as they are computationally cheap compared to LLM-based models, yet still achieve state-ofthe-art results. By feeding each triple into an NLI model, along with the grounding context, we obtain a probability of containing a hallucination for each triple. Finally, we classify the example as inconsistent if at least one triple produces a probability greater than 0.5.

Similar approaches to ours have been proposed in recent literature. SummaC [14] also uses NLI-based models to detect inconsistencies in LLM-generated summaries. However, it distinguishes itself by segmenting both the context and the summary into their respective sentences, and then by passing each context-summary pair into the NLI model. This approach presents challenges in main- taining entity references across sentences; for instance, "John Doe" may only be referred to as "he" in another sentence. Similarly, FactScore [19] faces the same limitation. Our method circumvents this issue by organising entity relationships with a KG.

While FactGraph [21] also makes use of graph structures in their consistency evaluation process, the method differs from GraphEval in a few major respects. Firstly, their approach can only be applied to the summarisation problem; whereas GraphEval can easily be applied to various domains such as Summarisation, Question Answering, Common Sense Reasoning and many others. Secondly, FactGraph does not employ LLMs anywhere in their framework, missing out on recent advances in the field. Finally, their approach aims to decompose both the LLM output and the provided context into the underlying core semantic concepts and relations, before comparing each of the graph structures. GraphEval, on the other hand, only represents the LLM output as a KG and aims to preserve as much of the information contained in the raw text as possible.

To summarise the advantages of GraphEval over previous methods:

• We present a systematic way of checking all pieces of information contained in the LLM output. • Our method only requires one call to an LLM, in the KG construction phase, and does not require the (usually) large context documents to be input, as in all previous LLM-based metrics. This makes GraphEval less computationally expensive than other LLM-based methods. • Our method returns the specific triples that are not grounded in the context, providing explainability for the decision and identifying which section of the output should not be trusted. We leverage this feature for hallucination correction and propose a new method called GraphCorrect, described in Section 6.

Construction of KGs using LLMs

Constructing KGs from unstructured textual data involves identifying the set of entities within the text and the relationships between them, resulting in a structured representation of the information contained within the text. The process can be divided into three main stages:

1. Entity detection -the process of identifying and extracting entities from text. 2. Coreference resolution -the process of finding of all expressions (also called mentions) in the text that refer to the same entity. 3. Relation extraction -the process of identifying semantic relationships between entities.

Previously, researchers addressed each stage individually, but with the increasing power of LLMs, there's been a shift towards end-to-end systems. Kumar et al. [22] suggest employing two LLM components: one for named entity recognition and another one for both relation classification and direction. Similarly, Grapher [23] utilizes a pre-trained LLM for entity extraction and relation prediction. However, these methods require users to provide possible relations. More recent methods like PiVE [24] and AutoKG [25] use LLM prompting strategies for KG construction without additional user input.

The aforementioned methods do not make use of some of the emergent abilities of LLMs, such as in-context learning and the chain-of thought prompting strategy. We decide to leverage these emergent abilities, and take a simple prompt engineering approach to our KG construction step. The techniques used can be summarised as the following:

• Chain-of-thought (CoT) prompting strategy. Providing intermediate reasoning steps in the prompt to enable LLMs to solve more complex tasks. • In-context learning. A method of prompt engineering where one provides several task demonstrations within the prompt, circumventing the need for fine-tuning.

The final prompt used in our experiments can be found in the Appendix. We highlight to the reader that our KG construction method is not the main contribution of our work, which is rather the application of KG construction to the hallucination detection problem. The major benefit of our KG construction approach is its ease of implementation with any LLM. Furthermore, it is less computationally intensive than methods like PiVE, which performs multiple iterations of improvements to the generated KG.

Of course, users may conduct the KG construction stage of GraphEval using their method of choice; the experiments in this paper exhibit the capability of a simple prompting strategy.

GraphCorrect: Correction of hallucinations with GraphEval

While the primary focus of this work lies in hallucination detection, GraphEval's breakdown of LLM outputs into triples easily allows for its extension to correct hallucinations within the given context. To achieve this, we first identify all triples within the KG that are likely to contain hallucinations (i.e. those with a probability greater than 0.5, if any). We then employ the following two-step procedure on each identified triple:

Step 1 -Input the given triple along with the context into an LLM to correct for the potential hallucinations within the triple. This results in a newly generated corrected triple.

Step 2 -Input the identified triple, its corrected counterpart and the initial LLM output. Selectively replace the information from the original (hallucination-containing) triple with the information from the new triple in the initial LLM output.

We name this LLM hallucination correction method as GraphCorrect. The final prompts used in our experiments for both step 1 and step 2 can be found in the Appendix B and C respectively. This systematic approach to hallucination correction offers several benefits. First, it tackles each identified hallucination separately, increasing the chances of all perceived hallucinations being corrected. Furthermore, it offers the advantage of exclusively altering the segments of the original text that are suspected to contain a hallucination, leaving other elements untouched and ensuring overall high similarity with the original text. Finally, breaking down the entire process into intermediate steps ensures that the original context and the initial LLM output never undergo simultaneous processing within an LLM. This guarantees safeguards against both the addition of extra information and the loss of information in the LLM output.

Experiments

Benchmarks

We conducted two sets of experiments: one focusing on hallucination detection to highlight GraphEval's performance and another on hallucination correction to showcase the advantages of GraphCorrect. For both scenarios, we utilized the SummEval [26], QAGS-C and QAGS-X [27] benchmarks -currently the most prevalent benchmarks in relevant academic literature. All three are concerned with detecting hallucinations in LLM-generated summaries and are human-annotated for factual consistency with respect to the grounding context. Table 1 contains some statistics pertaining to each of these datasets.

SummEval

The SummEval dataset consists of human evaluations on 16 summarization model outputs from 100 articles from the CNN/DailyMail dataset [28]. Each summary is labelled on a Likert scale from 1-5 on 4 categories: consistency, coherence, fluency and relevance. We follow the TRUE benchmark [13] in taking the consistency scores and mapping a score of 5 to being fully consistent, and anything lower to being inconsistent.

QAGS

The QAGS-C and QAGS-X datasets are built from the CNN/DailyMail and the XSum [29] datasets, respectively. The human annotators examined the summaries one sentence at a time, and determined the factual consistency of each sentence comparing it to the original article. Three annotators assessed each sentence and the majority decision was recorded. Again, we follow the TRUE benchmark in considering a summary to be factually consistent if and only if all sentences are considered consistent.

NLI models in GraphEval

As mentioned in Section 4, we employ NLI models to perform the second stage of GraphEval -checking the consistency of each individual triple with respect to the context. We conduct experiments using the three most popular NLI-based hallucination detection models available on HuggingFace1 .

HHEM Based on the DeBERTaV3 model [30] and initially trained on NLI data, the hallucination evaluation model created by Vectara 2 is further fine-tuned on datasets annotated for consistency. The datasets used for fine tuning were: FEVER [31], Vitamin C [32] and PAWS [33]. This model is considerably smaller than the following two models, requiring only 738 MB of memory, and thus has a significantly shorter run-time.

TRUE

The TRUE model is based on a T5-XXL model [34] and is trained similarly to the model described in the TRUE paper [13]. Instead of the ANLI dataset used in that paper, this model is trained on the same datasets as HHEM, plus the following: SNLI [35], MNLI [36] and Scitail [37]. This model requires 45.5 GB of memory.

TrueTeacher Gekhman et al. [15] leverage the ability of LLMs to evaluate hallucinations by generating synthetic data through annotating model-generated summaries. They then use this synthetic data to further fine-tune the model from [13], leading to state-of-theart performance on the TRUE benchmark. This model is the same size as the TRUE model.

Experimental settings

In all experiments conducted in this study necessitating the utilization of an LLM, we use Claude 2 3 , an LLM from Anthropic, through the Amazon Bedrock API 4 . We use the default settings for the LLM: temperature = 1, top_p = 1, top_k = 250. We also refer the reader to the Appendix for the prompts used in this work.

Results

Hallucination detection with GraphEval

We present our results of hallucination detection for the three NLI models, and their GraphEval counterparts, in Table 2. We report the balanced accuracy as our evaluation metric, which corrects for the class imbalance in the SummEval benchmark. In the case of using the NLI model directly, we classify the example as containing a hallucination if the NLI model returns a probability of more than 0.5. When combining the NLI model with GraphEval, we classify the example as containing a hallucination if at least one triple fed to the NLI model returns a probability of more than 0.5. We see that adding the GraphEval pre-processing step to each of the NLI models almost always improves the balanced accuracy score, sometimes by a considerable amount, such as the results for the SummEval and QAGS-C benchmarks in Table 2. On average (weighting by the number of samples in each dataset), adding the GraphEval pre-processing step improves the balanced accuracy by 6.2 (SE=1.3).

Table 2

Balanced accuracy scores for hallucination detection of NLI models (HHEM, TRUE, TrueTeacher) and their GraphEval counterparts on the SummEval, QAGS-C and QAGS-X benchmarks.

We hypothesise that the negligible difference between the base NLI model and the model supplemented with GraphEval for the QAGS-X dataset is due to the average length of the generated text (only 18 words, compared with 49 and 63 for QAGS-C and SummEval respectively, see 1). This highlights an important aspect of where the most value can be found in our method. When the LLM output is very short, there are less likely to be multiple facts that need to be checked for consistency (which can easily be done without the use of a KG) and the intricacies of the short sentence might even be lost in the KG construction phase. On the other hand, when the LLM output is very long, current methods struggle to test each individual fact against the context, and this is when GraphEval thrives.

It should be noted that even when the results for GraphEval are comparable to the baseline methods, the benefit of using GraphEval is the identification of the specific triple(s) that are inconsistent with the provided context.

Hallucination correction with GraphCorrect

Identifying the particular triple(s) likely to harbor a hallucination enables straightforward correction using Graph-Correct, as described in Section 6. For each of the evaluation frameworks proposed here (HHEM + GraphEval, TRUE + GraphEval, and TrueTeacher + GrapEval), we compared GraphCorrect to a basic prompting strategy for hallucination correction, serving as a baseline. The prompt used in this baseline approach, referred to as the Direct Prompt henceforth, is provided in Appendix D.

For each framework, we initially identify hallucinations, correct only the LLM outputs suspected of containing hallucinations using either GraphCorrect or Direct Prompt, and then reapply the evaluation framework to detect hallucinations in the corrected LLM outputs. Note that this procedure only allows us to measure what we presume to be corrected hallucinations, given the potential for errors in the evaluation frameworks utilized here. We report the

Table 3

Average ROUGE-1, ROUGE-2 and ROUGE-L scores measuring similarity between original and corrected summaries using Direct Prompt and GraphCorrect across different datasets and hallucination detection frameworks.

percentage of believed corrected hallucinations in Table 4. A score of 0% suggests no corrected hallucinations according to the given framework, while a score of 100% indicates correction of all hallucinations as per the given framework. GraphCorrect outperforms the prompting strategy proposed here by significantly correcting for more hallucinations on all tasks apart from two related to the QAGS-X dataset. As on the hallucination detection task, we hypothesise these results are correlated with the average length of the text, with GraphCorrect bringing most value in longer texts with a more complex structure to unravel and correct.

Additionally, as previously stated, GraphCorrect offers the advantage of only modifying the segments of text in the LLM outputs susceptible to hallucinations, while leaving other sections unaltered, thereby maintaining high overall similarity with the original text. This characteristic is illustrated in Table 3

Table 4

Percentage of believed corrected hallucinations using a direct prompting strategy and GraphCorrect on the SummEval, QAGS-C and QAGS-X benchmarks. The hallucinations were first detected by HHEM + GraphEval, TRUE + GraphEval and TrueTeacher + GraphEval respectively, and then corrections were evaluated by the same metric.

summaries and the corrected versions for both GraphCorrect and Direct Prompt across all experimental scenarios examined in this study. GraphCorrect systematically generates texts that are closer in similarity to the original LLM outputs compared to its counterpart.

Discussion

Our work focuses on detection of hallucinations in closeddomain tasks, where we are interested only in consistency with respect to the provided context. The GraphEval framework could be extended to open-domain hallucination detection by employing agents, as in AutoKG [25], to first retrieve relevant external sources as the grounding information to check against.

We expect that in the near future, more research will be conducted on the construction of KGs from unstructured text, which will provide improvements to the first stage of our procedure and ultimately the evaluation performance. Even as LLMs alone become more powerful, this will continue to contribute to improvements in GraphEval's performance.

We observe that, in the knowledge graph construction phase of our procedure, it is possible that some information loss may occur. However, as shown by the results in Section 7.4, our method rarely leads to a reduction in balanced accuracy. Furthermore, when it is comparable to the baseline methods, we have the added explainability of identifying the specific triples where the hallucination has occurred.

We believe our hallucination correction framework (GraphCorrect) shows promise and an interesting avenue for future work. However, the effectiveness of the approach described in this work should be assessed manually, rather than relying on the convoluted use of hallucination evaluation frameworks (which only yield mea-surements of believed corrected hallucinations).

Conclusion

We introduce GraphEval, a simple and effective preprocessing step for improving the explainability and performance of LLM hallucination detection metrics. Our method leverages LLM's ability to extract information from unstructured text and construct knowledge graphs, whose triples can be fed into out-of-the-box hallucination detection methods.

We demonstrate that GraphEval in conjunction with state-of-the-art NLI models leads to an average improvement in balanced accuracy of 6.2 (SE=1.3) on three popular hallucination benchmarks. Furthermore, our method indicates which triples, in the KG representation of the LLM output, are inconsistent. To the best of our knowledge, this is the first application of KGs to an LLM-based hallucination evaluation framework and we believe the success of GraphEval will only grow as KG construction methods also improve.

Finally, we examined the issue of hallucination correction and showed that GraphCorrect can effectively address the majority of hallucinations found in LLM outputs while maintaining extremely high similarity with the original texts.

A. KG Construction Prompt

( " s y s t e m " , " " " You a r e an e x p e r t a t e x t r a c t i n g i n f o r m a t i o n i n s t r u c t u r e d f o r m a t s t o b u i l d a knowledge g r a p h . S t e p 1 − E n t i t y d e t e c t i o n : I d e n t i f y a l l e n t i t i e s i n t h e raw t e x t . Make s u r e n o t t o m i s s any o u t . E n t i t i e s s h o u l d be b a s i c and s i m p l e , t h e y a r e a k i n t o W i k i p e d i a n o d e s . S t e p 2 − C o r e f e r e n c e r e s o l u t i o n : F i n d a l l e x p r e s s i o n s i n t h e t e x t t h a t r e f e r t o t h e same e n t i t y . Make s u r e e n t i t i e s a r e n o t d u p l i c a t e d .

I n p a r t i c u l a r do n o t i n c l u d e e n t i t i e s t h a t a r e more s p e c i f i c v e r s i o n s t h e m s e l v e s , e . g . " a d e t a i l e d view o f j u p i t e r ' s a t m o s p h e r e " and " j u p i t e r ' s a t m o s p h e r e " , o n l y i n c l u d e t h e most s p e c i f i c v e r s i o n o f t h e e n t i t y . S t e p 3 − R e l a t i o n e x t r a c t i o n : I d e n t i f y s e m a n t i c r e l a t i o n s h i p s between t h e e n t i t i e s you have i d e n t i f i e d .

Format : R e t u r n t h e knowledge g r a p h a s a l i s t o f t r i p l e s , i . e . [ " e n t i t y 1 " , " r e l a t i o n 1 − 2 " , " e n t i t y o c c u p a t i o n " , " Music e x e c u t i v e " ] , [ " D a r i u s Van Arman " , " born i n " , " P e n n s y l v a n i a " ] ,

[ " D a r i u s Van Arman " , " a t t e n d e d " , " Gonzaga C o l l e g e High S c h o o l " ] , [ " D a r i u s Van Arman " , " i n s t a n c e o f " , " human b e i n g " ] ] </ python > ## Example 4 . I n p u t : " I t a l y had 3 . 6 x t i m e s more c a s e s o f c o r o n a v i r u s t h a n China . " Output : < python > [ [ " I t a l y " , " had 3 . 6 x t i m e s more c a s e s o f c o r o n a v i r u s t h a n " , "

China " ] ] </ python > " " " , ) ,

B. Hallucination correction (step 1)

" " " You a r e an e x p e r t a t e x t r a c t i n g i n f o r m a t i o n i n s t r u c t u r e d f o r m a t s from t e x t . The f o l l o w i n g t r i p l e c o n t a i n s f a c t u a l l y i n c o r r e c t i n f o r m a t i o n . C o r r e c t i t b a s e d on t h e p r o v i d e d c o n t e x t , I m p o r t a n t T i p s :

1 . A t r i p l e i s d e f i n e d a s [ " e n t i t y 1 " , " r e l a t i o n 1 − 2 " , " e n t i t y 2 " ] . 2 . A t r i p l e must o n l y c o n t a i n t h r e e s t r i n g s ! None o f t h e s t r i n g s s h o u l d be empty . 3 . The c o n c a t e n a t e d t r i p l e must make s e n s e a s a s e n t e n c e . 4 . Only r e t u r n t h e c o r r e c t e d t r i p l e , n o t h i n g e l s e .

< t r i p l e > { t r i p l e } < / t r i p l e > < c o n t e x t > { c o n t e x t } < / c o n t e x t > Remember , i t i s i m p o r t a n t t h a t you o n l y r e t u r n t h e c o r r e c t e d t r i p l e . " " "

C. Hallucination correction (step 2)

" " " I n t h e f o l l o w i n g c o n t e x t , r e p l a c e t h e i n f o r m a t i o n o f t h e o l d t r i p l e w i t h t h e i n f o r m a t i o n o f t h e new one .

Do n o t make any o t h e r m o d i f i c a t i o n t o

t h e c o n t e x t . Only r e t u r n t h e new c o n t e x t . < c o n t e x t > { summary } < / c o n t e x t > < o l d _ t r i p l e > { o l d _ t r i p l e } < / o l d _ t r i p l e > < n e w _ t r i p l e > { n e w _ t r i p l e } < / n e w _ t r i p l e > " " "

D. Hallucination correction without a KG

" " " The f o l l o w i n g summary c o n t a i n s f a c t u a l l y i n c o r r e c t i n f o r m a t i o n . C o r r e c t i t b a s e d on t h e c o n t e x t , b u t don ' t change o t h e r p a r t s o f t h e summary . Only r e t u r n t h e c o r r e c t e d summary , n o t h i n g e l s e . <summary > { summary } < / summary > < c o n t e x t > { c o n t e x t } < / c o n t e x t > Remember , do m i n i m a l c h a n g e s t o t h e o r i g i n a l summary , don ' t make i t l o n g e r and keep a s much o f i t a s you can e x a c t l y t h e same . " " "

Figure 1 :1Figure 1: A visualisation of the GraphEval approach. First, the LLM output is fed into the KG construction prompt to produce the KG depicted on the right. Next, each individual triple in the KG is fed into an out-of-the-box hallucination detection method, such as an NLI model, and compared to the provided context for inconsistencies. Finally, any triples that are flagged as inconsistent are returned to the user, along with the overall hallucination decision.

I n p u t : " Amanda J a c k s o n was born i n S p r i n g f i e l d , Ohio , USA on J u n e 1 , 1 9 8 5 . She was a b a s k e t b a l l p l a y e r f o r t h e U . S . women ' s team . " Output : <python > [ [ " Amanda J a c k s o n " , " born i n " , " S p r i n g f i e l d , Ohio , USA " ] , [ " Amanda J a c k s o n " , " born on " , " J u n e 1 , 1 9 8 5 " ] , [ " Amanda J a c k s o n " , " o c c u p a t i o n " , " b a s k e t b a l l p l a y e r " ] , [ " Amanda J a c k s o n " , " p l a y e d f o r " , " U . S . women ' s b a s k e t b a l l team " ] ] </ python > ## Example 3 . I n p u t : " Music e x e c u t i v e D a r i u s Van Arman was born i n P e n n s y l v a n i a . He a t t e n d e d Gonzaga C o l l e g e High S c h o o l and i s a human b e i n g . " Output : <python > [ [ " D a r i u s Van Arman " , "

Benchmark No. of Examples Label Ratio Avg Output len. Avg Context len.SummEval1,60033.2%63359QAGS-C23548.1%49383QAGS-X23948.5%18318

Table 11Statistics relating to the evaluation benchmarks used. The label ratio is the ratio of factually consistent examples to inconsistent examples. The average output and context length are the average number of words in each.https://huggingface.cohttps://huggingface.co/vectara/hallucination_evaluation_modelhttps://www.anthropic.com/news/claude-2https://aws.amazon.com/bedrock/claude/

Survey of hallucination in natural language generation ZJi NLee RFrieske TYu DSu YXu EIshii YJBang AMadotto PFung 10.1145/3571730 doi: ACM Comput. Surv 55 2023 Bleu: a method for automatic evaluation of machine translation KPapineni SRoukos TWard W.-JZhu 10.3115/1073083.1073135 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics PIsabelle ECharniak DLin the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Philadelphia, Pennsylvania, USA

2002 ROUGE: A package for automatic evaluation of summaries C.-YLin Text Summarization Branches Out, Association for Computational Linguistics

Barcelona, Spain

2004 G-eval: NLG evaluation using gpt-4 with better human alignment YLiu DIter YXu SWang RXu CZhu 10.18653/v1/2023.emnlp-main.153 Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics HBouamor JPino KBali the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Singapore

2023 Retrieval-augmented generation for knowledge-intensive nlp tasks PLewis EPerez APiktus FPetroni VKarpukhin NGoyal HKüttler MLewis W-T. Yih TRocktäschel Advances in Neural Information Processing Systems 33 2020 LLuo Y.-FLi GHaffari SPan arXiv:2310.01061 Reasoning on graphs: Faithful and interpretable large language model reasoning 2023 arXiv preprint Give us the facts: Enhancing large language models with knowledge graphs for fact-aware language modeling LYang HChen ZLi XDing XWu IEEE Transactions on Knowledge and Data Engineering 2024 GAgrawal TKumarage ZAlghamdi HLiu arXiv:2311.07914 Can knowledge graphs reduce hallucinations in llms? : A survey 2024 On faithfulness and factuality in abstractive summarization JMaynez SNarayan BBohnet RMcdonald 10.18653/v1/2020.acl-main.173 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 𝑞 2 : Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering OHonovich LChoshen RAharoni ENeeman ISzpektor OAbend 10.18653/v1/2021.emnlp-main.619 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and M.-FMoens XHuang LSpecia SW.-T Yih the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and

Punta Cana, Dominican Republic

2021 Bertscore: Evaluating text generation with bert TZhang * VKishore * FWu * KQWeinberger YArtzi International Conference on Learning Representations 2020 BERT: Pre-training of deep bidirectional transformers for language understanding JDevlin M.-WChang KLee KToutanova 10.18653/v1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers JBurstein CDoran TSolorio the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

2019 1 Association for Computational Linguistics TRUE: Re-evaluating factual consistency evaluation OHonovich RAharoni JHerzig HTaitelbaum DKukliansy VCohen TScialom ISzpektor AHassidim YMatias 10.18653/v1/2022.dialdoc-1.19 Proceedings of the Second DialDoc Workshop on Documentgrounded Dialogue and Conversational Question Answering, Association for Computational Linguistics SFeng HWan CYuan HYu the Second DialDoc Workshop on Documentgrounded Dialogue and Conversational Question Answering, Association for Computational Linguistics

Dublin, Ireland

2022 SummaC: Re-visiting NLI-based models for inconsistency detection in summarization PLaban TSchnabel PNBennett MAHearst 10.1162/tacl_a_00453 Transactions of the Association for Computational Linguistics 10 2022 ZGekhman JHerzig RAharoni CElkind ISzpektor arXiv:2305.11171 Trueteacher: Learning factual consistency evaluation with large language models 2023 PManakul ALiusie MJGales arXiv:2303.08896 Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models 2023 arXiv preprint Selfcontradictory hallucinations of large language models: Evaluation, detection and mitigation NMündler JHe SJenko MVechev The Twelfth International Conference on Learning Representations 2024 JFu S.-KNg ZJiang PLiu arXiv:2302.04166 Gptscore: Evaluate as you desire 2023 SMin KKrishna XLyu MLewis W-T. Yih PWKoh MIyyer LZettlemoyer HHajishirzi arXiv:2305.14251 Factscore: Fine-grained atomic evaluation of factual precision in long form text generation 2023 arXiv preprint JWei CYang XSong YLu NHu DTran DPeng RLiu DHuang CDu arXiv:2403.18802 Long-form factuality in large language models 2024 arXiv preprint LF RRibeiro MLiu IGurevych MDreyer MBansal arXiv:2204.06508 Factgraph: Evaluating factuality in summarization with semantic graph representations 2022 Building knowledge graph using pre-trained language model for learning entity-aware relationships AKumar APandey RGadia MMishra 10.1109/GUCON48875.2020.9231227 IEEE International Conference on Computing, Power and Communication Technologies (GUCON) 2020. 2020 Grapher: Multi-stage knowledge graph construction using pretrained language models IMelnyk PDognin PDas NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications 2021 JHan NCollier WBuntine EShareghi arXiv:2305.12392 Pive: Prompting with iterative verification improving graph-based generative capability of llms 2023 arXiv preprint YZhu XWang JChen SQiao YOu YYao SDeng HChen NZhang arXiv:2305.13168 Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities 2023 arXiv preprint Summeval: Re-evaluating summarization evaluation ARFabbri WKryscinski BMccann RSocher DRRadev Transactions of the Association for Computational Linguistics 9 2020 Asking and answering questions to evaluate the factual consistency of summaries AWang KCho MLewis 10.18653/v1/2020.acl-main.450 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Teaching machines to read and comprehend KMHermann TKocisky EGrefenstette LEspeholt WKay MSuleyman PBlunsom Advances in neural information processing systems 28 2015 Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization SNarayan SBCohen MLapata 10.18653/v1/D18-1206 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics ERiloff DChiang JHockenmaier JTsujii the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Brussels, Belgium

2018 Deberta: Decodingenhanced bert with disentangled attention PHe XLiu JGao WChen ternational Conference on Learning Representations 2021 The FEVER2.0 shared task JThorne AVlachos OCocarascu CChristodoulopoulos AMittal Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER) the Second Workshop on Fact Extraction and VERification (FEVER) 2018 Get your vitamin C! robust fact verification with contrastive evidence TSchuster AFisch RBarzilay 10.18653/v1/2021.naacl-main.52 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics 2021 PAWS: Paraphrase Adversaries from Word Scrambling YZhang JBaldridge LHe Proc. of NAACL of NAACL 2019 Exploring the limits of transfer learning with a unified text-totext transformer CRaffel NShazeer ARoberts KLee SNarang MMatena YZhou WLi PJLiu Journal of Machine Learning Research 21 2020 A large annotated corpus for learning natural language inference SRBowman GAngeli CPotts CDManning 10.18653/v1/D15-1075 Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics LMàrquez CCallison-Burch JSu the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Lisbon, Portugal

2015 A broadcoverage challenge corpus for sentence understanding through inference AWilliams NNangia SBowman 10.18653/v1/N18-1101 Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long Papers MWalker HJi AStent the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

New Orleans, Louisiana

2018 1 Association for Computational Linguistics SciTail: A textual entailment dataset from science question answering TKhot ASabharwal PClark 2018 AAAI