=Paper= {{Paper |id=Vol-3894/paper5 |storemode=property |title=GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework |pdfUrl=https://ceur-ws.org/Vol-3894/paper5.pdf |volume=Vol-3894 |authors=Hannah Sansford,Nicholas Richardson,Hermina Petric Maretic,Juba Nait Saada |dblpUrl=https://dblp.org/rec/conf/kil/SansfordRMS24 }} ==GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework== https://ceur-ws.org/Vol-3894/paper5.pdf
                                GraphEval: A Knowledge-Graph Based LLM Hallucination
                                Evaluation Framework
                                Hannah Sansford1,* , Nicholas Richardson2 , Hermina Petric Maretic2 and Juba Nait Saada2
                                1
                                    University of Bristol, UK
                                2
                                    Amazon Science


                                                   Abstract
                                                   Methods to evaluate Large Language Model (LLM) responses and detect inconsistencies, also known as hallucinations, with
                                                   respect to the provided knowledge, are becoming increasingly important for LLM applications. Current metrics fall short in
                                                   their ability to provide explainable decisions, systematically check all pieces of information in the response, and are often too
                                                   computationally expensive to be used in practice. We present GraphEval: a hallucination evaluation framework based on
                                                   representing information in Knowledge Graph (KG) structures. Our method identifies the specific triples in the KG that are
                                                   prone to hallucinations and hence provides more insight into where in the response a hallucination has occurred, if at all,
                                                   than previous methods. Furthermore, using our approach in conjunction with state-of-the-art natural language inference
                                                   (NLI) models leads to an improvement in balanced accuracy on various hallucination benchmarks, compared to using the raw
                                                   NLI models. Lastly, we explore the use of GraphEval for hallucination correction by leveraging the structure of the KG, a
                                                   method we name GraphCorrect, and demonstrate that the majority of hallucinations can indeed be rectified.

                                                   Keywords
                                                   Large Language Models, Knowledge Graphs, Hallucination Detection, Hallucination Correction



                                1. Introduction                                                                                             before hallucinations were at the forefront of the prob-
                                                                                                                                            lem. Methods have evolved a great deal from traditional
                                As the size and power of LLMs have drastically increased                                                    N-gram based metrics, such as BLEU [2] and ROUGE
                                over recent years, so has the number of potential appli-                                                    [3], to much more intricate LLM-based evaluation met-
                                cations. Arguably, one of the biggest blockers to imple-                                                    rics with user-defined evaluation criteria, such as G-Eval
                                menting these models in practice is their tendency to                                                       [4]. More recently, techniques to mitigate the prevalence
                                hallucinate - returning seemingly plausible, but untrue,                                                    of hallucinations in generated outputs leveraging Re-
                                responses. Here, we focus on the problem of detecting                                                       trieval Augmented Generation (RAG) [5] and reasoning
                                hallucinations with respect to the provided context that                                                    on knowledge graphs (KGs) [6, 7] have been proposed.
                                the LLM should use as its source of knowledge; detecting                                                    The former suggested the concatenation of relevant con-
                                hallucinations that have deviated from the LLM’s original                                                   textual data into the prompt to ground the LLM response,
                                training data is out of the scope of this work. In appli-                                                   while the latter enforced a more robust reasoning process
                                cations where certainty in a response is critical, such as                                                  through providing grounding information in KG struc-
                                medical diagnosis, the existence of hallucinations that                                                     tures [8]. As successful as these approaches have been,
                                arise from a given context is especially limiting. There-                                                   they do not fully circumvent the need to evaluate LLM
                                fore, it is of utmost importance to develop successful                                                      outputs.
                                methods to detect these hallucinations and, when it is
                                of interest to address or correct them, provide clarity on     Inspired by current research harnessing KGs to provide
                                which aspect of the response is likely a hallucination.        grounded LLM responses, we propose GraphEval - a hal-
                                The importance of this issue is reflected in the amount        lucination detection framework based on the represen-
                                of research being published on the topic - see Ji et al. [1]   tation of information in KG structures. To the best of
                                for a recent survey of this area.                              our knowledge, we are the first to apply KGs to an LLM-
                                                                                               based hallucination evaluation framework, and in doing
                                Performing evaluation on natural language is a challeng- so we provide a higher level of insight into where in
                                ing task that researchers have been interested in long the output a hallucination has occurred than any previ-
                                                                                               ous metrics. Additionally, we demonstrate how using
                                KiL’24: Workshop on Knowledge-infused Learning co-located with our method in conjunction with current state-of-the-art
                                30th ACM KDD Conference, August 26, 2024, Barcelona, Spain     hallucination detection methods improves their classi-
                                *
                                  Work done during an internship with Amazon.
                                                                                               fication accuracy on various benchmarks. Finally, we
                                $ hannah.sansford@bristol.ac.uk (H. Sansford); nchls@amazon.es
                                (N. Richardson); maretich@amazon.co.uk (H. Petric Maretic);    consider the problem of hallucination correction and we
                                jubans@amazon.co.uk (J. Nait Saada)                            introduce GraphCorrect, showcasing how GraphEval can
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
effectively be extended to rectify a significant proportion  Therefore, researchers have begun to develop new meth-
of hallucinations present in LLM outputs.                    ods that are more acutely tuned to detecting inconsisten-
                                                             cies between an LLM output and its grounding context.
                                                             Maynez et al. [9] identified the crossover between the
2. Problem statement                                         textual entailment score in NLI tasks and consistency pre-
                                                             diction. This was a breakthrough at the time, producing
In this work we focus on the closed-domain hallucina- higher correlation with faithfulness than any previous
tion detection problem: the situation where we have a metrics, and paved the way for further research that cap-
textual output from an LLM which is generated using italised on NLI data and models [13, 14, 15].
some grounding context included in the prompt. In this
                                                             Very recently, attention has turned to leveraging LLMs
case, the goal is for the LLM to use the provided context
                                                             themselves to evaluate the consistency of LLM outputs.
as its only source of knowledge. The open-domain prob-
                                                             SelfCheckGPT [16] and ChatProtect [17] approach the
lem, which is with respect to all factual knowledge in
                                                             problem by considering the self-consistency within sam-
the world, is not explored here but is briefly discussed in
                                                             pled outputs. Since they require the generation of a large
Section 8.
                                                             number of responses from the LLM, many consider these
We consider hallucination detection to be a binary classifi- methods prohibitively computationally expensive.
cation problem, with 0 corresponding to the LLM output
                                                             Other LLM-based hallucination evaluation methods, such
being factually consistent given the provided context,
                                                             as G-Eval [4] and GPTScore [18], employ a different LLM
and 1 corresponding to the output containing at least
                                                             for evaluation than the one used to generate the LLM
one inconsistency. We can assess hallucination evalua-
                                                             response that needs to be evaluated. G-Eval allows user-
tion methods using a benchmarking dataset containing
                                                             defined evaluation criteria and uses automated chain-
ground-truth labels (usually human-annotated) to de-
                                                             of-thought prompting and form-filling to assign scores.
termine whether a given context-output pair contains
                                                             GPTScore treats the task as conditional generation, lever-
factual inconsistencies. Throughout the paper we use
                                                             aging models like GPT-3 to assign higher probabilities to
the terms factual, consistent, grounded and faithful in-
                                                             high-quality outputs by prepending evaluation instruc-
terchangeably to mean containing no hallucinations with
                                                             tions to the LLM prompt. Unlike NLI models trained on
respect to the context.
                                                             binary classification data, these methods produce scores
Finally, we explore the problem of hallucination correc- that are harder to interpret as probabilities and often
tion, wherein we do not use any directly labeled dataset. require additional steps for inconsistency classification.
Instead, we utilize hallucination detection frameworks to
                                                             Recent hallucination detection methods, such as
first identify hallucinations to correct, and subsequently
                                                             FactScore [19] and SAFE [20], utilize large language
repurposing them to evaluate the corrected outputs. It is
                                                             models to break down the response into atomic or in-
important to note that our exploration of hallucination
                                                             dividual facts for evaluation. These approaches have
correction only serves as an extension to our evaluation
                                                             enabled precise identification of where hallucinations oc-
framework and is not the primary focus of this study.
                                                             cur within the LLM response. Each fact is automatically
                                                             verified against a comprehensive knowledge source like
                                                             Wikipedia or scientific literature in the case of FactScore,
3. Related work                                              or through the use of a search engine in the case of SAFE.
Historically, N-gram based metrics such as BLEU [2]             FactGraph [21] is the only factuality evaluation method
and ROUGE [3] have been the most widely used metrics            we are aware of that utilises graph-like structures. The
for natural language evaluation. However, these met-            method is focused solely on the detection of inconsisten-
rics have been shown to perform poorly at the task of           cies in the summarization problem, decomposing both
factual inconsistency detection [9, 10]. In more recent         the summary and the supporting documents into what
years, embedding-based metrics such as BERTScore [11]           they call structured meaning representations (MRs). These
have been favoured over N-gram based metrics. These             MRs describe the core semantic concepts and relations,
methods measure the similarity between two pieces of            which the authors claim to be more suitable for factuality
text by comparing the contextualised embedding from a           evaluation than the raw text.
transformer model, such as BERT [12].
Both N-gram and embedding-based metrics base their
scores on how similar the text to be evaluated is to some
reference text. This similarity objective often fails to cap-
ture the intricacies of the hallucination detection problem.
Figure 1: A visualisation of the GraphEval approach. First, the LLM output is fed into the KG construction prompt to produce
the KG depicted on the right. Next, each individual triple in the KG is fed into an out-of-the-box hallucination detection
method, such as an NLI model, and compared to the provided context for inconsistencies. Finally, any triples that are flagged
as inconsistent are returned to the user, along with the overall hallucination decision.



4. GraphEval: Our evaluation                                   a visualisation of this process in Figure 1 using a real
                                                               example from one of the benchmarks described in Section
   method                                                      7.1.
GraphEval is based around the idea of representing infor-    Regarding stage 1, we provide a short review of LLM-
mation in a structured manner through KGs, and aims to       based KG construction methods in Section 5, along with
address the lack of explainability of previous hallucina-    results from our implementation. For stage 2, we leverage
tion detection approaches, i.e. which concrete pieces of     existing techniques and employ an out-of-the-box NLI
information in particular are inconsistent.                  model for this task. A benefit of this approach is that it
Formally, a KG is a collection of triples 𝒦𝒢 = gives us the opportunity to make a direct comparison
{(𝑒1 , 𝑟, 𝑒2 ) ⊆ ℰ × ℛ × ℰ}, where ℰ and ℛ denote between the performance of the raw NLI model and the
the set of entities and relationships, respectively. In the model supplemented with our KG approach. In essence,
GraphEval setting, both entities and relationships are our method is a pre-processing step, the output of which
simply pieces of text. We do not make use of common can be fed into any hallucination detection method; we
extensions to this simple setting, such as entity and rela- choose NLI models as they are computationally cheap
tionship types, or attached properties.                      compared to LLM-based models, yet still achieve state-of-
                                                             the-art results. By feeding each triple into an NLI model,
Our GraphEval metric consists of a two-stage procedure: along with the grounding context, we obtain a probability
         Stage 1 - Construct a KG from the LLM output of containing a hallucination for each triple. Finally, we
         to be evaluated.                                    classify the example as inconsistent if at least one triple
         Stage 2 - Iterate through each of the triples in produces a probability greater than 0.5.
         the KG, identifying whether they are factually Similar approaches to ours have been proposed in re-
         consistent given the provided context.              cent literature. SummaC [14] also uses NLI-based models
The output is considered factually inconsistent if any of to detect inconsistencies in LLM-generated summaries.
the triples in stage 2 are identified as not grounded in the However, it distinguishes itself by segmenting both the
context. The inconsistent triple(s) may also be returned context and the summary into their respective sentences,
to provide explainability by highlighting where in the and then by passing each context-summary pair into the
output the hallucination(s) has occurred. We provide NLI model. This approach presents challenges in main-
               Benchmark         No. of Examples        Label Ratio      Avg Output len.       Avg Context len.
                 SummEval               1,600               33.2%                63                    359
                  QAGS-C                 235                48.1%                49                    383
                  QAGS-X                 239                48.5%                18                    318

Table 1
Statistics relating to the evaluation benchmarks used. The label ratio is the ratio of factually consistent examples to inconsistent
examples. The average output and context length are the average number of words in each.



taining entity references across sentences; for instance, 5. Construction of KGs using LLMs
"John Doe" may only be referred to as "he" in another
sentence. Similarly, FactScore [19] faces the same limita- Constructing KGs from unstructured textual data in-
tion. Our method circumvents this issue by organising volves identifying the set of entities within the text and
entity relationships with a KG.                              the relationships between them, resulting in a structured
                                                             representation of the information contained within the
While FactGraph [21] also makes use of graph structures text. The process can be divided into three main stages:
in their consistency evaluation process, the method dif-
fers from GraphEval in a few major respects. Firstly,             1. Entity detection - the process of identifying and
their approach can only be applied to the summarisation               extracting entities from text.
problem; whereas GraphEval can easily be applied to               2.  Coreference resolution - the process of finding
various domains such as Summarisation, Question An-                   of all expressions (also called mentions) in the text
swering, Common Sense Reasoning and many others.                      that refer to the same entity.
Secondly, FactGraph does not employ LLMs anywhere in              3.  Relation extraction - the process of identifying
their framework, missing out on recent advances in the                semantic relationships between entities.
field. Finally, their approach aims to decompose both the
LLM output and the provided context into the underlying Previously, researchers addressed each stage individually,
core semantic concepts and relations, before comparing but with the increasing power of LLMs, there’s been a
each of the graph structures. GraphEval, on the other shift towards end-to-end systems. Kumar et al. [22] sug-
hand, only represents the LLM output as a KG and aims gest employing two LLM components: one for named
to preserve as much of the information contained in the entity recognition and another one for both relation clas-
raw text as possible.                                        sification and direction. Similarly, Grapher [23] utilizes a
                                                             pre-trained LLM for entity extraction and relation predic-
To summarise the advantages of GraphEval over previous tion. However, these methods require users to provide
methods:                                                     possible relations. More recent methods like PiVE [24]
      • We present a systematic way of checking all and AutoKG [25] use LLM prompting strategies for KG
         pieces of information contained in the LLM out- construction without additional user input.
         put.                                                The aforementioned methods do not make use of some of
      • Our method only requires one call to an LLM, in the emergent abilities of LLMs, such as in-context learn-
         the KG construction phase, and does not require ing and the chain-of thought prompting strategy. We
         the (usually) large context documents to be input, decide to leverage these emergent abilities, and take a
         as in all previous LLM-based metrics. This makes simple prompt engineering approach to our KG construc-
         GraphEval less computationally expensive than tion step. The techniques used can be summarised as the
         other LLM-based methods.                            following:
      • Our method returns the specific triples that are
                                                                   • Chain-of-thought (CoT) prompting strategy. Pro-
         not grounded in the context, providing explain-
                                                                      viding intermediate reasoning steps in the prompt
         ability for the decision and identifying which sec-
                                                                      to enable LLMs to solve more complex tasks.
         tion of the output should not be trusted. We lever-
         age this feature for hallucination correction and         • In-context learning. A method of prompt engi-
         propose a new method called GraphCorrect, de-                neering where one provides several task demon-
         scribed in Section 6.                                        strations within the prompt, circumventing the
                                                                      need for fine-tuning.
                                                                    The final prompt used in our experiments can be found
                                                                    in the Appendix. We highlight to the reader that our KG
                                                                    construction method is not the main contribution of our
work, which is rather the application of KG construction        7. Experiments
to the hallucination detection problem. The major benefit
of our KG construction approach is its ease of implemen-        7.1. Benchmarks
tation with any LLM. Furthermore, it is less computation-
                                                                We conducted two sets of experiments: one focusing on
ally intensive than methods like PiVE, which performs
                                                                hallucination detection to highlight GraphEval’s perfor-
multiple iterations of improvements to the generated KG.
                                                                mance and another on hallucination correction to show-
Of course, users may conduct the KG construction stage
                                                                case the advantages of GraphCorrect. For both scenarios,
of GraphEval using their method of choice; the exper-
                                                                we utilized the SummEval [26], QAGS-C and QAGS-X
iments in this paper exhibit the capability of a simple
                                                                [27] benchmarks - currently the most prevalent bench-
prompting strategy.
                                                                marks in relevant academic literature. All three are con-
                                                                cerned with detecting hallucinations in LLM-generated
                                                                summaries and are human-annotated for factual con-
6. GraphCorrect: Correction of                                  sistency with respect to the grounding context. Table
   hallucinations with GraphEval                                1 contains some statistics pertaining to each of these
                                                                datasets.
While the primary focus of this work lies in hallucination
detection, GraphEval’s breakdown of LLM outputs into
triples easily allows for its extension to correct hallucina-   SummEval The SummEval dataset consists of human
tions within the given context. To achieve this, we first       evaluations on 16 summarization model outputs from
identify all triples within the KG that are likely to con-      100 articles from the CNN/DailyMail dataset [28]. Each
tain hallucinations (i.e. those with a probability greater      summary is labelled on a Likert scale from 1-5 on 4 cat-
than 0.5, if any). We then employ the following two-step        egories: consistency, coherence, fluency and relevance.
procedure on each identified triple:                            We follow the TRUE benchmark [13] in taking the con-
                                                                sistency scores and mapping a score of 5 to being fully
        Step 1 - Input the given triple along with the          consistent, and anything lower to being inconsistent.
        context into an LLM to correct for the potential
        hallucinations within the triple. This results in a
        newly generated corrected triple.                       QAGS The QAGS-C and QAGS-X datasets are built
        Step 2 - Input the identified triple, its corrected     from the CNN/DailyMail and the XSum [29] datasets,
        counterpart and the initial LLM output. Selec-          respectively. The human annotators examined the sum-
        tively replace the information from the original        maries one sentence at a time, and determined the factual
        (hallucination-containing) triple with the infor-       consistency of each sentence comparing it to the original
        mation from the new triple in the initial LLM           article. Three annotators assessed each sentence and the
        output.                                                 majority decision was recorded. Again, we follow the
                                                                TRUE benchmark in considering a summary to be factu-
We name this LLM hallucination correction method as             ally consistent if and only if all sentences are considered
GraphCorrect. The final prompts used in our experiments         consistent.
for both step 1 and step 2 can be found in the Appendix B
and C respectively. This systematic approach to halluci-
nation correction offers several benefits. First, it tackles    7.2. NLI models in GraphEval
each identified hallucination separately, increasing the
chances of all perceived hallucinations being corrected.        As mentioned in Section 4, we employ NLI models to
Furthermore, it offers the advantage of exclusively alter-      perform the second stage of GraphEval - checking the
ing the segments of the original text that are suspected        consistency of each individual triple with respect to the
to contain a hallucination, leaving other elements un-          context. We conduct experiments using the three most
touched and ensuring overall high similarity with the           popular NLI-based hallucination detection models avail-
original text. Finally, breaking down the entire process        able on HuggingFace 1 .
into intermediate steps ensures that the original context
and the initial LLM output never undergo simultaneous           HHEM Based on the DeBERTaV3 model [30] and ini-
processing within an LLM. This guarantees safeguards            tially trained on NLI data, the hallucination evaluation
against both the addition of extra information and the          model created by Vectara 2 is further fine-tuned on
loss of information in the LLM output.                          datasets annotated for consistency. The datasets used

                                                                1
                                                                    https://huggingface.co
                                                                2
                                                                    https://huggingface.co/vectara/hallucination_evaluation_model
for fine tuning were: FEVER [31], Vitamin C [32] and                                      SummEval     QAGS-C     QAGS-X
PAWS [33]. This model is considerably smaller than the          HHEM                         66.0        63.5       75.5
following two models, requiring only 738 MB of memory,          HHEM + GraphEval             71.5        72.2       75.2
and thus has a significantly shorter run-time.                  TRUE                         61.3        61.8       72.6
                                                                TRUE + GraphEval             72.4        71.7       73.9
                                                                TrueTeacher                  74.9        75.6       79.0
TRUE The TRUE model is based on a T5-XXL model                  TrueTeacher + GraphEval      79.2        78.1       79.6
[34] and is trained similarly to the model described in
the TRUE paper [13]. Instead of the ANLI dataset used          Table 2
in that paper, this model is trained on the same datasets      Balanced accuracy scores for hallucination detection of NLI
as HHEM, plus the following: SNLI [35], MNLI [36] and          models (HHEM, TRUE, TrueTeacher) and their GraphEval
Scitail [37]. This model requires 45.5 GB of memory.           counterparts on the SummEval, QAGS-C and QAGS-X bench-
                                                               marks.

TrueTeacher Gekhman et al. [15] leverage the ability
of LLMs to evaluate hallucinations by generating syn-    We hypothesise that the negligible difference between
thetic data through annotating model-generated sum-      the base NLI model and the model supplemented with
maries. They then use this synthetic data to further     GraphEval for the QAGS-X dataset is due to the average
fine-tune the model from [13], leading to state-of-the-  length of the generated text (only 18 words, compared
art performance on the TRUE benchmark. This model is     with 49 and 63 for QAGS-C and SummEval respectively,
the same size as the TRUE model.                         see 1). This highlights an important aspect of where the
                                                         most value can be found in our method. When the LLM
                                                         output is very short, there are less likely to be multiple
7.3. Experimental settings                               facts that need to be checked for consistency (which
In all experiments conducted in this study necessitating can easily be done without the use of a KG) and the
the utilization of an LLM, we use Claude 2 3 , an LLM intricacies of the short sentence might even be lost in
from Anthropic, through the Amazon Bedrock API 4 . We the KG construction phase. On the other hand, when the
use the default settings for the LLM: temperature = 1, LLM output is very long, current methods struggle to
top_p = 1, top_k = 250. We also refer the reader to the test each individual fact against the context, and this is
Appendix for the prompts used in this work.              when GraphEval thrives.
                                                               It should be noted that even when the results for GraphE-
                                                               val are comparable to the baseline methods, the benefit
7.4. Results                                                   of using GraphEval is the identification of the specific
7.4.1. Hallucination detection with GraphEval                  triple(s) that are inconsistent with the provided context.

We present our results of hallucination detection for the
three NLI models, and their GraphEval counterparts, in         7.4.2. Hallucination correction with GraphCorrect
Table 2. We report the balanced accuracy as our evalu-
                                                               Identifying the particular triple(s) likely to harbor a hallu-
ation metric, which corrects for the class imbalance in
                                                               cination enables straightforward correction using Graph-
the SummEval benchmark. In the case of using the NLI
                                                               Correct, as described in Section 6. For each of the evalu-
model directly, we classify the example as containing a
                                                               ation frameworks proposed here (HHEM + GraphEval,
hallucination if the NLI model returns a probability of
                                                               TRUE + GraphEval, and TrueTeacher + GrapEval), we
more than 0.5. When combining the NLI model with
                                                               compared GraphCorrect to a basic prompting strategy
GraphEval, we classify the example as containing a hallu-
                                                               for hallucination correction, serving as a baseline. The
cination if at least one triple fed to the NLI model returns
                                                               prompt used in this baseline approach, referred to as the
a probability of more than 0.5. We see that adding the
                                                               Direct Prompt henceforth, is provided in Appendix D.
GraphEval pre-processing step to each of the NLI mod-
els almost always improves the balanced accuracy score,        For each framework, we initially identify hallucinations,
sometimes by a considerable amount, such as the results        correct only the LLM outputs suspected of containing hal-
for the SummEval and QAGS-C benchmarks in Table                lucinations using either GraphCorrect or Direct Prompt,
2. On average (weighting by the number of samples in           and then reapply the evaluation framework to detect hal-
each dataset), adding the GraphEval pre-processing step        lucinations in the corrected LLM outputs. Note that this
improves the balanced accuracy by 6.2 (SE=1.3).                procedure only allows us to measure what we presume to
                                                               be corrected hallucinations, given the potential for errors
3
    https://www.anthropic.com/news/claude-2                    in the evaluation frameworks utilized here. We report the
4
    https://aws.amazon.com/bedrock/claude/
                                                  ROUGE-1                         ROUGE-2                        ROUGE-L
 Detection                 Dataset
                                      Direct Prompt   GraphCorrect    Direct Prompt   GraphCorrect   Direct Prompt   GraphCorrect
                           SummEval       0.827             0.915         0.772             0.879        0.796             0.910
 HHEM + GraphEval          QAGS-C         0.800             0.893         0.735             0.841        0.769             0.885
                           QAGS-X         0.649             0.821         0.495             0.734        0.606             0.815
                           SummEval       0.781             0.880         0.707             0.833        0.746             0.871
 TRUE + GraphEval          QAGS-C         0.840             0.894         0.780             0.848        0.808             0.886
                           QAGS-X         0.651             0.805         0.505             0.706        0.613             0.795
                           SummEval       0.781             0.884         0.703             0.839        0.737             0.876
 TrueTeacher + GraphEval   QAGS-C         0.809             0.889         0.743             0.837        0.781             0.881
                           QAGS-X         0.643             0.797         0.486             0.694        0.598             0.784


Table 3
Average ROUGE-1, ROUGE-2 and ROUGE-L scores measuring similarity between original and corrected summaries using
Direct Prompt and GraphCorrect across different datasets and hallucination detection frameworks.



percentage of believed corrected hallucinations in Table             summaries and the corrected versions for both GraphCor-
4. A score of 0% suggests no corrected hallucinations                rect and Direct Prompt across all experimental scenarios
according to the given framework, while a score of 100%              examined in this study. GraphCorrect systematically gen-
indicates correction of all hallucinations as per the given          erates texts that are closer in similarity to the original
framework. GraphCorrect outperforms the prompting                    LLM outputs compared to its counterpart.
strategy proposed here by significantly correcting for
more hallucinations on all tasks apart from two related
to the QAGS-X dataset. As on the hallucination detection             8. Discussion
task, we hypothesise these results are correlated with the
average length of the text, with GraphCorrect bringing               Our work focuses on detection of hallucinations in closed-
most value in longer texts with a more complex structure             domain tasks, where we are interested only in consis-
to unravel and correct.                                              tency with respect to the provided context. The GraphE-
                                                                     val framework could be extended to open-domain halluci-
Additionally, as previously stated, GraphCorrect offers
                                                                     nation detection by employing agents, as in AutoKG [25],
the advantage of only modifying the segments of text
                                                                     to first retrieve relevant external sources as the grounding
in the LLM outputs susceptible to hallucinations, while
                                                                     information to check against.
leaving other sections unaltered, thereby maintaining
high overall similarity with the original text. This charac- We expect that in the near future, more research will be
teristic is illustrated in Table 3 by assessing the ROUGE-1, conducted on the construction of KGs from unstructured
ROUGE-2, and ROUGE-L metrics between the original text, which will provide improvements to the first stage of
                                                                  our procedure and ultimately the evaluation performance.
                                                                  Even as LLMs alone become more powerful, this will
                                         Method for Correction
  Detection & Evaluation  Dataset
                                     Direct Prompt   GraphCorrect
                                                                  continue to contribute to improvements in GraphEval’s
                          SummEval        48.6           55.1
                                                                  performance.
 HHEM + GraphEval          QAGS-C        38.5          58.7          We observe that, in the knowledge graph construction
                           QAGS-X        63.2          69.5
                                                                     phase of our procedure, it is possible that some informa-
                           SummEval      49.6          59.5
                                                                     tion loss may occur. However, as shown by the results
 TRUE + GraphEval          QAGS-C        42.7          53.7
                           QAGS-X        70.8           66.7
                                                                     in Section 7.4, our method rarely leads to a reduction in
                           SummEval      53.1          59.8
                                                                     balanced accuracy. Furthermore, when it is comparable
 TrueTeacher + GraphEval   QAGS-C        47.1          59.6          to the baseline methods, we have the added explainability
                           QAGS-X        71.1           69.3         of identifying the specific triples where the hallucination
                                                                     has occurred.
Table 4
Percentage of believed corrected hallucinations using a di- We believe our hallucination correction framework
rect prompting strategy and GraphCorrect on the SummEval, (GraphCorrect) shows promise and an interesting av-
QAGS-C and QAGS-X benchmarks. The hallucinations were enue for future work. However, the effectiveness of the
first detected by HHEM + GraphEval, TRUE + GraphEval and approach described in this work should be assessed man-
TrueTeacher + GraphEval respectively, and then corrections ually, rather than relying on the convoluted use of hallu-
were evaluated by the same metric.                          cination evaluation frameworks (which only yield mea-
surements of believed corrected hallucinations).                      on Empirical Methods in Natural Language Pro-
                                                                      cessing, Association for Computational Linguis-
                                                                      tics, Singapore, 2023, pp. 2511–2522. URL: https:
9. Conclusion                                                         //aclanthology.org/2023.emnlp-main.153. doi:10.
                                                                      18653/v1/2023.emnlp-main.153.
We introduce GraphEval, a simple and effective pre-               [5] P. Lewis, E. Perez, A. Piktus, F. Petroni,
processing step for improving the explainability and per-             V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t.
formance of LLM hallucination detection metrics. Our                  Yih, T. Rocktäschel, et al., Retrieval-augmented
method leverages LLM’s ability to extract information                 generation for knowledge-intensive nlp tasks,
from unstructured text and construct knowledge graphs,                Advances in Neural Information Processing
whose triples can be fed into out-of-the-box hallucination            Systems 33 (2020) 9459–9474.
detection methods.                                                [6] L. Luo, Y.-F. Li, G. Haffari, S. Pan, Reasoning on
                                                                      graphs: Faithful and interpretable large language
We demonstrate that GraphEval in conjunction with
                                                                      model reasoning, arXiv preprint arXiv:2310.01061
state-of-the-art NLI models leads to an average improve-
                                                                      (2023).
ment in balanced accuracy of 6.2 (SE=1.3) on three popu-
                                                                  [7] L. Yang, H. Chen, Z. Li, X. Ding, X. Wu, Give us
lar hallucination benchmarks. Furthermore, our method
                                                                      the facts: Enhancing large language models with
indicates which triples, in the KG representation of the
                                                                      knowledge graphs for fact-aware language model-
LLM output, are inconsistent. To the best of our knowl-
                                                                      ing, IEEE Transactions on Knowledge and Data
edge, this is the first application of KGs to an LLM-based
                                                                      Engineering (2024).
hallucination evaluation framework and we believe the
                                                                  [8] G. Agrawal, T. Kumarage, Z. Alghamdi, H. Liu, Can
success of GraphEval will only grow as KG construction
                                                                      knowledge graphs reduce hallucinations in llms? :
methods also improve.
                                                                      A survey, 2024. arXiv:2311.07914.
Finally, we examined the issue of hallucination correction        [9] J. Maynez, S. Narayan, B. Bohnet, R. McDonald,
and showed that GraphCorrect can effectively address the              On faithfulness and factuality in abstractive sum-
majority of hallucinations found in LLM outputs while                 marization, in: D. Jurafsky, J. Chai, N. Schluter,
maintaining extremely high similarity with the original               J. Tetreault (Eds.), Proceedings of the 58th An-
texts.                                                                nual Meeting of the Association for Computa-
                                                                      tional Linguistics, Association for Computational
                                                                      Linguistics, Online, 2020, pp. 1906–1919. URL:
References                                                            https://aclanthology.org/2020.acl-main.173. doi:10.
                                                                      18653/v1/2020.acl-main.173.
 [1] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,   [10] O. Honovich, L. Choshen, R. Aharoni, E. Nee-
     Y. J. Bang, A. Madotto, P. Fung, Survey of hal-                  man, I. Szpektor, O. Abend, 𝑞 2 : Evaluating
     lucination in natural language generation, ACM                   factual consistency in knowledge-grounded dia-
     Comput. Surv. 55 (2023). URL: https://doi.org/10.                logues via question generation and question an-
     1145/3571730. doi:10.1145/3571730.                               swering, in: M.-F. Moens, X. Huang, L. Spe-
 [2] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu:                cia, S. W.-t. Yih (Eds.), Proceedings of the 2021
     a method for automatic evaluation of machine                     Conference on Empirical Methods in Natural
     translation, in: P. Isabelle, E. Charniak, D. Lin                Language Processing, Association for Computa-
     (Eds.), Proceedings of the 40th Annual Meeting of                tional Linguistics, Online and Punta Cana, Do-
     the Association for Computational Linguistics, As-               minican Republic, 2021, pp. 7856–7870. URL: https:
     sociation for Computational Linguistics, Philadel-               //aclanthology.org/2021.emnlp-main.619. doi:10.
     phia, Pennsylvania, USA, 2002, pp. 311–318. URL:                 18653/v1/2021.emnlp-main.619.
     https://aclanthology.org/P02-1040. doi:10.3115/             [11] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger,
     1073083.1073135.                                                 Y. Artzi, Bertscore: Evaluating text generation with
 [3] C.-Y. Lin, ROUGE: A package for automatic eval-                  bert, in: International Conference on Learning
     uation of summaries, in: Text Summarization                      Representations, 2020. URL: https://openreview.net/
     Branches Out, Association for Computational Lin-                 forum?id=SkeHuCVFDr.
     guistics, Barcelona, Spain, 2004, pp. 74–81. URL:           [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     https://aclanthology.org/W04-1013.                               Pre-training of deep bidirectional transformers for
 [4] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, C. Zhu,                  language understanding, in: J. Burstein, C. Do-
     G-eval: NLG evaluation using gpt-4 with bet-                     ran, T. Solorio (Eds.), Proceedings of the 2019 Con-
     ter human alignment, in: H. Bouamor, J. Pino,                    ference of the North American Chapter of the As-
     K. Bali (Eds.), Proceedings of the 2023 Conference               sociation for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short              Power and Communication Technologies (GUCON),
     Papers), Association for Computational Linguistics,          2020, pp. 310–315. doi:10.1109/GUCON48875.
     Minneapolis, Minnesota, 2019, pp. 4171–4186. URL:            2020.9231227.
     https://aclanthology.org/N19-1423. doi:10.18653/        [23] I. Melnyk, P. Dognin, P. Das, Grapher: Multi-stage
     v1/N19-1423.                                                 knowledge graph construction using pretrained lan-
[13] O. Honovich, R. Aharoni, J. Herzig, H. Taitelbaum,           guage models, in: NeurIPS 2021 Workshop on
     D. Kukliansy, V. Cohen, T. Scialom, I. Szpektor,             Deep Generative Models and Downstream Appli-
     A. Hassidim, Y. Matias, TRUE: Re-evaluating                  cations, 2021. URL: https://openreview.net/forum?
     factual consistency evaluation,         in: S. Feng,         id=N2CFXG8-pRd.
     H. Wan, C. Yuan, H. Yu (Eds.), Proceedings of           [24] J. Han, N. Collier, W. Buntine, E. Shareghi, Pive:
     the Second DialDoc Workshop on Document-                     Prompting with iterative verification improving
     grounded Dialogue and Conversational Question                graph-based generative capability of llms, arXiv
     Answering, Association for Computational Lin-                preprint arXiv:2305.12392 (2023).
     guistics, Dublin, Ireland, 2022, pp. 161–175. URL:      [25] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao,
     https://aclanthology.org/2022.dialdoc-1.19. doi:10.          S. Deng, H. Chen, N. Zhang, Llms for knowledge
     18653/v1/2022.dialdoc-1.19.                                  graph construction and reasoning: Recent capa-
[14] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst,          bilities and future opportunities, arXiv preprint
     SummaC: Re-visiting NLI-based models for incon-              arXiv:2305.13168 (2023).
     sistency detection in summarization, Transactions       [26] A. R. Fabbri, W. Kryscinski, B. McCann, R. Socher,
     of the Association for Computational Linguistics             D. R. Radev, Summeval: Re-evaluating sum-
     10 (2022) 163–177. URL: https://aclanthology.org/            marization evaluation,         Transactions of the
     2022.tacl-1.10. doi:10.1162/tacl_a_00453.                    Association for Computational Linguistics 9
[15] Z. Gekhman, J. Herzig, R. Aharoni, C. Elkind,                (2020) 391–409. URL: https://api.semanticscholar.
     I. Szpektor, Trueteacher: Learning factual consis-           org/CorpusID:220768873.
     tency evaluation with large language models, 2023.      [27] A. Wang, K. Cho, M. Lewis, Asking and an-
     arXiv:2305.11171.                                            swering questions to evaluate the factual consis-
[16] P. Manakul, A. Liusie, M. J. Gales, Selfcheckgpt:            tency of summaries, in: D. Jurafsky, J. Chai,
     Zero-resource black-box hallucination detection for          N. Schluter, J. Tetreault (Eds.), Proceedings of the
     generative large language models, arXiv preprint             58th Annual Meeting of the Association for Com-
     arXiv:2303.08896 (2023).                                     putational Linguistics, Association for Computa-
[17] N. Mündler, J. He, S. Jenko, M. Vechev, Self-                tional Linguistics, Online, 2020, pp. 5008–5020. URL:
     contradictory hallucinations of large language mod-          https://aclanthology.org/2020.acl-main.450. doi:10.
     els: Evaluation, detection and mitigation, in: The           18653/v1/2020.acl-main.450.
     Twelfth International Conference on Learning Rep-       [28] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espe-
     resentations, 2024. URL: https://openreview.net/             holt, W. Kay, M. Suleyman, P. Blunsom, Teaching
     forum?id=EmQSOi1X2f.                                         machines to read and comprehend, Advances in
[18] J. Fu, S.-K. Ng, Z. Jiang, P. Liu, Gptscore: Evaluate        neural information processing systems 28 (2015).
     as you desire, 2023. arXiv:2302.04166.                  [29] S. Narayan, S. B. Cohen, M. Lapata, Don’t give
[19] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih,             me the details, just the summary! topic-aware
     P. W. Koh, M. Iyyer, L. Zettlemoyer, H. Hajishirzi,          convolutional neural networks for extreme sum-
     Factscore: Fine-grained atomic evaluation of fac-            marization, in: E. Riloff, D. Chiang, J. Hocken-
     tual precision in long form text generation, arXiv           maier, J. Tsujii (Eds.), Proceedings of the 2018 Con-
     preprint arXiv:2305.14251 (2023).                            ference on Empirical Methods in Natural Language
[20] J. Wei, C. Yang, X. Song, Y. Lu, N. Hu, D. Tran,             Processing, Association for Computational Linguis-
     D. Peng, R. Liu, D. Huang, C. Du, et al., Long-form          tics, Brussels, Belgium, 2018, pp. 1797–1807. URL:
     factuality in large language models, arXiv preprint          https://aclanthology.org/D18-1206. doi:10.18653/
     arXiv:2403.18802 (2024).                                     v1/D18-1206.
[21] L. F. R. Ribeiro, M. Liu, I. Gurevych, M. Dreyer,       [30] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-
     M. Bansal, Factgraph: Evaluating factuality in sum-          enhanced bert with disentangled attention, in: In-
     marization with semantic graph representations,              ternational Conference on Learning Representa-
     2022. arXiv:2204.06508.                                      tions, 2021. URL: https://openreview.net/forum?id=
[22] A. Kumar, A. Pandey, R. Gadia, M. Mishra, Build-             XPZIaotutsD.
     ing knowledge graph using pre-trained language          [31] J. Thorne,         A. Vlachos,       O. Cocarascu,
     model for learning entity-aware relationships, in:           C. Christodoulopoulos, A. Mittal, The FEVER2.0
     2020 IEEE International Conference on Computing,             shared task, in: Proceedings of the Second
     Workshop on Fact Extraction and VERification                      simple , they are akin to
     (FEVER), 2018.                                                    Wikipedia nodes .
[32] T. Schuster, A. Fisch, R. Barzilay, Get your vita-            Step 2 − Coreference r e s o l u t i o n : Find
     min C! robust fact verification with contrastive                    a l l expressions in the t e x t that
     evidence, in: Proceedings of the 2021 Confer-                       r e f e r t o t h e same e n t i t y . Make
                                                                       sure e n t i t i e s are not d u p l i c a t e d .
     ence of the North American Chapter of the As-
                                                                         I n p a r t i c u l a r do n o t i n c l u d e
     sociation for Computational Linguistics: Human                    e n t i t i e s t h a t a r e more s p e c i f i c
     Language Technologies, Association for Compu-                     versions themselves , e . g . " a
     tational Linguistics, Online, 2021, pp. 624–643.                  d e t a i l e d view o f j u p i t e r ’ s
     URL: https://aclanthology.org/2021.naacl-main.52.                 a t m o s p h e r e " and " j u p i t e r ’ s
     doi:10.18653/v1/2021.naacl-main.52.                               atmosphere " , only i n c l u d e the
[33] Y. Zhang, J. Baldridge, L. He, PAWS: Paraphrase                   most s p e c i f i c v e r s i o n o f t h e
     Adversaries from Word Scrambling, in: Proc. of                    entity .
     NAACL, 2019.                                                  Step 3 − Relation e x t r a c t i o n :
[34] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,             I d e n t i f y semantic r e l a t i o n s h i p s
                                                                       between t h e e n t i t i e s you have
     M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
                                                                       identified .
     limits of transfer learning with a unified text-to-
     text transformer, Journal of Machine Learning Re-             Format : R e t u r n t h e knowledge g r a p h a s
     search 21 (2020) 1–67. URL: http://jmlr.org/papers/                 a l i s t of t r i p l e s , i . e . [ " e n t i t y
     v21/20-074.html.                                                    1" , " r e l a t i o n 1 −2" , " e n t i t y 2 " ] ,
[35] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning,                   i n Python c o d e .
     A large annotated corpus for learning natural lan-            """ ,
     guage inference, in: L. Màrquez, C. Callison-Burch,     ) ,
     J. Su (Eds.), Proceedings of the 2015 Conference        ( " human " ,
     on Empirical Methods in Natural Language Pro-                       " Use t h e g i v e n f o r m a t t o e x t r a c t
     cessing, Association for Computational Linguistics,                        i n f o r m a t i o n from t h e
                                                                                following input : {
     Lisbon, Portugal, 2015, pp. 632–642. URL: https:
                                                                                input } . Skip the
     //aclanthology.org/D15-1075. doi:10.18653/v1/                              p r e a m b l e and o u t p u t t h e
     D15-1075.                                                                  r e s u l t as a l i s t within <
[36] A. Williams, N. Nangia, S. Bowman, A broad-                                python >  t a g s . " ,
     coverage challenge corpus for sentence understand-      ) ,
     ing through inference, in: M. Walker, H. Ji, A. Stent   ( " human " ,
     (Eds.), Proceedings of the 2018 Conference of the                   " " " Important Tips :
     North American Chapter of the Association for                              1 . Make s u r e a l l i n f o r m a t i o n
     Computational Linguistics: Human Language Tech-                                     i s included in the
     nologies, Volume 1 (Long Papers), Association for                                  knowledge g r a p h .
                                                                                2 . Each t r i p l e must o n l y
     Computational Linguistics, New Orleans, Louisiana,
                                                                                         contain three strings !
     2018, pp. 1112–1122. URL: https://aclanthology.org/                                None o f t h e s t r i n g s
     N18-1101. doi:10.18653/v1/N18-1101.                                                s h o u l d be empty .
[37] T. Khot, A. Sabharwal, P. Clark, SciTail: A textual                        3 . Do n o t s p l i t up r e l a t e d
     entailment dataset from science question answer-                                    information into separate
     ing, in: AAAI, 2018.                                                                  t r i p l e s because t h i s
                                                                                        c o u l d change t h e meaning .
                                                                                4 . Make s u r e a l l b r a c k e t s and
A. KG Construction Prompt                                                                  q u o t a t i o n marks a r e
                                                                                        matched .
 ( " system " ,                                                                 5 . Before adding a t r i p l e to
      """                                                                               t h e knowledge graph ,
      You a r e an e x p e r t a t e x t r a c t i n g                                  check the c onca tena ted
           information in                                                                t r i p l e makes s e n s e a s a
      structured formats to build a                                                      s e n t e n c e . I f not , d i s c a r d
           knowledge g r a p h .                                                           it .
      Step 1 − Entity detection : I d e n t i f y                        """ ,
           a l l e n t i t i e s i n t h e raw t e x t .     ) ,
          Make s u r e n o t t o m i s s any o u t .         ( " human " ,
           E n t i t i e s s h o u l d be b a s i c and                  " " " Here a r e some example i n p u t
       and o u t p u t p a i r s .                                   [ " D a r i u s Van Arman " , " a t t e n d e d " ,
                                                                            " Gonzaga C o l l e g e High S c h o o l
## Example 1 .                                                              " ] , [ " D a r i u s Van Arman " , "
Input :                                                                     i n s t a n c e o f " , " human b e i n g " ] ]
" The Walt D i s n e y Company ,                                     
       commonly known a s Disney , i s
       an American m u l t i n a t i o n a l                        ## Example 4 .
       mass media and e n t e r t a i n m e n t                     I n p u t : " I t a l y had 3 . 6 x t i m e s more
       conglomerate that i s                                                  c a s e s of c o r o n a v i r u s than
       h e a d q u a r t e r e d a t t h e Walt                             China . "
       D i s n e y S t u d i o s complex i n                        Output :
       Burbank , C a l i f o r n i a . "                            < python >
Output :                                                            [ [ " I t a l y " , " had 3 . 6 x t i m e s more
                                                                   c a s e s of c o r o n a v i r u s than " , "
[ [ " The Walt D i s n e y Company " , "                                    China " ] ]
       h e a d q u a r t e r e d a t " , " Walt                     
       D i s n e y S t u d i o s complex i n                        """ ,
       Burbank , C a l i f o r n i a " ] ,                          ) ,
[ " The Walt D i s n e y Company " , "
       commonly known a s " , " D i s n e y
       "] ,
[ " The Walt D i s n e y Company " , "                     B. Hallucination correction (step 1)
       i n s t a n c e o f " , " American
       m u l t i n a t i o n a l mass media and               """
       entertainment conglomerate " ] ]                       You a r e an e x p e r t a t e x t r a c t i n g
                                                           information in structured formats
                                                                        from t e x t .
## Example 2 .                                                The f o l l o w i n g t r i p l e c o n t a i n s
Input :                                                               f a ct u a l ly incorrect information .
" Amanda J a c k s o n was born i n                           C o r r e c t i t b a s e d on t h e p r o v i d e d
      S p r i n g f i e l d , Ohio , USA on                           context ,
      J u n e 1 , 1 9 8 5 . She was a                         Important Tips :
      b a s k e t b a l l player for the U. S                         1 . A t r i p l e i s defined as [ "
      . women ’ s team . "                                                   e n t i t y 1" , " r e l a t i o n 1 −2" , "
Output :                                                                     entity 2"].
                                                             2 . A t r i p l e must o n l y c o n t a i n
[ [ " Amanda J a c k s o n " , " born i n " , "                              t h r e e s t r i n g s ! None o f t h e
      S p r i n g f i e l d , Ohio , USA " ] ,                               s t r i n g s s h o u l d be empty .
[ " Amanda J a c k s o n " , " born on " , "                          3 . The c o n c a t e n a t e d t r i p l e must
      June 1 , 1 9 8 5 " ] ,                                                 make s e n s e a s a s e n t e n c e .
[ " Amanda J a c k s o n " , " o c c u p a t i o n " ,                4 . Only r e t u r n t h e c o r r e c t e d
      " basketball player "] ,                                               t r i p l e , nothing e l s e .
[ " Amanda J a c k s o n " , " p l a y e d f o r " ,
      " U . S . women ’ s b a s k e t b a l l team            < t r i p l e >{ t r i p l e } 
      " ] ]                                        { context } 

## Example 3 .                                                Remember , i t i s i m p o r t a n t t h a t you
Input :                                                           only return the c o r r e c t e d t r i p l e .
" Music e x e c u t i v e D a r i u s Van Arman               """
         was born i n P e n n s y l v a n i a . He
          a t t e n d e d Gonzaga C o l l e g e
       High S c h o o l and i s a human
       being . "                                           C. Hallucination correction (step
Output :

                                                              2)
[ [ " D a r i u s Van Arman " , "                             """
       o c c u p a t i o n " , " Music e x e c u t i v e      In the following context , r e p l a c e the
       "] ,                                                        information of the old t r i p l e
[ " D a r i u s Van Arman " , " born i n " , "                    w i t h t h e i n f o r m a t i o n o f t h e new
       Pennsylvania " ] ,                                         one .
   Do n o t make any o t h e r m o d i f i c a t i o n t o
             the context .
   Only r e t u r n t h e new c o n t e x t .
   < c o n t e x t > { summary } < / c o n t e x t >
   < o l d _ t r i p l e >{ o l d _ t r i p l e } 
   { new_triple } 
   """



D. Hallucination correction
   without a KG
   """
   The f o l l o w i n g summary c o n t a i n s
           f a ct u a l ly incorrect information .
   C o r r e c t i t b a s e d on t h e c o n t e x t , b u t
          don ’ t change o t h e r p a r t s o f t h e
          summary .
   Only r e t u r n t h e c o r r e c t e d summary ,
           nothing e l s e .
    { summary } < / summary >
   { context } 
   Remember , do m i n i m a l c h a n g e s t o t h e
           o r i g i n a l summary , don ’ t make i t
           l o n g e r and keep a s much o f i t a s
          you can e x a c t l y t h e same .
   """