1. Introduction

GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework

Hannah Sansford

Nicholas Richardson

Hermina Petric Maretic

Juba Nait Saada

Large Language Models, Knowledge Graphs, Hallucination Detection, Hallucination Correction

0 Amazon Science 1 University of Bristol , UK

Methods to evaluate Large Language Model (LLM) responses and detect inconsistencies, also known as hallucinations, with respect to the provided knowledge, are becoming increasingly important for LLM applications. Current metrics fall short in their ability to provide explainable decisions, systematically check all pieces of information in the response, and are often too computationally expensive to be used in practice. We present GraphEval: a hallucination evaluation framework based on representing information in Knowledge Graph (KG) structures. Our method identifies the specific triples in the KG that are prone to hallucinations and hence provides more insight into where in the response a hallucination has occurred, if at all, than previous methods. Furthermore, using our approach in conjunction with state-of-the-art natural language inference (NLI) models leads to an improvement in balanced accuracy on various hallucination benchmarks, compared to using the raw NLI models. Lastly, we explore the use of GraphEval for hallucination correction by leveraging the structure of the KG, a method we name GraphCorrect, and demonstrate that the majority of hallucinations can indeed be rectified.

1. Introduction

before hallucinations were at the forefront of the problem. Methods have evolved a great deal from traditional As the size and power of LLMs have drastically increased N-gram based metrics, such as BLEU [2] and ROUGE over recent years, so has the number of potential appli- [3], to much more intricate LLM-based evaluation metcations. Arguably, one of the biggest blockers to imple- rics with user-defined evaluation criteria, such as G-Eval menting these models in practice is their tendency to [4]. More recently, techniques to mitigate the prevalence hallucinate - returning seemingly plausible, but untrue, of hallucinations in generated outputs leveraging Reresponses. Here, we focus on the problem of detecting trieval Augmented Generation (RAG) [5] and reasoning hallucinations with respect to the provided context that on knowledge graphs (KGs) [6, 7] have been proposed. the LLM should use as its source of knowledge; detecting The former suggested the concatenation of relevant conhallucinations that have deviated from the LLM’s original textual data into the prompt to ground the LLM response, training data is out of the scope of this work. In appli- while the latter enforced a more robust reasoning process cations where certainty in a response is critical, such as through providing grounding information in KG strucmedical diagnosis, the existence of hallucinations that tures [8]. As successful as these approaches have been, arise from a given context is especially limiting. There- they do not fully circumvent the need to evaluate LLM fore, it is of utmost importance to develop successful outputs. methods to detect these hallucinations and, when it is of interest to address or correct them, provide clarity on which aspect of the response is likely a hallucination.

The importance of this issue is reflected in the amount of research being published on the topic - see Ji et al. [1] for a recent survey of this area.

Inspired by current research harnessing KGs to provide grounded LLM responses, we propose GraphEval - a hallucination detection framework based on the representation of information in KG structures. To the best of our knowledge, we are the first to apply KGs to an LLMbased hallucination evaluation framework, and in doing Performing evaluation on natural language is a challeng- so we provide a higher level of insight into where in ing task that researchers have been interested in long the output a hallucination has occurred than any previous metrics. Additionally, we demonstrate how using our method in conjunction with current state-of-the-art hallucination detection methods improves their classiifcation accuracy on various benchmarks. Finally, we consider the problem of hallucination correction and we introduce GraphCorrect, showcasing how GraphEval can efectively be extended to rectify a significant proportion Therefore, researchers have begun to develop new methof hallucinations present in LLM outputs. ods that are more acutely tuned to detecting inconsistencies between an LLM output and its grounding context.

Maynez et al. [9] identified the crossover between the 2. Problem statement textual entailment score in NLI tasks and consistency prediction. This was a breakthrough at the time, producing In this work we focus on the closed-domain hallucina- higher correlation with faithfulness than any previous tion detection problem: the situation where we have a metrics, and paved the way for further research that captextual output from an LLM which is generated using italised on NLI data and models [13, 14, 15]. some grounding context included in the prompt. In this case, the goal is for the LLM to use the provided context as its only source of knowledge. The open-domain problem, which is with respect to all factual knowledge in the world, is not explored here but is briefly discussed in Section 8.

Very recently, attention has turned to leveraging LLMs themselves to evaluate the consistency of LLM outputs.

SelfCheckGPT [16] and ChatProtect [17] approach the problem by considering the self-consistency within sampled outputs. Since they require the generation of a large number of responses from the LLM, many consider these methods prohibitively computationally expensive.

We consider hallucination detection to be a binary classification problem, with 0 corresponding to the LLM output being factually consistent given the provided context, and 1 corresponding to the output containing at least one inconsistency. We can assess hallucination evaluation methods using a benchmarking dataset containing ground-truth labels (usually human-annotated) to determine whether a given context-output pair contains factual inconsistencies. Throughout the paper we use the terms factual, consistent, grounded and faithful interchangeably to mean containing no hallucinations with respect to the context.

Other LLM-based hallucination evaluation methods, such as G-Eval [4] and GPTScore [18], employ a diferent LLM for evaluation than the one used to generate the LLM response that needs to be evaluated. G-Eval allows userdefined evaluation criteria and uses automated chainof-thought prompting and form-filling to assign scores.

GPTScore treats the task as conditional generation, leveraging models like GPT-3 to assign higher probabilities to high-quality outputs by prepending evaluation instructions to the LLM prompt. Unlike NLI models trained on binary classification data, these methods produce scores Finally, we explore the problem of hallucination correc- that are harder to interpret as probabilities and often tion, wherein we do not use any directly labeled dataset. require additional steps for inconsistency classification. Instead, we utilize hallucination detection frameworks to ifrst identify hallucinations to correct, and subsequently repurposing them to evaluate the corrected outputs. It is important to note that our exploration of hallucination correction only serves as an extension to our evaluation framework and is not the primary focus of this study.

Recent hallucination detection methods, such as FactScore [19] and SAFE [20], utilize large language models to break down the response into atomic or individual facts for evaluation. These approaches have enabled precise identification of where hallucinations occur within the LLM response. Each fact is automatically verified against a comprehensive knowledge source like Wikipedia or scientific literature in the case of FactScore, or through the use of a search engine in the case of SAFE.

3. Related work

Historically, N-gram based metrics such as BLEU [2] and ROUGE [3] have been the most widely used metrics for natural language evaluation. However, these metrics have been shown to perform poorly at the task of factual inconsistency detection [9, 10]. In more recent years, embedding-based metrics such as BERTScore [11] have been favoured over N-gram based metrics. These methods measure the similarity between two pieces of text by comparing the contextualised embedding from a transformer model, such as BERT [12].

Both N-gram and embedding-based metrics base their scores on how similar the text to be evaluated is to some reference text. This similarity objective often fails to capture the intricacies of the hallucination detection problem. FactGraph [21] is the only factuality evaluation method we are aware of that utilises graph-like structures. The method is focused solely on the detection of inconsistencies in the summarization problem, decomposing both the summary and the supporting documents into what they call structured meaning representations (MRs). These MRs describe the core semantic concepts and relations, which the authors claim to be more suitable for factuality evaluation than the raw text.

4. GraphEval: Our evaluation method

a visualisation of this process in Figure 1 using a real example from one of the benchmarks described in Section 7.1.

GraphEval is based around the idea of representing information in a structured manner through KGs, and aims to address the lack of explainability of previous hallucination detection approaches, i.e. which concrete pieces of information in particular are inconsistent.

Regarding stage 1, we provide a short review of LLMbased KG construction methods in Section 5, along with results from our implementation. For stage 2, we leverage existing techniques and employ an out-of-the-box NLI model for this task. A benefit of this approach is that it Formally, a KG is a collection of triples = gives us the opportunity to make a direct comparison {(1, , 2) ⊆ ℰ × ℛ × ℰ } , where ℰ and ℛ denote between the performance of the raw NLI model and the the set of entities and relationships, respectively. In the model supplemented with our KG approach. In essence, GraphEval setting, both entities and relationships are our method is a pre-processing step, the output of which simply pieces of text. We do not make use of common can be fed into any hallucination detection method; we extensions to this simple setting, such as entity and rela- choose NLI models as they are computationally cheap tionship types, or attached properties. compared to LLM-based models, yet still achieve state-ofthe-art results. By feeding each triple into an NLI model, Our GraphEval metric consists of a two-stage procedure: along with the grounding context, we obtain a probability of containing a hallucination for each triple. Finally, we classify the example as inconsistent if at least one triple produces a probability greater than 0.5.

Stage 1 - Construct a KG from the LLM output

to be evaluated.

Stage 2 - Iterate through each of the triples in the KG, identifying whether they are factually consistent given the provided context.

The output is considered factually inconsistent if any of the triples in stage 2 are identified as not grounded in the context. The inconsistent triple(s) may also be returned to provide explainability by highlighting where in the output the hallucination(s) has occurred. We provide Similar approaches to ours have been proposed in recent literature. SummaC [14] also uses NLI-based models to detect inconsistencies in LLM-generated summaries. However, it distinguishes itself by segmenting both the context and the summary into their respective sentences, and then by passing each context-summary pair into the NLI model. This approach presents challenges in main

Benchmark

SummEval QAGS-C QAGS-X

No. of Examples Label Ratio Avg Output len. Avg Context len.

1,600 235 239 33.2% 48.1% 48.5% 63 49 18 359 383 318 taining entity references across sentences; for instance, "John Doe" may only be referred to as "he" in another sentence. Similarly, FactScore [19] faces the same limitation. Our method circumvents this issue by organising entity relationships with a KG.

While FactGraph [21] also makes use of graph structures in their consistency evaluation process, the method differs from GraphEval in a few major respects. Firstly, their approach can only be applied to the summarisation problem; whereas GraphEval can easily be applied to various domains such as Summarisation, Question Answering, Common Sense Reasoning and many others. Secondly, FactGraph does not employ LLMs anywhere in their framework, missing out on recent advances in the ifeld. Finally, their approach aims to decompose both the LLM output and the provided context into the underlying core semantic concepts and relations, before comparing each of the graph structures. GraphEval, on the other hand, only represents the LLM output as a KG and aims to preserve as much of the information contained in the raw text as possible.

5. Construction of KGs using LLMs

Constructing KGs from unstructured textual data involves identifying the set of entities within the text and the relationships between them, resulting in a structured representation of the information contained within the text. The process can be divided into three main stages: 1. Entity detection - the process of identifying and

extracting entities from text. 2. Coreference resolution - the process of finding of all expressions (also called mentions) in the text that refer to the same entity. 3. Relation extraction - the process of identifying semantic relationships between entities.

To summarise the advantages of GraphEval over previous methods: Previously, researchers addressed each stage individually, but with the increasing power of LLMs, there’s been a shift towards end-to-end systems. Kumar et al. [22] suggest employing two LLM components: one for named entity recognition and another one for both relation classification and direction. Similarly, Grapher [ 23] utilizes a pre-trained LLM for entity extraction and relation prediction. However, these methods require users to provide possible relations. More recent methods like PiVE [24] • We present a systematic way of checking all and AutoKG [25] use LLM prompting strategies for KG pieces of information contained in the LLM out- construction without additional user input. put. The aforementioned methods do not make use of some of • Our method only requires one call to an LLM, in the emergent abilities of LLMs, such as in-context learnthe KG construction phase, and does not require ing and the chain-of thought prompting strategy. We the (usually) large context documents to be input, decide to leverage these emergent abilities, and take a as in all previous LLM-based metrics. This makes simple prompt engineering approach to our KG construcGraphEval less computationally expensive than tion step. The techniques used can be summarised as the other LLM-based methods. following: • Our method returns the specific triples that are not grounded in the context, providing explainability for the decision and identifying which section of the output should not be trusted. We leverage this feature for hallucination correction and propose a new method called GraphCorrect, described in Section 6. • Chain-of-thought (CoT) prompting strategy. Providing intermediate reasoning steps in the prompt to enable LLMs to solve more complex tasks. • In-context learning. A method of prompt engineering where one provides several task demonstrations within the prompt, circumventing the need for fine-tuning.

The final prompt used in our experiments can be found in the Appendix. We highlight to the reader that our KG construction method is not the main contribution of our work, which is rather the application of KG construction to the hallucination detection problem. The major benefit of our KG construction approach is its ease of implemen- 7.1. Benchmarks tation with any LLM. Furthermore, it is less computationally intensive than methods like PiVE, which performs multiple iterations of improvements to the generated KG.

Of course, users may conduct the KG construction stage of GraphEval using their method of choice; the experiments in this paper exhibit the capability of a simple prompting strategy.

7. Experiments 6. GraphCorrect: Correction of hallucinations with GraphEval

While the primary focus of this work lies in hallucination detection, GraphEval’s breakdown of LLM outputs into triples easily allows for its extension to correct hallucina- SummEval The SummEval dataset consists of human tions within the given context. To achieve this, we first evaluations on 16 summarization model outputs from identify all triples within the KG that are likely to con- 100 articles from the CNN/DailyMail dataset [28]. Each tain hallucinations (i.e. those with a probability greater summary is labelled on a Likert scale from 1-5 on 4 catthan 0.5, if any). We then employ the following two-step egories: consistency, coherence, fluency and relevance. procedure on each identified triple: We follow the TRUE benchmark [13] in taking the consistency scores and mapping a score of 5 to being fully consistent, and anything lower to being inconsistent.

We conducted two sets of experiments: one focusing on hallucination detection to highlight GraphEval’s performance and another on hallucination correction to showcase the advantages of GraphCorrect. For both scenarios, we utilized the SummEval [26], QAGS-C and QAGS-X [27] benchmarks - currently the most prevalent benchmarks in relevant academic literature. All three are concerned with detecting hallucinations in LLM-generated summaries and are human-annotated for factual consistency with respect to the grounding context. Table 1 contains some statistics pertaining to each of these datasets.

QAGS The QAGS-C and QAGS-X datasets are built from the CNN/DailyMail and the XSum [29] datasets, respectively. The human annotators examined the summaries one sentence at a time, and determined the factual consistency of each sentence comparing it to the original article. Three annotators assessed each sentence and the majority decision was recorded. Again, we follow the TRUE benchmark in considering a summary to be factually consistent if and only if all sentences are considered consistent.

Step 1 - Input the given triple along with the

context into an LLM to correct for the potential hallucinations within the triple. This results in a newly generated corrected triple.

Step 2 - Input the identified triple, its corrected counterpart and the initial LLM output. Selectively replace the information from the original (hallucination-containing) triple with the information from the new triple in the initial LLM output.

1https://huggingface.co 2https://huggingface.co/vectara/hallucination_evaluation_model We name this LLM hallucination correction method as GraphCorrect. The final prompts used in our experiments for both step 1 and step 2 can be found in the Appendix B and C respectively. This systematic approach to hallucination correction ofers several benefits. First, it tackles 7.2. NLI models in GraphEval each identified hallucination separately, increasing the chances of all perceived hallucinations being corrected. As mentioned in Section 4, we employ NLI models to Furthermore, it ofers the advantage of exclusively alter- perform the second stage of GraphEval - checking the ing the segments of the original text that are suspected consistency of each individual triple with respect to the to contain a hallucination, leaving other elements un- context. We conduct experiments using the three most touched and ensuring overall high similarity with the popular NLI-based hallucination detection models availoriginal text. Finally, breaking down the entire process able on HuggingFace 1. into intermediate steps ensures that the original context and the initial LLM output never undergo simultaneous processing within an LLM. This guarantees safeguards against both the addition of extra information and the loss of information in the LLM output.

HHEM Based on the DeBERTaV3 model [30] and initially trained on NLI data, the hallucination evaluation model created by Vectara 2 is further fine-tuned on datasets annotated for consistency. The datasets used for fine tuning were: FEVER [ 31], Vitamin C [32] and PAWS [33]. This model is considerably smaller than the following two models, requiring only 738 MB of memory, and thus has a significantly shorter run-time.

TRUE The TRUE model is based on a T5-XXL model [34] and is trained similarly to the model described in the TRUE paper [13]. Instead of the ANLI dataset used in that paper, this model is trained on the same datasets as HHEM, plus the following: SNLI [35], MNLI [36] and Scitail [37]. This model requires 45.5 GB of memory.

HHEM HHEM + GraphEval TRUE TRUE + GraphEval TrueTeacher TrueTeacher + GraphEval

In all experiments conducted in this study necessitating the utilization of an LLM, we use Claude 2 3, an LLM from Anthropic, through the Amazon Bedrock API 4. We use the default settings for the LLM: temperature = 1, top_p = 1, top_k = 250. We also refer the reader to the Appendix for the prompts used in this work.

TrueTeacher Gekhman et al. [15] leverage the ability of LLMs to evaluate hallucinations by generating syn- We hypothesise that the negligible diference between thetic data through annotating model-generated sum- the base NLI model and the model supplemented with maries. They then use this synthetic data to further GraphEval for the QAGS-X dataset is due to the average ifne-tune the model from [ 13], leading to state-of-the- length of the generated text (only 18 words, compared art performance on the TRUE benchmark. This model is with 49 and 63 for QAGS-C and SummEval respectively, the same size as the TRUE model. see 1). This highlights an important aspect of where the most value can be found in our method. When the LLM output is very short, there are less likely to be multiple 7.3. Experimental settings facts that need to be checked for consistency (which can easily be done without the use of a KG) and the intricacies of the short sentence might even be lost in the KG construction phase. On the other hand, when the LLM output is very long, current methods struggle to test each individual fact against the context, and this is when GraphEval thrives. 7.4. Results

7.4.1. Hallucination detection with GraphEval

It should be noted that even when the results for GraphEval are comparable to the baseline methods, the benefit of using GraphEval is the identification of the specific triple(s) that are inconsistent with the provided context. 3https://www.anthropic.com/news/claude-2 4https://aws.amazon.com/bedrock/claude/ We present our results of hallucination detection for the three NLI models, and their GraphEval counterparts, in 7.4.2. Hallucination correction with GraphCorrect Table 2. We report the balanced accuracy as our evalu- Identifying the particular triple(s) likely to harbor a halluation metric, which corrects for the class imbalance in cination enables straightforward correction using Graphthe SummEval benchmark. In the case of using the NLI Correct, as described in Section 6. For each of the evalumodel directly, we classify the example as containing a ation frameworks proposed here (HHEM + GraphEval, hallucination if the NLI model returns a probability of TRUE + GraphEval, and TrueTeacher + GrapEval), we more than 0.5. When combining the NLI model with compared GraphCorrect to a basic prompting strategy GraphEval, we classify the example as containing a hallu- for hallucination correction, serving as a baseline. The cination if at least one triple fed to the NLI model returns prompt used in this baseline approach, referred to as the a probability of more than 0.5. We see that adding the Direct Prompt henceforth, is provided in Appendix D. GraphEval pre-processing step to each of the NLI models almost always improves the balanced accuracy score, For each framework, we initially identify hallucinations, sometimes by a considerable amount, such as the results correct only the LLM outputs suspected of containing halfor the SummEval and QAGS-C benchmarks in Table lucinations using either GraphCorrect or Direct Prompt, 2. On average (weighting by the number of samples in and then reapply the evaluation framework to detect haleach dataset), adding the GraphEval pre-processing step lucinations in the corrected LLM outputs. Note that this improves the balanced accuracy by 6.2 (SE=1.3). procedure only allows us to measure what we presume to be corrected hallucinations, given the potential for errors in the evaluation frameworks utilized here. We report the

8. Discussion

We observe that, in the knowledge graph construction phase of our procedure, it is possible that some information loss may occur. However, as shown by the results in Section 7.4, our method rarely leads to a reduction in balanced accuracy. Furthermore, when it is comparable to the baseline methods, we have the added explainability of identifying the specific triples where the hallucination has occurred. percentage of believed corrected hallucinations in Table 4. A score of 0% suggests no corrected hallucinations according to the given framework, while a score of 100% indicates correction of all hallucinations as per the given framework. GraphCorrect outperforms the prompting strategy proposed here by significantly correcting for more hallucinations on all tasks apart from two related to the QAGS-X dataset. As on the hallucination detection task, we hypothesise these results are correlated with the average length of the text, with GraphCorrect bringing most value in longer texts with a more complex structure to unravel and correct.

Additionally, as previously stated, GraphCorrect ofers the advantage of only modifying the segments of text in the LLM outputs susceptible to hallucinations, while leaving other sections unaltered, thereby maintaining high overall similarity with the original text. This charac- We expect that in the near future, more research will be teristic is illustrated in Table 3 by assessing the ROUGE-1, conducted on the construction of KGs from unstructured ROUGE-2, and ROUGE-L metrics between the original text, which will provide improvements to the first stage of our procedure and ultimately the evaluation performance.

Even as LLMs alone become more powerful, this will Detection & Evaluation Dataset continue to contribute to improvements in GraphEval’s performance.

Our work focuses on detection of hallucinations in closeddomain tasks, where we are interested only in consistency with respect to the provided context. The GraphEval framework could be extended to open-domain hallucination detection by employing agents, as in AutoKG [25], to first retrieve relevant external sources as the grounding information to check against.

9. Conclusion

We introduce GraphEval, a simple and efective preprocessing step for improving the explainability and performance of LLM hallucination detection metrics. Our method leverages LLM’s ability to extract information from unstructured text and construct knowledge graphs, whose triples can be fed into out-of-the-box hallucination detection methods.

Finally, we examined the issue of hallucination correction and showed that GraphCorrect can efectively address the majority of hallucinations found in LLM outputs while maintaining extremely high similarity with the original texts. We demonstrate that GraphEval in conjunction with state-of-the-art NLI models leads to an average improvement in balanced accuracy of 6.2 (SE=1.3) on three popular hallucination benchmarks. Furthermore, our method indicates which triples, in the KG representation of the LLM output, are inconsistent. To the best of our knowledge, this is the first application of KGs to an LLM-based hallucination evaluation framework and we believe the success of GraphEval will only grow as KG construction methods also improve. Language Technologies, Volume 1 (Long and Short Power and Communication Technologies (GUCON), Papers), Association for Computational Linguistics, 2020, pp. 310–315. doi:10.1109/GUCON48875. Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: 2020.9231227. https://aclanthology.org/N19-1423. doi:10.18653/ [23] I. Melnyk, P. Dognin, P. Das, Grapher: Multi-stage v1/N19-1423. knowledge graph construction using pretrained lan[13] O. Honovich, R. Aharoni, J. Herzig, H. Taitelbaum, guage models, in: NeurIPS 2021 Workshop on D. Kukliansy, V. Cohen, T. Scialom, I. Szpektor, Deep Generative Models and Downstream AppliA. Hassidim, Y. Matias, TRUE: Re-evaluating cations, 2021. URL: https://openreview.net/forum? factual consistency evaluation, in: S. Feng, id=N2CFXG8-pRd.

H. Wan, C. Yuan, H. Yu (Eds.), Proceedings of [24] J. Han, N. Collier, W. Buntine, E. Shareghi, Pive: the Second DialDoc Workshop on Document- Prompting with iterative verification improving grounded Dialogue and Conversational Question graph-based generative capability of llms, arXiv Answering, Association for Computational Lin- preprint arXiv:2305.12392 (2023). guistics, Dublin, Ireland, 2022, pp. 161–175. URL: [25] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, https://aclanthology.org/2022.dialdoc-1.19. doi:10. S. Deng, H. Chen, N. Zhang, Llms for knowledge 18653/v1/2022.dialdoc-1.19. graph construction and reasoning: Recent capa[14] P. Laban, T. Schnabel, P. N. Bennett, M. A. Hearst, bilities and future opportunities, arXiv preprint SummaC: Re-visiting NLI-based models for incon- arXiv:2305.13168 (2023). sistency detection in summarization, Transactions [26] A. R. Fabbri, W. Kryscinski, B. McCann, R. Socher, of the Association for Computational Linguistics D. R. Radev, Summeval: Re-evaluating sum10 (2022) 163–177. URL: https://aclanthology.org/ marization evaluation, Transactions of the 2022.tacl-1.10. doi:10.1162/tacl_a_00453. Association for Computational Linguistics 9 [15] Z. Gekhman, J. Herzig, R. Aharoni, C. Elkind, (2020) 391–409. URL: https://api.semanticscholar.

I. Szpektor, Trueteacher: Learning factual consis- org/CorpusID:220768873. tency evaluation with large language models, 2023. [27] A. Wang, K. Cho, M. Lewis, Asking and anarXiv:2305.11171. swering questions to evaluate the factual consis[16] P. Manakul, A. Liusie, M. J. Gales, Selfcheckgpt: tency of summaries, in: D. Jurafsky, J. Chai, Zero-resource black-box hallucination detection for N. Schluter, J. Tetreault (Eds.), Proceedings of the generative large language models, arXiv preprint 58th Annual Meeting of the Association for ComarXiv:2303.08896 (2023). putational Linguistics, Association for Computa[17] N. Mündler, J. He, S. Jenko, M. Vechev, Self- tional Linguistics, Online, 2020, pp. 5008–5020. URL: contradictory hallucinations of large language mod- https://aclanthology.org/2020.acl-main.450. doi:10. els: Evaluation, detection and mitigation, in: The 18653/v1/2020.acl-main.450. Twelfth International Conference on Learning Rep- [28] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Esperesentations, 2024. URL: https://openreview.net/ holt, W. Kay, M. Suleyman, P. Blunsom, Teaching forum?id=EmQSOi1X2f. machines to read and comprehend, Advances in [18] J. Fu, S.-K. Ng, Z. Jiang, P. Liu, Gptscore: Evaluate neural information processing systems 28 (2015).

as you desire, 2023. arXiv:2302.04166. [29] S. Narayan, S. B. Cohen, M. Lapata, Don’t give [19] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, me the details, just the summary! topic-aware P. W. Koh, M. Iyyer, L. Zettlemoyer, H. Hajishirzi, convolutional neural networks for extreme sumFactscore: Fine-grained atomic evaluation of fac- marization, in: E. Rilof, D. Chiang, J. Hockentual precision in long form text generation, arXiv maier, J. Tsujii (Eds.), Proceedings of the 2018 Conpreprint arXiv:2305.14251 (2023). ference on Empirical Methods in Natural Language [20] J. Wei, C. Yang, X. Song, Y. Lu, N. Hu, D. Tran, Processing, Association for Computational LinguisD. Peng, R. Liu, D. Huang, C. Du, et al., Long-form tics, Brussels, Belgium, 2018, pp. 1797–1807. URL: factuality in large language models, arXiv preprint https://aclanthology.org/D18-1206. doi:10.18653/ arXiv:2403.18802 (2024). v1/D18-1206. [21] L. F. R. Ribeiro, M. Liu, I. Gurevych, M. Dreyer, [30] P. He, X. Liu, J. Gao, W. Chen, Deberta: DecodingM. Bansal, Factgraph: Evaluating factuality in sum- enhanced bert with disentangled attention, in: Inmarization with semantic graph representations, ternational Conference on Learning Representa2022. arXiv:2204.06508. tions, 2021. URL: https://openreview.net/forum?id= [22] A. Kumar, A. Pandey, R. Gadia, M. Mishra, Build- XPZIaotutsD.

ing knowledge graph using pre-trained language [31] J. Thorne, A. Vlachos, O. Cocarascu, model for learning entity-aware relationships, in: C. Christodoulopoulos, A. Mittal, The FEVER2.0 2020 IEEE International Conference on Computing, shared task, in: Proceedings of the Second Workshop on Fact Extraction and VERification s i m p l e , t h e y a r e a k i n t o (FEVER), 2018. W i k i p e d i a nodes . [32] T. Schuster, A. Fisch, R. Barzilay, Get your vita- S t e p 2 − C o r e f e r e n c e r e s o l u t i o n : F i n d min C! robust fact verification with contrastive a l l e x p r e s s i o n s i n t h e t e x t t h a t evidence, in: Proceedings of the 2021 Confer- r e f e r t o t h e same e n t i t y . Make ence of the North American Chapter of the As- s uInr e pea nrttiict uiel as r a dreo nn oo tt idnucplluidc ea t e d . sociation for Computational Linguistics: Human e n t i t i e s t h a t a r e more s p e c i f i c Language Technologies, Association for Compu- v e r s i o n s t h e m s e l v e s , e . g . " a tational Linguistics, Online, 2021, pp. 624–643. d e t a i l e d view o f j u p i t e r ’ s URL: https://aclanthology.org/2021.naacl-main.52. a t m o s p h e r e " and " j u p i t e r ’ s doi:10.18653/v1/2021.naacl-main.52. a t m o s p h e r e " , o n l y i n c l u d e t h e [33] Y. Zhang, J. Baldridge, L. He, PAWS: Paraphrase most s p e c i f i c v e r s i o n o f t h e Adversaries from Word Scrambling, in: Proc. of e n t i t y .

NAACL, 2019. S t e p 3 − R e l a t i o n e x t r a c t i o n : [34] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, I d e n t i f y s e m a n t i c r e l a t i o n s h i p s M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the between t h e e n t i t i e s you have limits of transfer learning with a unified text-to- i d e n t i f i e d . text transformer, Journal of Machine Learning Re- Format : R e t u r n t h e knowledge graph a s search 21 (2020) 1–67. URL: http://jmlr.org/papers/ a l i s t o f t r i p l e s , i . e . [ " e n t i t y v21/20-074.html. 1 " , " r e l a t i o n 1 − 2 " , " e n t i t y 2 " ] , [35] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, i n Python code .

A large annotated corpus for learning natural lan- " " " , guage inference, in: L. Màrquez, C. Callison-Burch, ) , J. Su (Eds.), Proceedings of the 2015 Conference ( " human " , on Empirical Methods in Natural Language Pro- " Use t h e g i v e n f o r m a t t o e x t r a c t cessing, Association for Computational Linguistics, i n f o r m a t i o n from t h e Lisbon, Portugal, 2015, pp. 632–642. URL: https: ifno plluotw} i<n/gi n pi nu pt u>t. : <S ikni pp u tt h>e{ //aclanthology.org/D15-1075. doi:10.18653/v1/ p r e a m b l e and o u t p u t t h e D15-1075. r e s u l t a s a l i s t w i t h i n < [36] A. Williams, N. Nangia, S. Bowman, A broad- python > </ python > t a g s . " , coverage challenge corpus for sentence understand- ) , ing through inference, in: M. Walker, H. Ji, A. Stent ( " human " , (Eds.), Proceedings of the 2018 Conference of the " " " I m p o r t a n t T i p s : North American Chapter of the Association for 1 . Make s u r e a l l i n f o r m a t i o n Computational Linguistics: Human Language Tech- i s i n c l u d e d i n t h e nologies, Volume 1 (Long Papers), Association for knowledge graph . Computational Linguistics, New Orleans, Louisiana, 2 . Each t r i p l e must o n l y 2018, pp. 1112–1122. URL: https://aclanthology.org/ Ncoonnet a ionf tthher e es t rsitnr gins g s ! N18-1101. doi:10.18653/v1/N18-1101. s h o u l d be empty . [37] T. Khot, A. Sabharwal, P. Clark, SciTail: A textual 3 . Do n o t s p l i t up r e l a t e d entailment dataset from science question answer- i n f o r m a t i o n i n t o s e p a r a t e ing, in: AAAI, 2018. t r i p l e s b e c a u s e t h i s c o u l d change t h e meaning . 4 . Make s u r e a l l b r a c k e t s and A. KG Construction Prompt q u o t a t i o n marks a r e matched . 5 . B e f o r e a d d i n g a t r i p l e t o t h e knowledge graph , c h e c k t h e c o n c a t e n a t e d t r i p l e makes s e n s e a s a s e n t e n c e . I f not , d i s c a r d i t . ( " s y s t e m " , " " " You a r e an e x p e r t a t e x t r a c t i n g

i n f o r m a t i o n i n s t r u c t u r e d f o r m a t s t o b u i l d a

knowledge graph .

S t e p 1 − E n t i t y d e t e c t i o n : I d e n t i f y a l l e n t i t i e s i n t h e raw t e x t .

Make s u r e n o t t o m i s s any o u t .

E n t i t i e s s h o u l d be b a s i c and " " " , ) , ( " human " ,

" " " Here a r e some example i n p u t and o u t p u t p a i r s . ## Example 1 .

I n p u t : " The Walt Disney Company , commonly known a s Disney , i s an American m u l t i n a t i o n a l mass media and e n t e r t a i n m e n t c o n g l o m e r a t e t h a t i s h e a d q u a r t e r e d a t t h e Walt Disney S t u d i o s complex i n

Burbank , C a l i f o r n i a . " Output : <python > [ [ " The Walt Disney Company " , " h e a d q u a r t e r e d a t " , " Walt Disney S t u d i o s complex i n

Burbank , C a l i f o r n i a " ] , [ " The Walt Disney Company " , " commonly known a s " , " Disney " ] , [ " The Walt Disney Company " , " i n s t a n c e o f " , " American m u l t i n a t i o n a l mass media and e n t e r t a i n m e n t c o n g l o m e r a t e " ] ] </ python > ## Example 2 .

I n p u t : " Amanda J a c k s o n was born i n

S p r i n g f i e l d , Ohio , USA on June 1 , 1 9 8 5 . She was a b a s k e t b a l l p l a y e r f o r t h e U . S . women ’ s team . " Output : <python > [ [ " Amanda J a c k s o n " , " born i n " , "

S p r i n g f i e l d , Ohio , USA " ] , [ " Amanda J a c k s o n " , " born on " , "

June 1 , 1 9 8 5 " ] , [ " Amanda J a c k s o n " , " o c c u p a t i o n " ,

" b a s k e t b a l l p l a y e r " ] , [ " Amanda J a c k s o n " , " p l a y e d f o r " , "U . S . women ’ s b a s k e t b a l l team " ] ] </ python > ## Example 3 .

I n p u t : " Music e x e c u t i v e D a r i u s Van Arman was born i n P e n n s y l v a n i a . He a t t e n d e d Gonzaga C o l l e g e High S c h o o l and i s a human b e i n g . " Output : <python > [ [ " D a r i u s Van Arman " , " o c c u p a t i o n " , " Music e x e c u t i v e " ] , [ " D a r i u s Van Arman " , " born i n " , "

P e n n s y l v a n i a " ] , [ " D a r i u s Van Arman " , " a t t e n d e d " , " Gonzaga C o l l e g e High S c h o o l " ] , [ " D a r i u s Van Arman " , " i n s t a n c e o f " , " human b e i n g " ] ] </ python > ## Example 4 .

I n p u t : " I t a l y had 3 . 6 x t i m e s more

c a s e s o f c o r o n a v i r u s than

China . " Output : <python > [ [ " I t a l y " , " had 3 . 6 x t i m e s more c a s e s o f c o r o n a v i r u s than " , "

China " ] ] </ python > " " " , ) ,

B. Hallucination correction (step 1)

" " " You a r e an e x p e r t a t e x t r a c t i n g i n f o r m a t i o n i n s t r u c t u r e d f o r m a t s from t e x t .

The f o l l o w i n g t r i p l e c o n t a i n s f a c t u a l l y i n c o r r e c t i n f o r m a t i o n . C o r r e c t i t b a s e d on t h e p r o v i d e d c o n t e x t , I m p o r t a n t T i p s : 1 . A t r i p l e i s d e f i n e d a s [ " e n t i t y 1 " , " r e l a t i o n 1 − 2 " , " e n t i t y 2 " ] . 2 . A t r i p l e must only c o n t a i n t h r e e s t r i n g s ! None o f t h e s t r i n g s s h o u l d be empty . 3 . The c o n c a t e n a t e d t r i p l e must make s e n s e a s a s e n t e n c e . 4 . Only r e t u r n t h e c o r r e c t e d t r i p l e , n o t h i n g e l s e . < t r i p l e > { t r i p l e } < / t r i p l e > < c o n t e x t > { c o n t e x t } < / c o n t e x t > Remember , i t i s i m p o r t a n t t h a t you only r e t u r n t h e c o r r e c t e d t r i p l e . " " "

C. Hallucination correction (step 2)

" " " I n t h e f o l l o w i n g c o n t e x t , r e p l a c e t h e i n f o r m a t i o n o f t h e o l d t r i p l e with t h e i n f o r m a t i o n o f t h e new one .

Do not make any o t h e r m o d i f i c a t i o n t o t h e c o n t e x t .

Only r e t u r n t h e new c o n t e x t . < c o n t e x t > { summary } < / c o n t e x t > < o l d _ t r i p l e > { o l d _ t r i p l e } < / o l d _ t r i p l e > < n e w _ t r i p l e > { n e w _ t r i p l e } < / n e w _ t r i p l e > " " "

D. Hallucination correction without a KG

" " " The f o l l o w i n g summary c o n t a i n s f a c t u a l l y i n c o r r e c t i n f o r m a t i o n . C o r r e c t i t b a s e d on t h e c o n t e x t , but don ’ t change o t h e r p a r t s o f t h e summary .

Only r e t u r n t h e c o r r e c t e d summary , n o t h i n g e l s e . <summary > { summary } < / summary> < c o n t e x t > { c o n t e x t } < / c o n t e x t > Remember , do minimal changes t o t h e o r i g i n a l summary , don ’ t make i t l o n g e r and keep a s much o f i t a s you can e x a c t l y t h e same .