<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hannah</forename><surname>Sansford</surname></persName>
							<email>hannah.sansford@bristol.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Bristol</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nicholas</forename><surname>Richardson</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Amazon Science</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hermina</forename><forename type="middle">Petric</forename><surname>Maretic</surname></persName>
							<email>maretich@amazon.co.uk</email>
							<affiliation key="aff1">
								<orgName type="institution">Amazon Science</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Juba</forename><forename type="middle">Nait</forename><surname>Saada</surname></persName>
							<email>jubans@amazon.co.uk</email>
							<affiliation key="aff1">
								<orgName type="institution">Amazon Science</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">GraphEval: A Knowledge-Graph Based LLM Hallucination Evaluation Framework</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">0493B68747775A0C6079A4C691695B7B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Language Models</term>
					<term>Knowledge Graphs</term>
					<term>Hallucination Detection</term>
					<term>Hallucination Correction</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Methods to evaluate Large Language Model (LLM) responses and detect inconsistencies, also known as hallucinations, with respect to the provided knowledge, are becoming increasingly important for LLM applications. Current metrics fall short in their ability to provide explainable decisions, systematically check all pieces of information in the response, and are often too computationally expensive to be used in practice. We present GraphEval: a hallucination evaluation framework based on representing information in Knowledge Graph (KG) structures. Our method identifies the specific triples in the KG that are prone to hallucinations and hence provides more insight into where in the response a hallucination has occurred, if at all, than previous methods. Furthermore, using our approach in conjunction with state-of-the-art natural language inference (NLI) models leads to an improvement in balanced accuracy on various hallucination benchmarks, compared to using the raw NLI models. Lastly, we explore the use of GraphEval for hallucination correction by leveraging the structure of the KG, a method we name GraphCorrect, and demonstrate that the majority of hallucinations can indeed be rectified.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>As the size and power of LLMs have drastically increased over recent years, so has the number of potential applications. Arguably, one of the biggest blockers to implementing these models in practice is their tendency to hallucinate -returning seemingly plausible, but untrue, responses. Here, we focus on the problem of detecting hallucinations with respect to the provided context that the LLM should use as its source of knowledge; detecting hallucinations that have deviated from the LLM's original training data is out of the scope of this work. In applications where certainty in a response is critical, such as medical diagnosis, the existence of hallucinations that arise from a given context is especially limiting. Therefore, it is of utmost importance to develop successful methods to detect these hallucinations and, when it is of interest to address or correct them, provide clarity on which aspect of the response is likely a hallucination. The importance of this issue is reflected in the amount of research being published on the topic -see Ji et al. <ref type="bibr" target="#b0">[1]</ref> for a recent survey of this area. before hallucinations were at the forefront of the problem. Methods have evolved a great deal from traditional N-gram based metrics, such as BLEU <ref type="bibr" target="#b1">[2]</ref> and ROUGE <ref type="bibr" target="#b2">[3]</ref>, to much more intricate LLM-based evaluation metrics with user-defined evaluation criteria, such as G-Eval <ref type="bibr" target="#b3">[4]</ref>. More recently, techniques to mitigate the prevalence of hallucinations in generated outputs leveraging Retrieval Augmented Generation (RAG) <ref type="bibr" target="#b4">[5]</ref> and reasoning on knowledge graphs (KGs) <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref> have been proposed. The former suggested the concatenation of relevant contextual data into the prompt to ground the LLM response, while the latter enforced a more robust reasoning process through providing grounding information in KG structures <ref type="bibr" target="#b7">[8]</ref>. As successful as these approaches have been, they do not fully circumvent the need to evaluate LLM outputs.</p><p>Inspired by current research harnessing KGs to provide grounded LLM responses, we propose GraphEval -a hallucination detection framework based on the representation of information in KG structures. To the best of our knowledge, we are the first to apply KGs to an LLMbased hallucination evaluation framework, and in doing so we provide a higher level of insight into where in the output a hallucination has occurred than any previous metrics. Additionally, we demonstrate how using our method in conjunction with current state-of-the-art hallucination detection methods improves their classification accuracy on various benchmarks. Finally, we consider the problem of hallucination correction and we introduce GraphCorrect, showcasing how GraphEval can effectively be extended to rectify a significant proportion of hallucinations present in LLM outputs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Problem statement</head><p>In this work we focus on the closed-domain hallucination detection problem: the situation where we have a textual output from an LLM which is generated using some grounding context included in the prompt. In this case, the goal is for the LLM to use the provided context as its only source of knowledge. The open-domain problem, which is with respect to all factual knowledge in the world, is not explored here but is briefly discussed in Section 8.</p><p>We consider hallucination detection to be a binary classification problem, with 0 corresponding to the LLM output being factually consistent given the provided context, and 1 corresponding to the output containing at least one inconsistency. We can assess hallucination evaluation methods using a benchmarking dataset containing ground-truth labels (usually human-annotated) to determine whether a given context-output pair contains factual inconsistencies. Throughout the paper we use the terms factual, consistent, grounded and faithful interchangeably to mean containing no hallucinations with respect to the context. Finally, we explore the problem of hallucination correction, wherein we do not use any directly labeled dataset. Instead, we utilize hallucination detection frameworks to first identify hallucinations to correct, and subsequently repurposing them to evaluate the corrected outputs. It is important to note that our exploration of hallucination correction only serves as an extension to our evaluation framework and is not the primary focus of this study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Related work</head><p>Historically, N-gram based metrics such as BLEU <ref type="bibr" target="#b1">[2]</ref> and ROUGE <ref type="bibr" target="#b2">[3]</ref> have been the most widely used metrics for natural language evaluation. However, these metrics have been shown to perform poorly at the task of factual inconsistency detection <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref>. In more recent years, embedding-based metrics such as BERTScore <ref type="bibr" target="#b10">[11]</ref> have been favoured over N-gram based metrics. These methods measure the similarity between two pieces of text by comparing the contextualised embedding from a transformer model, such as BERT <ref type="bibr" target="#b11">[12]</ref>.</p><p>Both N-gram and embedding-based metrics base their scores on how similar the text to be evaluated is to some reference text. This similarity objective often fails to capture the intricacies of the hallucination detection problem.</p><p>Therefore, researchers have begun to develop new methods that are more acutely tuned to detecting inconsistencies between an LLM output and its grounding context. Maynez et al. <ref type="bibr" target="#b8">[9]</ref> identified the crossover between the textual entailment score in NLI tasks and consistency prediction. This was a breakthrough at the time, producing higher correlation with faithfulness than any previous metrics, and paved the way for further research that capitalised on NLI data and models <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b14">15]</ref>.</p><p>Very recently, attention has turned to leveraging LLMs themselves to evaluate the consistency of LLM outputs. SelfCheckGPT <ref type="bibr" target="#b15">[16]</ref> and ChatProtect <ref type="bibr" target="#b16">[17]</ref> approach the problem by considering the self-consistency within sampled outputs. Since they require the generation of a large number of responses from the LLM, many consider these methods prohibitively computationally expensive.</p><p>Other LLM-based hallucination evaluation methods, such as G-Eval <ref type="bibr" target="#b3">[4]</ref> and GPTScore <ref type="bibr" target="#b17">[18]</ref>, employ a different LLM for evaluation than the one used to generate the LLM response that needs to be evaluated. G-Eval allows userdefined evaluation criteria and uses automated chainof-thought prompting and form-filling to assign scores. GPTScore treats the task as conditional generation, leveraging models like GPT-3 to assign higher probabilities to high-quality outputs by prepending evaluation instructions to the LLM prompt. Unlike NLI models trained on binary classification data, these methods produce scores that are harder to interpret as probabilities and often require additional steps for inconsistency classification.</p><p>Recent hallucination detection methods, such as FactScore <ref type="bibr" target="#b18">[19]</ref> and SAFE <ref type="bibr" target="#b19">[20]</ref>, utilize large language models to break down the response into atomic or individual facts for evaluation. These approaches have enabled precise identification of where hallucinations occur within the LLM response. Each fact is automatically verified against a comprehensive knowledge source like Wikipedia or scientific literature in the case of FactScore, or through the use of a search engine in the case of SAFE.</p><p>FactGraph <ref type="bibr" target="#b20">[21]</ref> is the only factuality evaluation method we are aware of that utilises graph-like structures. The method is focused solely on the detection of inconsistencies in the summarization problem, decomposing both the summary and the supporting documents into what they call structured meaning representations (MRs). These MRs describe the core semantic concepts and relations, which the authors claim to be more suitable for factuality evaluation than the raw text. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">GraphEval: Our evaluation method</head><p>GraphEval is based around the idea of representing information in a structured manner through KGs, and aims to address the lack of explainability of previous hallucination detection approaches, i.e. which concrete pieces of information in particular are inconsistent.</p><p>Formally, a KG is a collection of triples 𝒦𝒢 = {(𝑒1, 𝑟, 𝑒2) ⊆ ℰ × ℛ × ℰ}, where ℰ and ℛ denote the set of entities and relationships, respectively. In the GraphEval setting, both entities and relationships are simply pieces of text. We do not make use of common extensions to this simple setting, such as entity and relationship types, or attached properties.</p><p>Our GraphEval metric consists of a two-stage procedure:</p><p>Stage 1 -Construct a KG from the LLM output to be evaluated. Stage 2 -Iterate through each of the triples in the KG, identifying whether they are factually consistent given the provided context.</p><p>The output is considered factually inconsistent if any of the triples in stage 2 are identified as not grounded in the context. The inconsistent triple(s) may also be returned to provide explainability by highlighting where in the output the hallucination(s) has occurred. We provide a visualisation of this process in Figure <ref type="figure" target="#fig_0">1</ref> using a real example from one of the benchmarks described in Section 7.1.</p><p>Regarding stage 1, we provide a short review of LLMbased KG construction methods in Section 5, along with results from our implementation. For stage 2, we leverage existing techniques and employ an out-of-the-box NLI model for this task. A benefit of this approach is that it gives us the opportunity to make a direct comparison between the performance of the raw NLI model and the model supplemented with our KG approach. In essence, our method is a pre-processing step, the output of which can be fed into any hallucination detection method; we choose NLI models as they are computationally cheap compared to LLM-based models, yet still achieve state-ofthe-art results. By feeding each triple into an NLI model, along with the grounding context, we obtain a probability of containing a hallucination for each triple. Finally, we classify the example as inconsistent if at least one triple produces a probability greater than 0.5.</p><p>Similar approaches to ours have been proposed in recent literature. SummaC <ref type="bibr" target="#b13">[14]</ref> also uses NLI-based models to detect inconsistencies in LLM-generated summaries. However, it distinguishes itself by segmenting both the context and the summary into their respective sentences, and then by passing each context-summary pair into the NLI model. This approach presents challenges in main- taining entity references across sentences; for instance, "John Doe" may only be referred to as "he" in another sentence. Similarly, FactScore <ref type="bibr" target="#b18">[19]</ref> faces the same limitation. Our method circumvents this issue by organising entity relationships with a KG.</p><p>While FactGraph <ref type="bibr" target="#b20">[21]</ref> also makes use of graph structures in their consistency evaluation process, the method differs from GraphEval in a few major respects. Firstly, their approach can only be applied to the summarisation problem; whereas GraphEval can easily be applied to various domains such as Summarisation, Question Answering, Common Sense Reasoning and many others. Secondly, FactGraph does not employ LLMs anywhere in their framework, missing out on recent advances in the field. Finally, their approach aims to decompose both the LLM output and the provided context into the underlying core semantic concepts and relations, before comparing each of the graph structures. GraphEval, on the other hand, only represents the LLM output as a KG and aims to preserve as much of the information contained in the raw text as possible.</p><p>To summarise the advantages of GraphEval over previous methods:</p><p>• We present a systematic way of checking all pieces of information contained in the LLM output. • Our method only requires one call to an LLM, in the KG construction phase, and does not require the (usually) large context documents to be input, as in all previous LLM-based metrics. This makes GraphEval less computationally expensive than other LLM-based methods. • Our method returns the specific triples that are not grounded in the context, providing explainability for the decision and identifying which section of the output should not be trusted. We leverage this feature for hallucination correction and propose a new method called GraphCorrect, described in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Construction of KGs using LLMs</head><p>Constructing KGs from unstructured textual data involves identifying the set of entities within the text and the relationships between them, resulting in a structured representation of the information contained within the text. The process can be divided into three main stages:</p><p>1. Entity detection -the process of identifying and extracting entities from text. 2. Coreference resolution -the process of finding of all expressions (also called mentions) in the text that refer to the same entity. 3. Relation extraction -the process of identifying semantic relationships between entities.</p><p>Previously, researchers addressed each stage individually, but with the increasing power of LLMs, there's been a shift towards end-to-end systems. Kumar et al. <ref type="bibr" target="#b21">[22]</ref> suggest employing two LLM components: one for named entity recognition and another one for both relation classification and direction. Similarly, Grapher <ref type="bibr" target="#b22">[23]</ref> utilizes a pre-trained LLM for entity extraction and relation prediction. However, these methods require users to provide possible relations. More recent methods like PiVE <ref type="bibr" target="#b23">[24]</ref> and AutoKG <ref type="bibr" target="#b24">[25]</ref> use LLM prompting strategies for KG construction without additional user input.</p><p>The aforementioned methods do not make use of some of the emergent abilities of LLMs, such as in-context learning and the chain-of thought prompting strategy. We decide to leverage these emergent abilities, and take a simple prompt engineering approach to our KG construction step. The techniques used can be summarised as the following:</p><p>• Chain-of-thought (CoT) prompting strategy. Providing intermediate reasoning steps in the prompt to enable LLMs to solve more complex tasks. • In-context learning. A method of prompt engineering where one provides several task demonstrations within the prompt, circumventing the need for fine-tuning.</p><p>The final prompt used in our experiments can be found in the Appendix. We highlight to the reader that our KG construction method is not the main contribution of our work, which is rather the application of KG construction to the hallucination detection problem. The major benefit of our KG construction approach is its ease of implementation with any LLM. Furthermore, it is less computationally intensive than methods like PiVE, which performs multiple iterations of improvements to the generated KG.</p><p>Of course, users may conduct the KG construction stage of GraphEval using their method of choice; the experiments in this paper exhibit the capability of a simple prompting strategy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">GraphCorrect: Correction of hallucinations with GraphEval</head><p>While the primary focus of this work lies in hallucination detection, GraphEval's breakdown of LLM outputs into triples easily allows for its extension to correct hallucinations within the given context. To achieve this, we first identify all triples within the KG that are likely to contain hallucinations (i.e. those with a probability greater than 0.5, if any). We then employ the following two-step procedure on each identified triple:</p><p>Step 1 -Input the given triple along with the context into an LLM to correct for the potential hallucinations within the triple. This results in a newly generated corrected triple.</p><p>Step 2 -Input the identified triple, its corrected counterpart and the initial LLM output. Selectively replace the information from the original (hallucination-containing) triple with the information from the new triple in the initial LLM output.</p><p>We name this LLM hallucination correction method as GraphCorrect. The final prompts used in our experiments for both step 1 and step 2 can be found in the Appendix B and C respectively. This systematic approach to hallucination correction offers several benefits. First, it tackles each identified hallucination separately, increasing the chances of all perceived hallucinations being corrected. Furthermore, it offers the advantage of exclusively altering the segments of the original text that are suspected to contain a hallucination, leaving other elements untouched and ensuring overall high similarity with the original text. Finally, breaking down the entire process into intermediate steps ensures that the original context and the initial LLM output never undergo simultaneous processing within an LLM. This guarantees safeguards against both the addition of extra information and the loss of information in the LLM output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Benchmarks</head><p>We conducted two sets of experiments: one focusing on hallucination detection to highlight GraphEval's performance and another on hallucination correction to showcase the advantages of GraphCorrect. For both scenarios, we utilized the SummEval <ref type="bibr" target="#b25">[26]</ref>, QAGS-C and QAGS-X <ref type="bibr" target="#b26">[27]</ref> benchmarks -currently the most prevalent benchmarks in relevant academic literature. All three are concerned with detecting hallucinations in LLM-generated summaries and are human-annotated for factual consistency with respect to the grounding context. Table <ref type="table" target="#tab_1">1</ref> contains some statistics pertaining to each of these datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>SummEval</head><p>The SummEval dataset consists of human evaluations on 16 summarization model outputs from 100 articles from the CNN/DailyMail dataset <ref type="bibr" target="#b27">[28]</ref>. Each summary is labelled on a Likert scale from 1-5 on 4 categories: consistency, coherence, fluency and relevance. We follow the TRUE benchmark <ref type="bibr" target="#b12">[13]</ref> in taking the consistency scores and mapping a score of 5 to being fully consistent, and anything lower to being inconsistent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>QAGS</head><p>The QAGS-C and QAGS-X datasets are built from the CNN/DailyMail and the XSum <ref type="bibr" target="#b28">[29]</ref> datasets, respectively. The human annotators examined the summaries one sentence at a time, and determined the factual consistency of each sentence comparing it to the original article. Three annotators assessed each sentence and the majority decision was recorded. Again, we follow the TRUE benchmark in considering a summary to be factually consistent if and only if all sentences are considered consistent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">NLI models in GraphEval</head><p>As mentioned in Section 4, we employ NLI models to perform the second stage of GraphEval -checking the consistency of each individual triple with respect to the context. We conduct experiments using the three most popular NLI-based hallucination detection models available on HuggingFace<ref type="foot" target="#foot_0">1</ref> .</p><p>HHEM Based on the DeBERTaV3 model <ref type="bibr" target="#b29">[30]</ref> and initially trained on NLI data, the hallucination evaluation model created by Vectara 2 is further fine-tuned on datasets annotated for consistency. The datasets used for fine tuning were: FEVER <ref type="bibr" target="#b30">[31]</ref>, Vitamin C <ref type="bibr" target="#b31">[32]</ref> and PAWS <ref type="bibr" target="#b32">[33]</ref>. This model is considerably smaller than the following two models, requiring only 738 MB of memory, and thus has a significantly shorter run-time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>TRUE</head><p>The TRUE model is based on a T5-XXL model <ref type="bibr" target="#b33">[34]</ref> and is trained similarly to the model described in the TRUE paper <ref type="bibr" target="#b12">[13]</ref>. Instead of the ANLI dataset used in that paper, this model is trained on the same datasets as HHEM, plus the following: SNLI <ref type="bibr" target="#b34">[35]</ref>, MNLI <ref type="bibr" target="#b35">[36]</ref> and Scitail <ref type="bibr" target="#b36">[37]</ref>. This model requires 45.5 GB of memory.</p><p>TrueTeacher Gekhman et al. <ref type="bibr" target="#b14">[15]</ref> leverage the ability of LLMs to evaluate hallucinations by generating synthetic data through annotating model-generated summaries. They then use this synthetic data to further fine-tune the model from <ref type="bibr" target="#b12">[13]</ref>, leading to state-of-theart performance on the TRUE benchmark. This model is the same size as the TRUE model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.">Experimental settings</head><p>In all experiments conducted in this study necessitating the utilization of an LLM, we use Claude 2 <ref type="foot" target="#foot_2">3</ref> , an LLM from Anthropic, through the Amazon Bedrock API <ref type="foot" target="#foot_3">4</ref> . We use the default settings for the LLM: temperature = 1, top_p = 1, top_k = 250. We also refer the reader to the Appendix for the prompts used in this work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.1.">Hallucination detection with GraphEval</head><p>We present our results of hallucination detection for the three NLI models, and their GraphEval counterparts, in Table <ref type="table">2</ref>. We report the balanced accuracy as our evaluation metric, which corrects for the class imbalance in the SummEval benchmark. In the case of using the NLI model directly, we classify the example as containing a hallucination if the NLI model returns a probability of more than 0.5. When combining the NLI model with GraphEval, we classify the example as containing a hallucination if at least one triple fed to the NLI model returns a probability of more than 0.5. We see that adding the GraphEval pre-processing step to each of the NLI models almost always improves the balanced accuracy score, sometimes by a considerable amount, such as the results for the SummEval and QAGS-C benchmarks in Table <ref type="table">2</ref>. On average (weighting by the number of samples in each dataset), adding the GraphEval pre-processing step improves the balanced accuracy by 6.2 (SE=1.3). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Balanced accuracy scores for hallucination detection of NLI models (HHEM, TRUE, TrueTeacher) and their GraphEval counterparts on the SummEval, QAGS-C and QAGS-X benchmarks.</p><p>We hypothesise that the negligible difference between the base NLI model and the model supplemented with GraphEval for the QAGS-X dataset is due to the average length of the generated text (only 18 words, compared with 49 and 63 for QAGS-C and SummEval respectively, see 1). This highlights an important aspect of where the most value can be found in our method. When the LLM output is very short, there are less likely to be multiple facts that need to be checked for consistency (which can easily be done without the use of a KG) and the intricacies of the short sentence might even be lost in the KG construction phase. On the other hand, when the LLM output is very long, current methods struggle to test each individual fact against the context, and this is when GraphEval thrives.</p><p>It should be noted that even when the results for GraphEval are comparable to the baseline methods, the benefit of using GraphEval is the identification of the specific triple(s) that are inconsistent with the provided context.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.2.">Hallucination correction with GraphCorrect</head><p>Identifying the particular triple(s) likely to harbor a hallucination enables straightforward correction using Graph-Correct, as described in Section 6. For each of the evaluation frameworks proposed here (HHEM + GraphEval, TRUE + GraphEval, and TrueTeacher + GrapEval), we compared GraphCorrect to a basic prompting strategy for hallucination correction, serving as a baseline. The prompt used in this baseline approach, referred to as the Direct Prompt henceforth, is provided in Appendix D.</p><p>For each framework, we initially identify hallucinations, correct only the LLM outputs suspected of containing hallucinations using either GraphCorrect or Direct Prompt, and then reapply the evaluation framework to detect hallucinations in the corrected LLM outputs. Note that this procedure only allows us to measure what we presume to be corrected hallucinations, given the potential for errors in the evaluation frameworks utilized here. We report the </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Average ROUGE-1, ROUGE-2 and ROUGE-L scores measuring similarity between original and corrected summaries using Direct Prompt and GraphCorrect across different datasets and hallucination detection frameworks.</p><p>percentage of believed corrected hallucinations in Table <ref type="table">4</ref>. A score of 0% suggests no corrected hallucinations according to the given framework, while a score of 100% indicates correction of all hallucinations as per the given framework. GraphCorrect outperforms the prompting strategy proposed here by significantly correcting for more hallucinations on all tasks apart from two related to the QAGS-X dataset. As on the hallucination detection task, we hypothesise these results are correlated with the average length of the text, with GraphCorrect bringing most value in longer texts with a more complex structure to unravel and correct.</p><p>Additionally, as previously stated, GraphCorrect offers the advantage of only modifying the segments of text in the LLM outputs susceptible to hallucinations, while leaving other sections unaltered, thereby maintaining high overall similarity with the original text. This characteristic is illustrated in Table <ref type="table">3</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Percentage of believed corrected hallucinations using a direct prompting strategy and GraphCorrect on the SummEval, QAGS-C and QAGS-X benchmarks. The hallucinations were first detected by HHEM + GraphEval, TRUE + GraphEval and TrueTeacher + GraphEval respectively, and then corrections were evaluated by the same metric.</p><p>summaries and the corrected versions for both GraphCorrect and Direct Prompt across all experimental scenarios examined in this study. GraphCorrect systematically generates texts that are closer in similarity to the original LLM outputs compared to its counterpart.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Discussion</head><p>Our work focuses on detection of hallucinations in closeddomain tasks, where we are interested only in consistency with respect to the provided context. The GraphEval framework could be extended to open-domain hallucination detection by employing agents, as in AutoKG <ref type="bibr" target="#b24">[25]</ref>, to first retrieve relevant external sources as the grounding information to check against.</p><p>We expect that in the near future, more research will be conducted on the construction of KGs from unstructured text, which will provide improvements to the first stage of our procedure and ultimately the evaluation performance. Even as LLMs alone become more powerful, this will continue to contribute to improvements in GraphEval's performance.</p><p>We observe that, in the knowledge graph construction phase of our procedure, it is possible that some information loss may occur. However, as shown by the results in Section 7.4, our method rarely leads to a reduction in balanced accuracy. Furthermore, when it is comparable to the baseline methods, we have the added explainability of identifying the specific triples where the hallucination has occurred.</p><p>We believe our hallucination correction framework (GraphCorrect) shows promise and an interesting avenue for future work. However, the effectiveness of the approach described in this work should be assessed manually, rather than relying on the convoluted use of hallucination evaluation frameworks (which only yield mea-surements of believed corrected hallucinations).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Conclusion</head><p>We introduce GraphEval, a simple and effective preprocessing step for improving the explainability and performance of LLM hallucination detection metrics. Our method leverages LLM's ability to extract information from unstructured text and construct knowledge graphs, whose triples can be fed into out-of-the-box hallucination detection methods.</p><p>We demonstrate that GraphEval in conjunction with state-of-the-art NLI models leads to an average improvement in balanced accuracy of 6.2 (SE=1.3) on three popular hallucination benchmarks. Furthermore, our method indicates which triples, in the KG representation of the LLM output, are inconsistent. To the best of our knowledge, this is the first application of KGs to an LLM-based hallucination evaluation framework and we believe the success of GraphEval will only grow as KG construction methods also improve.</p><p>Finally, we examined the issue of hallucination correction and showed that GraphCorrect can effectively address the majority of hallucinations found in LLM outputs while maintaining extremely high similarity with the original texts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. KG Construction Prompt</head><p>( " s y s t e m " , " " " You a r e an e x p e r t a t e x t r a c t i n g i n f o r m a t i o n i n s t r u c t u r e d f o r m a t s t o b u i l d a knowledge g r a p h . S t e p 1 − E n t i t y d e t e c t i o n : I d e n t i f y a l l e n t i t i e s i n t h e raw t e x t . Make s u r e n o t t o m i s s any o u t . E n t i t i e s s h o u l d be b a s i c and s i m p l e , t h e y a r e a k i n t o W i k i p e d i a n o d e s . S t e p 2 − C o r e f e r e n c e r e s o l u t i o n : F i n d a l l e x p r e s s i o n s i n t h e t e x t t h a t r e f e r t o t h e same e n t i t y . Make s u r e e n t i t i e s a r e n o t d u p l i c a t e d .</p><p>I n p a r t i c u l a r do n o t i n c l u d e e n t i t i e s t h a t a r e more s p e c i f i c v e r s i o n s t h e m s e l v e s , e . g . " a d e t a i l e d view o f j u p i t e r ' s a t m o s p h e r e " and " j u p i t e r ' s a t m o s p h e r e " , o n l y i n c l u d e t h e most s p e c i f i c v e r s i o n o f t h e e n t i t y . S t e p 3 − R e l a t i o n e x t r a c t i o n : I d e n t i f y s e m a n t i c r e l a t i o n s h i p s between t h e e n t i t i e s you have i d e n t i f i e d .</p><p>Format : R e t u r n t h e knowledge g r a p h a s a l i s t o f t r i p l e s , i . e . [ " e n t i t y 1 " , " r e l a t i o n 1 − 2 " , " e n t i t y  o c c u p a t i o n " , " Music e x e c u t i v e " ] , [ " D a r i u s Van Arman " , " born i n " , " P e n n s y l v a n i a " ] ,</p><p>[ " D a r i u s Van Arman " , " a t t e n d e d " , " Gonzaga C o l l e g e High S c h o o l " ] , [ " D a r i u s Van Arman " , " i n s t a n c e o f " , " human b e i n g " ] ] &lt;/ python &gt; ## Example 4 . I n p u t : " I t a l y had 3 . 6 x t i m e s more c a s e s o f c o r o n a v i r u s t h a n China . " Output : &lt; python &gt; [ [ " I t a l y " , " had 3 . 6 x t i m e s more c a s e s o f c o r o n a v i r u s t h a n " , "</p><p>China " ] ] &lt;/ python &gt; " " " , ) ,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Hallucination correction (step 1)</head><p>" " " You a r e an e x p e r t a t e x t r a c t i n g i n f o r m a t i o n i n s t r u c t u r e d f o r m a t s from t e x t . The f o l l o w i n g t r i p l e c o n t a i n s f a c t u a l l y i n c o r r e c t i n f o r m a t i o n . C o r r e c t i t b a s e d on t h e p r o v i d e d c o n t e x t , I m p o r t a n t T i p s :</p><p>1 . A t r i p l e i s d e f i n e d a s [ " e n t i t y 1 " , " r e l a t i o n 1 − 2 " , " e n t i t y 2 " ] . 2 . A t r i p l e must o n l y c o n t a i n t h r e e s t r i n g s ! None o f t h e s t r i n g s s h o u l d be empty . 3 . The c o n c a t e n a t e d t r i p l e must make s e n s e a s a s e n t e n c e . 4 . Only r e t u r n t h e c o r r e c t e d t r i p l e , n o t h i n g e l s e .</p><p>&lt; t r i p l e &gt; { t r i p l e } &lt; / t r i p l e &gt; &lt; c o n t e x t &gt; { c o n t e x t } &lt; / c o n t e x t &gt; Remember , i t i s i m p o r t a n t t h a t you o n l y r e t u r n t h e c o r r e c t e d t r i p l e . " " "</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Hallucination correction (step 2)</head><p>" " " I n t h e f o l l o w i n g c o n t e x t , r e p l a c e t h e i n f o r m a t i o n o f t h e o l d t r i p l e w i t h t h e i n f o r m a t i o n o f t h e new one .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Do n o t make any o t h e r m o d i f i c a t i o n t o</head><p>t h e c o n t e x t . Only r e t u r n t h e new c o n t e x t . &lt; c o n t e x t &gt; { summary } &lt; / c o n t e x t &gt; &lt; o l d _ t r i p l e &gt; { o l d _ t r i p l e } &lt; / o l d _ t r i p l e &gt; &lt; n e w _ t r i p l e &gt; { n e w _ t r i p l e } &lt; / n e w _ t r i p l e &gt; " " "</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Hallucination correction without a KG</head><p>" " " The f o l l o w i n g summary c o n t a i n s f a c t u a l l y i n c o r r e c t i n f o r m a t i o n . C o r r e c t i t b a s e d on t h e c o n t e x t , b u t don ' t change o t h e r p a r t s o f t h e summary . Only r e t u r n t h e c o r r e c t e d summary , n o t h i n g e l s e . &lt;summary &gt; { summary } &lt; / summary &gt; &lt; c o n t e x t &gt; { c o n t e x t } &lt; / c o n t e x t &gt; Remember , do m i n i m a l c h a n g e s t o t h e o r i g i n a l summary , don ' t make i t l o n g e r and keep a s much o f i t a s you can e x a c t l y t h e same . " " "</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A visualisation of the GraphEval approach. First, the LLM output is fed into the KG construction prompt to produce the KG depicted on the right. Next, each individual triple in the KG is fed into an out-of-the-box hallucination detection method, such as an NLI model, and compared to the provided context for inconsistencies. Finally, any triples that are flagged as inconsistent are returned to the user, along with the overall hallucination decision.</figDesc><graphic coords="3,89.29,84.19,416.68,209.04" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>I n p u t : " Amanda J a c k s o n was born i n S p r i n g f i e l d , Ohio , USA on J u n e 1 , 1 9 8 5 . She was a b a s k e t b a l l p l a y e r f o r t h e U . S . women ' s team . " Output : &lt;python &gt; [ [ " Amanda J a c k s o n " , " born i n " , " S p r i n g f i e l d , Ohio , USA " ] , [ " Amanda J a c k s o n " , " born on " , " J u n e 1 , 1 9 8 5 " ] , [ " Amanda J a c k s o n " , " o c c u p a t i o n " , " b a s k e t b a l l p l a y e r " ] , [ " Amanda J a c k s o n " , " p l a y e d f o r " , " U . S . women ' s b a s k e t b a l l team " ] ] &lt;/ python &gt; ## Example 3 . I n p u t : " Music e x e c u t i v e D a r i u s Van Arman was born i n P e n n s y l v a n i a . He a t t e n d e d Gonzaga C o l l e g e High S c h o o l and i s a human b e i n g . " Output : &lt;python &gt; [ [ " D a r i u s Van Arman " , "</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Benchmark No. of Examples Label Ratio Avg Output len. Avg Context len.</head><label></label><figDesc></figDesc><table><row><cell>SummEval</cell><cell>1,600</cell><cell>33.2%</cell><cell>63</cell><cell>359</cell></row><row><cell>QAGS-C</cell><cell>235</cell><cell>48.1%</cell><cell>49</cell><cell>383</cell></row><row><cell>QAGS-X</cell><cell>239</cell><cell>48.5%</cell><cell>18</cell><cell>318</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Statistics relating to the evaluation benchmarks used. The label ratio is the ratio of factually consistent examples to inconsistent examples. The average output and context length are the average number of words in each.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://huggingface.co</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://huggingface.co/vectara/hallucination_evaluation_model</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://www.anthropic.com/news/claude-2</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://aws.amazon.com/bedrock/claude/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Survey of hallucination in natural language generation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Frieske</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ishii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Bang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fung</surname></persName>
		</author>
		<idno type="DOI">10.1145/3571730</idno>
		<idno>doi:</idno>
		<ptr target="10.1145/3571730" />
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
		<ptr target="https://aclanthology.org/P02-1040.doi:10.3115/1073083.1073135" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Isabelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Charniak</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</editor>
		<meeting>the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Philadelphia, Pennsylvania, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">ROUGE: A package for automatic evaluation of summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W04-1013" />
	</analytic>
	<monogr>
		<title level="m">Text Summarization Branches Out, Association for Computational Linguistics</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">G-eval: NLG evaluation using gpt-4 with better human alignment</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Iter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.emnlp-main.153</idno>
		<ptr target="https://aclanthology.org/2023.emnlp-main.153.doi:10.18653/v1/2023.emnlp-main.153" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</editor>
		<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="2511" to="2522" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Retrieval-augmented generation for knowledge-intensive nlp tasks</title>
		<author>
			<persName><forename type="first">P</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Piktus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Petroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Karpukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Küttler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>-T. Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rocktäschel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="9459" to="9474" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-F</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Haffari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Pan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.01061</idno>
		<title level="m">Reasoning on graphs: Faithful and interpretable large language model reasoning</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Give us the facts: Enhancing large language models with knowledge graphs for fact-aware language modeling</title>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kumarage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Alghamdi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2311.07914</idno>
		<title level="m">Can knowledge graphs reduce hallucinations in llms? : A survey</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">On faithfulness and factuality in abstractive summarization</title>
		<author>
			<persName><forename type="first">J</forename><surname>Maynez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narayan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bohnet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Mcdonald</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.173</idno>
		<ptr target="https://aclanthology.org/2020.acl-main.173.doi:10.18653/v1/2020.acl-main.173" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Chai</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Schluter</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tetreault</surname></persName>
		</editor>
		<meeting>the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1906" to="1919" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">𝑞 2 : Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering</title>
		<author>
			<persName><forename type="first">O</forename><surname>Honovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Choshen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Aharoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Neeman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Szpektor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Abend</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.emnlp-main.619</idno>
		<ptr target="https://aclanthology.org/2021.emnlp-main.619.doi:10.18653/v1/2021.emnlp-main.619" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and</title>
				<editor>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and<address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="7856" to="7870" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Bertscore: Evaluating text generation with bert</title>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">*</forename></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kishore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">*</forename></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">*</forename></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=SkeHuCVFDr" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N19-1423</idno>
		<ptr target="https://aclanthology.org/N19-1423.doi:10.18653/v1/N19-1423" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long and Short Papers</title>
		<editor>
			<persName><forename type="first">J</forename><surname>Burstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Doran</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Solorio</surname></persName>
		</editor>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Minneapolis, Minnesota</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">TRUE: Re-evaluating factual consistency evaluation</title>
		<author>
			<persName><forename type="first">O</forename><surname>Honovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Aharoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Herzig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Taitelbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kukliansy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Scialom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Szpektor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hassidim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Matias</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.dialdoc-1.19</idno>
		<ptr target="https://aclanthology.org/2022.dialdoc-1.19.doi:10.18653/v1/2022.dialdoc-1.19" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second DialDoc Workshop on Documentgrounded Dialogue and Conversational Question Answering, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Feng</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Wan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Yuan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Yu</surname></persName>
		</editor>
		<meeting>the Second DialDoc Workshop on Documentgrounded Dialogue and Conversational Question Answering, Association for Computational Linguistics<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="161" to="175" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">SummaC: Re-visiting NLI-based models for inconsistency detection in summarization</title>
		<author>
			<persName><forename type="first">P</forename><surname>Laban</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Schnabel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">N</forename><surname>Bennett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hearst</surname></persName>
		</author>
		<idno type="DOI">10.1162/tacl_a_00453</idno>
		<ptr target="https://aclanthology.org/2022.tacl-1.10.doi:10.1162/tacl_a_00453" />
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="163" to="177" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Gekhman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Herzig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Aharoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Elkind</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Szpektor</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.11171</idno>
		<title level="m">Trueteacher: Learning factual consistency evaluation with large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Manakul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Liusie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Gales</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.08896</idno>
		<title level="m">Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Selfcontradictory hallucinations of large language models: Evaluation, detection and mitigation</title>
		<author>
			<persName><forename type="first">N</forename><surname>Mündler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vechev</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=EmQSOi1X2f" />
	</analytic>
	<monogr>
		<title level="m">The Twelfth International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-K</forename><surname>Ng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.04166</idno>
		<title level="m">Gptscore: Evaluate as you desire</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Min</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Krishna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lyu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>-T. Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">W</forename><surname>Koh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Iyyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajishirzi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.14251</idno>
		<title level="m">Factscore: Fine-grained atomic evaluation of factual precision in long form text generation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Du</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2403.18802</idno>
		<title level="m">Long-form factuality in large language models</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F R</forename><surname>Ribeiro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dreyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bansal</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2204.06508</idno>
		<title level="m">Factgraph: Evaluating factuality in summarization with semantic graph representations</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Building knowledge graph using pre-trained language model for learning entity-aware relationships</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pandey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gadia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mishra</surname></persName>
		</author>
		<idno type="DOI">10.1109/GUCON48875.2020.9231227</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Computing, Power and Communication Technologies (GUCON)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="310" to="315" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Grapher: Multi-stage knowledge graph construction using pretrained language models</title>
		<author>
			<persName><forename type="first">I</forename><surname>Melnyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dognin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Das</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=N2CFXG8-pRd" />
	</analytic>
	<monogr>
		<title level="m">NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Collier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Buntine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Shareghi</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.12392</idno>
		<title level="m">Pive: Prompting with iterative verification improving graph-based generative capability of llms</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Qiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.13168</idno>
		<title level="m">Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Summeval: Re-evaluating summarization evaluation</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Fabbri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Kryscinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mccann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Radev</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:220768873" />
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="391" to="409" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Asking and answering questions to evaluate the factual consistency of summaries</title>
		<author>
			<persName><forename type="first">A</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.450</idno>
		<ptr target="https://aclanthology.org/2020.acl-main.450.doi:10.18653/v1/2020.acl-main.450" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Chai</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Schluter</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tetreault</surname></persName>
		</editor>
		<meeting>the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="5008" to="5020" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Teaching machines to read and comprehend</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">M</forename><surname>Hermann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kocisky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grefenstette</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Espeholt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Kay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Suleyman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Blunsom</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Don&apos;t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization</title>
		<author>
			<persName><forename type="first">S</forename><surname>Narayan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">B</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lapata</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D18-1206</idno>
		<ptr target="https://aclanthology.org/D18-1206.doi:10.18653/v1/D18-1206" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Riloff</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Chiang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Hockenmaier</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tsujii</surname></persName>
		</editor>
		<meeting>the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1797" to="1807" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Deberta: Decodingenhanced bert with disentangled attention</title>
		<author>
			<persName><forename type="first">P</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=XPZIaotutsD" />
	</analytic>
	<monogr>
		<title level="m">ternational Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">The FEVER2.0 shared task</title>
		<author>
			<persName><forename type="first">J</forename><surname>Thorne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Vlachos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Cocarascu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Christodoulopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mittal</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER)</title>
				<meeting>the Second Workshop on Fact Extraction and VERification (FEVER)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Get your vitamin C! robust fact verification with contrastive evidence</title>
		<author>
			<persName><forename type="first">T</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fisch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Barzilay</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.naacl-main.52</idno>
		<ptr target="https://aclanthology.org/2021.naacl-main.52.doi:10.18653/v1/2021.naacl-main.52" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</title>
				<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="624" to="643" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">PAWS: Paraphrase Adversaries from Word Scrambling</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Baldridge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>He</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of NAACL</title>
				<meeting>of NAACL</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Exploring the limits of transfer learning with a unified text-totext transformer</title>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Matena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Liu</surname></persName>
		</author>
		<ptr target="http://jmlr.org/papers/v21/20-074.html" />
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page" from="1" to="67" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">A large annotated corpus for learning natural language inference</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">R</forename><surname>Bowman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Angeli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Potts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D15-1075</idno>
		<ptr target="https://aclanthology.org/D15-1075.doi:10.18653/v1/D15-1075" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">L</forename><surname>Màrquez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Callison-Burch</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Su</surname></persName>
		</editor>
		<meeting>the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Lisbon, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="632" to="642" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">A broadcoverage challenge corpus for sentence understanding through inference</title>
		<author>
			<persName><forename type="first">A</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nangia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bowman</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N18-1101</idno>
		<ptr target="https://aclanthology.org/N18-1101.doi:10.18653/v1/N18-1101" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long Papers</title>
		<editor>
			<persName><forename type="first">M</forename><surname>Walker</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Ji</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Stent</surname></persName>
		</editor>
		<meeting>the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>New Orleans, Louisiana</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1112" to="1122" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b36">
	<monogr>
		<title level="m" type="main">SciTail: A textual entailment dataset from science question answering</title>
		<author>
			<persName><forename type="first">T</forename><surname>Khot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sabharwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Clark</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>AAAI</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
