<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han,
M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, J. Tang, J. Wen, J. Yuan, W. X.
Zhao, J. Zhu, Pre-trained models: Past, present and future, AI Open</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/arXiv.2204.02311</article-id>
      <title-group>
        <article-title>Cognitive Mirage: A Review of Hallucinations in Large Language Models⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hongbin Ye</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tong Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aijia Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wei Hua</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weiqiang Jia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Zhejiang Lab</institution>
          ,
          <addr-line>No. 1 Kechuang Avenue, Yuhang District, Hangzhou City, Zhejiang Province</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <issue>2021</issue>
      <fpage>25</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>As large language models continue to develop in the field of AI, text generation systems are susceptible to a worrisome phenomenon known as hallucination. In this study, we summarize recent compelling insights into hallucinations in LLMs. We present a novel taxonomy of hallucinations from various text generation tasks, thus provideing theoretical insights, detection methods and improvement approaches. Based on this, future research directions are proposed. Our contributions are threefold: (1) We provide a complete taxonomy for hallucinations appearing in text generation tasks; (2) We provide theoretical analyses of hallucinations in LLMs and provide existing detection and improvement methods; (3) We propose several research directions that can be developed in the future. Our literature library is available at https://github.com/hongbinye/Cognitive-Mirage-Hallucinations-in-LLMs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Taxonomy of Hallucination</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Hallucination Detection</kwd>
        <kwd>Hallucination Correction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the ever-evolving realm of large language models (LLMs), a constellation of innovative creations
has emerged, such as GPT-3 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], InstructGPT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], FLAN [3], PaLM [4], LLaMA [5] and other notable
contributors [6, 7, 8, 9]. These models implicitly encode global knowledge within their parameters during
the pre-training phase [10, 11], ofering valuable insights as knowledge repositories for downstream
tasks [12, 13, 14]. Nevertheless, the generalization of knowledge can result in memory distortion,
an inherent limitation that may give rise to potential inaccuracies [15]. Moreover, their ability to
represent knowledge is constrained by model scale and faces challenges in addressing long-tailed
knowledge problems [16, 17]. While the privacy and timeliness of data in the real world [18, 19]
unfortunately exacerbate this problem, leaving models dificult to maintain a comprehensive and
up-todate understanding of the facts. These challenges present a serious obstacle to the reliability of LLMs,
which we refer to as hallucination. [20]. A prominent example of this drawback is that models typically
generate statements that appear reasonable but are either cognitively irrelevant or factually incorrect.
In light of this observation, hallucinations remain a critical challenge in medical [21, 22], financial [ 23]
and other knowledge-intensive fields due to the exacting accuracy requirements. Particularly, the
applications for legal case drafting showcase plausible interpretation as an aggregation of diverse
subjective perspectives [24].
      </p>
      <p>Definition of Hallucination. As depicted in Figure 1, hallucination refers to the generation of texts or
responses that exhibit grammatical correctness, fluency, and authenticity, but deviate from the provided
source inputs (faithfulness) or do not align with factual accuracy (factualness) [25]. In traditional
NLP tasks [26], hallucinations are often synonymous with faithfulness: conflicting information leads
to Intrinsic Hallucination, i.e., LMs conflict with the input information when generating a response;
Conversely, generating ambiguous supplementary information may lead to Extrinsic Hallucination,
i.e., LMs produce personal names, historical events, or technical documents that are challenging to
◆Daniel Vacek (born 1 April 1971) is a former
tennis player from Czechoslovakia and the Czech</p>
      <p>Republic who turned professional in 1990.
◆Hana Mandlíková (born 19 February 1962) is a
former professional tennis player from
Czechoslovakia who later obtained Australian
citizenship.</p>
      <p>In which sport did the Czech stars Daniel Vacek and Hana</p>
      <p>Mandlíková gain professional status?</p>
      <sec id="sec-1-1">
        <title>Daniel Vacek and Hana Mandlíková both gained</title>
        <p>professional status in cricket.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Daniel Vacek and Hana Mandlíková both gained</title>
        <p>professional status in tennis.
verify. LLMs-oriented hallucinations instead prioritize factualness, focusing on whether the result can be
evidenced or negated by reference to external facts in the real world. Uncritical trust in LLMs can give
rise to a phenomenon Cognitive Mirage, contributing to misguided decision-making and a cascade of
unintended consequences [27].</p>
        <p>Present work To efectively control the risk of hallucinations, we summarize recent progress in
hallucination theories and solutions in this paper. We propose to organize relevant work by a comprehensive
survey (Figure 2):
• Theoretical insight and mechanism analysis. We provide in-depth theoretical and mechanism
analysis from three typical perspectives: data collection, knowledge gap and optimization process,
reviewing the recent and relevant theories related to hallucinations (§2).
• Taxonomy of hallucination in LLMs. We conduct a comprehensive review of hallucination in
LLMs together with a task axis. We review the task-specific benchmarks with a comprehensive
comparison and summary (§3).
• Wide coverage on emerging hallucination detection and correction methods. We propose
a comprehensive investigation into the proactive detection (§4) and mitigation of hallucinations
(§5) in the era of LLMs. This is critical to study the most popular techniques for inspiring future
research directions (§6).</p>
        <p>Related work As this topic is relatively nascent, only a few surveys exist. Closest to our work, [25]
analyzes hallucinatory content in task-specific research progress, which focuses on early works in
natural language generation field. Currently there are significant eforts to address hallucination in
LLMs. [28] covers methods for efectively collecting high-quality instructions for LLM alignment,
including the use of NLP benchmarks, human annotations, and leveraging strong LLMs. [29] discusses
self-correcting methods where LLM itself is prompted or guided to correct the hallucinations from its
Knowledge Gap
Optimization</p>
        <p>Process
Hallucination
Detection (§4)</p>
        <p>Mechanism
Analysis (§2)
Inference
Classifier
Uncertainty</p>
        <p>Metric
Self-Evaluation</p>
        <p>Evidence
Retrieval</p>
        <p>Hallucination
in LLMs</p>
        <p>Definition</p>
        <p>Related work
Hallucination
Correction (§5)</p>
        <p>Taxonomy of
Hallucination (§3)</p>
        <p>Question and</p>
        <p>Answer
Dialog System</p>
        <p>Future
Directions (§6)</p>
        <p>Summarization</p>
        <p>System
Knowledge Graph
with LLMs
Cross-modal</p>
        <p>System
Data Construction</p>
        <p>Management
Downstream Task</p>
        <p>Alignment
Reasoning
Mechanism
Exploitation
Multi-modal
Hallucination</p>
        <p>Survey
own outputs. Despite some benchmarks [30, 31, 32] is constructed to evaluate whether LLMs are able
to generate factual responses, these works scattered among various tasks have not been systematically
reviewed and analyzed. Diferent from those surveys, in this paper, we conduct a literature review on
hallucinations in LLMs, hoping to systematically understand the methodologies, compare diferent
methods and inspire new ideas.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Mechanism Analysis</title>
      <p>For the sake of clean exposition, this section provides theoretical insight into mechanism analysis for
hallucinations in LLMs. As a regular LLM, the generative objective is modeled by a parameterized
probabilistic model , and sampled to predict the next token in the sentence, thus generating the
entire sentence:
() = ℱ (ℐ, , , &lt;)
(1)
where  represents probable tokens at each step that can be selected by beam search from vocabulary
. Note that the instructions ℐ utilize a variety of predefined templates according to diferent tasks [ 33].
Multifarious and high-quality in-context demonstrations  are aimed at providing analogy samples
to reduce the cost of adapting models to new tasks [34]. Parameters  implicitly memorize corpus
knowledge through diverse architectural ℱ such as decoder-only, encoder-only, or encoder-decoder
LLMs. As LLM-based systems can exhibit a variety of hallucinations, we summarise three primary
mechanism types for theoretical analysis, and each mechanism is correlated with a distinct training
factor.</p>
      <p>Data Collection The parameters are implicitly stored within the model as a priori knowledge acquired
during the pre-training process. Given the varying quality and range of knowledge within the pre-trained
corpus, the information incorporated into the LLMs may be incomplete or outdated. In cases where
pertinent memories are unavailable, the LLM’s performance may deteriorates, resorting to rudimentary
corpus-based heuristics that rely on term frequencies to render judgements [35]. Another bias stems
from the capacity for contextual learning [36] when a few demonstrations are introduced as input to the
prefix context. Previous research [ 37, 38] has demonstrated that the acquisition of knowledge through
model learning demonstrations depends on disparities in label categories and the order of demonstration
samples. Likewise, multilingual LLMs encounter challenges related to hallucinations, particularly in
n
o
i
t
ian ion
llcau ttcee
H D</p>
      <p>
        FIB [
        <xref ref-type="bibr" rid="ref21">62</xref>
        ], ExHalder [
        <xref ref-type="bibr" rid="ref23">64</xref>
        ], HaluEval [31], GAVIE [
        <xref ref-type="bibr" rid="ref26">67</xref>
        ], Fact-checking [
        <xref ref-type="bibr" rid="ref27">68</xref>
        ], CoNLI [
        <xref ref-type="bibr" rid="ref28">69</xref>
        ]
Uncertainty Metric
      </p>
      <p>
        BARTScore [
        <xref ref-type="bibr" rid="ref29">70</xref>
        ], KoK [
        <xref ref-type="bibr" rid="ref30">71</xref>
        ], SLAG [
        <xref ref-type="bibr" rid="ref31">72</xref>
        ], KLD [
        <xref ref-type="bibr" rid="ref32">73</xref>
        ], POLAR [
        <xref ref-type="bibr" rid="ref33">74</xref>
        ], ASTSN [
        <xref ref-type="bibr" rid="ref34">75</xref>
        ]
handling language pairs with limited resources or non-English translations [39]. Furthermore,
cuttingedge Large Vision-Language Models (LVLMs) exhibit instances of hallucinating common objects within
visual instructional datasets and prone to objects that frequently co-occur in the same image [40, 41].
Knowledge Gap Knowledge gaps are typically attributed to diferences in input format between
the pre-training and fine-tuning stages [ 42]. Even when considering the automatic updating of textual
knowledge bases, the output can deviate from the expected corrections [43]. For example, questions
often do not align efectively with stored knowledge, and the available information remains unknown
until the questions are presented. This knowledge gap poses thorny challenges in balancing memory
with retrieved evidence, which is construed as a passive defense mechanism against the misuse of
retrieval [
        <xref ref-type="bibr" rid="ref3">44</xref>
        ]. To delve into this issue, [
        <xref ref-type="bibr" rid="ref4">45</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">46</xref>
        ] propose that disregarding retrieved evidence
introduces biased model knowledge, while mis-covering and over-thinking disrupt model behavior.
Furthermore, in scenarios where a cache component is utilized to ofer historical memory during
training [
        <xref ref-type="bibr" rid="ref6">47</xref>
        ], the model also experiences inconsistency between the present hidden state and the hidden
state stored in the cache.
      </p>
      <p>
        Optimization Process The maximum likelihood estimation and teacher-forcing training have the
potential to result in a phenomenon known as stochastic parroting [
        <xref ref-type="bibr" rid="ref7">48</xref>
        ], wherein the model is prompted
to imitate the training data without comprehension [
        <xref ref-type="bibr" rid="ref8">49</xref>
        ]. Specifically, exposure bias between the training
and testing stages have been demonstrated to lead to hallucinations within LLMs, particularly when
generating lengthy responses [
        <xref ref-type="bibr" rid="ref9">50</xref>
        ]. Besides, sampling techniques characterized by high uncertainty [
        <xref ref-type="bibr" rid="ref10">51</xref>
        ],
such as top-p and top-k, exacerbate the issue of hallucination. Furthermore, [27] observes that LLMs
tend to produce snowballing hallucinations to maintain coherence with earlier hallucinations, and
even when directed with prompts as "Let’s think step by step", they still generate inefectual chains of
reasoning [13].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Taxonomy of Hallucination</title>
      <p>
        In this paper, we mainly consider representative hallucinations, which are widely observed in various
downstream tasks, i.e. Machine Translation, Question and Answer, Dialog System, Summarization System,
Knowledge graph with LLMs, and Visual Question Answer. As shown in Table 1, these hallucinations are
identified complex taxonomy in numerous mainstream tasks associated with LLMs. In the following
sections, we will introduce representative types of hallucinations to be resolved.
∙ Machine Translation. Since perturbations (e.g., spellings or capital errors) can induce hallucinations
reliably, traditional machine translation models tend to validate instances memorised by the model
when subjected to perturbations [
        <xref ref-type="bibr" rid="ref46 ref47">87, 88</xref>
        ]. It is worth noting that hallucinations generated by LLMs are
mainly translation of-target, over-generation, or failed translation attempts [ 39]. While in low-resource
language setting, most models exhibit subpar performance due to the lack of annotated data [
        <xref ref-type="bibr" rid="ref13">54</xref>
        ]. In
contrast, they are vulnerable to increased amount of pre-trained languages in multilingual setting [
        <xref ref-type="bibr" rid="ref48">89</xref>
        ].
Subsequently, familial LLMs trained on diferent scales of monolingual data are proved to be viscous [ 39],
as the source of oscillatory hallucination pathology.
Enc-Dec
Enc-Dec
Enc-Dec
Enc-Dec
Enc-Dec,
Only-Dec
Only-Dec
Only-Dec
Enc-Dec,
Only-Dec
Enc-Dec,
Only-Dec
Enc-Dec,
Only-Enc,
Only-Dec
Enc-Dec,
Only-Dec
Enc-Dec,
Only-Dec
Enc-Dec,
Only-Dec
Enc-Dec,
Only-Enc
Multiple
ADapters
Enc-Dec,
Only-Dec
Only-Dec
      </p>
      <p>IWSLT-2014
WMT2018
FLORES-200,
Jigsaw, Wikipedia
XQuAD, TyDi,
XNLI, XL-Sum,
MASSIVE
TruthfulQA
HotpotQA,
BoolQ
MEDMCQA,
Headqa,
MILE,
Pubmed</p>
      <p>US</p>
      <p>Medqa,
WoW
WoW,
DOG,
calChat
WoW
MENT
NHNet
XL-Sum
Encyclopedic,
ETC
TekGen,
WebNLG</p>
      <p>Under perturbation, Natural
hallucination
Oscillatory hallucination, Largely
fluent hallucination</p>
      <p>Consider a natural
scenario
Full hallucination, Partial halluci- Introduce pathology
nation, Word-level hallucination detection
Source language hallucination</p>
      <p>Source perturbation
Imitative falsehoods
Comprehension, Factualness, Manual analysis of
reSpecificity, Inference Hallucina- sponses
tion
Semantic equivalence, Symbolic
equivalence, Intrinsic ambiguity,
Granularity discrepancies,
Incomplete, Enumeration, Satisfactory
Subset
Reasoning hallucination,
Memory-based hallucination</p>
      <p>Medical benchmark
Med-HALT
Evaluate source
language hallucination
Cause imitative
falsehoods
Evaluate retrieval
augmented QA
Knowledge-grounded
interaction
benchmark Begin
Sample responses for
conversation
Generate summaries
from given models
Majority vote of
journalism degree holders
In a cross-lingual
transfer setting
Evaluate knowledge
creating ability given
known facts
Ontology driven
KGC benchmark
Text2KGBench
Caption hallucination
assessment
Non-hallucinated, Factual halluci- Label factual entities
nation, Non-factual hallucination, from summarizations
Intrinsic hallucination
News headline hallucination
Intrinsic hallucination, Extrinsic
hallucination
Knowledge hallucination
Subject hallucination, relation
hallucination, object hallucination
Question and
Answer</p>
      <p>Enc-Dec,
Only-Dec</p>
      <p>NQ, HotpotQA,</p>
      <p>
        TopiOCQA
∙ Question and Answer. Imperfect responses sufer from flawed external knowledge, knowledge
recall cues and reasoning instruction [42]. For example, LLMs are mostly unable to avoid answering
when provided with no relevant information, instead provide incomplete and plausible answers [
        <xref ref-type="bibr" rid="ref15">56</xref>
        ]. In
additon to external knowledge, memorized information without accurate, reliable and accessible source
also contributes to diferent types of hallucinations [ 22]. Though scaling laws suggest that perplexity
on the training distribution is positively correlated with parameter size, [30] further discovers that
scaling up models should increase the rate of imitative falsehoods.
∙ Dialog System. Some studies view dialogue models as unobtrusive imitators, which simulates the
distributional properties of data instead of generating faithful output. For example, Uncooperativeness
responses [
        <xref ref-type="bibr" rid="ref16">57</xref>
        ] originating from discourse phenomena inclines to output an exact copy of the entire
evidence. [
        <xref ref-type="bibr" rid="ref17">58</xref>
        ] reports more nuanced hallucinations in KG-grounded dialogue systems as analyzed
through human feedback. Similarly, FaithDial [
        <xref ref-type="bibr" rid="ref18">59</xref>
        ], BEGIN [
        <xref ref-type="bibr" rid="ref19">60</xref>
        ], MixCL [
        <xref ref-type="bibr" rid="ref20">61</xref>
        ] all implement
experiments on the WoW dataset to conduct a meta-evaluation of the hallucination in knowledge grounded
dialogue.
∙ Summarization System. Automatically generated abstracts based on LLMs may be fluent, but
they still typically lack faithfulness to the source document. Compared to the human evaluation of
traditional summarization models [26], the summarizations generated by LLMs can be categorized
into two major types: intrinsic hallucinations that distort the information present in the document;
extrinsic hallucinations that provide additional information that cannot be directly attributed to the
document [
        <xref ref-type="bibr" rid="ref24">65</xref>
        ]. Note that extrinsic hallucination as a metrics of factually consistent continuation of
inputs in LLMs is given more attention in summarisation systems [
        <xref ref-type="bibr" rid="ref21 ref23">62, 64</xref>
        ]. Furthermore, [
        <xref ref-type="bibr" rid="ref22">63</xref>
        ] subdivides
extrinsic hallucinations into factual and non-factual hallucinations. The former provides additional
world knowledge, which may benefit comprehensive understanding.
∙ Knowledge Graph with LLMs. Despite the promising progress in knowledge-based text geneartion,
it encounters intrinsic hallucinations inherent to the process where the generated text not only covers
the input information but also incorporates redundant details derived from its internal memorized
knowledge [
        <xref ref-type="bibr" rid="ref49">90</xref>
        ]. To address this, [
        <xref ref-type="bibr" rid="ref25">66</xref>
        ] establish a distinction between correctly generated knowledge
and knowledge hallucinations in terms of knowledge creation. Notably, the Virtual Knowledge Extraction
[
        <xref ref-type="bibr" rid="ref50">91</xref>
        ] underscores the potential generalization capabilities of LLMs in the realms of constructing and
inferring from knowledge graphs. [32] further empower LLMs to produce interpretable fact-checks
through a neural symbolic approach. Based on their fidelity to the source, hallucinations are defined as
subject hallucination, relation hallucination, and object hallucination.
∙ Cross-modal System. Augmented by the superior language capabilities of LLMs, performance of
cross-modal tasks achieves promising progress [
        <xref ref-type="bibr" rid="ref51">92, 40</xref>
        ]. However, despite replacing the original language
encoder with LLMs, Large Visual Language Models (LVLMs) [
        <xref ref-type="bibr" rid="ref52">93</xref>
        ] still generate object descriptions that
not present in the target image, denoted as object hallucinations [41]. Especially, the various failure
cases could be typically found in Visual Question Answering [
        <xref ref-type="bibr" rid="ref26">41, 67</xref>
        ], Image Captioning [
        <xref ref-type="bibr" rid="ref53 ref54 ref55">94, 95, 96</xref>
        ],
Report Generation [
        <xref ref-type="bibr" rid="ref27">68</xref>
        ] etc.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Hallucination Detection</title>
      <p>
        Conventional hallucination detection mainly depends on task-specific metrics, such as ROUGE and
BLEU to evaluate the information overlap between source and target texts in summarization tasks [
        <xref ref-type="bibr" rid="ref56">97</xref>
        ],
while knowledge F1 to estimate the knowledge-aware ability of response generation [
        <xref ref-type="bibr" rid="ref57">98</xref>
        ]. These metrics
focus on measuring faithfulness of references and fail to provide an assessment of factualness. Despite
some reference-free works are proposed, plugin-based methods [
        <xref ref-type="bibr" rid="ref58">99</xref>
        ] sufer from world knowledge
limitation. QA-based matching metrics [
        <xref ref-type="bibr" rid="ref59">100</xref>
        ] lack knowledge completeness of source information.
NLI-based methods [
        <xref ref-type="bibr" rid="ref19">60</xref>
        ] are unable to support finer-grained hallucination checking as they are
sentencelevel, besides entailment and hallucination problems are not equivalent. As the paradigm shift in
hallucination detection arising from the rapid development of LLMs, we present a novel taxonomy in
Fig 3 and introduce each category in following sections.
∙ Inference Classifier. The most straightforward strategy involves adopting classifiers to assess the
likelihood of hallucinations. Concretely, given a question  and an answer , an inferential classifier
 can be asked to determine whether the answer contains hallucinatory content ℋ via computing
(ℋ) = ℱ(, ). Therefore, [
        <xref ref-type="bibr" rid="ref23">64</xref>
        ] employs the state-of-the-art LLMs to do end-to-end text generation
of detection results. Some other studies [31] finds that adding chains of thought indiscriminately
may intervene in the final judgement, whereas retrieving the knowledge properly resulted in gains.
Furthering this concept, the hinted classifer and explainer [
        <xref ref-type="bibr" rid="ref23">64</xref>
        ], used to generate intermediate process
labels and high-quality natural language explanations, are demonstrated to enhance the final predicted
class from a variety of perspectives. Subsequently, [
        <xref ref-type="bibr" rid="ref21">62</xref>
        ] suggests adopting a diferent classifier model
to the generated model, contributing to easier judgement of factual consistency. For radiology report
generation, binary classifiers [
        <xref ref-type="bibr" rid="ref27">68</xref>
        ] can be leveraged to measure the reliability by combining image
and text embedding. Unlike previous work that employs complex human-crafted rules to evaluate
object hallucinations, GAVIE [
        <xref ref-type="bibr" rid="ref26">67</xref>
        ] scores responses towards image content based on both accuracy and
relevance criteria, which evaluates the LMMs output in an open-ended manner.
∙ Uncertainty Metric. It is important to examine the correlation between the hallucination metric and
the quality of output from a variety of perspectives. One intuitive approach is to employ the probabilistic
output of the model itself, as ASTSN [
        <xref ref-type="bibr" rid="ref34">75</xref>
        ] calculates the models’ uncertainty about the identified concepts
by utilising the logit output values. Similarly, BARTSCORE [
        <xref ref-type="bibr" rid="ref29">70</xref>
        ] employs a universal notion that models
trained to convert generated text to reference output or source text will score higher when the generated
text is superior. It is an unsupervised metric that supports the addition of appropriate prompts to
improve the measure design, without human judgement to train. Furthermore, KoK [
        <xref ref-type="bibr" rid="ref30">71</xref>
        ] based on the
work of [
        <xref ref-type="bibr" rid="ref60">101</xref>
        ] evaluates answer uncertainty from three categories, i.e., subjectivity, hedges and text
uncertainty. However, SLAG [
        <xref ref-type="bibr" rid="ref31">72</xref>
        ] measures consistent factual beliefs in terms of paraphrase, logic, and
entailment. In addition to this, KLD [
        <xref ref-type="bibr" rid="ref32">73</xref>
        ] combines information theory-based metrics (e.g., entropy and
KL-divergence) to capture knowledge uncertainty. Beside expert-stipulated programmatic supervision,
POLAR [
        <xref ref-type="bibr" rid="ref33">74</xref>
        ] introduces Pareto optimal learning assessed risk score for estimating the confidence level
of a response.
∙ Self-Evaluation. To self-evaluate is challenging since the model might be overconfident about its
generated samples being correct. The motivating idea of SelfCheckGPT [
        <xref ref-type="bibr" rid="ref36">77</xref>
        ] is to use the ability of
the LLMs themselves to sample multiple responses and identify fictitious statements by measuring
the consistency of information among responses. [
        <xref ref-type="bibr" rid="ref35">76</xref>
        ] further illustrates that both the increase in
size and the demonstration of assessment can improve self-assessment. Beyond repetitive multiple
direct queries, [
        <xref ref-type="bibr" rid="ref37">78</xref>
        ] uses open-ended indirect queries and compares their answers to each other for an
agreed-upon score outcome. SelfCk [
        <xref ref-type="bibr" rid="ref40">81</xref>
        ] imposes appropriate constraints on the same LLM to generate
pairs of sentences triggering self-contradictions, which prompt the detection. In contrast, Polling-based
querying [41] reduce the complexity of judgement by randomly sampling query objects. Besides,
Self-Checker [
        <xref ref-type="bibr" rid="ref38">79</xref>
        ] decomposes complex statements into multiple simple statements, fact-checking
them one by one. However, [
        <xref ref-type="bibr" rid="ref39">80</xref>
        ] introduces two LLMs to drive the complex fact-checking reasoning
process by crosscheck.
∙ Evidence Retrieval. Evidence retrieval accomplishes factual detection by retrieving supporting
evidence related to hallucinations. To this end, Designing a claim-centric pipeline allows for a
questionretrieve-summary chain to efectively collect original evidence [
        <xref ref-type="bibr" rid="ref43 ref44">84, 85</xref>
        ]. Consequently, FActScore [
        <xref ref-type="bibr" rid="ref42">83</xref>
        ]
calculates the percentage of atomic facts supported by the given knowledge source. Towards adapting
the tasks that users in interaction with generative models, FacTool [
        <xref ref-type="bibr" rid="ref45">86</xref>
        ] proposes to integrate a variety
of tools into a task-agnostic and domain-agnostic detection framework, in order to assemble evidence
about the authenticity of the generated content.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Hallucination Correction</title>
      <p>In this section, we delve into the methods to correct hallucination in terms of diferent aspects. As shown
in Figure 4, these hallucination correction paradigms have demonstrated strong dominance in many
mainstream NLP tasks. Note that these methods are not entirely orthogonal but could complement
each other as required by the tasks in practical applications. In the following sections, we will introduce
each methods as shown in Figure 5.</p>
      <p>Question and Answer
Summarization System</p>
      <p>
        LSHF [131], TLM [132], BRIO [133], LM-know [
        <xref ref-type="bibr" rid="ref35">76</xref>
        ], Chain-of-Hindsight [134], ZEROFEC [43],
CRITIC [135], VIVID [
        <xref ref-type="bibr" rid="ref55">96</xref>
        ], LMH-Snowball [27], MixAlign [
        <xref ref-type="bibr" rid="ref4">45</xref>
        ], REFEED [15], PaD [136], ALCE [137],
      </p>
      <p>
        Do-LLM-Know [
        <xref ref-type="bibr" rid="ref37">78</xref>
        ], CRL [138], SR [139]
      </p>
      <p>
        HLMTM [39], Multiagent-Debate [140], MAD [141], FORD [142], LM-vs-LM [
        <xref ref-type="bibr" rid="ref39">80</xref>
        ], PRD [143], SPP [144]
∙ Parameter Adaptation. Parameters in LLMs store biases learned in pre-training, are often unaligned
with user intent. A cutting-edge strategy is to guide efective knowledge through parameter conditioning,
editing, and optimisation. For example, CLR [
        <xref ref-type="bibr" rid="ref20">61</xref>
        ] optimises to reduce the generation probability of
negative samples at span level utilising contrastive learning parameters. While introducing contextual
knowledge background that contradicts the model’s intrinsic prior knowledge, TYE [105] efectively
reduces the weight of prior knowledge through context-aware decoding method. Besides, PURR [104]
corrupts noise into the text, fine-tune compact editors, and denoise by merging relevant evidence. To
introduce additional cache component, HISTALIGN [
        <xref ref-type="bibr" rid="ref6">47</xref>
        ] discovers that its hidden state is not aligned
with the current hidden state, and proposes sequence information contrastive learning to improve
the reliability of memory parameters. Consequently, Edit-TA [
        <xref ref-type="bibr" rid="ref61">102</xref>
        ] mitigates the biases learnt in
pre-training from a task algorithm perspective. An intuition behind it is that parameter variations learnt
through negative example tasks could be perceived through weight variances. However as this fails to
take the importance of diferent negative examples into account, therefore EWR [103] proposes Fisher
information matrices to measure the uncertainty of their estimation, which is applied for the dialogue
systems to execute a parameter interpolation and remove hallucination. EasyEdit [109] summarises
methods for parameter editing, while minimising the influence to irrelevant parameter.
      </p>
      <p>
        An eficient alternative is to identify task-specific parameters and exploit them. For example,
ALLM [106] aligns the parameter module with task-specific knowledge, and then generates the relevant
knowledge as additional context in background augmented prompts. Similarly, mmT5 [
        <xref ref-type="bibr" rid="ref14">55</xref>
        ] utilises
language-specific modules during pre-training to separate language-specific information from
languageindependent information, demonstrating that adding language-specific modules can alleviate the curse
of multilinguality. Instead, TRAC [107] combines conformal prediction and global testing to augment
retrieval-based QA. The conservative strategy formulation ensures that a semantically equivalent
answer to the truthful answer is included in the prediction set.
      </p>
      <p>
        Another parameter adaptation idea focuses on flexible sampling consistent with user requirements.
For instance, [
        <xref ref-type="bibr" rid="ref10">51</xref>
        ] observes that the randomness of sampling is more detrimental to factuality when
generating the latter part of a sentence. The factual-nucleus sampling algorithm is introduced to keep the
faithfulness of the generation while ensuring the quality and diversity. Besides, Inference-Time [108]
ifrstly identiefis a set of attentional heads with high linear probing accuracy, and then shifts activation
in the inference process along the direction associated with factual knowledge.
∙ Post-hoc Attribution and Edit Technology. A source of hallucination is that LLMs may leverage
the patterns observed in the pre-training data for inference in a novel form. Recently, ORCA [112] reveals
problematic patterns in the behaviour of models by probing supporting data evidences from pre-training
data. Similarly, TRAK [114] and Data-Portraits [115] analyse whether models plagiarise or reference
existing resources by means of data attribution. QUIP [118] further demonstrates that providing text
that has been observed in the pre-training phase can improve the ability of LLMs to generate more
factual information. Furthermore, motivated by the gap between LLMs and human modes of thinking,
one intuition is to align the two modes of reasoning. Thus CoT [14] elicits faithful reasoning via a kind
of Chain-of-Thought (CoT) [13] prompts. Similarly, RR [113] retrieves relevant external knowledge
based on decomposed reasoning steps obtained from a CoT prompt. Since LLMs do not produce the
best output on the first attempt, Self-Refine [116] implements self-refinement algorithms through
iterative feedback and improvement. Reflexion [117] also employs verbal reinforcement to generate
reflective feedback by learning about prior failings. Verify-and-Edit [119] proposes a CoT-prompted
verify-and-edit framework, which improves the fidelity of predictions by post-editing the inference
chain based on externally retrieved knowledge. CoVe [120] emphasises the importance of independent
self-verification to prevent being influenced by other responses. Another source of hallucinations is to
describe factual content with incorrect retrievals. To recify this, NP-Hunter [111] follows a
generatethen-refine strategy whereby a generated response is amended using the KG so that the dialogue system
is able to correct potential hallucinations by querying the KG.
∙ Leverage External Knowledge. As an attempt to extend the language model for halucination
mitigation, a suggestion is to retrieve relevant documents from large textual databases. RETRO [121]
splits the input sequence into chunks and retrieves similar documents, while In-Context RALM [
        <xref ref-type="bibr" rid="ref21">62</xref>
        ]
places the selected document before the input text to improve the prediction. Furthermore, IRCoT [122]
interweaves CoT generation and document retrieval steps to guide LLMs. LLM-AUGMENTER [123] also
bases the responses of LLMs on integrated external knowledge and automated feedback to improve
the truthfulness score of the answers. Another work, CoK [126] iteratively analyses future content of
upcoming sentences, and then applies them as a query to retrieve relevant documents for the purposes
of re-generating sentences when they contain low confidence tokens. Similarly, RETA-LLM [129] creates
a complete pipeline to assist users in building their own domain-based LLM retrieval systems. Note that
in addition to document retrieval, diverse external knowledge queries coule be assembled into
retrievalaugmented LLM systems. For example, FLARE [127] leverages structured knowledge bases to support
complex queries and provide more straightforward factual statements. Further, KnowledGPT [130]
adopts program of thoughts (PoT) prompting, which generates codes to interact with knowledge bases.
While cTBL [125] proposes to enhance LLMs with tabular data in conversation settings. Besides,
GeneGPT [124] demonstrates that expertise can be accessed more easily and accurately by detecting and
executing API calls through contextual learning and augmented decoding algorithms. To support
potentially millions of ever-changing APIs, Gorilla [128] explores self-instruct fine-tuning and retrieval
for eficient API exploitation.
∙ Assessment Feedback. As language models become more sophisticated, evaluation feedback can
significantly improve the quality of generated text, as well as reduce the appearance of hallucinations. To
realise this concept, LSHF [131],TLM [132] and Chain-of-Hindsight [134] predict human preferences
through reinforcement learning and employs this as the reward function. In addition to enabling the
model to learn directly from the feedback of factual metrics in a sample-eficient manner [ 138], it
is also important to build in a self-evaluation function of the model to filter candidate generated
texts. For example, BRIO [133] empowers summarization model assessment, estimating probability
distributions of candidate outputs to rate the quality of candidate summaries. While LM-know [
        <xref ref-type="bibr" rid="ref35">76</xref>
        ]
is devoted to investigating whether LLMs can evaluate the validity of their own claims by detecting
the probability that they know the answer to a question. Consequently, Do-LLM-Know [
        <xref ref-type="bibr" rid="ref37">78</xref>
        ] queries
exclusively with black-box LLMs, and the results of queries repeatedly generated multiple times are
compared with each other to pass consistency checks. As missing citation quality evaluation afects
the final performance, ALCE [137] employs a natural language reasoning model to measure citation
quality and extends the integrated retrieval system. Similarly, CRITIC [135] proposes to interact with
appropriate tools to assess certain aspects of the text, and then to modify the output based on the
feedback obtained during the verification process. Note that automated error checking can also utilise
LLMs to generate text that conforms to tool interfaces. PaD [136] distills the LLMs with a synthetic
inference procedure, and the synthesis program obtained can be automatically compiled and executed
by an explainer. Further, iterative refinement processes are validated to efectively identify appropriate
details [
        <xref ref-type="bibr" rid="ref4 ref55">96, 45, 15</xref>
        ], and can stop early invalid reasoning chains, beneficially reducing the phenomenon
of hallucination snowballing [27].
∙ Mindset Society. Human intelligence thrives on the concept of cognitive synergy, where collaboration
between diferent cognitive processes produces better results than isolated individual cognitive processes.
"Society of minds" [145] is believed to have the potential to significantly improve the performance of
LLMs and pave the way for consistency in language production and comprehension. For the purpose
of addressing hallucinations in large-scale multilingual models across diferent translation scenarios,
HLMTM [39] proposes a hybrid setting in which other translation systems can be requested to act as a
back-up system when the original system is hallucinating. Consequently, Multiagent-Debate [140]
employs multiple LLMs in several rounds to propose and debate their individual responses and reasoning
processes to reach a consensus final answer. As a result of this process, the models are encouraged
to construct answers that are consistent with both internal criticisations and responses from other
agents. Before a final answer is presented, the resultant community of models can hold and maintain
multiple reasoning chains and possible answers simultaneously. Based on this idea, MAD [141] adds a
judge-managed debate process, demonstrating that adaptive interruptions of debate and controlled
"titfor-tat" states help to complete factual debates. Furthermore, FORD [142] proposes roundtable debates
that include more than two LLMs and emphasises that competent judges are essential to dominate the
debate. LM-vs-LM [
        <xref ref-type="bibr" rid="ref39">80</xref>
        ] also proposes multi-round interactions between LM and another LM to check
the factualness of original statements. Besides, PRD [143] proposes a peer rank and discussionbased
evaluation framework to arrive at a well-recognised assessment result that all peers are in agreement
with. In an efort to maintain strong reasoning, SPP [144] utilises LLMs to assign several fine-grained
roles, which efectively stimulates knowledge acquisition and reduces hallucinations.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Directions</title>
      <p>
        Though numerous technical solutions have been proposed in the survey for hallucinations in LLMs,
there exist some potential directions:
∙ Data Construction Management. As previously discussed, the style, and knowledge of LLMs is
basically learned during model pre-training. High quality data present promising opportunities for
facilitating the reduction of hallucinations in LLMs [146]. Inspired by the basic rule of machine learning
models "Garbage input, garbage output", [147] proposes that data quality and diversity outweigh
the importance of fine-tuning large-scale instructions [ 148, 3, 149] and RLHF [
        <xref ref-type="bibr" rid="ref2">6, 2</xref>
        ]. To perform
eficiently in knowledge-intensive verticals, we argue that construction of entity-centred fine-tuned
instructions [150, 151, 152] is a promising direction that it can enhance the factuality of generated
entity information. Another feasible proposal is to incorporate a self-curation phase [153] in the
instruction construction process to rate the quality of candidate pairs. During the iteration process,
quality evaluation [154] based on manual or automated rule constraints could provide self-correction
capacity.
∙ Reasoning Mechanism Exploitation. The emerging CoT technique [14] stimulates the emergent
reasoning ability of LLMs by imitating intrinsic stream of thought. Recently, A primary improvement
is ToT [155] introduces tree and into the thought process, and provides a novel backtrack function.
However, the actual thinking process creates a complex network of ideas, as an example, people could
explore a particular chain of reasoning, backtrack or start a new chain of reasoning. GoT [156] extends
the dependencies between thoughts by constructing vertices with multiple incoming edges to aggregate
arbitrary thoughts. Since previous methods have no storages for intermediate results, CR [156] uses
cumulative and iterative manners to simulate human thought processes, and decompose the task
into smaller components. In addition to self-heuristic methods, PAL [157] and PoT [158] introduce
programming logic into the language space [159], expanding the ability to invoke external explainers.
As a summary, research based on human cognition helps to provide brilliant insights into the analysis
of hallucinations, such as Dual Process Theory [160], Three layer mental model [161], Computational
Theory of Mind [162], and Connectionism [163].
∙ Multi-modal Hallucination Survey. It has become a community consensus to establish powerful
Multimodal Large Language Models (MLLMs) [164, 165, 166] by taking advantage of excellent
comprehension and reasoning capabilities of LLMs. [41] confirms the severity of hallucinations in MLLM by
object detecting and polling-based querying. The results indicate that they are highly susceptible to
object hallucination, and the generated description does not match the target image. Besides, [167] that
MLLMs have limited multimodal reasoning ability as well as dependence on spurious cues. Though
current study [168] provides a broad overview of MLLMs, the causation of hallucinations has not been
comprehensively investigated. In the future, as more sophisticated multi-model applications emerge,
in-depth analyses of the biased distribution resulting from misalignment among modes is a promising
research direction, to provide faithful modal interactions.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Vision</title>
      <p>In this paper, we provide an overview of hallucinations in LLMs with new taxonomy, theoretical
insight, detection methods, correction methods and several future research directions. Note that it is
crucial to utilize LLMs in a responsible and beneficial manner. Furthermore, with sophisticated and
eficient detection methods proposed for various aspects, LLMs will provide human reliable and secure
information in broad application scenarios.
[14] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le,
D. Zhou, Chain-of-thought prompting elicits reasoning in large language
models, in: NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/
9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
[15] W. Yu, Z. Zhang, Z. Liang, M. Jiang, A. Sabharwal, Improving language models via plug-and-play
retrieval feedback, CoRR abs/2305.14002 (2023). URL: https://doi.org/10.48550/arXiv.2305.14002.
doi:10.48550/arXiv.2305.14002. arXiv:2305.14002.
[16] N. Kandpal, H. Deng, A. Roberts, E. Wallace, C. Rafel, Large language models struggle to learn
long-tail knowledge, in: A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett
(Eds.), ICML 2023, volume 202 of Proceedings of Machine Learning Research, PMLR, 2023, pp.
15696–15707. URL: https://proceedings.mlr.press/v202/kandpal23a.html.
[17] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, H. Hajishirzi, When not to trust language
models: Investigating efectiveness of parametric and non-parametric memories, in: A. Rogers,
J. L. Boyd-Graber, N. Okazaki (Eds.), ACL 2023, ACL, 2023, pp. 9802–9822. URL: https://doi.org/10.
18653/v1/2023.acl-long.546. doi:10.18653/v1/2023.acl-long.546.
[18] A. Lazaridou, E. Gribovskaya, W. Stokowiec, N. Grigorev, Internet-augmented language
models through few-shot prompting for open-domain question answering, CoRR abs/2203.05115
(2022). URL: https://doi.org/10.48550/arXiv.2203.05115. doi:10.48550/arXiv.2203.05115.
arXiv:2203.05115.
[19] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, W. Yih, REPLUG:
retrieval-augmented black-box language models, CoRR abs/2301.12652 (2023). URL: https://doi.
org/10.48550/arXiv.2301.12652. doi:10.48550/arXiv.2301.12652. arXiv:2301.12652.
[20] W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, M. Jiang, A survey of knowledge-enhanced text
generation, ACM Comput. Surv. 54 (2022) 227:1–227:38. URL: https://doi.org/10.1145/3512467.
doi:10.1145/3512467.
[21] D. Dash, R. Thapa, J. M. Banda, A. Swaminathan, M. Cheatham, M. Kashyap, N. Kotecha, J. H.</p>
      <p>Chen, S. Gombar, L. Downing, R. Pedreira, E. Goh, A. Arnaout, G. K. Morris, H. Magon, M. P.
Lungren, E. Horvitz, N. H. Shah, Evaluation of GPT-3.5 and GPT-4 for supporting real-world
information needs in healthcare delivery, CoRR abs/2304.13714 (2023). URL: https://doi.org/10.
48550/arXiv.2304.13714. doi:10.48550/arXiv.2304.13714. arXiv:2304.13714.
[22] L. K. Umapathi, A. Pal, M. Sankarasubbu, Med-halt: Medical domain hallucination test for large
language models, CoRR abs/2307.15343 (2023). URL: https://doi.org/10.48550/arXiv.2307.15343.
doi:10.48550/arXiv.2307.15343. arXiv:2307.15343.
[23] S. S. Gill, M. Xu, P. Patros, H. Wu, R. Kaur, K. Kaur, S. Fuller, M. Singh, P. Arora, A. K. Parlikad,
V. Stankovski, A. Abraham, S. K. Ghosh, H. Lutfiyya, S. S. Kanhere, R. Bahsoon, O. F. Rana,
S. Dustdar, R. Sakellariou, S. Uhlig, R. Buyya, Transformative efects of chatgpt on modern
education: Emerging era of AI chatbots, CoRR abs/2306.03823 (2023). URL: https://doi.org/10.
48550/arXiv.2306.03823. doi:10.48550/arXiv.2306.03823. arXiv:2306.03823.
[24] S. Curran, S. Lansley, O. Bethell, Hallucination is the last thing you need, CoRR abs/2306.11520
(2023). URL: https://doi.org/10.48550/arXiv.2306.11520. doi:10.48550/arXiv.2306.11520.
arXiv:2306.11520.
[25] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, P. Fung, Survey of
hallucination in natural language generation, ACM Comput. Surv. 55 (2023) 248:1–248:38. URL:
https://doi.org/10.1145/3571730. doi:10.1145/3571730.
[26] J. Maynez, S. Narayan, B. Bohnet, R. T. McDonald, On faithfulness and factuality in abstractive
summarization, in: D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July
5-10, 2020, ACL, 2020, pp. 1906–1919. URL: https://doi.org/10.18653/v1/2020.acl-main.173. doi:10.
18653/v1/2020.acl-main.173.
[27] M. Zhang, O. Press, W. Merrill, A. Liu, N. A. Smith, How language model hallucinations can
snowball, CoRR abs/2305.13534 (2023). URL: https://doi.org/10.48550/arXiv.2305.13534. doi:10.
48550/arXiv.2305.13534. arXiv:2305.13534.
[28] Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, Q. Liu, Aligning large
language models with human: A survey, CoRR abs/2307.12966 (2023). URL: https://doi.org/10.
48550/arXiv.2307.12966. doi:10.48550/arXiv.2307.12966. arXiv:2307.12966.
[29] L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, W. Y. Wang, Automatically correcting large
language models: Surveying the landscape of diverse self-correction strategies, CoRR abs/2308.03188
(2023). URL: https://doi.org/10.48550/arXiv.2308.03188. doi:10.48550/arXiv.2308.03188.
arXiv:2308.03188.
[30] S. Lin, J. Hilton, O. Evans, Truthfulqa: Measuring how models mimic human falsehoods, in:
ACL 2022, ACL, 2022, pp. 3214–3252. URL: https://doi.org/10.18653/v1/2022.acl-long.229. doi:10.
18653/v1/2022.acl-long.229.
[31] J. Li, X. Cheng, W. X. Zhao, J. Nie, J. Wen, Halueval: A large-scale hallucination evaluation
benchmark for large language models, CoRR abs/2305.11747 (2023). URL: https://doi.org/10.
48550/arXiv.2305.11747. doi:10.48550/arXiv.2305.11747. arXiv:2305.11747.
[32] N. Mihindukulasooriya, S. Tiwari, C. F. Enguix, K. Lata, Text2kgbench: A benchmark for
ontologydriven knowledge graph generation from text, CoRR abs/2308.02357 (2023). URL: https://doi.org/
10.48550/arXiv.2308.02357. doi:10.48550/arXiv.2308.02357. arXiv:2308.02357.
[33] F. Yin, J. Vig, P. Laban, S. Joty, C. Xiong, C. Wu, Did you read the instructions? rethinking the
efectiveness of task definitions in instruction learning, in: A. Rogers, J. L. Boyd-Graber, N. Okazaki
(Eds.), ACL 2023, ACL, 2023, pp. 3063–3079. URL: https://doi.org/10.18653/v1/2023.acl-long.172.
doi:10.18653/v1/2023.acl-long.172.
[34] M. Chen, J. Du, R. Pasunuru, T. Mihaylov, S. Iyer, V. Stoyanov, Z. Kozareva, Improving in-context
few-shot learning via self-supervised training, in: M. Carpuat, M. de Marnefe, I. V. M. Ruíz (Eds.),
NAACL 2022, ACL, 2022, pp. 3558–3573. URL: https://doi.org/10.18653/v1/2022.naacl-main.260.
doi:10.18653/v1/2022.naacl-main.260.
[35] N. McKenna, T. Li, L. Cheng, M. J. Hosseini, M. Johnson, M. Steedman, Sources of hallucination
by large language models on inference tasks, CoRR abs/2305.14552 (2023). URL: https://doi.org/
10.48550/arXiv.2305.14552. doi:10.48550/arXiv.2305.14552. arXiv:2305.14552.
[36] S. Chan, A. Santoro, A. K. Lampinen, J. Wang, A. Singh, P. H. Richemond, J. L.
McClelland, F. Hill, Data distributional properties drive emergent in-context learning in
transformers, in: NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/
77c6ccacfd9962e2307fc64680fc5ace-Abstract-Conference.html.
[37] S. Wang, K. Wei, H. Zhang, Y. Li, W. Wu, Let me check the examples: Enhancing demonstration
learning via explicit imitation, in: A. Rogers, J. L. Boyd-Graber, N. Okazaki (Eds.), ACL 2023,
ACL, 2023, pp. 1080–1088. URL: https://doi.org/10.18653/v1/2023.acl-short.93. doi:10.18653/
v1/2023.acl-short.93.
[38] Y. Lu, M. Bartolo, A. Moore, S. Riedel, P. Stenetorp, Fantastically ordered prompts and where to find
them: Overcoming few-shot prompt order sensitivity, in: S. Muresan, P. Nakov, A. Villavicencio
(Eds.), ACL 2022, ACL, 2022, pp. 8086–8098. URL: https://doi.org/10.18653/v1/2022.acl-long.556.
doi:10.18653/v1/2022.acl-long.556.
[39] N. M. Guerreiro, D. M. Alves, J. Waldendorf, B. Haddow, A. Birch, P. Colombo, A. F. T. Martins,
Hallucinations in large multilingual translation models, CoRR abs/2303.16104 (2023). URL: https://
doi.org/10.48550/arXiv.2303.16104. doi:10.48550/arXiv.2303.16104. arXiv:2303.16104.
[40] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, CoRR abs/2304.08485 (2023). URL: https://
doi.org/10.48550/arXiv.2304.08485. doi:10.48550/arXiv.2304.08485. arXiv:2304.08485.
[41] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, J. Wen, Evaluating object hallucination in large
visionlanguage models, CoRR abs/2305.10355 (2023). URL: https://doi.org/10.48550/arXiv.2305.10355.
doi:10.48550/arXiv.2305.10355. arXiv:2305.10355.
[42] S. Zheng, J. Huang, K. C. Chang, Why does chatgpt fall short in answering questions faithfully?,
CoRR abs/2304.10513 (2023). URL: https://doi.org/10.48550/arXiv.2304.10513. doi:10.48550/
arXiv.2304.10513. arXiv:2304.10513.
[43] K. Huang, H. P. Chan, H. Ji, Zero-shot faithful factual error correction, in: ACL 2023, ACL,
2023, pp. 5660–5676. URL: https://doi.org/10.18653/v1/2023.acl-long.311. doi:10.18653/v1/
[103] N. Daheim, N. Dziri, M. Sachan, I. Gurevych, E. M. Ponti, Elastic weight removal for faithful
and abstractive dialogue generation, CoRR abs/2303.17574 (2023). URL: https://doi.org/10.48550/
arXiv.2303.17574. doi:10.48550/arXiv.2303.17574. arXiv:2303.17574.
[104] A. Chen, P. Pasupat, S. Singh, H. Lee, K. Guu, PURR: eficiently editing language model
hallucinations by denoising language model corruptions, CoRR abs/2305.14908 (2023). URL: https://doi.
org/10.48550/arXiv.2305.14908. doi:10.48550/arXiv.2305.14908. arXiv:2305.14908.
[105] W. Shi, X. Han, M. Lewis, Y. Tsvetkov, L. Zettlemoyer, S. W. Yih, Trusting your evidence:
Hallucinate less with context-aware decoding, CoRR abs/2305.14739 (2023). URL: https://doi.org/
10.48550/arXiv.2305.14739. doi:10.48550/arXiv.2305.14739. arXiv:2305.14739.
[106] Z. Luo, C. Xu, P. Zhao, X. Geng, C. Tao, J. Ma, Q. Lin, D. Jiang, Augmented large language models
with parametric knowledge guiding, CoRR abs/2305.04757 (2023). URL: https://doi.org/10.48550/
arXiv.2305.04757. doi:10.48550/arXiv.2305.04757. arXiv:2305.04757.
[107] S. Li, S. Park, I. Lee, O. Bastani, TRAC: trustworthy retrieval augmented chatbot, CoRR
abs/2307.04642 (2023). URL: https://doi.org/10.48550/arXiv.2307.04642. doi:10.48550/arXiv.
2307.04642. arXiv:2307.04642.
[108] K. Li, O. Patel, F. B. Viégas, H. Pfister, M. Wattenberg, Inference-time intervention: Eliciting
truthful answers from a language model, CoRR abs/2306.03341 (2023). URL: https://doi.org/10.
48550/arXiv.2306.03341. doi:10.48550/arXiv.2306.03341. arXiv:2306.03341.
[109] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian, M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, H. Chen,
Easyedit: An easy-to-use knowledge editing framework for large language models, CoRR
abs/2308.07269 (2023). URL: https://doi.org/10.48550/arXiv.2308.07269. doi:10.48550/arXiv.
2308.07269. arXiv:2308.07269.
[110] Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. Glass, P. He, Dola: Decoding by contrasting layers improves
factuality in large language models, arXiv preprint arXiv:2309.03883 (2023).
[111] N. Dziri, A. Madotto, O. Zaïane, A. J. Bose, Neural path hunter: Reducing hallucination in
dialogue systems via path grounding, in: M. Moens, X. Huang, L. Specia, S. W. Yih (Eds.),
EMNLP 2021, ACL, 2021, pp. 2197–2214. URL: https://doi.org/10.18653/v1/2021.emnlp-main.168.
doi:10.18653/v1/2021.emnlp-main.168.
[112] X. Han, Y. Tsvetkov, ORCA: interpreting prompted language models via locating supporting data
evidence in the ocean of pretraining data, CoRR abs/2205.12600 (2022). URL: https://doi.org/10.
48550/arXiv.2205.12600. doi:10.48550/arXiv.2205.12600. arXiv:2205.12600.
[113] H. He, H. Zhang, D. Roth, Rethinking with retrieval: Faithful large language model inference,
CoRR abs/2301.00303 (2023). URL: https://doi.org/10.48550/arXiv.2301.00303. doi:10.48550/
arXiv.2301.00303. arXiv:2301.00303.
[114] S. M. Park, K. Georgiev, A. Ilyas, G. Leclerc, A. Madry, TRAK: attributing model behavior at
scale, in: A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett (Eds.), ICML 2023,
volume 202 of Proceedings of Machine Learning Research, PMLR, 2023, pp. 27074–27113. URL:
https://proceedings.mlr.press/v202/park23c.html.
[115] M. Marone, B. V. Durme, Data portraits: Recording foundation model training data, CoRR
abs/2303.03919 (2023). URL: https://doi.org/10.48550/arXiv.2303.03919. doi:10.48550/arXiv.
2303.03919. arXiv:2303.03919.
[116] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegrefe, U. Alon, N. Dziri, S. Prabhumoye,
Y. Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yazdanbakhsh, P. Clark, Self-refine: Iterative
refinement with self-feedback, CoRR abs/2303.17651 (2023). URL: https://doi.org/10.48550/arXiv.
2303.17651. doi:10.48550/arXiv.2303.17651. arXiv:2303.17651.
[117] N. Shinn, B. Labash, A. Gopinath, Reflexion: an autonomous agent with dynamic memory
and self-reflection, CoRR abs/2303.11366 (2023). URL: https://doi.org/10.48550/arXiv.2303.11366.
doi:10.48550/arXiv.2303.11366. arXiv:2303.11366.
[118] O. Weller, M. Marone, N. Weir, D. J. Lawrie, D. Khashabi, B. V. Durme, "according to ..."
prompting language models improves quoting from pre-training data, CoRR abs/2305.13252
(2023). URL: https://doi.org/10.48550/arXiv.2305.13252. doi:10.48550/arXiv.2305.13252.
arXiv:2305.13252.
[119] R. Zhao, X. Li, S. Joty, C. Qin, L. Bing, Verify-and-edit: A knowledge-enhanced chain-of-thought
framework, in: A. Rogers, J. L. Boyd-Graber, N. Okazaki (Eds.), ACL 2023, ACL, 2023, pp. 5823–
5840. URL: https://doi.org/10.18653/v1/2023.acl-long.320. doi:10.18653/v1/2023.acl-long.
320.
[120] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, J. Weston, Chain-of-verification
reduces hallucination in large language models, CoRR abs/2309.11495 (2023). URL: https://doi.
org/10.48550/arXiv.2309.11495. doi:10.48550/arXiv.2309.11495. arXiv:2309.11495.
[121] S. Borgeaud, A. Mensch, J. Hofmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche,
J. Lespiau, B. Damoc, A. Clark, D. de Las Casas, A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang,
L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Paganini, G. Irving, O. Vinyals, S. Osindero,
K. Simonyan, J. W. Rae, E. Elsen, L. Sifre, Improving language models by retrieving from trillions
of tokens, in: K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, S. Sabato (Eds.), ICML
2022, volume 162 of Proceedings of Machine Learning Research, PMLR, 2022, pp. 2206–2240. URL:
https://proceedings.mlr.press/v162/borgeaud22a.html.
[122] H. Trivedi, N. Balasubramanian, T. Khot, A. Sabharwal, Interleaving retrieval with
chain-ofthought reasoning for knowledge-intensive multi-step questions, in: A. Rogers, J. L. Boyd-Graber,
N. Okazaki (Eds.), ACL 2023, ACL, 2023, pp. 10014–10037. URL: https://doi.org/10.18653/v1/2023.
acl-long.557. doi:10.18653/v1/2023.acl-long.557.
[123] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, J. Gao,
Check your facts and try again: Improving large language models with external knowledge and
automated feedback, CoRR abs/2302.12813 (2023). URL: https://doi.org/10.48550/arXiv.2302.12813.
doi:10.48550/arXiv.2302.12813. arXiv:2302.12813.
[124] Q. Jin, Y. Yang, Q. Chen, Z. Lu, Genegpt: Augmenting large language models with domain tools
for improved access to biomedical information, CoRR abs/2304.09667 (2023). URL: https://doi.
org/10.48550/arXiv.2304.09667. doi:10.48550/arXiv.2304.09667. arXiv:2304.09667.
[125] Z. Ding, A. Srinivasan, S. MacNeil, J. Chan, Fluid transformers and creative analogies: Exploring
large language models’ capacity for augmenting cross-domain analogical creativity, in: Creativity
and Cognition, C&amp;C 2023, Virtual Event, USA, June 19-21, 2023, ACM, 2023, pp. 489–505. URL:
https://doi.org/10.1145/3591196.3593516. doi:10.1145/3591196.3593516.
[126] X. Li, R. Zhao, Y. K. Chia, B. Ding, L. Bing, S. R. Joty, S. Poria, Chain of knowledge: A framework
for grounding large language models with structured knowledge bases, CoRR abs/2305.13269
(2023). URL: https://doi.org/10.48550/arXiv.2305.13269. doi:10.48550/arXiv.2305.13269.
arXiv:2305.13269.
[127] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, G. Neubig, Active
retrieval augmented generation, CoRR abs/2305.06983 (2023). URL: https://doi.org/10.48550/arXiv.
2305.06983. doi:10.48550/arXiv.2305.06983. arXiv:2305.06983.
[128] S. G. Patil, T. Zhang, X. Wang, J. E. Gonzalez, Gorilla: Large language model connected with
massive apis, CoRR abs/2305.15334 (2023). URL: https://doi.org/10.48550/arXiv.2305.15334. doi:10.
48550/arXiv.2305.15334. arXiv:2305.15334.
[129] J. Liu, J. Jin, Z. Wang, J. Cheng, Z. Dou, J. Wen, RETA-LLM: A retrieval-augmented large language
model toolkit, CoRR abs/2306.05212 (2023). URL: https://doi.org/10.48550/arXiv.2306.05212.
doi:10.48550/arXiv.2306.05212. arXiv:2306.05212.
[130] X. Wang, Q. Yang, Y. Qiu, J. Liang, Q. He, Z. Gu, Y. Xiao, W. Wang, Knowledgpt: Enhancing large
language models with retrieval and storage access on knowledge bases, CoRR abs/2308.11761
(2023). URL: https://doi.org/10.48550/arXiv.2308.11761. doi:10.48550/arXiv.2308.11761.
arXiv:2308.11761.
[131] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, P. F.</p>
      <p>Christiano, Learning to summarize with human feedback, in: H. Larochelle, M. Ranzato, R. Hadsell,
M. Balcan, H. Lin (Eds.), NeurIPS 2020, 2020. URL: https://proceedings.neurips.cc/paper/2020/
hash/1f89885d556929e98d3ef9b86448f951-Abstract.html.
[132] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, H. F. Song, M. J. Chadwick, M. Glaese, S. Young,
L. Campbell-Gillingham, G. Irving, N. McAleese, Teaching language models to support answers
with verified quotes, CoRR abs/2203.11147 (2022). URL: https://doi.org/10.48550/arXiv.2203.11147.
doi:10.48550/arXiv.2203.11147. arXiv:2203.11147.
[133] Y. Liu, P. Liu, D. R. Radev, G. Neubig, BRIO: bringing order to abstractive summarization,
in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), ACL 2022, ACL, 2022, pp. 2890–2903. URL:
https://doi.org/10.18653/v1/2022.acl-long.207. doi:10.18653/v1/2022.acl-long.207.
[134] H. Liu, C. Sferrazza, P. Abbeel, Chain of hindsight aligns language models with feedback, CoRR
abs/2302.02676 (2023). URL: https://doi.org/10.48550/arXiv.2302.02676. doi:10.48550/arXiv.
2302.02676. arXiv:2302.02676.
[135] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, W. Chen, CRITIC: large language models
can self-correct with tool-interactive critiquing, CoRR abs/2305.11738 (2023). URL: https://doi.
org/10.48550/arXiv.2305.11738. doi:10.48550/arXiv.2305.11738. arXiv:2305.11738.
[136] X. Zhu, B. Qi, K. Zhang, X. Long, B. Zhou, Pad: Program-aided distillation specializes large
models in reasoning, CoRR abs/2305.13888 (2023). URL: https://doi.org/10.48550/arXiv.2305.13888.
doi:10.48550/arXiv.2305.13888. arXiv:2305.13888.
[137] T. Gao, H. Yen, J. Yu, D. Chen, Enabling large language models to generate text with citations,
CoRR abs/2305.14627 (2023). URL: https://doi.org/10.48550/arXiv.2305.14627. doi:10.48550/
arXiv.2305.14627. arXiv:2305.14627.
[138] T. Dixit, F. Wang, M. Chen, Improving factuality of abstractive summarization without sacrificing
summary quality, in: A. Rogers, J. L. Boyd-Graber, N. Okazaki (Eds.), ACL 2023, ACL, 2023, pp. 902–
913. URL: https://doi.org/10.18653/v1/2023.acl-short.78. doi:10.18653/v1/2023.acl-short.
78.
[139] Z. Ji, T. Yu, Y. Xu, N. Lee, E. Ishii, P. Fung, Towards mitigating hallucination in large language
models via self-reflection (2023). arXiv:2310.06271.
[140] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, I. Mordatch, Improving factuality and reasoning in
language models through multiagent debate, CoRR abs/2305.14325 (2023). URL: https://doi.org/
10.48550/arXiv.2305.14325. doi:10.48550/arXiv.2305.14325. arXiv:2305.14325.
[141] T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, Z. Tu, S. Shi, Encouraging
divergent thinking in large language models through multi-agent debate, CoRR abs/2305.19118
(2023). URL: https://doi.org/10.48550/arXiv.2305.19118. doi:10.48550/arXiv.2305.19118.
arXiv:2305.19118.
[142] K. Xiong, X. Ding, Y. Cao, T. Liu, B. Qin, Examining the inter-consistency of large language
models: An in-depth analysis via debate, CoRR abs/2305.11595 (2023). URL: https://doi.org/10.
48550/arXiv.2305.11595. doi:10.48550/arXiv.2305.11595. arXiv:2305.11595.
[143] R. Li, T. Patel, X. Du, PRD: peer rank and discussion improve large language model based
evaluations, CoRR abs/2307.02762 (2023). URL: https://doi.org/10.48550/arXiv.2307.02762. doi:10.
48550/arXiv.2307.02762. arXiv:2307.02762.
[144] Z. Wang, S. Mao, W. Wu, T. Ge, F. Wei, H. Ji, Unleashing cognitive synergy in large language
models: A task-solving agent through multi-persona self-collaboration, CoRR abs/2307.05300
(2023). URL: https://doi.org/10.48550/arXiv.2307.05300. doi:10.48550/arXiv.2307.05300.
arXiv:2307.05300.
[145] M. Minsky, Society of mind, Simon and Schuster, 1988.
[146] Y. Kirstain, P. S. H. Lewis, S. Riedel, O. Levy, A few more examples may be worth billions of
parameters, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Findings of EMNLP 2022, ACL, 2022, pp.
1017–1029. URL: https://doi.org/10.18653/v1/2022.findings-emnlp.72. doi: 10.18653/v1/2022.
findings-emnlp.72.
[147] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh,
M. Lewis, L. Zettlemoyer, O. Levy, LIMA: less is more for alignment, CoRR abs/2305.11206
(2023). URL: https://doi.org/10.48550/arXiv.2305.11206. doi:10.48550/arXiv.2305.11206.
arXiv:2305.11206.
[148] S. Mishra, D. Khashabi, C. Baral, H. Hajishirzi, Natural instructions: Benchmarking generalization
to new tasks from natural language instructions, CoRR abs/2104.08773 (2021). URL: https:
//arxiv.org/abs/2104.08773. arXiv:2104.08773.
[149] V. Sanh, A. Webson, C. Rafel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chafin, A. Stiegler, A. Raja,
M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V.
Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey,
R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Févry, J. A. Fries, R. Teehan, T. L.
Scao, S. Biderman, L. Gao, T. Wolf, A. M. Rush, Multitask prompted training enables zero-shot
task generalization, in: The Tenth International Conference on Learning Representations, ICLR
2022, Virtual Event, April 25-29, 2022, OpenReview.net, 2022. URL: https://openreview.net/forum?
id=9Vrb9D0WI4.
[150] Z. Bao, W. Chen, S. Xiao, K. Ren, J. Wu, C. Zhong, J. Peng, X. Huang, Z. Wei, Disc-medllm: Bridging
general large language models and real-world medical consultation, 2023. arXiv:2308.14346.
[151] H. Gui, J. Zhang, H. Ye, N. Zhang, Instructie: A chinese instruction-based information extraction
dataset, CoRR abs/2305.11527 (2023). URL: https://doi.org/10.48550/arXiv.2305.11527. doi:10.
48550/arXiv.2305.11527. arXiv:2305.11527.
[152] W. Y. Wei Zhu, X. Wang, Shennong-tcm: A traditional chinese medicine large language model,
https://github.com/michael-wzhu/ShenNong-TCM-LLM, 2023.
[153] X. Li, P. Yu, C. Zhou, T. Schick, L. Zettlemoyer, O. Levy, J. Weston, M. Lewis, Self-alignment with
instruction backtranslation, CoRR abs/2308.06259 (2023). URL: https://doi.org/10.48550/arXiv.
2308.06259. doi:10.48550/arXiv.2308.06259. arXiv:2308.06259.
[154] L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Yadav, Z. Tang, V. Srinivasan, T. Zhou,
H. Huang, H. Jin, Alpagasus: Training A better alpaca with fewer data, CoRR abs/2307.08701
(2023). URL: https://doi.org/10.48550/arXiv.2307.08701. doi:10.48550/arXiv.2307.08701.
arXiv:2307.08701.
[155] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Grifiths, Y. Cao, K. Narasimhan, Tree of thoughts: Deliberate
problem solving with large language models, CoRR abs/2305.10601 (2023). URL: https://doi.org/
10.48550/arXiv.2305.10601. doi:10.48550/arXiv.2305.10601. arXiv:2305.10601.
[156] Y. Zhang, J. Yang, Y. Yuan, A. C. Yao, Cumulative reasoning with large language models, CoRR
abs/2308.04371 (2023). URL: https://doi.org/10.48550/arXiv.2308.04371. doi:10.48550/arXiv.
2308.04371. arXiv:2308.04371.
[157] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, G. Neubig, PAL: program-aided
language models, in: A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett (Eds.),
ICML 2023, volume 202 of Proceedings of Machine Learning Research, PMLR, 2023, pp. 10764–10799.</p>
      <p>URL: https://proceedings.mlr.press/v202/gao23f.html.
[158] W. Chen, X. Ma, X. Wang, W. W. Cohen, Program of thoughts prompting: Disentangling
computation from reasoning for numerical reasoning tasks, CoRR abs/2211.12588 (2022). URL: https://
doi.org/10.48550/arXiv.2211.12588. doi:10.48550/arXiv.2211.12588. arXiv:2211.12588.
[159] Z. Bi, N. Zhang, Y. Jiang, S. Deng, G. Zheng, H. Chen, When do program-of-thoughts work for
reasoning?, 2023. arXiv:2308.15452.
[160] K. Frankish, Dual-process and dual-system theories of reasoning, Philosophy Compass 5 (2010)
914–926.
[161] K. Stanovich, Rationality and the reflective mind, Oxford University Press, USA, 2011.
[162] G. Piccinini, The first computational theory of mind and brain: a close look at mcculloch and
pitts’s “logical calculus of ideas immanent in nervous activity”, Synthese 141 (2004) 175–215.
[163] E. L. Thorndike, Animal intelligence, Nature 58 (1898) 390–390.
[164] J. Li, D. Li, S. Savarese, S. C. H. Hoi, BLIP-2: bootstrapping language-image pre-training with frozen
image encoders and large language models, in: A. Krause, E. Brunskill, K. Cho, B. Engelhardt,
S. Sabato, J. Scarlett (Eds.), ICML 2023, volume 202 of Proceedings of Machine Learning Research,
PMLR, 2023, pp. 19730–19742. URL: https://proceedings.mlr.press/v202/li23q.html.
[165] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, S. C. H. Hoi, Instructblip:
Towards general-purpose vision-language models with instruction tuning, CoRR abs/2305.06500
(2023). URL: https://doi.org/10.48550/arXiv.2305.06500. doi:10.48550/arXiv.2305.06500.
arXiv:2305.06500.
[166] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen,
J. Tian, Q. Qi, J. Zhang, F. Huang, mplug-owl: Modularization empowers large language models
with multimodality, CoRR abs/2304.14178 (2023). URL: https://doi.org/10.48550/arXiv.2304.14178.
doi:10.48550/arXiv.2304.14178. arXiv:2304.14178.
[167] W. Shao, Y. Hu, P. Gao, M. Lei, K. Zhang, F. Meng, P. Xu, S. Huang, H. Li, Y. Qiao, P. Luo, Tiny
lvlm-ehub: Early multimodal experiments with bard, CoRR abs/2308.03729 (2023). URL: https://
doi.org/10.48550/arXiv.2308.03729. doi:10.48550/arXiv.2308.03729. arXiv:2308.03729.
[168] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on multimodal large language
models, CoRR abs/2306.13549 (2023). URL: https://doi.org/10.48550/arXiv.2306.13549. doi:10.
48550/arXiv.2306.13549. arXiv:2306.13549.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , in: H.
          <string-name>
            <surname>Larochelle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Balcan</surname>
          </string-name>
          , H. Lin (Eds.),
          <source>NeurIPS</source>
          <year>2020</year>
          ,
          <year>2020</year>
          . URL: https://proceedings.neurips.cc/paper/2020/hash/ 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kelton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Training language models to follow instructions with 2023.acl-long</article-title>
          .
          <volume>311</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pasupat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Chaganty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Juan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <article-title>RARR: researching and revising what language models say, using language models</article-title>
          ,
          <source>in: ACL</source>
          <year>2023</year>
          , ACL,
          <year>2023</year>
          , pp.
          <fpage>16477</fpage>
          -
          <lpage>16508</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>910</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>910</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Pan,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Mitigating language model hallucination with interactive question-knowledge alignment</article-title>
          ,
          <source>CoRR abs/2305</source>
          .13669 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv. 2305.13669. doi:
          <volume>10</volume>
          .48550/arXiv.2305.13669. arXiv:
          <volume>2305</volume>
          .
          <fpage>13669</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>D.</given-names>
            <surname>Halawi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Denain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <article-title>Overthinking the truth: Understanding how language models process false demonstrations</article-title>
          ,
          <source>CoRR abs/2307</source>
          .09476 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv. 2307.09476. doi:
          <volume>10</volume>
          .48550/arXiv.2307.09476. arXiv:
          <volume>2307</volume>
          .
          <fpage>09476</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Bansal, Histalign:
          <article-title>Improving context dependency in language generation by aligning with history</article-title>
          ,
          <source>CoRR abs/2305</source>
          .04782 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305.04782. doi:
          <volume>10</volume>
          .48550/arXiv.2305.04782. arXiv:
          <volume>2305</volume>
          .
          <fpage>04782</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chiesurin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimakopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A. S.</given-names>
            <surname>Cabezudo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eshghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Papaioannou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rieser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Konstas</surname>
          </string-name>
          ,
          <article-title>The dangers of trusting stochastic parrots: Faithfulness and trust in open-domain conversational question answering</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Findings of ACL</source>
          <year>2023</year>
          , ACL,
          <year>2023</year>
          , pp.
          <fpage>947</fpage>
          -
          <lpage>959</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .findings-acl.
          <volume>60</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2023</year>
          .findings-acl.
          <volume>60</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kang</surname>
          </string-name>
          , T. Hashimoto,
          <article-title>Improved natural language generation via loss truncation</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , ACL,
          <year>2020</year>
          , pp.
          <fpage>718</fpage>
          -
          <lpage>731</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>66</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>66</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <article-title>On exposure bias, hallucination and domain shift in neural machine translation</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , ACL,
          <year>2020</year>
          , pp.
          <fpage>3544</fpage>
          -
          <lpage>3552</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>326</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>326</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ping</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shoeybi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catanzaro</surname>
          </string-name>
          ,
          <article-title>Factuality enhanced language models for open-ended text generation</article-title>
          , in: NeurIPS,
          <year>2022</year>
          . URL: http://papers.nips.cc/ paper_files/paper/2022/hash/df438caa36714f69277daa92d608dd63-Abstract-Conference.html.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>V.</given-names>
            <surname>Raunak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Menezes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Junczys-Dowmunt</surname>
          </string-name>
          ,
          <article-title>The curious case of hallucinations in neural machine translation</article-title>
          ,
          <source>in: NAACL</source>
          <year>2021</year>
          , ACL,
          <year>2021</year>
          , pp.
          <fpage>1172</fpage>
          -
          <lpage>1183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guerreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Voita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F. T.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <article-title>Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation</article-title>
          ,
          <source>in: EACL</source>
          <year>2023</year>
          , ACL,
          <year>2023</year>
          , pp.
          <fpage>1059</fpage>
          -
          <lpage>1075</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .eacl-main.
          <volume>75</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Voita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hansanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ropers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kalbassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrault</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. R.</surname>
          </string-name>
          <article-title>Costa-jussà, Halomi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation</article-title>
          ,
          <source>CoRR abs/2305</source>
          .11746 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305. 11746. doi:
          <volume>10</volume>
          .48550/arXiv.2305.11746. arXiv:
          <volume>2305</volume>
          .
          <fpage>11746</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pfeifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piccinno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nicosia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          , mmt5:
          <article-title>Modular multilingual pre-training solves source language hallucinations</article-title>
          ,
          <source>CoRR abs/2305</source>
          .14224 (
          <year>2023</year>
          ). URL: https://doi. org/10.48550/arXiv.2305.14224. doi:
          <volume>10</volume>
          .48550/arXiv.2305.14224. arXiv:
          <volume>2305</volume>
          .
          <fpage>14224</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>V.</given-names>
            <surname>Adlakha</surname>
          </string-name>
          , P. BehnamGhader,
          <string-name>
            <given-names>X. H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Meade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <article-title>Evaluating correctness and faithfulness of instruction-following models for question answering</article-title>
          ,
          <source>CoRR abs/2307</source>
          .16877 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307.16877. doi:
          <volume>10</volume>
          .48550/arXiv.2307.16877. arXiv:
          <volume>2307</volume>
          .
          <fpage>16877</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Milton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. R.</given-names>
            <surname>Zaïane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <article-title>On the origin of hallucinations in conversational models: Is it the datasets or the models?</article-title>
          ,
          <source>in: NAACL</source>
          <year>2022</year>
          , ACL,
          <year>2022</year>
          , pp.
          <fpage>5271</fpage>
          -
          <lpage>5285</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2022</year>
          .naacl-main.
          <volume>387</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .naacl-main.
          <volume>387</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Srihari</surname>
          </string-name>
          ,
          <article-title>Diving deep into modes of fact hallucinations in dialogue systems</article-title>
          ,
          <source>in: Findings of EMNLP</source>
          <year>2022</year>
          , ACL,
          <year>2022</year>
          , pp.
          <fpage>684</fpage>
          -
          <lpage>699</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2022</year>
          . ifndings-emnlp.
          <volume>48</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings-emnlp.
          <volume>48</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kamalloo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Milton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. R.</given-names>
            <surname>Zaïane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Ponti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <article-title>Faithdial: A faithful benchmark for information-seeking dialogue</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>1473</fpage>
          -
          <lpage>1490</lpage>
          . URL: https://transacl.org/ojs/index.php/tacl/article/view/4113.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rashkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reitter</surname>
          </string-name>
          ,
          <article-title>Evaluating attribution in dialogue systems: The BEGIN benchmark</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>1066</fpage>
          -
          <lpage>1083</lpage>
          . URL: https://transacl.org/ ojs/index.php/tacl/article/view/3977.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [61]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Rijke</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Contrastive learning reduces hallucination in conversations</article-title>
          ,
          <source>in: AAAI</source>
          <year>2023</year>
          , AAAI Press,
          <year>2023</year>
          , pp.
          <fpage>13618</fpage>
          -
          <lpage>13626</lpage>
          . URL: https://ojs.aaai.org/ index.php/AAAI/article/view/26596.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [62]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mascarenhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Kwan,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Rafel, Evaluating the factual consistency of large language models through news summarization</article-title>
          ,
          <source>in: Findings of ACL</source>
          <year>2023</year>
          , ACL,
          <year>2023</year>
          , pp.
          <fpage>5220</fpage>
          -
          <lpage>5255</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .findings-acl.
          <volume>322</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . findings-acl.
          <volume>322</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [63]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. K.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <article-title>Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization</article-title>
          ,
          <source>in: ACL</source>
          <year>2022</year>
          , ACL,
          <year>2022</year>
          , pp.
          <fpage>3340</fpage>
          -
          <lpage>3354</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>236</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>236</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [64]
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Finnie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rahmati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <article-title>"why is this misleading?": Detecting news headline hallucinations with explanations</article-title>
          ,
          <source>in: WWW</source>
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>1662</fpage>
          -
          <lpage>1672</lpage>
          . URL: https://doi.org/10.1145/3543507.3583375. doi:
          <volume>10</volume>
          .1145/3543507.3583375.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [65]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ziser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Ponti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Detecting and mitigating hallucinations in multilingual summarisation</article-title>
          ,
          <source>CoRR abs/2305</source>
          .13632 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv. 2305.13632. doi:
          <volume>10</volume>
          .48550/arXiv.2305.13632. arXiv:
          <volume>2305</volume>
          .
          <fpage>13632</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [66]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            Zhang-li,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Yun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Guan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
            , J. Liu,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Hou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Kola: Carefully benchmarking world knowledge of large language models</article-title>
          ,
          <source>CoRR abs/2306</source>
          .09296 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2306.09296. doi:
          <volume>10</volume>
          .48550/arXiv.2306.09296. arXiv:
          <volume>2306</volume>
          .
          <fpage>09296</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [67]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yacoob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Aligning large multi-modal model with robust instruction tuning</article-title>
          ,
          <source>CoRR abs/2306</source>
          .14565 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2306.14565. doi:
          <volume>10</volume>
          .48550/arXiv.2306.14565. arXiv:
          <volume>2306</volume>
          .
          <fpage>14565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [68]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mahmood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Kalra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <article-title>Fact-checking of ai-generated reports</article-title>
          ,
          <source>CoRR abs/2307</source>
          .14634 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307.14634. doi:
          <volume>10</volume>
          .48550/arXiv. 2307.14634. arXiv:
          <volume>2307</volume>
          .
          <fpage>14634</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [69]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ching</surname>
          </string-name>
          , E. Kamal,
          <article-title>Chain of natural language inference for reducing large language model ungrounded hallucinations (</article-title>
          <year>2023</year>
          ). arXiv:
          <volume>2310</volume>
          .
          <fpage>03951</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [70]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, Bartscore:
          <article-title>Evaluating generated text as text generation</article-title>
          , in: M.
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>Y. N.</given-names>
          </string-name>
          <string-name>
            <surname>Dauphin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
          </string-name>
          (Eds.),
          <source>NeurIPS</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>27263</fpage>
          -
          <lpage>27277</lpage>
          . URL: https://proceedings.neurips.cc/paper/2021/hash/ e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [71]
          <string-name>
            <given-names>A.</given-names>
            <surname>Amayuelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Knowledge of knowledge: Exploring knownunknowns uncertainty with large language models</article-title>
          ,
          <source>CoRR abs/2305</source>
          .13712 (
          <year>2023</year>
          ). URL: https:// doi.org/10.48550/arXiv.2305.13712. doi:
          <volume>10</volume>
          .48550/arXiv.2305.13712. arXiv:
          <volume>2305</volume>
          .
          <fpage>13712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [72]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Diab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Celikyilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kozareva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <article-title>Methods for measuring, updating, and visualizing factual beliefs in language models</article-title>
          , in: A.
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , I. Augenstein (Eds.),
          <source>EACL</source>
          <year>2023</year>
          , ACL,
          <year>2023</year>
          , pp.
          <fpage>2706</fpage>
          -
          <lpage>2723</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          . eacl-main.
          <volume>199</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [73]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pezeshkpour</surname>
          </string-name>
          ,
          <article-title>Measuring and modifying factual knowledge in large language models</article-title>
          ,
          <source>CoRR abs/2306</source>
          .06264 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2306.06264. doi:
          <volume>10</volume>
          .48550/arXiv. 2306.06264. arXiv:
          <volume>2306</volume>
          .
          <fpage>06264</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [74]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Preston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Llm calibration and automatic hallucination detection via pareto optimal self-supervision</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2306</volume>
          .
          <fpage>16564</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [75]
          <string-name>
            <given-names>N.</given-names>
            <surname>Varshney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation</article-title>
          ,
          <source>CoRR abs/2307</source>
          .03987 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307.03987. doi:
          <volume>10</volume>
          .48550/arXiv.2307.03987. arXiv:
          <volume>2307</volume>
          .
          <fpage>03987</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [76]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kadavath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Conerly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Drain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schiefer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>HatfieldDodds</surname>
          </string-name>
          , N. DasSarma, E. Tran-Johnson, S. Johnston,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Showk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Elhage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hume</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jacobson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kernion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kravec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lovitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ndousse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          , T. Brown, J. Clark,
          <string-name>
            <given-names>N.</given-names>
            <surname>Joseph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <article-title>Language models (mostly) know what they know</article-title>
          ,
          <source>CoRR abs/2207</source>
          .05221 (
          <year>2022</year>
          ). URL: https://doi.org/10.48550/arXiv.2207.05221. doi:
          <volume>10</volume>
          .48550/arXiv. 2207.05221. arXiv:
          <volume>2207</volume>
          .
          <fpage>05221</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [77]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liusie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. F.</given-names>
            <surname>Gales</surname>
          </string-name>
          , Selfcheckgpt:
          <article-title>Zero-resource black-box hallucination detection for generative large language models</article-title>
          ,
          <source>CoRR abs/2303</source>
          .08896 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/ arXiv.2303.08896. doi:
          <volume>10</volume>
          .48550/arXiv.2303.08896. arXiv:
          <volume>2303</volume>
          .
          <fpage>08896</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [78]
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mackey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Kalai</surname>
          </string-name>
          ,
          <article-title>Do language models know when they're hallucinating references?</article-title>
          ,
          <source>CoRR abs/2305</source>
          .18248 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305.18248. doi:
          <volume>10</volume>
          . 48550/arXiv.2305.18248. arXiv:
          <volume>2305</volume>
          .
          <fpage>18248</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [79]
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Self-checker:
          <article-title>Plug-and-play modules for fact-checking with large language models</article-title>
          ,
          <source>CoRR abs/2305</source>
          .14623 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305.14623. doi:
          <volume>10</volume>
          .48550/arXiv.2305.14623. arXiv:
          <volume>2305</volume>
          .
          <fpage>14623</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [80]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hamri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Geva</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Globerson,</surname>
          </string-name>
          <article-title>LM vs LM: detecting factual errors via cross examination</article-title>
          ,
          <source>CoRR abs/2305</source>
          .13281 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305.13281. doi:
          <volume>10</volume>
          . 48550/arXiv.2305.13281. arXiv:
          <volume>2305</volume>
          .
          <fpage>13281</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [81]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mündler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jenko</surname>
          </string-name>
          , M. T. Vechev,
          <article-title>Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation</article-title>
          ,
          <source>CoRR abs/2305</source>
          .15852 (
          <year>2023</year>
          ). URL: https://doi.org/ 10.48550/arXiv.2305.15852. doi:
          <volume>10</volume>
          .48550/arXiv.2305.15852. arXiv:
          <volume>2305</volume>
          .
          <fpage>15852</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [82]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <article-title>A new benchmark and reverse validation method for passage-level hallucination detection (</article-title>
          <year>2023</year>
          ). arXiv:
          <volume>2310</volume>
          .
          <fpage>06498</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [83]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , Factscore:
          <article-title>Fine-grained atomic evaluation of factual precision in long form text generation</article-title>
          ,
          <source>CoRR abs/2305</source>
          .14251 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305.14251. doi:
          <volume>10</volume>
          .48550/arXiv. 2305.14251. arXiv:
          <volume>2305</volume>
          .
          <fpage>14251</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [84]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sriram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Durrett</surname>
          </string-name>
          , E. Choi,
          <article-title>Complex claim verification with evidence retrieved in the wild</article-title>
          ,
          <source>CoRR abs/2305</source>
          .11859 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305.11859. doi:
          <volume>10</volume>
          .48550/ARXIV.2305.11859. arXiv:
          <volume>2305</volume>
          .
          <fpage>11859</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [85]
          <string-name>
            <given-names>S.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arabzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <article-title>Retrieving supporting evidence for llms generated answers</article-title>
          ,
          <source>CoRR abs/2306</source>
          .13781 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2306.13781. doi:
          <volume>10</volume>
          .48550/ ARXIV.2306.13781. arXiv:
          <volume>2306</volume>
          .
          <fpage>13781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [86]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          , G. Neubig, P. Liu, Factool:
          <article-title>Factuality detection in generative AI - A tool augmented framework for multi-task and multidomain scenarios</article-title>
          ,
          <source>CoRR abs/2307</source>
          .13528 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307.13528. doi:
          <volume>10</volume>
          .48550/arXiv.2307.13528. arXiv:
          <volume>2307</volume>
          .
          <fpage>13528</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [87]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bawden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yvon</surname>
          </string-name>
          ,
          <article-title>Investigating the translation performance of a large multilingual language model: the case of BLOOM</article-title>
          ,
          <source>CoRR abs/2303</source>
          .
          <year>01911</year>
          (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv. 2303.
          <year>01911</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2303.
          <year>01911</year>
          . arXiv:
          <fpage>2303</fpage>
          .
          <year>01911</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [88]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hendy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdelrehim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raunak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gabr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Matsushita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Afify</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Awadalla</surname>
          </string-name>
          ,
          <article-title>How good are GPT models at machine translation? A comprehensive evaluation</article-title>
          ,
          <source>CoRR abs/2302</source>
          .09210 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2302.09210. doi:
          <volume>10</volume>
          .48550/arXiv. 2302.09210. arXiv:
          <volume>2302</volume>
          .
          <fpage>09210</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [89]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , ACL,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [90]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Färber</surname>
          </string-name>
          ,
          <article-title>Evaluating generative models for graph-to-text generation</article-title>
          ,
          <source>CoRR abs/2307</source>
          .14712 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307.14712. doi:
          <volume>10</volume>
          .48550/arXiv. 2307.14712. arXiv:
          <volume>2307</volume>
          .
          <fpage>14712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [91]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. Zhang,</surname>
          </string-name>
          <article-title>Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities</article-title>
          ,
          <source>CoRR abs/2305</source>
          .13168 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2305.13168. doi:
          <volume>10</volume>
          .48550/arXiv. 2305.13168. arXiv:
          <volume>2305</volume>
          .
          <fpage>13168</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [92]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoseiny</surname>
          </string-name>
          , Minigpt-4:
          <article-title>Enhancing vision-language understanding with advanced large language models</article-title>
          ,
          <source>CoRR abs/2304</source>
          .10592 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/ arXiv.2304.10592. doi:
          <volume>10</volume>
          .48550/arXiv.2304.10592. arXiv:
          <volume>2304</volume>
          .
          <fpage>10592</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [93]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework</article-title>
          , in: K. Chaudhuri,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jegelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szepesvári</surname>
          </string-name>
          , G. Niu, S. Sabato (Eds.),
          <source>ICML</source>
          <year>2022</year>
          , volume
          <volume>162</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>23318</fpage>
          -
          <lpage>23340</lpage>
          . URL: https: //proceedings.mlr.press/v162/wang22al.html.
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [94]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Biten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <article-title>Let there be a clock on the beach: Reducing object hallucination in image captioning</article-title>
          ,
          <source>in: IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          , WACV 2022,
          <article-title>Waikoloa</article-title>
          ,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA, January 3-
          <issue>8</issue>
          ,
          <year>2022</year>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>2473</fpage>
          -
          <lpage>2482</lpage>
          . URL: https: //doi.org/10.1109/WACV51458.
          <year>2022</year>
          .
          <volume>00253</volume>
          . doi:
          <volume>10</volume>
          .1109/WACV51458.
          <year>2022</year>
          .
          <volume>00253</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [95]
          <string-name>
            <given-names>S.</given-names>
            <surname>Petryk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Whitehead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <article-title>Simple token-level confidence improves caption correctness</article-title>
          ,
          <source>CoRR abs/2305</source>
          .07021 (
          <year>2023</year>
          ). URL: https://doi.org/10. 48550/arXiv.2305.07021. doi:
          <volume>10</volume>
          .48550/arXiv.2305.07021. arXiv:
          <volume>2305</volume>
          .
          <fpage>07021</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [96]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          , L. Yuan,
          <article-title>Album storytelling with iterative story-aware captioning and large language models</article-title>
          ,
          <source>CoRR abs/2305</source>
          .12943 (
          <year>2023</year>
          ). URL: https://doi. org/10.48550/arXiv.2305.12943. doi:
          <volume>10</volume>
          .48550/arXiv.2305.12943. arXiv:
          <volume>2305</volume>
          .
          <fpage>12943</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [97]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pagnoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Balachandran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsvetkov</surname>
          </string-name>
          ,
          <article-title>Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics</article-title>
          ,
          <source>in: NAACL</source>
          <year>2021</year>
          , ACL,
          <year>2021</year>
          , pp.
          <fpage>4812</fpage>
          -
          <lpage>4829</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>383</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>383</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [98]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Knowledge-grounded dialogue generation with a unified knowledge representation</article-title>
          ,
          <source>in: NAACL</source>
          <year>2022</year>
          , ACL,
          <year>2022</year>
          , pp.
          <fpage>206</fpage>
          -
          <lpage>218</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2022</year>
          .naacl-main.
          <volume>15</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .naacl-main.
          <volume>15</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [99]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wieting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Verga</surname>
          </string-name>
          ,
          <article-title>Faithful to the document or to the world? mitigating hallucinations via entity-linked knowledge in abstractive summarization</article-title>
          ,
          <source>in: Findings of EMNLP</source>
          <year>2022</year>
          , ACL,
          <year>2022</year>
          , pp.
          <fpage>1067</fpage>
          -
          <lpage>1082</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2022</year>
          .findings-emnlp.
          <volume>76</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2022</year>
          .findings-emnlp.
          <volume>76</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          [100]
          <string-name>
            <given-names>E.</given-names>
            <surname>Durmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          , M. T. Diab,
          <article-title>FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , ACL,
          <year>2020</year>
          , pp.
          <fpage>5055</fpage>
          -
          <lpage>5070</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          . acl-main.
          <volume>454</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>454</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          [101]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurgens</surname>
          </string-name>
          ,
          <article-title>Measuring sentence-level and aspect-level (un)certainty in science communications</article-title>
          , in: M.
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W. Yih (Eds.),
          <source>EMNLP</source>
          <year>2021</year>
          , ACL,
          <year>2021</year>
          , pp.
          <fpage>9959</fpage>
          -
          <lpage>10011</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>784</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . emnlp-main.
          <volume>784</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          [102]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ilharco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <article-title>Editing models with task arithmetic</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations, ICLR</source>
          <year>2023</year>
          , Kigali, Rwanda, May 1-
          <issue>5</issue>
          ,
          <year>2023</year>
          , OpenReview.net,
          <year>2023</year>
          . URL: https://openreview.net/pdf?id= 6t0Kwf8-
          <fpage>jrj</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>