Cognitive Mirage: A Review of Hallucinations in Large Language Models ⋆

Cognitive Mirage: A Review of Hallucinations in Large Language Models ⋆ HongbinYe yehongbin@zhejianglab.com Zhejiang Lab

No. 1 Kechuang Avenue, Yuhang District Hangzhou City, Zhejiang Province China

TongLiu liutong@zhejianglab.com Zhejiang Lab

No. 1 Kechuang Avenue, Yuhang District Hangzhou City, Zhejiang Province China

AijiaZhang zhangaijia@zhejianglab.com Zhejiang Lab

No. 1 Kechuang Avenue, Yuhang District Hangzhou City, Zhejiang Province China

WeiHua huawei@zhejianglab.com Zhejiang Lab

No. 1 Kechuang Avenue, Yuhang District Hangzhou City, Zhejiang Province China

WeiqiangJia jiaweiqiang@zhejianglab.com Zhejiang Lab

No. 1 Kechuang Avenue, Yuhang District Hangzhou City, Zhejiang Province China

Cognitive Mirage: A Review of Hallucinations in Large Language Models ⋆ 1613-0073 201125FA1AC698B6DDF803613F0888C0 GROBID - A machine learning software for extracting information from scholarly documents Taxonomy of Hallucination Large Language Models Hallucination Detection Hallucination Correction Hallucination Detection Inference Classifier FIB [62] ExHalder [64] HaluEval [31] GAVIE [67] Fact-checking [68] CoNLI [69] Uncertainty Metric BARTScore [70] KoK [71] SLAG [72] KLD [73] POLAR [74] ASTSN [75] Self-Evaluation LM-know [76] SelfCheckGPT [77] Do-LLM-Know [78] EOH [41] Self-Checker [79] LM-vs-LM [80] SelfCk [81] RV [82] Evidence Retrieval FActScore [83] CCV [84] RSE [85] FacTool [86]

As large language models continue to develop in the field of AI, text generation systems are susceptible to a worrisome phenomenon known as hallucination. In this study, we summarize recent compelling insights into hallucinations in LLMs. We present a novel taxonomy of hallucinations from various text generation tasks, thus provideing theoretical insights, detection methods and improvement approaches. Based on this, future research directions are proposed. Our contributions are threefold: (1) We provide a complete taxonomy for hallucinations appearing in text generation tasks; (2) We provide theoretical analyses of hallucinations in LLMs and provide existing detection and improvement methods; (3) We propose several research directions that can be developed in the future. Our literature library is available at https://github.com/hongbinye/Cognitive-Mirage-Hallucinations-in-LLMs.

Introduction

In the ever-evolving realm of large language models (LLMs), a constellation of innovative creations has emerged, such as GPT-3 [1], InstructGPT [2], FLAN [3], PaLM [4], LLaMA [5] and other notable contributors [6,7,8,9]. These models implicitly encode global knowledge within their parameters during the pre-training phase [10,11], offering valuable insights as knowledge repositories for downstream tasks [12,13,14]. Nevertheless, the generalization of knowledge can result in memory distortion, an inherent limitation that may give rise to potential inaccuracies [15]. Moreover, their ability to represent knowledge is constrained by model scale and faces challenges in addressing long-tailed knowledge problems [16,17]. While the privacy and timeliness of data in the real world [18,19] unfortunately exacerbate this problem, leaving models difficult to maintain a comprehensive and up-todate understanding of the facts. These challenges present a serious obstacle to the reliability of LLMs, which we refer to as hallucination. [20]. A prominent example of this drawback is that models typically generate statements that appear reasonable but are either cognitively irrelevant or factually incorrect. In light of this observation, hallucinations remain a critical challenge in medical [21,22], financial [23] and other knowledge-intensive fields due to the exacting accuracy requirements. Particularly, the applications for legal case drafting showcase plausible interpretation as an aggregation of diverse subjective perspectives [24].

Definition of Hallucination.

As depicted in Figure 1, hallucination refers to the generation of texts or responses that exhibit grammatical correctness, fluency, and authenticity, but deviate from the provided source inputs (faithfulness) or do not align with factual accuracy (factualness) [25]. In traditional NLP tasks [26], hallucinations are often synonymous with faithfulness: conflicting information leads to Intrinsic Hallucination, i.e., LMs conflict with the input information when generating a response; Conversely, generating ambiguous supplementary information may lead to Extrinsic Hallucination, i.e., LMs produce personal names, historical events, or technical documents that are challenging to verify. LLMs-oriented hallucinations instead prioritize factualness, focusing on whether the result can be evidenced or negated by reference to external facts in the real world. Uncritical trust in LLMs can give rise to a phenomenon Cognitive Mirage, contributing to misguided decision-making and a cascade of unintended consequences [27].

Present work

To effectively control the risk of hallucinations, we summarize recent progress in hallucination theories and solutions in this paper. We propose to organize relevant work by a comprehensive survey (Figure 2):

• Theoretical insight and mechanism analysis. We provide in-depth theoretical and mechanism analysis from three typical perspectives: data collection, knowledge gap and optimization process, reviewing the recent and relevant theories related to hallucinations ( §2). • Taxonomy of hallucination in LLMs. We conduct a comprehensive review of hallucination in LLMs together with a task axis. We review the task-specific benchmarks with a comprehensive comparison and summary ( §3). • Wide coverage on emerging hallucination detection and correction methods. We propose a comprehensive investigation into the proactive detection ( §4) and mitigation of hallucinations handling language pairs with limited resources or non-English translations [39]. Furthermore, cuttingedge Large Vision-Language Models (LVLMs) exhibit instances of hallucinating common objects within visual instructional datasets and prone to objects that frequently co-occur in the same image [40,41].

Knowledge Gap Knowledge gaps are typically attributed to differences in input format between the pre-training and fine-tuning stages [42]. Even when considering the automatic updating of textual knowledge bases, the output can deviate from the expected corrections [43]. For example, questions often do not align effectively with stored knowledge, and the available information remains unknown until the questions are presented. This knowledge gap poses thorny challenges in balancing memory with retrieved evidence, which is construed as a passive defense mechanism against the misuse of retrieval [44]. To delve into this issue, [45] and [46] propose that disregarding retrieved evidence introduces biased model knowledge, while mis-covering and over-thinking disrupt model behavior. Furthermore, in scenarios where a cache component is utilized to offer historical memory during training [47], the model also experiences inconsistency between the present hidden state and the hidden state stored in the cache.

Optimization Process

The maximum likelihood estimation and teacher-forcing training have the potential to result in a phenomenon known as stochastic parroting [48], wherein the model is prompted to imitate the training data without comprehension [49]. Specifically, exposure bias between the training and testing stages have been demonstrated to lead to hallucinations within LLMs, particularly when generating lengthy responses [50]. Besides, sampling techniques characterized by high uncertainty [51], such as top-p and top-k, exacerbate the issue of hallucination. Furthermore, [27] observes that LLMs tend to produce snowballing hallucinations to maintain coherence with earlier hallucinations, and even when directed with prompts as "Let's think step by step", they still generate ineffectual chains of reasoning [13].

Taxonomy of Hallucination

In this paper, we mainly consider representative hallucinations, which are widely observed in various downstream tasks, i.e. Machine Translation, Question and Answer, Dialog System, Summarization System, Knowledge graph with LLMs, and Visual Question Answer. As shown in Table 1, these hallucinations are identified complex taxonomy in numerous mainstream tasks associated with LLMs. In the following sections, we will introduce representative types of hallucinations to be resolved.

• Machine Translation. Since perturbations (e.g., spellings or capital errors) can induce hallucinations reliably, traditional machine translation models tend to validate instances memorised by the model when subjected to perturbations [87,88]. It is worth noting that hallucinations generated by LLMs are mainly translation off-target, over-generation, or failed translation attempts [39]. While in low-resource language setting, most models exhibit subpar performance due to the lack of annotated data [54]. In contrast, they are vulnerable to increased amount of pre-trained languages in multilingual setting [89]. Subsequently, familial LLMs trained on different scales of monolingual data are proved to be viscous [39], as the source of oscillatory hallucination pathology. • Question and Answer. Imperfect responses suffer from flawed external knowledge, knowledge recall cues and reasoning instruction [42]. For example, LLMs are mostly unable to avoid answering when provided with no relevant information, instead provide incomplete and plausible answers [56]. In additon to external knowledge, memorized information without accurate, reliable and accessible source also contributes to different types of hallucinations [22]. Though scaling laws suggest that perplexity on the training distribution is positively correlated with parameter size, [30] further discovers that scaling up models should increase the rate of imitative falsehoods.

• Dialog System. Some studies view dialogue models as unobtrusive imitators, which simulates the distributional properties of data instead of generating faithful output. For example, Uncooperativeness responses [57] originating from discourse phenomena inclines to output an exact copy of the entire evidence. [58] reports more nuanced hallucinations in KG-grounded dialogue systems as analyzed through human feedback. Similarly, FaithDial [59], BEGIN [60], MixCL [61] all implement experiments on the WoW dataset to conduct a meta-evaluation of the hallucination in knowledge grounded dialogue.

• Summarization System. Automatically generated abstracts based on LLMs may be fluent, but they still typically lack faithfulness to the source document. Compared to the human evaluation of traditional summarization models [26], the summarizations generated by LLMs can be categorized into two major types: intrinsic hallucinations that distort the information present in the document; extrinsic hallucinations that provide additional information that cannot be directly attributed to the document [65]. Note that extrinsic hallucination as a metrics of factually consistent continuation of inputs in LLMs is given more attention in summarisation systems [62,64]. Furthermore, [63] subdivides extrinsic hallucinations into factual and non-factual hallucinations. The former provides additional world knowledge, which may benefit comprehensive understanding.

• Knowledge Graph with LLMs. Despite the promising progress in knowledge-based text geneartion, it encounters intrinsic hallucinations inherent to the process where the generated text not only covers the input information but also incorporates redundant details derived from its internal memorized knowledge [90]. To address this, [66] establish a distinction between correctly generated knowledge and knowledge hallucinations in terms of knowledge creation. Notably, the Virtual Knowledge Extraction [91] underscores the potential generalization capabilities of LLMs in the realms of constructing and inferring from knowledge graphs. [32] further empower LLMs to produce interpretable fact-checks through a neural symbolic approach. Based on their fidelity to the source, hallucinations are defined as subject hallucination, relation hallucination, and object hallucination.

• Cross-modal System. Augmented by the superior language capabilities of LLMs, performance of cross-modal tasks achieves promising progress [92,40]. However, despite replacing the original language encoder with LLMs, Large Visual Language Models (LVLMs) [93] still generate object descriptions that not present in the target image, denoted as object hallucinations [41]. Especially, the various failure cases could be typically found in Visual Question Answering [41,67], Image Captioning [94,95,96],

Report Generation [68] etc.

Hallucination Detection

Conventional hallucination detection mainly depends on task-specific metrics, such as ROUGE and BLEU to evaluate the information overlap between source and target texts in summarization tasks [97], while knowledge F1 to estimate the knowledge-aware ability of response generation [98]. These metrics focus on measuring faithfulness of references and fail to provide an assessment of factualness. Despite some reference-free works are proposed, plugin-based methods [99] suffer from world knowledge limitation. QA-based matching metrics [100] lack knowledge completeness of source information. NLI-based methods [60] are unable to support finer-grained hallucination checking as they are sentencelevel, besides entailment and hallucination problems are not equivalent. As the paradigm shift in hallucination detection arising from the rapid development of LLMs, we present a novel taxonomy in Fig 3 and introduce each category in following sections.

• Inference Classifier. The most straightforward strategy involves adopting classifiers to assess the likelihood of hallucinations. Concretely, given a question 𝒬 and an answer 𝒜, an inferential classifier 𝒞 can be asked to determine whether the answer contains hallucinatory content ℋ via computing 𝑝(ℋ) = ℱ 𝒞 (𝒬, 𝒜). Therefore, [64] employs the state-of-the-art LLMs to do end-to-end text generation of detection results. Some other studies [31] finds that adding chains of thought indiscriminately may intervene in the final judgement, whereas retrieving the knowledge properly resulted in gains. Furthering this concept, the hinted classifer and explainer [64], used to generate intermediate process labels and high-quality natural language explanations, are demonstrated to enhance the final predicted class from a variety of perspectives. Subsequently, [62] suggests adopting a different classifier model to the generated model, contributing to easier judgement of factual consistency. For radiology report generation, binary classifiers [68] can be leveraged to measure the reliability by combining image and text embedding. Unlike previous work that employs complex human-crafted rules to evaluate object hallucinations, GAVIE [67] scores responses towards image content based on both accuracy and relevance criteria, which evaluates the LMMs output in an open-ended manner.

• Uncertainty Metric. It is important to examine the correlation between the hallucination metric and the quality of output from a variety of perspectives. One intuitive approach is to employ the probabilistic output of the model itself, as ASTSN [75] calculates the models' uncertainty about the identified concepts by utilising the logit output values. Similarly, BARTSCORE [70] employs a universal notion that models trained to convert generated text to reference output or source text will score higher when the generated text is superior. It is an unsupervised metric that supports the addition of appropriate prompts to improve the measure design, without human judgement to train. Furthermore, KoK [71] based on the work of [101] evaluates answer uncertainty from three categories, i.e., subjectivity, hedges and text uncertainty. However, SLAG [72] measures consistent factual beliefs in terms of paraphrase, logic, and entailment. In addition to this, KLD [73] combines information theory-based metrics (e.g., entropy and KL-divergence) to capture knowledge uncertainty. Beside expert-stipulated programmatic supervision, POLAR [74] introduces Pareto optimal learning assessed risk score for estimating the confidence level of a response.

• Self-Evaluation. To self-evaluate is challenging since the model might be overconfident about its generated samples being correct. The motivating idea of SelfCheckGPT [77] is to use the ability of the LLMs themselves to sample multiple responses and identify fictitious statements by measuring the consistency of information among responses. [76] further illustrates that both the increase in size and the demonstration of assessment can improve self-assessment. Beyond repetitive multiple direct queries, [78] uses open-ended indirect queries and compares their answers to each other for an agreed-upon score outcome. SelfCk [81] imposes appropriate constraints on the same LLM to generate pairs of sentences triggering self-contradictions, which prompt the detection. In contrast, Polling-based querying [41] reduce the complexity of judgement by randomly sampling query objects. Besides, Self-Checker [79] decomposes complex statements into multiple simple statements, fact-checking them one by one. However, [80] introduces two LLMs to drive the complex fact-checking reasoning process by crosscheck.

• Evidence Retrieval. Evidence retrieval accomplishes factual detection by retrieving supporting evidence related to hallucinations. To this end, Designing a claim-centric pipeline allows for a questionretrieve-summary chain to effectively collect original evidence [84,85]. Consequently, FActScore [83] calculates the percentage of atomic facts supported by the given knowledge source. Towards adapting the tasks that users in interaction with generative models, FacTool [86] proposes to integrate a variety of tools into a task-agnostic and domain-agnostic detection framework, in order to assemble evidence about the authenticity of the generated content.

Hallucination Correction

In this section, we delve into the methods to correct hallucination in terms of different aspects. As shown in Figure 4, these hallucination correction paradigms have demonstrated strong dominance in many mainstream NLP tasks. Note that these methods are not entirely orthogonal but could complement each other as required by the tasks in practical applications. In the following sections, we will introduce each methods as shown in Figure 5.

Hallucination Correction

Parameter Adaptation Factual-Nucleus [51], CLR [61], Edit-TA [102], EWR [103], PURR [104], mmT5 [55], HISTALIGN [47], TYE [105], ALLM [106], TRAC [107], Inference-Time [108], EasyEdit [109], DoLa [110] Post-hoc Attribution and Edit Technology NP-Hunter [111], CoT [14], ORCA [112], RR [113], TRAK [114], Data-Portraits [115], Self-Refine [116], Reflexion [117], QUIP [118], Verify-and-Edit [119], CoVe [120], CoNLI [69] Leverage External Knowledge RETRO [121], IRCoT [122], POPQA [17], LLM-AUGMENTER [123], In-Context RALM [62], GeneGPT [124], cTBL [125], CoK [126], FLARE [127], Gorilla [128], RETA-LLM [129], KnowledGPT [130] Assessment Feedback LSHF [131], TLM [132], BRIO [133], LM-know [76], Chain-of-Hindsight [134], ZEROFEC [43], CRITIC [135], VIVID [96], LMH-Snowball [27], MixAlign [45], REFEED [15], PaD [136], ALCE [137], Do-LLM-Know [78], CRL [138], SR [139] Mindset Society HLMTM [39], Multiagent-Debate [140], MAD [141], FORD [142], LM-vs-LM [80], PRD [143], SPP [144] Figure 5: Taxonomy of Hallucination Correction.

• Parameter Adaptation. Parameters in LLMs store biases learned in pre-training, are often unaligned with user intent. A cutting-edge strategy is to guide effective knowledge through parameter conditioning, editing, and optimisation. For example, CLR [61] optimises to reduce the generation probability of negative samples at span level utilising contrastive learning parameters. While introducing contextual knowledge background that contradicts the model's intrinsic prior knowledge, TYE [105] effectively reduces the weight of prior knowledge through context-aware decoding method. Besides, PURR [104] corrupts noise into the text, fine-tune compact editors, and denoise by merging relevant evidence. To introduce additional cache component, HISTALIGN [47] discovers that its hidden state is not aligned with the current hidden state, and proposes sequence information contrastive learning to improve the reliability of memory parameters. Consequently, Edit-TA [102] mitigates the biases learnt in pre-training from a task algorithm perspective. An intuition behind it is that parameter variations learnt through negative example tasks could be perceived through weight variances. However as this fails to take the importance of different negative examples into account, therefore EWR [103] proposes Fisher information matrices to measure the uncertainty of their estimation, which is applied for the dialogue systems to execute a parameter interpolation and remove hallucination. EasyEdit [109] summarises methods for parameter editing, while minimising the influence to irrelevant parameter. An efficient alternative is to identify task-specific parameters and exploit them. For example, ALLM [106] aligns the parameter module with task-specific knowledge, and then generates the relevant knowledge as additional context in background augmented prompts. Similarly, mmT5 [55] utilises language-specific modules during pre-training to separate language-specific information from languageindependent information, demonstrating that adding language-specific modules can alleviate the curse of multilinguality. Instead, TRAC [107] combines conformal prediction and global testing to augment retrieval-based QA. The conservative strategy formulation ensures that a semantically equivalent answer to the truthful answer is included in the prediction set.

Another parameter adaptation idea focuses on flexible sampling consistent with user requirements. For instance, [51] observes that the randomness of sampling is more detrimental to factuality when generating the latter part of a sentence. The factual-nucleus sampling algorithm is introduced to keep the faithfulness of the generation while ensuring the quality and diversity. Besides, Inference-Time [108] firstly identifies a set of attentional heads with high linear probing accuracy, and then shifts activation in the inference process along the direction associated with factual knowledge.

• Post-hoc Attribution and Edit Technology. A source of hallucination is that LLMs may leverage the patterns observed in the pre-training data for inference in a novel form. Recently, ORCA [112] reveals problematic patterns in the behaviour of models by probing supporting data evidences from pre-training data. Similarly, TRAK [114] and Data-Portraits [115] analyse whether models plagiarise or reference existing resources by means of data attribution. QUIP [118] further demonstrates that providing text that has been observed in the pre-training phase can improve the ability of LLMs to generate more factual information. Furthermore, motivated by the gap between LLMs and human modes of thinking, one intuition is to align the two modes of reasoning. Thus CoT [14] elicits faithful reasoning via a kind of Chain-of-Thought (CoT) [13] prompts. Similarly, RR [113] retrieves relevant external knowledge based on decomposed reasoning steps obtained from a CoT prompt. Since LLMs do not produce the best output on the first attempt, Self-Refine [116] implements self-refinement algorithms through iterative feedback and improvement. Reflexion [117] also employs verbal reinforcement to generate reflective feedback by learning about prior failings. Verify-and-Edit [119] proposes a CoT-prompted verify-and-edit framework, which improves the fidelity of predictions by post-editing the inference chain based on externally retrieved knowledge. CoVe [120] emphasises the importance of independent self-verification to prevent being influenced by other responses. Another source of hallucinations is to describe factual content with incorrect retrievals. To recify this, NP-Hunter [111] follows a generatethen-refine strategy whereby a generated response is amended using the KG so that the dialogue system is able to correct potential hallucinations by querying the KG.

• Leverage External Knowledge. As an attempt to extend the language model for halucination mitigation, a suggestion is to retrieve relevant documents from large textual databases. RETRO [121] splits the input sequence into chunks and retrieves similar documents, while In-Context RALM [62] places the selected document before the input text to improve the prediction. Furthermore, IRCoT [122] interweaves CoT generation and document retrieval steps to guide LLMs. LLM-AUGMENTER [123] also bases the responses of LLMs on integrated external knowledge and automated feedback to improve the truthfulness score of the answers. Another work, CoK [126] iteratively analyses future content of upcoming sentences, and then applies them as a query to retrieve relevant documents for the purposes of re-generating sentences when they contain low confidence tokens. Similarly, RETA-LLM [129] creates a complete pipeline to assist users in building their own domain-based LLM retrieval systems. Note that in addition to document retrieval, diverse external knowledge queries coule be assembled into retrievalaugmented LLM systems. For example, FLARE [127] leverages structured knowledge bases to support complex queries and provide more straightforward factual statements. Further, KnowledGPT [130] adopts program of thoughts (PoT) prompting, which generates codes to interact with knowledge bases. While cTBL [125] proposes to enhance LLMs with tabular data in conversation settings. Besides, GeneGPT [124] demonstrates that expertise can be accessed more easily and accurately by detecting and executing API calls through contextual learning and augmented decoding algorithms. To support potentially millions of ever-changing APIs, Gorilla [128] explores self-instruct fine-tuning and retrieval for efficient API exploitation.

• Assessment Feedback. As language models become more sophisticated, evaluation feedback can significantly improve the quality of generated text, as well as reduce the appearance of hallucinations. To realise this concept, LSHF [131],TLM [132] and Chain-of-Hindsight [134] predict human preferences through reinforcement learning and employs this as the reward function. In addition to enabling the model to learn directly from the feedback of factual metrics in a sample-efficient manner [138], it is also important to build in a self-evaluation function of the model to filter candidate generated texts. For example, BRIO [133] empowers summarization model assessment, estimating probability distributions of candidate outputs to rate the quality of candidate summaries. While LM-know [76] is devoted to investigating whether LLMs can evaluate the validity of their own claims by detecting the probability that they know the answer to a question. Consequently, Do-LLM-Know [78] queries exclusively with black-box LLMs, and the results of queries repeatedly generated multiple times are compared with each other to pass consistency checks. As missing citation quality evaluation affects the final performance, ALCE [137] employs a natural language reasoning model to measure citation quality and extends the integrated retrieval system. Similarly, CRITIC [135] proposes to interact with appropriate tools to assess certain aspects of the text, and then to modify the output based on the feedback obtained during the verification process. Note that automated error checking can also utilise LLMs to generate text that conforms to tool interfaces. PaD [136] distills the LLMs with a synthetic inference procedure, and the synthesis program obtained can be automatically compiled and executed by an explainer. Further, iterative refinement processes are validated to effectively identify appropriate details [96,45,15], and can stop early invalid reasoning chains, beneficially reducing the phenomenon of hallucination snowballing [27].

• Mindset Society. Human intelligence thrives on the concept of cognitive synergy, where collaboration between different cognitive processes produces better results than isolated individual cognitive processes. "Society of minds" [145] is believed to have the potential to significantly improve the performance of LLMs and pave the way for consistency in language production and comprehension. For the purpose of addressing hallucinations in large-scale multilingual models across different translation scenarios, HLMTM [39] proposes a hybrid setting in which other translation systems can be requested to act as a back-up system when the original system is hallucinating. Consequently, Multiagent-Debate [140] employs multiple LLMs in several rounds to propose and debate their individual responses and reasoning processes to reach a consensus final answer. As a result of this process, the models are encouraged to construct answers that are consistent with both internal criticisations and responses from other agents. Before a final answer is presented, the resultant community of models can hold and maintain multiple reasoning chains and possible answers simultaneously. Based on this idea, MAD [141] adds a judge-managed debate process, demonstrating that adaptive interruptions of debate and controlled "titfor-tat" states help to complete factual debates. Furthermore, FORD [142] proposes roundtable debates that include more than two LLMs and emphasises that competent judges are essential to dominate the debate. LM-vs-LM [80] also proposes multi-round interactions between LM and another LM to check the factualness of original statements. Besides, PRD [143] proposes a peer rank and discussionbased evaluation framework to arrive at a well-recognised assessment result that all peers are in agreement with. In an effort to maintain strong reasoning, SPP [144] utilises LLMs to assign several fine-grained roles, which effectively stimulates knowledge acquisition and reduces hallucinations.

Future Directions

Though numerous technical solutions have been proposed in the survey for hallucinations in LLMs, there exist some potential directions:

• Data Construction Management. As previously discussed, the style, and knowledge of LLMs is basically learned during model pre-training. High quality data present promising opportunities for facilitating the reduction of hallucinations in LLMs [146]. Inspired by the basic rule of machine learning models "Garbage input, garbage output", [147] proposes that data quality and diversity outweigh the importance of fine-tuning large-scale instructions [148,3,149] and RLHF [6,2]. To perform efficiently in knowledge-intensive verticals, we argue that construction of entity-centred fine-tuned instructions [150,151,152] is a promising direction that it can enhance the factuality of generated entity information. Another feasible proposal is to incorporate a self-curation phase [153] in the instruction construction process to rate the quality of candidate pairs. During the iteration process, quality evaluation [154] based on manual or automated rule constraints could provide self-correction capacity.

• Reasoning Mechanism Exploitation. The emerging CoT technique [14] stimulates the emergent reasoning ability of LLMs by imitating intrinsic stream of thought. Recently, A primary improvement is ToT [155] introduces tree and into the thought process, and provides a novel backtrack function. However, the actual thinking process creates a complex network of ideas, as an example, people could explore a particular chain of reasoning, backtrack or start a new chain of reasoning. GoT [156] extends the dependencies between thoughts by constructing vertices with multiple incoming edges to aggregate arbitrary thoughts. Since previous methods have no storages for intermediate results, CR [156] uses cumulative and iterative manners to simulate human thought processes, and decompose the task into smaller components. In addition to self-heuristic methods, PAL [157] and PoT [158] introduce programming logic into the language space [159], expanding the ability to invoke external explainers. As a summary, research based on human cognition helps to provide brilliant insights into the analysis of hallucinations, such as Dual Process Theory [160], Three layer mental model [161], Computational Theory of Mind [162], and Connectionism [163].

• Multi-modal Hallucination Survey. It has become a community consensus to establish powerful Multimodal Large Language Models (MLLMs) [164,165,166] by taking advantage of excellent comprehension and reasoning capabilities of LLMs. [41] confirms the severity of hallucinations in MLLM by object detecting and polling-based querying. The results indicate that they are highly susceptible to object hallucination, and the generated description does not match the target image. Besides, [167] that MLLMs have limited multimodal reasoning ability as well as dependence on spurious cues. Though current study [168] provides a broad overview of MLLMs, the causation of hallucinations has not been comprehensively investigated. In the future, as more sophisticated multi-model applications emerge, in-depth analyses of the biased distribution resulting from misalignment among modes is a promising research direction, to provide faithful modal interactions.

Conclusion and Vision

In this paper, we provide an overview of hallucinations in LLMs with new taxonomy, theoretical insight, detection methods, correction methods and several future research directions. Note that it is crucial to utilize LLMs in a responsible and beneficial manner. Furthermore, with sophisticated and efficient detection methods proposed for various aspects, LLMs will provide human reliable and secure information in broad application scenarios.

Figure 1 :1Figure 1: Illustration of Hallucination in LLMs. While the initial response appears fluent, it fails to align with the world knowledge retrieved from the external knowledge base.

Figure 3 :3Figure 3: Taxonomy of Hallucination Detection.

Figure 4 :4Figure 4: Sankey diagram of hallucination correction methods with different mainstream NLP tasks.

In which sport did the Czech stars Daniel Vacek and Hana Mandlíková gain professional status? Daniel Vacek and Hana Mandlíková both gained professional status in cricket. ◆Daniel Vacek (born 1 April 1971) is a former tennis player from Czechoslovakia and the Czech Republic who turned professional in 1990.◆Hana Mandlíková (born 19 February 1962) is aformer professional tennis player fromCzechoslovakia who later obtained Australiancitizenship.Daniel Vacek and Hana Mandlíková both gainedprofessional status in tennis.

Table 11List of Representative HallucinationPaperTaskArchitecture ResourcesHallucination TypesResearch MethodRaunak et al. [52]Machine Transla-Enc-DecIWSLT-2014Under perturbation, Natural hal-Source perturbationtionlucinationGuerreiro et al. [53] Machine Transla-Enc-DecWMT2018Oscillatory hallucination, LargelyConsider a natural sce-tionfluent hallucinationnarioDale et al. [54]Machine Transla-Enc-DecFLORES-200, Jig-Full hallucination, Partial halluci-Introduce pathologytionsaw, Wikipedianation, Word-level hallucinationdetectionPfeiffer et al. [55]MultilingualEnc-DecXQuAD, TyDi,Source language hallucinationEvaluate source lan-Seq2seqXNLI, XL-Sum,guage hallucinationMASSIVELin et al. [30]Question and An-Enc-Dec,TruthfulQAImitative falsehoodsCause imitative false-swerOnly-DechoodsZheng et al. [42]Question and An-Only-DecHotpotQA,Comprehension,Factualness,Manual analysis of re-swerBoolQSpecificity, Inference Hallucina-sponsestionAdlakha et al. [56]Question and An-Enc-Dec,NQ, HotpotQA,Semantic equivalence, SymbolicEvaluate retrieval aug-swerOnly-DecTopiOCQAequivalence, Intrinsic ambiguity,mented QAGranularity discrepancies, Incom-plete, Enumeration, SatisfactorySubsetUmapathi et al. [22] Question and An-Only-DecMEDMCQA,Reasoninghallucination,Medical benchmarkswerHeadqa,US-Memory-based hallucinationMed-HALTMILE, Medqa,PubmedDziri et al. [57]Dialog SystemEnc-Dec,WoW,CMU-Hallucination, Partial hallucina-InferexclusivelyOnly-DecDOG,Topi-tion, Generic, Uncooperativefrom the knowledge-calChatsnippetDas et al. [58]Dialog SystemOnly-DecOpenDialKGExtrinsic-Soft/Hard/ Grouped,Analyze entity-levelIntrinsic-Soft/ Hard/Repetitive,fact hallucinationHistory CorruptedDziri et al. [59]Dialog SystemEnc-Dec,WoWHallucination, Generic, Uncooper-Hallucination-freeOnly-Decativenessbenchmark FaithDialDziri et al. [60]Dialog SystemEnc-Dec,WoW,CMU-Fully attributable, Not at-Knowledge-groundedOnly-Enc,DOG,Topi-tributable, Genericinteractionbench-Only-DeccalChatmark BeginSun et al. [61]Dialog SystemEnc-Dec,WoWIntrinsic hallucination, ExtrinsicSample responses forOnly-DechallucinationconversationTam et al. [62]SummarizationEnc-Dec,CNN/DM, XSum Factually inconsistent summaries Generate summariesSystemOnly-Decfrom given modelsCao et al. [63]SummarizationEnc-Dec,MENTNon-hallucinated, Factual halluci-Label factual entitiesSystemOnly-Decnation, Non-factual hallucination,from summarizationsIntrinsic hallucinationShen et al. [64]SummarizationEnc-Dec,NHNetNews headline hallucinationMajority vote of jour-SystemOnly-Encnalism degree holdersQiu et al. [65]SummarizationMultipleXL-SumIntrinsic hallucination, ExtrinsicIn a cross-lingualSystemADaptershallucinationtransfer settingYu et al. [66]Knowledge-basedEnc-Dec,Encyclopedic,Knowledge hallucinationEvaluate knowledgetext generationOnly-DecETCcreating ability givenknown factsMihindukulasooriyaKnowledge graphOnly-DecTekGen,Subject hallucination, relation hal-Ontologydrivenet al. [32]generationWebNLGlucination, object hallucinationKGCbenchmarkText2KGBenchLi et al. [41]Visual QuestionEnc-DecMSCOCOObject hallucinationCaption hallucinationAnswerassessment

Language models are few-shot learners TBBrown BMann NRyder MSubbiah JKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DMZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei NeurIPS 2020 HLarochelle MRanzato RHadsell MBalcan HLin 2020 Training language models to follow instructions with human feedback LOuyang JWu XJiang DAlmeida CLWainwright PMishkin CZhang SAgarwal KSlama ARay JSchulman JHilton FKelton LMiller MSimens AAskell PWelinder PFChristiano JLeike RLowe 2022 NeurIPS Finetuned language models are zero-shot learners JWei MBosma VYZhao KGuu AWYu BLester NDu AMDai QVLe The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event

OpenReview

April 25-29, 2022. 2022 AChowdhery SNarang JDevlin MBosma GMishra ARoberts PBarham HWChung CSutton SGehrmann PSchuh KShi STsvyashchenko JMaynez ARao PBarnes YTay NShazeer VPrabhakaran EReif NDu BHutchinson RPope JBradbury JAustin MIsard GGur-Ari PYin TDuke ALevskaya SGhemawat SDev HMichalewski XGarcia VMisra KRobinson LFedus DZhou DIppolito DLuan HLim BZoph ASpiridonov RSepassi DDohan SAgrawal MOmernick AMDai TSPillai MPellat ALewkowycz EMoreira RChild OPolozov KLee ZZhou XWang BSaeta MDiaz OFirat MCatasta JWei KMeier-Hellstern DEck JDean SPetrov NFiedel 10.48550/arXiv.2204.02311 arXiv:2204.02311 Palm: Scaling language modeling with pathways 2022 HTouvron TLavril GIzacard XMartinet MLachaux TLacroix BRozière NGoyal EHambro FAzhar ARodriguez AJoulin EGrave GLample 10.48550/arXiv.2302.13971 arXiv:2302.13971 Llama: Open and efficient foundation language models 2023 Training a helpful and harmless assistant with reinforcement learning from human feedback YBai AJones KNdousse AAskell AChen NDassarma DDrain SFort DGanguli THenighan NJoseph SKadavath JKernion TConerly SEShowk NElhage ZHatfield-Dodds DHernandez THume SJohnston SKravec LLovitt NNanda COlsson DAmodei TBBrown JClark SMccandlish COlah BMann JKaplan 10.48550/arXiv.2204.05862 arXiv:2204.05862 2022 SZhang SRoller NGoyal MArtetxe MChen SChen CDewan MTDiab XLi XVLin TMihaylov MOtt SShleifer KShuster DSimig PSKoura ASridhar TWang LZettlemoyer 10.48550/arXiv.2205.01068 arXiv:2205.01068 OPT: open pre-trained transformer language models 2022 GLM-130B: an open bilingual pre-trained model AZeng XLiu ZDu ZWang HLai MDing ZYang YXu WZheng XXia WLTam ZMa YXue JZhai WChen ZLiu PZhang YDong JTang The Eleventh International Conference on Learning Representations, ICLR 2023

Kigali, Rwanda

OpenReview May 1-5, 2023. 2023 Wizardlm: Empowering large language models to follow complex instructions CXu QSun KZheng XGeng PZhao JFeng CTao DJiang 10.48550/arXiv.2304.12244 arXiv:2304.12244 2023 Pre-trained models: Past, present and future XHan ZZhang NDing YGu XLiu YHuo JQiu YYao AZhang LZhang WHan MHuang QJin YLan YLiu ZLiu ZLu XQiu RSong JTang JWen JYuan WXZhao JZhu 10.1016/j.aiopen.2021.08.002 AI Open 2 2021 Towards reasoning in large language models: A survey JHuang KCChang 10.18653/v1/2023.findings-acl.67 Findings of ACL 2023, ACL ARogers JLBoyd-Graber NOkazaki 2023 Chatgpt vs human-authored text: Insights into controllable text summarization and sentence style transfer DPu VDemberg 10.18653/v1/2023.acl-srw.1 ACL 2023, ACL VPadmakumar GVallejo YFu 2023 Large language models are zeroshot reasoners TKojima SSGu MReid YMatsuo YIwasawa 2022 NeurIPS Chain-of-thought prompting elicits reasoning in large language models JWei XWang DSchuurmans MBosma BIchter FXia EHChi QVLe DZhou 2022 NeurIPS Improving language models via plug-and-play retrieval feedback WYu ZZhang ZLiang MJiang ASabharwal 10.48550/arXiv.2305.14002 arXiv:2305.14002 2023 Large language models struggle to learn long-tail knowledge NKandpal HDeng ARoberts EWallace CRaffel Proceedings of Machine Learning Research AKrause EBrunskill KCho BEngelhardt SSabato JScarlett Machine Learning Research

PMLR

2023 202 ICML 2023 When not to trust language models: Investigating effectiveness of parametric and non-parametric memories AMallen AAsai VZhong RDas DKhashabi HHajishirzi 10.18653/v1/2023.acl-long.546 ACL 2023, ACL ARogers JLBoyd-Graber NOkazaki 2023 Internet-augmented language models through few-shot prompting for open-domain question answering ALazaridou EGribovskaya WStokowiec NGrigorev 10.48550/arXiv.2203.05115 arXiv:2203.05115 2022 REPLUG: retrieval-augmented black-box language models WShi SMin MYasunaga MSeo RJames MLewis LZettlemoyer WYih 10.48550/arXiv.2301.12652 arXiv:2301.12652 2023 A survey of knowledge-enhanced text generation WYu CZhu ZLi ZHu QWang HJi MJiang 10.1145/3512467 ACM Comput. Surv 54 38 2022 DDash RThapa JMBanda ASwaminathan MCheatham MKashyap NKotecha JHChen SGombar LDowning RPedreira EGoh AArnaout GKMorris HMagon MPLungren EHorvitz NHShah 10.48550/arXiv.2304.13714 arXiv:2304.13714 Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery 2023 Med-halt: Medical domain hallucination test for large language models LKUmapathi APal MSankarasubbu 10.48550/arXiv.2307.15343 arXiv:2307.15343 2023 Transformative effects of chatgpt on modern education: Emerging era of AI chatbots SSGill MXu PPatros HWu RKaur KKaur SFuller MSingh PArora AKParlikad VStankovski AAbraham SKGhosh HLutfiyya SSKanhere RBahsoon OFRana SDustdar RSakellariou SUhlig RBuyya 10.48550/arXiv.2306.03823 arXiv:2306.03823 2023 Hallucination is the last thing you need SCurran SLansley OBethell 10.48550/arXiv.2306.11520 arXiv:2306.11520 2023 Survey of hallucination in natural language generation ZJi NLee RFrieske TYu DSu YXu EIshii YBang AMadotto PFung 10.1145/3571730 ACM Comput. Surv 55 38 2023 On faithfulness and factuality in abstractive summarization JMaynez SNarayan BBohnet RTMcdonald 10.18653/v1/2020.acl-main.173 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 DJurafsky JChai NSchluter JRTetreault the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 ACL July 5-10, 2020. 2020 How language model hallucinations can snowball MZhang OPress WMerrill ALiu NASmith 10.48550/arXiv.2305.13534 arXiv:2305.13534 2023 YWang WZhong LLi FMi XZeng WHuang LShang XJiang QLiu 10.48550/arXiv.2307.12966 arXiv:2307.12966 Aligning large language models with human: A survey 2023 Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies LPan MSaxon WXu DNathani XWang WYWang 10.48550/arXiv.2308.03188 arXiv:2308.03188 2023 Truthfulqa: Measuring how models mimic human falsehoods SLin JHilton OEvans 10.18653/v1/2022.acl-long.229 ACL 2022, ACL 2022 Halueval: A large-scale hallucination evaluation benchmark for large language models JLi XCheng WXZhao JNie JWen 10.48550/arXiv.2305.11747 arXiv:2305.11747 2023 Text2kgbench: A benchmark for ontologydriven knowledge graph generation from text NMihindukulasooriya STiwari CFEnguix KLata 10.48550/arXiv.2308.02357 arXiv:2308.02357 2023 Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning FYin JVig PLaban SJoty CXiong CWu 10.18653/v1/2023.acl-long.172 ACL 2023, ACL ARogers JLBoyd-Graber NOkazaki 2023 Improving in-context few-shot learning via self-supervised training MChen JDu RPasunuru TMihaylov SIyer VStoyanov ZKozareva 10.18653/v1/2022.naacl-main.260 NAACL 2022, ACL MCarpuat MDe Marneffe IV MRuíz 2022 Sources of hallucination by large language models on inference tasks NMckenna TLi LCheng MJHosseini MJohnson MSteedman 10.48550/arXiv.2305.14552 arXiv:2305.14552 2023 Data distributional properties drive emergent in-context learning in transformers SChan ASantoro AKLampinen JWang ASingh PHRichemond JLMc-Clelland FHill 2022 NeurIPS Let me check the examples: Enhancing demonstration learning via explicit imitation SWang KWei HZhang YLi WWu 10.18653/v1/2023.acl-short.93 ACL 2023, ACL ARogers JLBoyd-Graber NOkazaki 2023 Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity YLu MBartolo AMoore SRiedel PStenetorp 10.18653/v1/2022.acl-long.556 ACL 2022, ACL SMuresan PNakov AVillavicencio 2022 Hallucinations in large multilingual translation models NMGuerreiro DMAlves JWaldendorf BHaddow ABirch PColombo AF TMartins 10.48550/arXiv.2303.16104 arXiv:2303.16104 2023 Visual instruction tuning HLiu CLi QWu YJLee 10.48550/arXiv.2304.08485 arXiv:2304.08485 2023 Evaluating object hallucination in large visionlanguage models YLi YDu KZhou JWang WXZhao JWen 10.48550/arXiv.2305.10355 arXiv:2305.10355 2023 Why does chatgpt fall short in answering questions faithfully? SZheng JHuang KCChang 10.48550/arXiv.2304.10513 arXiv:2304.10513 2023 Zero-shot faithful factual error correction KHuang HPChan HJi 10.18653/v1/2023.acl-long.311 ACL 2023, ACL 2023 RARR: researching and revising what language models say, using language models LGao ZDai PPasupat AChen ATChaganty YFan VYZhao NLao HLee DJuan KGuu 10.18653/v1/2023.acl-long.910 ACL 2023, ACL 2023 Mitigating language model hallucination with interactive question-knowledge alignment SZhang LPan JZhao WYWang 10.48550/arXiv.2305.13669 arXiv:2305.13669 2023 DHalawi JDenain JSteinhardt 10.48550/arXiv.2307.09476 arXiv:2307.09476 Overthinking the truth: Understanding how language models process false demonstrations 2023 Histalign: Improving context dependency in language generation by aligning with history DWan SZhang MBansal 10.48550/arXiv.2305.04782 arXiv:2305.04782 2023 The dangers of trusting stochastic parrots: Faithfulness and trust in open-domain conversational question answering SChiesurin DDimakopoulos MA SCabezudo AEshghi IPapaioannou VRieser IKonstas 10.18653/v1/2023.findings-acl.60 Findings of ACL 2023, ACL ARogers JLBoyd-Graber NOkazaki 2023 Improved natural language generation via loss truncation DKang THashimoto 10.18653/v1/2020.acl-main.66 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 DJurafsky JChai NSchluter JRTetreault the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 ACL July 5-10, 2020. 2020 On exposure bias, hallucination and domain shift in neural machine translation CWang RSennrich 10.18653/v1/2020.acl-main.326 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 DJurafsky JChai NSchluter JRTetreault the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 ACL July 5-10, 2020. 2020 Factuality enhanced language models for open-ended text generation NLee WPing PXu MPatwary PFung MShoeybi BCatanzaro 2022 NeurIPS The curious case of hallucinations in neural machine translation VRaunak AMenezes MJunczys-Dowmunt NAACL 2021, ACL 2021 Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation NMGuerreiro EVoita AF TMartins EACL 2023, ACL 2023 Halomi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation DDale EVoita JLam PHansanti CRopers EKalbassi CGao LBarrault MRCosta-Jussà 10.48550/arXiv.2305.11746 arXiv:2305.11746 2023 mmt5: Modular multilingual pre-training solves source language hallucinations JPfeiffer FPiccinno MNicosia XWang MReid SRuder 10.48550/arXiv.2305.14224 arXiv:2305.14224 2023 Evaluating correctness and faithfulness of instruction-following models for question answering VAdlakha PBehnamghader XHLu NMeade SReddy 10.48550/arXiv.2307.16877 arXiv:2307.16877 2023 On the origin of hallucinations in conversational models: Is it the datasets or the models? NDziri SMilton MYu ORZaïane SReddy 10.18653/v1/2022.naacl-main.387 NAACL 2022, ACL 2022 Diving deep into modes of fact hallucinations in dialogue systems SDas SSaha RKSrihari 10.18653/v1/2022.findings-emnlp.48 Findings of EMNLP 2022, ACL 2022 Faithdial: A faithful benchmark for information-seeking dialogue NDziri EKamalloo SMilton ORZaïane MYu EMPonti SReddy Trans. Assoc. Comput. Linguistics 10 2022 Evaluating attribution in dialogue systems: The BEGIN benchmark NDziri HRashkin TLinzen DReitter Trans. Assoc. Comput. Linguistics 10 2022 Contrastive learning reduces hallucination in conversations WSun ZShi SGao PRen MDe Rijke ZRen AAAI 2023 AAAI Press 2023 Evaluating the factual consistency of large language models through news summarization DTam AMascarenhas SZhang SKwan MBansal CRaffel 10.18653/v1/2023.findings-acl.322 Findings of ACL 2023, ACL 2023 Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization MCao YDong JC KCheung 10.18653/v1/2022.acl-long.236 ACL 2022, ACL 2022 why is this misleading?": Detecting news headline hallucinations with explanations JShen JLiu DFinnie NRahmati MBendersky MNajork 10.1145/3543507.3583375 doi:10.1145/3543507.3583375 WWW 2023 ACM 2023 Detecting and mitigating hallucinations in multilingual summarisation YQiu YZiser AKorhonen EMPonti SBCohen 10.48550/arXiv.2305.13632 arXiv:2305.13632 2023 JYu XWang STu SCao DZhang-Li XLv HPeng ZYao XZhang HLi CLi ZZhang YBai YLiu AXin NLin KYun LGong JChen ZWu YQi WLi YGuan KZeng JQi HJin JLiu YGu YYao NDing LHou ZLiu BXu JTang JLi 10.48550/arXiv.2306.09296 arXiv:2306.09296 Kola: Carefully benchmarking world knowledge of large language models 2023 Aligning large multi-modal model with robust instruction tuning FLiu KLin LLi JWang YYacoob LWang 10.48550/arXiv.2306.14565 arXiv:2306.14565 2023 RMahmood GWang MKKalra PYan 10.48550/arXiv.2307.14634 arXiv:2307.14634 Fact-checking of ai-generated reports 2023 Chain of natural language inference for reducing large language model ungrounded hallucinations DLei YLi MHu MWang VYun EChing EKamal arXiv:2310.03951 2023 Bartscore: Evaluating generated text as text generation WYuan GNeubig PLiu NeurIPS M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, J. W. Vaughan 2021 2021 Knowledge of knowledge: Exploring knownunknowns uncertainty with large language models AAmayuelas LPan WChen WYWang 10.48550/arXiv.2305.13712 arXiv:2305.13712 2023 Methods for measuring, updating, and visualizing factual beliefs in language models PHase MTDiab ACelikyilmaz XLi ZKozareva VStoyanov MBansal SIyer EACL 2023, ACL AVlachos IAugenstein 2023 Measuring and modifying factual knowledge in large language models PPezeshkpour 10.48550/arXiv.2306.06264 arXiv:2306.06264 2023 TZhao MWei JSPreston HPoon arXiv:2306.16564 Llm calibration and automatic hallucination detection via pareto optimal self-supervision 2023 A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation NVarshney WYao HZhang JChen DYu 10.48550/arXiv.2307.03987 arXiv:2307.03987 2023 Language models (mostly) know what they know SKadavath TConerly AAskell THenighan DDrain EPerez NSchiefer ZHatfield-Dodds NDassarma ETran-Johnson SJohnston SEShowk AJones NElhage THume AChen YBai SBowman SFort DGanguli DHernandez JJacobson JKernion SKravec LLovitt KNdousse COlsson SRinger DAmodei TBrown JClark NJoseph BMann SMccandlish COlah JKaplan 10.48550/arXiv.2207.05221 arXiv:2207.05221 2022 Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models PManakul ALiusie MJ FGales 10.48550/arXiv.2303.08896 arXiv:2303.08896 2023 Do language models know when they're hallucinating references? AAgrawal LMackey ATKalai 10.48550/arXiv.2305.18248 arXiv:2305.18248 2023 Self-checker: Plug-and-play modules for fact-checking with large language models MLi BPeng ZZhang 10.48550/arXiv.2305.14623 arXiv:2305.14623 2023 detecting factual errors via cross examination RCohen MHamri MGeva AGloberson 10.48550/arXiv.2305.13281 arXiv:2305.13281 2023 Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation NMündler JHe SJenko MTVechev 10.48550/arXiv.2305.15852 arXiv:2305.15852 2023 A new benchmark and reverse validation method for passage-level hallucination detection SYang RSun XWan arXiv:2310.06498 2023 Factscore: Fine-grained atomic evaluation of factual precision in long form text generation SMin KKrishna XLyu MLewis WYih PWKoh MIyyer LZettlemoyer HHajishirzi 10.48550/arXiv.2305.14251 arXiv:2305.14251 2023 Complex claim verification with evidence retrieved in the wild JChen GKim ASriram GDurrett EChoi 10.48550/ARXIV.2305.11859 arXiv:2305.11859 2023 Retrieving supporting evidence for llms generated answers SHuo NArabzadeh CL AClarke 10.48550/ARXIV.2306.13781 arXiv:2306.13781 2023 IChern SChern SChen WYuan KFeng CZhou JHe GNeubig PLiu 10.48550/arXiv.2307.13528 arXiv:2307.13528 Factool: Factuality detection in generative AI -A tool augmented framework for multi-task and multidomain scenarios 2023 Investigating the translation performance of a large multilingual language model: the case of BLOOM RBawden FYvon 10.48550/arXiv.2303.01911 arXiv:2303.01911 2023 How good are GPT models at machine translation? A comprehensive evaluation AHendy MAbdelrehim ASharaf VRaunak MGabr HMatsushita YJKim MAfify HHAwadalla 10.48550/arXiv.2302.09210 arXiv:2302.09210 2023 Unsupervised cross-lingual representation learning at scale AConneau KKhandelwal NGoyal VChaudhary GWenzek FGuzmán EGrave MOtt LZettlemoyer VStoyanov 10.18653/v1/2020.acl-main.747 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 DJurafsky JChai NSchluter JRTetreault the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 ACL July 5-10, 2020. 2020 Evaluating generative models for graph-to-text generation SYuan MFärber 10.48550/arXiv.2307.14712 arXiv:2307.14712 2023 YZhu XWang JChen SQiao YOu YYao SDeng HChen NZhang 10.48550/arXiv.2305.13168 arXiv:2305.13168 Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities 2023 Minigpt-4: Enhancing vision-language understanding with advanced large language models DZhu JChen XShen XLi MElhoseiny 10.48550/arXiv.2304.10592 arXiv:2304.10592 2023 OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework PWang AYang RMen JLin SBai ZLi JMa CZhou JZhou HYang Proceedings of Machine Learning Research KChaudhuri SJegelka LSong CSzepesvári GNiu SSabato Machine Learning Research

PMLR

2022 162 ICML 2022 Let there be a clock on the beach: Reducing object hallucination in image captioning AFBiten LGómez DKaratzas 10.1109/WACV51458.2022.00253 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022

Waikoloa, HI, USA

IEEE January 3-8, 2022. 2022 SPetryk SWhitehead JEGonzalez TDarrell ARohrbach MRohrbach 10.48550/arXiv.2305.07021 arXiv:2305.07021 Simple token-level confidence improves caption correctness 2023 Album storytelling with iterative story-aware captioning and large language models MNing YXie DChen ZSong LYuan YTian QYe LYuan 10.48550/arXiv.2305.12943 arXiv:2305.12943 2023 Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics APagnoni VBalachandran YTsvetkov 10.18653/v1/2021.naacl-main.383 NAACL 2021, ACL 2021 Knowledge-grounded dialogue generation with a unified knowledge representation YLi BPeng YShen YMao LLiden ZYu JGao 10.18653/v1/2022.naacl-main.15 NAACL 2022, ACL 2022 Faithful to the document or to the world? mitigating hallucinations via entity-linked knowledge in abstractive summarization YDong JWieting PVerga 10.18653/v1/2022.findings-emnlp.76 Findings of EMNLP 2022, ACL 2022 FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization EDurmus HHe MTDiab 10.18653/v1/2020.acl-main.454 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 DJurafsky JChai NSchluter JRTetreault the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 ACL July 5-10, 2020. 2020 Measuring sentence-level and aspect-level (un)certainty in science communications JPei DJurgens 10.18653/v1/2021.emnlp-main.784 EMNLP 2021, ACL MMoens XHuang LSpecia SWYih 2021 Editing models with task arithmetic GIlharco MTRibeiro MWortsman LSchmidt HHajishirzi AFarhadi The Eleventh International Conference on Learning Representations, ICLR 2023

Kigali, Rwanda

OpenReview May 1-5, 2023. 2023 Elastic weight removal for faithful and abstractive dialogue generation NDaheim NDziri MSachan IGurevych EMPonti 10.48550/arXiv.2303.17574 arXiv:2303.17574 2023 PURR: efficiently editing language model hallucinations by denoising language model corruptions AChen PPasupat SSingh HLee KGuu 10.48550/arXiv.2305.14908 arXiv:2305.14908 2023 Trusting your evidence: Hallucinate less with context-aware decoding WShi XHan MLewis YTsvetkov LZettlemoyer SWYih 10.48550/arXiv.2305.14739 arXiv:2305.14739 2023 Augmented large language models with parametric knowledge guiding ZLuo CXu PZhao XGeng CTao JMa QLin DJiang 10.48550/arXiv.2305.04757 arXiv:2305.04757 2023 TRAC: trustworthy retrieval augmented chatbot SLi SPark ILee OBastani 10.48550/arXiv.2307.04642 arXiv:2307.04642 2023 Inference-time intervention: Eliciting truthful answers from a language model KLi OPatel FBViégas HPfister MWattenberg 10.48550/arXiv.2306.03341 arXiv:2306.03341 2023 Easyedit: An easy-to-use knowledge editing framework for large language models PWang NZhang XXie YYao BTian MWang ZXi SCheng KLiu GZheng HChen 10.48550/arXiv.2308.07269 arXiv:2308.07269 2023 Y.-SChuang YXie HLuo YKim JGlass PHe arXiv:2309.03883 Dola: Decoding by contrasting layers improves factuality in large language models 2023 arXiv preprint Neural path hunter: Reducing hallucination in dialogue systems via path grounding NDziri AMadotto OZaïane AJBose 10.18653/v1/2021.emnlp-main.168 EMNLP 2021, ACL MMoens XHuang LSpecia SWYih 2021 ORCA: interpreting prompted language models via locating supporting data evidence in the ocean of pretraining data XHan YTsvetkov 10.48550/arXiv.2205.12600 arXiv:2205.12600 2022 Rethinking with retrieval: Faithful large language model inference HHe HZhang DRoth 10.48550/arXiv.2301.00303 arXiv:2301.00303 2023 TRAK: attributing model behavior at scale SMPark KGeorgiev AIlyas GLeclerc AMadry Proceedings of Machine Learning Research AKrause EBrunskill KCho BEngelhardt SSabato JScarlett Machine Learning Research

PMLR

2023 202 ICML 2023 Data portraits: Recording foundation model training data MMarone BVDurme 10.48550/arXiv.2303.03919 arXiv:2303.03919 2023 Self-refine: Iterative refinement with self-feedback AMadaan NTandon PGupta SHallinan LGao SWiegreffe UAlon NDziri SPrabhumoye YYang SWelleck BPMajumder SGupta AYazdanbakhsh PClark 10.48550/arXiv.2303.17651 arXiv:2303.17651 2023 Reflexion: an autonomous agent with dynamic memory and self-reflection NShinn BLabash AGopinath 10.48550/arXiv.2303.11366 arXiv:2303.11366 2023 prompting language models improves quoting from pre-training data OWeller MMarone NWeir DJLawrie DKhashabi BVDurme 10.48550/arXiv.2305.13252 arXiv:2305.13252 according to 2023 Verify-and-edit: A knowledge-enhanced chain-of-thought framework RZhao XLi SJoty CQin LBing 10.18653/v1/2023.acl-long.320 ACL 2023, ACL ARogers JLBoyd-Graber NOkazaki 2023 Chain-of-verification reduces hallucination in large language models SDhuliawala MKomeili JXu RRaileanu XLi ACelikyilmaz JWeston 10.48550/arXiv.2309.11495 arXiv:2309.11495 2023 Improving language models by retrieving from trillions of tokens SBorgeaud AMensch JHoffmann TCai ERutherford KMillican GVan Den Driessche JLespiau BDamoc AClark DDe Las Casas AGuy JMenick RRing THennigan SHuang LMaggiore CJones ACassirer ABrock MPaganini GIrving OVinyals SOsindero KSimonyan JWRae EElsen LSifre Proceedings of Machine Learning Research KChaudhuri SJegelka LSong CSzepesvári GNiu SSabato Machine Learning Research

PMLR

2022. 2022 162 ICML Interleaving retrieval with chain-ofthought reasoning for knowledge-intensive multi-step questions HTrivedi NBalasubramanian TKhot ASabharwal 10.18653/v1/2023.acl-long.557 ACL 2023, ACL ARogers JLBoyd-Graber NOkazaki 2023 Check your facts and try again: Improving large language models with external knowledge and automated feedback BPeng MGalley PHe HCheng YXie YHu QHuang LLiden ZYu WChen JGao 10.48550/arXiv.2302.12813 arXiv:2302.12813 2023 QJin YYang QChen ZLu 10.48550/arXiv.2304.09667 arXiv:2304.09667 Genegpt: Augmenting large language models with domain tools for improved access to biomedical information 2023 Fluid transformers and creative analogies: Exploring large language models' capacity for augmenting cross-domain analogical creativity ZDing ASrinivasan SMacneil JChan 10.1145/3591196.3593516 doi:10.1145/3591196.3593516 Creativity and Cognition, C&C 2023, Virtual Event

, USA

ACM June 19-21, 2023. 2023 Chain of knowledge: A framework for grounding large language models with structured knowledge bases XLi RZhao YKChia BDing LBing SRJoty SPoria 10.48550/arXiv.2305.13269 arXiv:2305.13269 2023 Active retrieval augmented generation ZJiang FFXu LGao ZSun QLiu JDwivedi-Yu YYang JCallan GNeubig 10.48550/arXiv.2305.06983 arXiv:2305.06983 2023 Gorilla: Large language model connected with massive apis SGPatil TZhang XWang JEGonzalez 10.48550/arXiv.2305.15334 arXiv:2305.15334 2023 RETA-LLM: A retrieval-augmented large language model toolkit JLiu JJin ZWang JCheng ZDou JWen 10.48550/arXiv.2306.05212 arXiv:2306.05212 2023 Knowledgpt: Enhancing large language models with retrieval and storage access on knowledge bases XWang QYang YQiu JLiang QHe ZGu YXiao WWang 10.48550/arXiv.2308.11761 arXiv:2308.11761 2023 Learning to summarize with human feedback NStiennon LOuyang JWu DMZiegler RLowe CVoss ARadford DAmodei PFChristiano NeurIPS 2020 HLarochelle MRanzato RHadsell MBalcan HLin 2020 Teaching language models to support answers with verified quotes JMenick MTrebacz VMikulik JAslanides HFSong MJChadwick MGlaese SYoung LCampbell-Gillingham GIrving NMcaleese 10.48550/arXiv.2203.11147 arXiv:2203.11147 2022 BRIO: bringing order to abstractive summarization YLiu PLiu DRRadev GNeubig 10.18653/v1/2022.acl-long.207 ACL 2022, ACL SMuresan PNakov AVillavicencio 2022 Chain of hindsight aligns language models with feedback HLiu CSferrazza PAbbeel 10.48550/arXiv.2302.02676 arXiv:2302.02676 2023 CRITIC: large language models can self-correct with tool-interactive critiquing ZGou ZShao YGong YShen YYang NDuan WChen 10.48550/arXiv.2305.11738 arXiv:2305.11738 2023 Pad: Program-aided distillation specializes large models in reasoning XZhu BQi KZhang XLong BZhou 10.48550/arXiv.2305.13888 arXiv:2305.13888 2023 Enabling large language models to generate text with citations TGao HYen JYu DChen 10.48550/arXiv.2305.14627 arXiv:2305.14627 2023 Improving factuality of abstractive summarization without sacrificing summary quality TDixit FWang MChen 10.18653/v1/2023.acl-short.78 ACL 2023, ACL ARogers JLBoyd-Graber NOkazaki 2023 ZJi TYu YXu NLee EIshii PFung arXiv:2310.06271 Towards mitigating hallucination in large language models via self-reflection 2023 Improving factuality and reasoning in language models through multiagent debate YDu SLi ATorralba JBTenenbaum IMordatch 10.48550/arXiv.2305.14325 arXiv:2305.14325 2023 Encouraging divergent thinking in large language models through multi-agent debate TLiang ZHe WJiao XWang YWang RWang YYang ZTu SShi 10.48550/arXiv.2305.19118 arXiv:2305.19118 2023 Examining the inter-consistency of large language models: An in-depth analysis via debate KXiong XDing YCao TLiu BQin 10.48550/arXiv.2305.11595 arXiv:2305.11595 2023 PRD: peer rank and discussion improve large language model based evaluations RLi TPatel XDu 10.48550/arXiv.2307.02762 arXiv:2307.02762 2023 Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration ZWang SMao WWu TGe FWei HJi 10.48550/arXiv.2307.05300 arXiv:2307.05300 2023 MMinsky Society of mind Simon and Schuster 1988 A few more examples may be worth billions of parameters YKirstain PS HLewis SRiedel OLevy 10.18653/v1/2022.findings-emnlp.72 Findings of EMNLP 2022, ACL YGoldberg ZKozareva YZhang 2022 LIMA: less is more for alignment CZhou PLiu PXu SIyer JSun YMao XMa AEfrat PYu LYu SZhang GGhosh MLewis LZettlemoyer OLevy 10.48550/arXiv.2305.11206 arXiv:2305.11206 2023 Natural instructions: Benchmarking generalization to new tasks from natural language instructions SMishra DKhashabi CBaral HHajishirzi CoRR abs/2104.08773 2021 Multitask prompted training enables zero-shot task generalization VSanh AWebson CRaffel SHBach LSutawika ZAlyafeai AChaffin AStiegler ARaja MDey MSBari CXu UThakker SSSharma ESzczechla TKim GChhablani NVNayak DDatta JChang MTJiang HWang MManica SShen ZXYong HPandey RBawden TWang TNeeraj JRozen ASharma ASantilli TFévry JAFries RTeehan TLScao SBiderman LGao TWolf AMRush The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event

OpenReview

April 25-29, 2022. 2022 ZBao WChen SXiao KRen JWu CZhong JPeng XHuang ZWei arXiv:2308.14346 Disc-medllm: Bridging general large language models and real-world medical consultation 2023 Instructie: A chinese instruction-based information extraction dataset HGui JZhang HYe NZhang 10.48550/arXiv.2305.11527 arXiv:2305.11527 2023 WYWei Zhu XWang Shennong-tcm: A traditional chinese medicine large language model 2023 XLi PYu CZhou TSchick LZettlemoyer OLevy JWeston MLewis 10.48550/arXiv.2308.06259 arXiv:2308.06259 Self-alignment with instruction backtranslation 2023 Alpagasus: Training A better alpaca with fewer data LChen SLi JYan HWang KGunaratna VYadav ZTang VSrinivasan TZhou HHuang HJin 10.48550/arXiv.2307.08701 arXiv:2307.08701 2023 Tree of thoughts: Deliberate problem solving with large language models SYao DYu JZhao IShafran TLGriffiths YCao KNarasimhan 10.48550/arXiv.2305.10601 arXiv:2305.10601 2023 Cumulative reasoning with large language models YZhang JYang YYuan ACYao 10.48550/arXiv.2308.04371 arXiv:2308.04371 2023 PAL: program-aided language models LGao AMadaan SZhou UAlon PLiu YYang JCallan GNeubig Proceedings of Machine Learning Research AKrause EBrunskill KCho BEngelhardt SSabato JScarlett Machine Learning Research

PMLR

2023 202 ICML 2023 Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks WChen XMa XWang WWCohen 10.48550/arXiv.2211.12588 arXiv:2211.12588 2022 When do program-of-thoughts work for reasoning? ZBi NZhang YJiang SDeng GZheng HChen arXiv:2308.15452 2023 Dual-process and dual-system theories of reasoning KFrankish Philosophy Compass 5 2010 KStanovich Rationality and the reflective mind

USA

Oxford University Press 2011 The first computational theory of mind and brain: a close look at mcculloch and pitts's "logical calculus of ideas immanent in nervous activity GPiccinini Synthese 141 2004 <author> <persName><forename type="first">E</forename><forename type="middle">L</forename><surname>Thorndike</surname></persName> </author> </analytic> <monogr> <title level="j">Animal intelligence 58 1898 Nature BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models JLi DLi SSavarese SC HHoi Proceedings of Machine Learning Research AKrause EBrunskill KCho BEngelhardt SSabato JScarlett Machine Learning Research

PMLR

2023 202 ICML 2023 Instructblip: Towards general-purpose vision-language models with instruction tuning WDai JLi DLi AM HTiong JZhao WWang BLi PFung SC HHoi 10.48550/arXiv.2305.06500 arXiv:2305.06500 2023 QYe HXu GXu JYe MYan YZhou JWang AHu PShi YShi CLi YXu HChen JTian QQi JZhang FHuang 10.48550/arXiv.2304.14178 arXiv:2304.14178 mplug-owl: Modularization empowers large language models with multimodality 2023 Tiny lvlm-ehub: Early multimodal experiments with bard WShao YHu PGao MLei KZhang FMeng PXu SHuang HLi YQiao PLuo 10.48550/arXiv.2308.03729 arXiv:2308.03729 2023 A survey on multimodal large language models SYin CFu SZhao KLi XSun TXu EChen 10.48550/arXiv.2306.13549 arXiv:2306.13549 2023