1. Introduction

Cross-Layer Attention Probing for Fine-Grained Hallucination Detection

Malavika Suresh

Rahaf Aljundi

Ikechukwu Nkisi-Orji

Nirmalie Wiratunga

0 0 Robert Gordon University , Aberdeen , United Kingdom 1 Toyota Motor Europe , Brussels , Belgium

2025

With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among diferent sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.

eol>hallucination detection activation probing large language models

1. Introduction

hallucination detection is not possible without both positive and negative examples. First, across five LLMs and three tasks (two factual question-answering tasks and one chain-of-thought reasoning task), we show that our method improves over uncertainty baselines and activation probing methods that consider only individual layers.

Next, we build on the observation that diferent responses sampled for a given prompt can vary, with some being hallucinated and others not. We leverage the responses in the sampled space to augment the training data. We find that our proposed method, by learning to attend to diferent layers, can leverage this fine-grained supervision signal better to provide improved fine-grained detection performance compared to baselines. We further explore the integration of our method with hallucination mitigation pipelines, such as DoLa [ 4 ]. Noting that mitigation methods can adversely afect originally non-hallucinated samples, we combine CLAP with DoLa. Our results demonstrate that this combination significantly reduces the hallucination rate.

To support our approach of attending to activations across layers, we rigorously evaluate our method against various strategies for selecting probes at diferent layers. We conduct tests using cases from domains diferent from the training data to assess the generalization of layer-based probes compared to our proposed solution. Our results show that CLAP provides significant gains over probing at diferent layers when prompts fall outside the domains of the training samples. Notably, CLAP also improves over Semantic Entropy Probes [ 1 ], which have been shown to generalise well.

In summary, this paper makes the following contributions: 1. A novel probing technique, called Cross-Layer Attention Probing (CLAP), which consists of an attention mechanism operating on the LLM residual stream, is proposed for improving hallucination detection. 2. CLAP improves fine-grained detection of hallucinations among diferent responses sampled for the same prompt, helping reduce model hallucinations. 3. On an out-of-distribution study, CLAP improves over probes constructed at individual layers.

The rest of the paper is structured as follows. Section 2 describes methods used in prior work for hallucination detection and mitigation. Section 3 describes our proposed approach, CLAP, and the methodology for fine-grained detection and mitigation. Section 4 evaluates CLAP against baselines. Section 5 provides an analysis of out-of-distribution generalisation. Finally, we perform an ablation study of the design in section 6 before concluding with a discussion on future work in section 7.

2. Related work 2.1. LLM-based detection and mitigation (black box)

Black-box methods assume no access to model internals and therefore rely on additional LLM-prompting. [ 10 ] proposed the use of a consistency check among diferent responses sampled for a given prompt as a measure of hallucination. Here the assumption is that when an LLM hallucinates, the sampled responses would be inconsistent with each other. This evaluates whether for a given prompt, any given response from the LLM can be trusted, and is therefore not suited to identify non-hallucinating responses within the sampled space for a prompt as well as in cases where multiple diferent answers are valid. Other works [ 11, 12 ] have shown that LLMs can be prompted to detect hallucinations in outputs by the same or diferent LLM. This relies on the LLM having a good reasoning ability and is therefore often restricted to large models, introducing additional cost and latency.

2.2. Uncertainty estimation (grey-box)

Uncertainty estimation methods [ 10, 13, 14 ] use the probabilities of the generated output tokens to measure the confidence or uncertainty in the generation, using a threshold to classify low confidence outputs as hallucinations. However, identifying an appropriate threshold is often challenging, especially for long output sequences. Instead of considering the uncertainty in a single LLM generated response, [ 13 ] propose to measure the semantic uncertainty in a set of responses sampled from the LLM for a given prompt. The authors show that a high semantic uncertainty (i.e. high conceptual variety) in sampled responses is a good indication of hallucination. The confidence estimate in this case is similar to the consistency measure and has the same pitfalls mentioned above. Overall, current uncertainty estimates, by relying purely on the output probabilities, remain naive approaches to hallucination detection.

2.3. Activation probing (open-box)

Recent works [ 3, 2, 5, 1 ] have focussed on building hallucination detectors or probes using the LLM activations at generation time. While ITI [ 5 ] constructs the probes using activations at the output of attention heads, CCS [ 3 ], SAPLMA [ 2 ], Semantic Entropy probing [ 1 ] and HaloScope [ 15 ] use the activations at the output of the transformer decoder block (layer activations). Unlike ITI and SAPLMA, where probes are trained in a supervised manner using a dataset of labelled responses, HaloScope and CCS train the probes in an unsupervised manner. Some works provide interpretability - [ 16 ] show that hallucinations with respect to input context are caused by the LLM attending to generated tokens rather than context tokens, while [ 17 ] show that lack of attention to entity tokens is indicative of lack of entity knowledge. [ 18 ] show that there exist latent directions in the layer activation space that correspond to notions of "I know this entity" and "I don’t know this entity". Unlike these prior works that focus on activations at specific points/layers during decoding, in this paper we propose an approach to improve detection by extracting a signature of hallucination across the entire residual stream.

2.4. Activation editing (open-box)

Several studies aim to mitigate hallucinations during the decoding process by manipulating model activations [ 5 ], adjusting token output probabilities [ 19, 4 ] or modifying output logits [ 6 ]. In ITI [ 5 ], model activations are shifted towards a direction associated with ‘truthfulness’. CAD [ 19 ] contrasts output token probabilities generated with and without the input context to obtain new token probabilities that are expected to be more aligned with the input context. DoLa [ 4 ] builds on the early exit strategy [ 7 ] and contrasts the output probabilities of the final layer with those of the intermediate layers. In Opera [ 6 ], the final layer logits are modified with a penalty term that discourages the model from attending to summary tokens in long-form generation tasks.

3. Method

In this section, we propose a novel probing technique that incorporates a learning mechanism over the activations at diferent layers. Several works have indicated that due to the residual connections in transformers, the outputs of individual LLM layers can be considered to be in the same embedding space [ 7, 8 ]. Building on this stream of work, here we explore how the activation pattern across LLM layers can be exploited to better detect hallucinations as compared to looking at activation patterns of individual layers. Unlike [ 4 ], where the authors propose to contrast the output of the final LLM layer against that of intermediate layers as an alternative decoding strategy, here we leverage the residual stream for improving hallucination detection. To this end, we propose cross-layer attention probing (CLAP), which takes as input the activations across all LLM layers when generating a given token. Notations Consider a dataset = {, } of prompts, and corresponding LLM responses, . For a given prompt ∈ passing through an LLM with layers, let , ∈ R represent the activation vector at layer of the LLM, where is the LLM activation dimensions. Following prior work [ 5, 2, 3 ], we probe the activations when generating the last token of the LLM output response (EOS token). Let represent the binary label of hallucination/non-hallucination for the corresponding LLM response ∈ . We assume that ground truth correct answers to prompts are available and compare the model generated response to ground truth to obtain this label.

E O S

3.1. Cross-Layer Attention Probing (CLAP)

Figure 1 depicts our proposed probing method. First, we consider the set of all layer level activations {,} as forming a set of input tokens. The tokens are arranged in the same order as the LLM layers (i.e. the residual stream) in order to be processed jointly as a sequence input. Depending on the dimensions of the LLM being probed, the sequence input can get very large, increasing computational costs. In order to allow scaling the method to larger LLMs, the activations are passed through a learnable , ∈ R . down-projection layer at the start to produce ′

The down-projected sequence input is then fed through a transformer encoder block, with encoder layers (we experiment with ∈ {1, 2}), each consisting of a self-attention module and a feed-forward network. The role of this encoder block is to learn to extract a pattern of hallucination across the residual stream by attending diferently to activations of diferent layers and thus learn an embedding vector that better separates hallucinating and non-hallucinating responses. To extract this information, we employ a learnable CLS token at the start of the sequence input. This transforms the setting into a supervised classification problem, and the transformer embedding output at the CLS position is then fed to a linear classifier layer and trained with binary cross-entropy using the supervision signal .

3.2. Leveraging Hallucinations in the Sampled Space for Fine-Grained Detection

The sampled response space for a given prompt can contain both hallucinations and non-hallucinations, indicating that correct entity/information can in fact exist in the residual stream even when the most confident generation is incorrect [ 4 ]. Given that our proposed probing mechanism attends to the activations across the entire residual stream, we hypothesise that it can also be applied for a fine-grained detection of hallucination among responses sampled for the same prompt. In order to guide the probe training for fine-grained detection, we sample a set of additional responses to each prompt at high temperature, alongside the greedy decoded response. Each response is then labelled independently as hallucination/non-hallucination. When including the sampled responses during training all responses generated for a given prompt are always arranged in the same batch - we ablate this choice against random sampling in appendix B.2. We use CLAP trained on the set of all greedy and sampled responses to prompts as the method for detecting hallucinations at the sample level, making it compatible with diferent strategies of decoding/sampling responses.

3.3. Hallucination Mitigation

Strategies that aim to mitigate hallucinations by directly modifying activations or output token probabilities during decoding can negatively impact the quality of original, non-hallucinated responses, as we shall demonstrate in our experiments in section 4.2. A natural approach to address this issue is to couple the hallucination mitigation strategy with hallucination detection. In this section, we discuss how CLAP can be employed for this purpose. Given a fine-grained CLAP hallucination detector trained for a given LLM, we use the macro-F1 score on an in-distribution validation set to determine a classification threshold for binary hallucination label prediction. Then at test time, we generate responses with CLAP as follows: 1. Generate greedy decoded response. 2. Classify whether the response is hallucinated using CLAP. 3. When classified as hallucination, generate an alternative response using either DoLa decoding [ 4 ] or random sampling. 4. Classify whether the alternate response is hallucinated using CLAP. 5. Abstain when both the greedy response and alternate response are classified as hallucination.

In summary, we combine default decoding with an alternate response on a per need basis to improve hallucination mitigation, without the negative efects of directly applying mitigation strategies such as DoLa. When the mitigation strategy is signalled to fail by CLAP we abstain from responding, leading to safer use of LLMs.

4. Experiments 4.1. Experimental setup

Section 4.1 describes the setup used for the main experiments. Section 4.2 presents the results. Data Experiments are conducted on two open-domain question answering (QA) tasks - Natural Questions (NQ) [ 20 ] and Trivia QA (TQA) [ 21 ] - and one chain-of-thought (COT) reasoning task Strategy QA (STR) [ 22 ]. The LLMs are evaluated in a closed-book setting for each of the tasks. Prompt formats used are shown in appendix A.1. For each prompt, greedy decoding is used to generate the response. When generating additional sampled responses per prompt, sampling temperature and top_p parameter are set to 1 and 0.95, respectively. See appendix A.2 for notes on data labelling and dataset statistics. In appendix A.3, we ablate the rate of true hallucinations versus query refusals. Models We use Llama-7B [ 23 ], Alpaca-7B [ 24 ], Vicuna-7B [ 25 ], Gemma-2B [ 26 ] and Llama3.1-Instruct8B [ 27 ] in our experiments.

Implementation Details For CLAP, we set the linear projection dimension _ = 128 and use a held-out validation set to select the number of encoder layers ∈ [ 1,2 ] keeping the memory footprint low. We report results of varying _ in section 6. Further details are in appendix A.4. Baselines Our main focus is in comparing the accuracy of probes that consider only the final layer activations to that of probing techniques that consider multiple layers. Therefore, the main baselines are (1) linear probe LP and a (2) non-linear probe NLP [ 2 ] on the last layer activations. Additional baselines are (3) Self-Check SC [ 10 ] (best result between NLI and Prompt versions using ∈ {3, 5, 7, 10}) (3) classifier based on the predictive entropy PE [ 13 ] of the generated text and a (4) linear probe on the attention head activations AH [ 5 ] (best performing head identified using a held-out validation set).

4.2. Results CLAP for fine-grained hallucination detection Table 1 compares the hallucination detection

performance of CLAP against baselines. Dataset-wise expanded results are provided in appendix B.1 and comparison of inference cost is provided in appendix D. When testing on greedy responses, CLAP trained on greedy responses (CLAP-g) generally improves over the baselines (SC, PE, AH-g, LP-g, NLP-g), while including sampled responses at train-time can often provide further gains for CLAP (CLAP-s). AH performs slightly better than CLAP on Gemma-2B and PE performs slightly better than CLAP on Lamma3.1-Instruct-8B. However these baselines are inferior to CLAP when coupled with other LLMs. When testing on sampled responses, we find that CLAP can leverage the sampled responses at train-time (CLAP-s) better than the baselines (AH-s, LP-s, NLP-s) to improve fine-grained detection consistently and providing gains of up to 1.5% (on TQA with Alpaca-7B and Gemma-2B). Though AH-s performs slightly better than CLAP on average with Gemma-2B, CLAP couples more robustly with all the LLMs, illustrating that it is agnostic to the LLM and widely applicable. Improving hallucination mitigation with CLAP In this section, we show how fine-grained detection using CLAP can help improve hallucination mitigation. Table 2 compares the percentage of non-hallucinated responses using our approach of combining CLAP with mitigation (denoted +CLAP-II), as described in section 3.3, alongside four baseline strategies, described below: • Default (Def) Always use the greedy decoding strategy. • Def+Abstain 1. Generate greedy decoded response. 2. Classify whether hallucinated using CLAP.

3. Abstain when classified as hallucination. • Alternate (Alt) Always use an alternate, non-greedy decoded response. Here we use DoLa [ 4 ]. • +CLAP-I 1. Generate greedy decoded response. 2. Classify whether hallucinated using CLAP. 3.

When classified as hallucination, generate an alternate response.

First, we see that with the +CLAP-I strategy, non-hallucination rate is generally improved over the Default and Alternate strategies, with an overall average gain of 11.7% over Default and 4.7% over Alt. Next, with the +CLAP-II strategy, we additionally detect hallucinations in the alternate response and abstain if hallucinated. We see that +CLAP-II reduces the abstention rate significantly (by 24.5% on average) compared to the Def+Abs strategy while consistently maintaining high non-hallucination

LLM

In figure 2a, we show the percentage of hallucinated greedy decoded responses that are replaced with non-hallucinated responses and vice versa when using the DoLa mitigation approach. We find that DoLa applied directly often negatively afects a significant percentage of the original non-hallucinated responses (orange bars). In figure 2b, we show the ratio of the replacement rate when using CLAP-II against the replacement rate when using DoLa directly. We see that CLAP-II significantly reduces the NH->H replacements (orange bars) while generally maintaining a good H->NH replacement rate (blue bars), thereby maximising the gains from DoLa.

In appendix B.4, we show that mitigation using CLAP outperforms mitigation using baseline probes. (a) Shows the % H->NH (or NH->H) transitions using

Alt.

(b) Shows the ratio of % H->NH (or NH->H) transitions using CLAP-II compared to % H->NH (or NH->H) transitions using Alt.

5. Attending to layers benefits generalisability

In this section, we compare the out-of-distribution performance of CLAP to independent probes constructed at each LLM layer when transferring from one domain to another. In addition to TQA and NQ, we use three categories from wikidata [ 28 ] (city-country, player-date-birth, movie-cast). We construct twenty train-test pairs using these five datasets, which allows us to capture a wide array of generalisation scenarios. At train time, for an LLM with layers, we construct independent probes, where each probe is a binary logistic regression classifier trained on the activations at one LLM layer {,}, to predict a hallucination (H)/non-hallucination (NH) label {}. At test time, to classify an LLM response, we experiment with four strategies for selecting among the probe predictions, as follows. • Last layer Uses the probe trained on the last layer activations. • Most Accurate Layer (MA) Uses the in-distribution validation split to select one out of the probes that performs best for the domain trained on. • Most Confident Layer (MC) Instead of pre-selecting a probe at train-time as above, this strategy measures the entropy of the predicted labels at each probe to then identify the probe with the most confident prediction (i.e., least entropy) for a given sample at test-time. • Majority Voting Across Layers (MV) Uses an ensemble setup where the final label for a sample is given by the majority vote across all probes.

In table 3, we show the % gain-over-baseline (AUC) achieved by CLAP over the probe selection strategies as well as semantic entropy probes [ 1 ], which have been shown to generalise well. CLAP not only outperforms other hallucination detection strategies on in-distribution samples but also demonstrates generalisability to samples from domains not covered in the training set. This is a crucial property - if hallucination detection deteriorates out-of-domain, the LLM is left with no guard.

6. Ablating design choices for CLAP

First, in table 4, we assess the sensitivity of CLAP to the number of encoder layers used and input dimensionality reduction. For TQA and NQ, increasing the projection dimensionality () has negligible efect while adding another encoder layer ( = 2) can result in a slight gain. For STR, performance is sometimes improved with higher projection dimensionality. We note that directly using raw activations or projecting to high dimensions becomes prohibitively expensive for larger LLMs. In this regard, we interpret our results as indicating that discriminative information for detecting hallucinations is retained at lower dimensions, making the method viable for larger LLMs. We note that CLAP with = 128 and = 2 has only 15K parameters for an LLM of 2B parameters.

Next, the design of CLAP is ablated in table 5 by comparing to two alternative probes that also take activations from all LLM layers but without any cross-layer attention mechanism. Maxpool denotes element-wise max-pooling of all activations before training a linear classifier layer. Project + Concat denotes use of a learnable down-projection layer on layer-wise activations followed by concatenation before training a linear classifier layer. We see that Maxpool, though memory and compute-wise more eficient, performs much worse than Project + Concat. This indicates the benefit of modelling layer-wise activations jointly. As we increase the projection dimensions, the performance of Project + Concat sometimes improves but memory/compute cost increases significantly. The benefit of performing crosslayer attention is evident in the out-of-distribution tests, where CLAP ( = 2) provides significant gains (1) at comparable costs over Project + Concat ( = 256) (2) by trading computation for memory eficiency over Project + Concat ( = 4096* ). In appendix C.1, CLAP is compared to token-wise attention-pooling [ 29 ], showing again the advantage of CLAP in out-of-distribution testing. 7. Conclusion This work proposed a novel probing technique for detecting hallucinations in LLMs, called CrossLayer Attention Probing (CLAP), that takes the entire LLM residual stream as a sequence of input tokens, with an attention mechanism operating over the layer-wise activations. CLAP outperforms uncertainty baselines and probes that consider only individual layers. Further, leveraging responses in the sampled space at train time helps CLAP achieve fine-grained detection between hallucinated and non-hallucinated responses to the same prompt at test time. This allowed us to apply CLAP as a finegrained detector to reduce LLM hallucination rate by sampling alternative responses to a given prompt and distinguishing hallucinated outputs from non-hallucinated ones. Finally, an out-of-distribution study revealed that attending to diferent layers enables CLAP to generalise more efectively.

We focus on small LLMs of 2B-8B where hallucination is more prominent, making detection crucial. Our ablation study indicates that hallucinations can still be detected after projecting to lower dimensions, providing evidence for scaling CLAP to larger LLMs - we leave this to future work. While CLAP takes input from all layers, we leave the investigation of the role of each layer within CLAP to future work.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools.

A. Experiment Setup A.1. Prompt Formats A.2. Data Labelling and Dataset Statistics

For Trivia QA and Natural Questions, each LLM response is labelled as hallucinated/non-hallucinated using a rouge-1 cut-of of 0.3, following prior work [ 13, 14 ], where the rouge labels are validated against human annotated labels finding a 0.96 accuracy. For StrategyQA, each LLM response is labelled by matching the final answer produced after the COT against the gold reference of YES/NO, following prior work [ 4 ]. Table 6 shows the number of prompts used, number of additional responses sampled per prompt at high temperature and the hallucination rate among greedy and sampled responses. We note that since we use LLMs of at most 8B parameters, given the wide range of facts queried and with no access to external information, such high hallucination rates are expected. For NQ with Llama3.1-I-8B, we find a very high hallucination rate of >95% among greedy responses and exclude this from the analysis. We note that high hallucination rates for NQ are in line with observations in prior work [ 30 ] and is generally attributed to the diference between typical LLM pre-training data and data used for creating NQ (Google search queries).

A.3. Response Refusal Rate

Depending on the LLM, and particularly with instruction fine-tuned models, the LLM may sometimes refuse to respond to queries, providing an "I don’t know" type response instead. In our experiments, we are concerned with surfacing a factually correct response, when one exists, and therefore model the problem as a binary classification task of non-hallucination-vs-all, treating both true hallucinations as well as refusal responses under the same label. In order to validate that the hallucination label category is not dominated by refusal responses, in table 7, we analyse the % responses containing any of the following common refusal phrases - ["don’t know", "do not know", "don’t have", "do not have", "can’t", "cannot", "unable"]. We find that this is only a small proportion of responses and further manual inspection in fact indicates that the numbers reported are slight over-estimations since the phrases are also used in non refusal responses such as "Q: Who is featured on Puf Daddy’s Can’t Hold Me Down? A: jimmy page is featured on puf daddy’s can’t hold me down." For Llama-3.1-Instruct 8B, being much more capable at answering STR (see % Hallucinations in table 6), the over-estimation is higher since the chain-of-thought reasoning often contains these phrases, eg. "Q: Do you have to pass through circle of lust to find Saladin in Dante’s Inferno? A: dante’s inferno is in three main circles: lust, gluttony, and the rest. saladin is mentioned in limbo. limbo is not one of the main circles. so the answer is no. in fact, it seems you have to pass through lust to get away from saladin. however, this is not an explicitly clear path in dante’s inferno. it seems more likely that you simply cannot pass through lust to find saladin." .

A.4. Implementation Details

All probes are trained with a batch size of 128, using AdamW optimiser with linear warm-up for 5 epochs and cosine annealing for a maximum of 50 epochs. For each dataset and method, learning rate is selected from a coarse grid search ∈ [0.5, 0.05, 0.005, 0.0005, 0.00005] using a held-out validation set.

B. Additional Results B.1. Hallucination Detection: Expanded Results for Table 1 B.2. Analysis of Train-time Batching Strategy of Sampled Responses

In table 10, we compare the efect of arranging sampled responses of the same prompt to be in the same training batch, denoted as prompt-wise sampling (pw), against the strategy of randomly sampling each batch from the set of all sampled responses across prompts, denoted as random sampling (rs). On greedy responses, we observe minor improvements using the prompt-wise sampling strategy for each method. On sampled responses, we observe significant gains using the prompt-wise sampling strategy. For the experiments reported in the main-text we use the prompt-wise sampling strategy when training on sampled responses.

B.3. Hallucination Mitigation: CLAP with Random Sampling B.4. Hallucination Mitigation: CLAP versus Last Layer Probing

Table 12 compares mitigation using +CLAP-I against using baseline probes. We see that +CLAP-I results in better overall non-hallucination rates compared to the two baselines and that this stems from the higher H->NH replacements using +CLAP-I. Tables 13 and 14 compare mitigation using +CLAP-II against using baseline probes. We see that +CLAP-II results in better overall non-hallucination rates, while maintaining comparable abstention rates, and that again the improvement stems from the higher H->NH replacements using +CLAP-II.

Design Ablations Comparing CLAP with Token-wise Attention-pooling

In table 15 we compare CLAP with attention pooling [ 29 ], which implements a learnable query vector followed by softmax pooling to aggregate token-wise activations at each layer before training a logistic regression probe on the pooled activation vector. Following the original work, we train 2L attentionpooling probes where L denotes the number of LLM decoder layers and probes are trained at both layer output as well as attention output (after residual connection) positions. After training the 2L probes, the individual probe weights are frozen and an ensemble logistic regression probe is trained on the output of the individual probes. Att-Pool (MA) denotes the best individual probe out of 2L probes (chosen using in-distribution validation data), while Att-Pool-Ens denotes the ensemble probe. We implement attention pooling with 20 tokens, taking either the last 20 or padding to 20 with zero vectors, as required1. We find that while token-wise attention pooling slightly outperforms CLAP on in-distribution testing, CLAP significantly outperforms in the out-of-distribution setting, demonstrating its superiority. 1We train all probes including CLAP on 2000 samples instead of the 5000 samples used for the main experiments, due to the GPU memory constraint of loading token-wise activations for all layers when training.

C.2. Analysis of Hyper-parameter Choices for CLAP

Table 16 reports the efect of varying the two architectural hyper-parameters and on the validation data for the Alpaca 7B, Vicuna 7B, Gemma 2B and Llama3.1-Instruct 8B models.

D. Inference cost

Table 17 shows the memory and computation cost at inference time for the compared hallucination detection methods, measured in terms of the number of parameters and the number of floating point operations (flops), respectively. For the black-box methods that involve additional response sampling, lfops for generating one output token is estimated using the standard formula for transformers - 2 x N, where N denotes the number of parameters of the LLM. The total cost of detection then involves the cost of generating additional samples of tokens each and the cost of NLI-based/prompt-based comparison of the greedy response against each of the sampled responses. For Self-Check NLI, the recommended DeBERTa-v3-large-mnli model is assumed. For Self-Check Prompt, a single token YES/NO response is assumed. Unsurprisingly, the probing based methods are significantly more compute eficient than the black-box methods. Amongst the probing based methods, while CLAP increases the compute cost, this is still negligible compared to performing black-box detection.

Method

AH LP/SEP/Most Accurate

NLP Most Confident/Majority Voting CLAP ( = 1, = 128) CLAP ( = 2, = 128)

Self-Check NLI # Params 128 4K 1.1M 135K 826K 1.1M

7B +304M 7B

Flops

[1]

Kossen , J. Han,

Razzak ,

Schut ,

Malik ,

Gal , Semantic entropy probes: Robust and cheap hallucination detection in llms, 2024 . URL: https://arxiv.org/abs/2406.15927. arXiv: 2406 . 15927 .

[2]

Azaria , T. Mitchell, The internal state of an LLM knows when it's lying , in: The 2023 Conference on Empirical Methods in Natural Language Processing , 2023 . URL: https://openreview.net/forum? id=y2V6YgLaW7.

[3]

Burns ,

Ye ,

Klein ,

Steinhardt , Discovering latent knowledge in language models without supervision , arXiv preprint arXiv:2212.03827 ( 2022 ).

[4]

Y.-S.

Chuang ,

Xie ,

Luo ,

Kim ,

Glass ,

He , Dola: Decoding by contrasting layers improves factuality in large language models , in: The Twelfth International Conference on Learning Representations , 2024 . URL: https://openreview.net/forum?id= Th6NyL07na .

[5]

Li ,

Patel ,

Viégas ,

Pfister ,

Wattenberg , Inference-time intervention: Eliciting truthful answers from a language model , in: Thirty-seventh Conference on Neural Information Processing Systems , 2023 . URL: https://openreview.net/forum?id=aLLuYpn83y.

[6]

Huang , X. wen Dong,

Zhang ,

Wang ,

He ,

Wang ,

Lin ,

Zhang ,

N. H.

Yu , Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation , ArXiv abs/2311 .17911 ( 2023 ). URL: https://api.semanticscholar.org/ CorpusID:265498818.

[7]

Schuster ,

Fisch ,

Gupta ,

Dehghani ,

Bahri ,

Tran ,

Tay ,

Metzler , Confident adaptive language modeling , Advances in Neural Information Processing Systems 35 ( 2022 ) 17456 - 17472 .

[8]

Geva ,

Caciularu ,

Wang ,

Goldberg , Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space , in: Y. Goldberg , Z. Kozareva , Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022 , pp. 30 - 45 . URL: https://aclanthology.org/ 2022 .emnlp-main.3/. doi: 10 .18653/v1/ 2022 .emnlp-main. 3 .

[9]

Karbasi ,

Montasser ,

Sous , G. Velegkas, (im)possibility of automated hallucination detection in large language models , 2025 . URL: https://arxiv.org/abs/2504.17004. arXiv: 2504 . 17004 .

[10]

Manakul ,

Liusie ,

M. J. F.

Gales , Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , 2023 . arXiv: 2303 . 08896 .

[11]

Mündler ,

He ,

Jenko ,

Vechev , Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation , arXiv preprint arXiv:2305.15852 ( 2023 ).

[12]

Dhuliawala ,

Komeili ,

Xu ,

Raileanu ,

Li ,

Celikyilmaz ,

Weston , Chain-of-verification reduces hallucination in large language models , arXiv preprint arXiv:2309.11495 ( 2023 ).

[13]

Kuhn ,

Gal ,

Farquhar , Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , ArXiv abs/2302 .09664 ( 2023 ). URL: https://api.semanticscholar. org/CorpusID:257039062.

[14]

Duan , H. Cheng,

Wang ,

Zavalny ,

Wang ,

Xu ,

Kailkhura ,

Xu , Shifting attention to relevance: Towards the uncertainty estimation of large language models , 2023 . arXiv: 2307 . 01379 .

[15]

Du ,

Xiao ,

Li , Haloscope: Harnessing unlabeled LLM generations for hallucination detection , in: The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024 . URL: https://openreview.net/forum?id=nfK0ZXFFSn.

[16]

Y.-S.

Chuang ,

Qiu , C.-Y. Hsieh,

Krishna ,

Kim ,

Glass , Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps , 2024 . URL: https://arxiv.org/abs/2407.07071. arXiv: 2407 . 07071 .

[17]

Yuksekgonul ,

Chandrasekaran ,

Jones ,

Gunasekar ,

Naik ,

Palangi ,

Kamar ,

Nushi , Attention satisfies: A constraint-satisfaction lens on factual errors of language models , in: The Twelfth International Conference on Learning Representations , 2024 . URL: https://openreview. net/forum?id=gfFVATfPd.

[18]

Ferrando ,

O. B.

Obeso ,

Rajamanoharan ,

Nanda , Do i know this entity? knowledge awareness and hallucinations in language models , in: The Thirteenth International Conference on Learning Representations , 2025 . URL: https://openreview.net/forum?id= WCRQFlji2q .

[19]

Shi ,

Han ,

Lewis ,

Tsvetkov ,

Zettlemoyer , S. W. tau Yih, Trusting your evidence: Hallucinate less with context-aware decoding , 2023 . arXiv: 2305 . 14739 .

[20]

Lee ,

M.-W.

Chang ,

Toutanova , Latent retrieval for weakly supervised open domain question answering, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Florence, Italy, 2019 , pp. 6086 - 6096 . URL: https://www.aclweb.org/anthology/P19-1612. doi: 10 .18653/v1/ P19 -1612.

[21]

Joshi ,

Choi ,

Weld , L. Zettlemoyer, triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , arXiv e-prints ( 2017 ) arXiv: 1705 .03551. arXiv: 1705 . 03551 .

[22]

Geva ,

Khashabi , E. Segal,

Khot ,

Roth ,

Berant , Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies, Transactions of the Association for Computational Linguistics (TACL) ( 2021 ).

[23]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , A.

Rodriguez , A.

Joulin , E. Grave, G. Lample, Llama: Open and eficient foundation language models , 2023 . URL: https://arxiv.org/abs/2302.13971. arXiv: 2302 . 13971 .

[24]

Taori , I. Gulrajani,

Zhang ,

Dubois ,

Li ,

Guestrin ,

Liang , T. B. Hashimoto , Stanford alpaca: An instruction-following llama model , https://github.com/tatsu-lab/stanford_alpaca, 2023 .

[25] W.-L. Chiang , Z.

Li , Z.

Lin , Y.

Sheng , Z.

Wu , H.

Zhang , L. Zheng, S.

Zhuang , Y.

Zhuang , J. E.

Gonzalez , I.

Stoica , E. P.

Xing , Vicuna:

An open-source chatbot impressing gpt-4 with 90%* chatgpt quality , 2023 . URL: https://lmsys.org/blog/2023-03-30-vicuna/.

[26]

Team ,

Mesnard ,

Hardin ,

Dadashi ,

Bhupatiraju , et. al., Gemma: Open models based on gemini research and technology , 2024 . URL: https://arxiv.org/abs/2403.08295. arXiv: 2403 . 08295 .

[27]

Grattafiori ,

Dubey ,

Jauhri ,

Pandey ,

Kadian , et. al., The llama 3 herd of models , 2024 . URL: https://arxiv.org/abs/2407.21783. arXiv: 2407 . 21783 .

[28]

Vrandečić ,

Krötzsch , Wikidata: a free collaborative knowledgebase , Commun. ACM 57 ( 2014 ) 78 - 85 . URL: https://doi.org/10.1145/2629489. doi: 10 .1145/2629489.

[29]

CH-Wang , B. Van Durme ,

Eisner ,

Kedzie , Do androids know they're only dreaming of electric sheep? , in: L. -W. Ku , A. Martins , V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 , Association for Computational Linguistics , Bangkok, Thailand, 2024 , pp. 4401 - 4420 . URL: https://aclanthology.org/ 2024 .findings-acl. 260 /. doi: 10 .18653/v1/ 2024 . findings-acl. 260 .

[30]

Wang , S. Cheng,

Guo ,

Yue ,

Ding ,

Xu ,

Wang ,

Hu ,

Zhang , Y. Zhang, Evaluating open-QA evaluation , in: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023 . URL: https://openreview.net/forum?id= UErNpveP6R .

Llama3. 1 -Instruct 8B