<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-Layer Attention Probing for Fine-Grained Hallucination Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Malavika Suresh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rahaf Aljundi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ikechukwu Nkisi-Orji</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nirmalie Wiratunga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Robert Gordon University</institution>
          ,
          <addr-line>Aberdeen</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Toyota Motor Europe</institution>
          ,
          <addr-line>Brussels</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>With the large-scale adoption of Large Language Models (LLMs) in various applications, there is a growing reliability concern due to their tendency to generate inaccurate text, i.e. hallucinations. In this work, we propose Cross-Layer Attention Probing (CLAP), a novel activation probing technique for hallucination detection, which processes the LLM activations across the entire residual stream as a joint sequence. Our empirical evaluations using five LLMs and three tasks show that CLAP improves hallucination detection compared to baselines on both greedy decoded responses as well as responses sampled at higher temperatures, thus enabling fine-grained detection, i.e. the ability to disambiguate hallucinations and non-hallucinations among diferent sampled responses to a given prompt. This allows us to propose a detect-then-mitigate strategy using CLAP to reduce hallucinations and improve LLM reliability compared to direct mitigation approaches. Finally, we show that CLAP maintains high reliability even when applied out-of-distribution.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;hallucination detection</kwd>
        <kwd>activation probing</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>hallucination detection is not possible without both positive and negative examples. First, across five
LLMs and three tasks (two factual question-answering tasks and one chain-of-thought reasoning task),
we show that our method improves over uncertainty baselines and activation probing methods that
consider only individual layers.</p>
      <p>
        Next, we build on the observation that diferent responses sampled for a given prompt can vary,
with some being hallucinated and others not. We leverage the responses in the sampled space to
augment the training data. We find that our proposed method, by learning to attend to diferent layers,
can leverage this fine-grained supervision signal better to provide improved fine-grained detection
performance compared to baselines. We further explore the integration of our method with hallucination
mitigation pipelines, such as DoLa [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Noting that mitigation methods can adversely afect originally
non-hallucinated samples, we combine CLAP with DoLa. Our results demonstrate that this combination
significantly reduces the hallucination rate.
      </p>
      <p>
        To support our approach of attending to activations across layers, we rigorously evaluate our method
against various strategies for selecting probes at diferent layers. We conduct tests using cases from
domains diferent from the training data to assess the generalization of layer-based probes compared to
our proposed solution. Our results show that CLAP provides significant gains over probing at diferent
layers when prompts fall outside the domains of the training samples. Notably, CLAP also improves
over Semantic Entropy Probes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which have been shown to generalise well.
      </p>
      <p>In summary, this paper makes the following contributions:
1. A novel probing technique, called Cross-Layer Attention Probing (CLAP), which consists of an
attention mechanism operating on the LLM residual stream, is proposed for improving hallucination
detection.
2. CLAP improves fine-grained detection of hallucinations among diferent responses sampled for
the same prompt, helping reduce model hallucinations.
3. On an out-of-distribution study, CLAP improves over probes constructed at individual layers.</p>
      <p>The rest of the paper is structured as follows. Section 2 describes methods used in prior work for
hallucination detection and mitigation. Section 3 describes our proposed approach, CLAP, and the
methodology for fine-grained detection and mitigation. Section 4 evaluates CLAP against baselines.
Section 5 provides an analysis of out-of-distribution generalisation. Finally, we perform an ablation
study of the design in section 6 before concluding with a discussion on future work in section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. LLM-based detection and mitigation (black box)</title>
        <p>
          Black-box methods assume no access to model internals and therefore rely on additional LLM-prompting.
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] proposed the use of a consistency check among diferent responses sampled for a given prompt
as a measure of hallucination. Here the assumption is that when an LLM hallucinates, the sampled
responses would be inconsistent with each other. This evaluates whether for a given prompt, any
given response from the LLM can be trusted, and is therefore not suited to identify non-hallucinating
responses within the sampled space for a prompt as well as in cases where multiple diferent answers
are valid. Other works [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ] have shown that LLMs can be prompted to detect hallucinations in
outputs by the same or diferent LLM. This relies on the LLM having a good reasoning ability and is
therefore often restricted to large models, introducing additional cost and latency.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Uncertainty estimation (grey-box)</title>
        <p>
          Uncertainty estimation methods [
          <xref ref-type="bibr" rid="ref10 ref13 ref14">10, 13, 14</xref>
          ] use the probabilities of the generated output tokens to
measure the confidence or uncertainty in the generation, using a threshold to classify low confidence
outputs as hallucinations. However, identifying an appropriate threshold is often challenging, especially
for long output sequences. Instead of considering the uncertainty in a single LLM generated response,
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] propose to measure the semantic uncertainty in a set of responses sampled from the LLM for a given
prompt. The authors show that a high semantic uncertainty (i.e. high conceptual variety) in sampled
responses is a good indication of hallucination. The confidence estimate in this case is similar to the
consistency measure and has the same pitfalls mentioned above. Overall, current uncertainty estimates,
by relying purely on the output probabilities, remain naive approaches to hallucination detection.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Activation probing (open-box)</title>
        <p>
          Recent works [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref31 ref5">3, 2, 5, 1</xref>
          ] have focussed on building hallucination detectors or probes using the LLM
activations at generation time. While ITI [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] constructs the probes using activations at the output
of attention heads, CCS [
          <xref ref-type="bibr" rid="ref3 ref31">3</xref>
          ], SAPLMA [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], Semantic Entropy probing [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and HaloScope [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] use the
activations at the output of the transformer decoder block (layer activations). Unlike ITI and SAPLMA,
where probes are trained in a supervised manner using a dataset of labelled responses, HaloScope and
CCS train the probes in an unsupervised manner. Some works provide interpretability - [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] show that
hallucinations with respect to input context are caused by the LLM attending to generated tokens rather
than context tokens, while [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] show that lack of attention to entity tokens is indicative of lack of entity
knowledge. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] show that there exist latent directions in the layer activation space that correspond to
notions of "I know this entity" and "I don’t know this entity". Unlike these prior works that focus on
activations at specific points/layers during decoding, in this paper we propose an approach to improve
detection by extracting a signature of hallucination across the entire residual stream.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Activation editing (open-box)</title>
        <p>
          Several studies aim to mitigate hallucinations during the decoding process by manipulating model
activations [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], adjusting token output probabilities [
          <xref ref-type="bibr" rid="ref19 ref4">19, 4</xref>
          ] or modifying output logits [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In ITI [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
model activations are shifted towards a direction associated with ‘truthfulness’. CAD [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] contrasts
output token probabilities generated with and without the input context to obtain new token probabilities
that are expected to be more aligned with the input context. DoLa [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] builds on the early exit strategy
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and contrasts the output probabilities of the final layer with those of the intermediate layers. In
Opera [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the final layer logits are modified with a penalty term that discourages the model from
attending to summary tokens in long-form generation tasks.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>
        In this section, we propose a novel probing technique that incorporates a learning mechanism over
the activations at diferent layers. Several works have indicated that due to the residual connections
in transformers, the outputs of individual LLM layers can be considered to be in the same embedding
space [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Building on this stream of work, here we explore how the activation pattern across LLM
layers can be exploited to better detect hallucinations as compared to looking at activation patterns of
individual layers. Unlike [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], where the authors propose to contrast the output of the final LLM layer
against that of intermediate layers as an alternative decoding strategy, here we leverage the residual
stream for improving hallucination detection. To this end, we propose cross-layer attention probing
(CLAP), which takes as input the activations across all LLM layers when generating a given token.
Notations Consider a dataset  = {, } of prompts,  and corresponding LLM responses, . For
a given prompt  ∈  passing through an LLM with  layers, let , ∈ R represent the activation
vector at layer  of the LLM, where  is the LLM activation dimensions. Following prior work
[
        <xref ref-type="bibr" rid="ref2 ref3 ref31 ref5">5, 2, 3</xref>
        ], we probe the activations when generating the last token of the LLM output response (EOS
token). Let  represent the binary label of hallucination/non-hallucination for the corresponding LLM
response  ∈ . We assume that ground truth correct answers to prompts are available and compare
the model generated response to ground truth to obtain this label.
      </p>
      <p>E O S</p>
      <sec id="sec-3-1">
        <title>3.1. Cross-Layer Attention Probing (CLAP)</title>
        <p>Figure 1 depicts our proposed probing method. First, we consider the set of all layer level activations
{,} as forming a set of input tokens. The tokens are arranged in the same order as the LLM layers (i.e.
the residual stream) in order to be processed jointly as a sequence input. Depending on the dimensions
of the LLM being probed, the sequence input can get very large, increasing computational costs. In
order to allow scaling the method to larger LLMs, the activations are passed through a learnable
, ∈ R .
down-projection layer at the start to produce ′</p>
        <p>The down-projected sequence input is then fed through a transformer encoder block, with 
encoder layers (we experiment with  ∈ {1, 2}), each consisting of a self-attention module and a
feed-forward network. The role of this encoder block is to learn to extract a pattern of hallucination
across the residual stream by attending diferently to activations of diferent layers and thus learn an
embedding vector that better separates hallucinating and non-hallucinating responses. To extract this
information, we employ a learnable CLS token at the start of the sequence input. This transforms
the setting into a supervised classification problem, and the transformer embedding output at the
CLS position is then fed to a linear classifier layer and trained with binary cross-entropy using the
supervision signal .</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Leveraging Hallucinations in the Sampled Space for Fine-Grained Detection</title>
        <p>
          The sampled response space for a given prompt can contain both hallucinations and non-hallucinations,
indicating that correct entity/information can in fact exist in the residual stream even when the most
confident generation is incorrect [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Given that our proposed probing mechanism attends to the
activations across the entire residual stream, we hypothesise that it can also be applied for a fine-grained
detection of hallucination among responses sampled for the same prompt. In order to guide the probe
training for fine-grained detection, we sample a set of  additional responses to each prompt at high
temperature, alongside the greedy decoded response. Each response is then labelled independently as
hallucination/non-hallucination. When including the sampled responses during training all responses
generated for a given prompt are always arranged in the same batch - we ablate this choice against
random sampling in appendix B.2. We use CLAP trained on the set of all greedy and sampled responses
to prompts as the method for detecting hallucinations at the sample level, making it compatible with
diferent strategies of decoding/sampling responses.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Hallucination Mitigation</title>
        <p>
          Strategies that aim to mitigate hallucinations by directly modifying activations or output token
probabilities during decoding can negatively impact the quality of original, non-hallucinated responses, as we
shall demonstrate in our experiments in section 4.2. A natural approach to address this issue is to couple
the hallucination mitigation strategy with hallucination detection. In this section, we discuss how
CLAP can be employed for this purpose. Given a fine-grained CLAP hallucination detector trained for a
given LLM, we use the macro-F1 score on an in-distribution validation set to determine a classification
threshold for binary hallucination label prediction. Then at test time, we generate responses with CLAP
as follows:
1. Generate greedy decoded response.
2. Classify whether the response is hallucinated using CLAP.
3. When classified as hallucination, generate an alternative response using either DoLa decoding
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] or random sampling.
4. Classify whether the alternate response is hallucinated using CLAP.
5. Abstain when both the greedy response and alternate response are classified as hallucination.
        </p>
        <p>In summary, we combine default decoding with an alternate response on a per need basis to improve
hallucination mitigation, without the negative efects of directly applying mitigation strategies such as
DoLa. When the mitigation strategy is signalled to fail by CLAP we abstain from responding, leading to
safer use of LLMs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental setup</title>
        <p>
          Section 4.1 describes the setup used for the main experiments. Section 4.2 presents the results.
Data Experiments are conducted on two open-domain question answering (QA) tasks - Natural
Questions (NQ) [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and Trivia QA (TQA) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] - and one chain-of-thought (COT) reasoning task
Strategy QA (STR) [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The LLMs are evaluated in a closed-book setting for each of the tasks. Prompt
formats used are shown in appendix A.1. For each prompt, greedy decoding is used to generate the
response. When generating additional sampled responses per prompt, sampling temperature and top_p
parameter are set to 1 and 0.95, respectively. See appendix A.2 for notes on data labelling and dataset
statistics. In appendix A.3, we ablate the rate of true hallucinations versus query refusals.
Models We use Llama-7B [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], Alpaca-7B [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], Vicuna-7B [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], Gemma-2B [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and
Llama3.1-Instruct8B [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] in our experiments.
        </p>
        <p>
          Implementation Details For CLAP, we set the linear projection dimension _ = 128 and
use a held-out validation set to select the number of encoder layers  ∈ [
          <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
          ] keeping the memory
footprint low. We report results of varying _ in section 6. Further details are in appendix A.4.
Baselines Our main focus is in comparing the accuracy of probes that consider only the final layer
activations to that of probing techniques that consider multiple layers. Therefore, the main baselines are
(1) linear probe LP and a (2) non-linear probe NLP [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] on the last layer activations. Additional baselines
are (3) Self-Check SC [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] (best result between NLI and Prompt versions using  ∈ {3, 5, 7, 10}) (3)
classifier based on the predictive entropy PE [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] of the generated text and a (4) linear probe on the
attention head activations AH [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] (best performing head identified using a held-out validation set).
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
      </sec>
      <sec id="sec-4-3">
        <title>CLAP for fine-grained hallucination detection Table 1 compares the hallucination detection</title>
        <p>performance of CLAP against baselines. Dataset-wise expanded results are provided in appendix B.1
and comparison of inference cost is provided in appendix D. When testing on greedy responses, CLAP
trained on greedy responses (CLAP-g) generally improves over the baselines (SC, PE, AH-g, LP-g,
NLP-g), while including sampled responses at train-time can often provide further gains for CLAP
(CLAP-s). AH performs slightly better than CLAP on Gemma-2B and PE performs slightly better than
CLAP on Lamma3.1-Instruct-8B. However these baselines are inferior to CLAP when coupled with other
LLMs. When testing on sampled responses, we find that CLAP can leverage the sampled responses at
train-time (CLAP-s) better than the baselines (AH-s, LP-s, NLP-s) to improve fine-grained detection
consistently and providing gains of up to 1.5% (on TQA with Alpaca-7B and Gemma-2B). Though AH-s
performs slightly better than CLAP on average with Gemma-2B, CLAP couples more robustly with all
the LLMs, illustrating that it is agnostic to the LLM and widely applicable.
Improving hallucination mitigation with CLAP In this section, we show how fine-grained
detection using CLAP can help improve hallucination mitigation. Table 2 compares the percentage of
non-hallucinated responses using our approach of combining CLAP with mitigation (denoted +CLAP-II),
as described in section 3.3, alongside four baseline strategies, described below:
• Default (Def) Always use the greedy decoding strategy.
• Def+Abstain 1. Generate greedy decoded response. 2. Classify whether hallucinated using CLAP.</p>
        <p>
          3. Abstain when classified as hallucination.
• Alternate (Alt) Always use an alternate, non-greedy decoded response. Here we use DoLa [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
• +CLAP-I 1. Generate greedy decoded response. 2. Classify whether hallucinated using CLAP. 3.
        </p>
        <p>When classified as hallucination, generate an alternate response.</p>
        <p>First, we see that with the +CLAP-I strategy, non-hallucination rate is generally improved over the
Default and Alternate strategies, with an overall average gain of 11.7% over Default and 4.7% over Alt.
Next, with the +CLAP-II strategy, we additionally detect hallucinations in the alternate response and
abstain if hallucinated. We see that +CLAP-II reduces the abstention rate significantly (by 24.5% on
average) compared to the Def+Abs strategy while consistently maintaining high non-hallucination</p>
        <p>LLM</p>
        <p>In figure 2a, we show the percentage of hallucinated greedy decoded responses that are replaced with
non-hallucinated responses and vice versa when using the DoLa mitigation approach. We find that
DoLa applied directly often negatively afects a significant percentage of the original non-hallucinated
responses (orange bars). In figure 2b, we show the ratio of the replacement rate when using CLAP-II
against the replacement rate when using DoLa directly. We see that CLAP-II significantly reduces the
NH-&gt;H replacements (orange bars) while generally maintaining a good H-&gt;NH replacement rate (blue
bars), thereby maximising the gains from DoLa.</p>
        <p>In appendix B.4, we show that mitigation using CLAP outperforms mitigation using baseline probes.
(a) Shows the % H-&gt;NH (or NH-&gt;H) transitions using</p>
        <p>Alt.</p>
        <p>(b) Shows the ratio of % H-&gt;NH (or NH-&gt;H) transitions
using CLAP-II compared to % H-&gt;NH (or NH-&gt;H)
transitions using Alt.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Attending to layers benefits generalisability</title>
      <p>
        In this section, we compare the out-of-distribution performance of CLAP to independent probes
constructed at each LLM layer when transferring from one domain to another. In addition to TQA
and NQ, we use three categories from wikidata [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] (city-country, player-date-birth, movie-cast). We
construct twenty train-test pairs using these five datasets, which allows us to capture a wide array of
generalisation scenarios. At train time, for an LLM with  layers, we construct  independent probes,
where each probe is a binary logistic regression classifier trained on the activations at one LLM layer
{,}, to predict a hallucination (H)/non-hallucination (NH) label {}. At test time, to classify an LLM
response, we experiment with four strategies for selecting among the  probe predictions, as follows.
• Last layer Uses the probe trained on the last layer activations.
• Most Accurate Layer (MA) Uses the in-distribution validation split to select one out of the 
probes that performs best for the domain trained on.
• Most Confident Layer (MC) Instead of pre-selecting a probe at train-time as above, this strategy
measures the entropy of the predicted labels at each probe to then identify the probe with the
most confident prediction (i.e., least entropy) for a given sample at test-time.
• Majority Voting Across Layers (MV) Uses an ensemble setup where the final label for a sample
is given by the majority vote across all probes.
      </p>
      <p>
        In table 3, we show the % gain-over-baseline (AUC) achieved by CLAP over the probe selection
strategies as well as semantic entropy probes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which have been shown to generalise well. CLAP
not only outperforms other hallucination detection strategies on in-distribution samples but also
demonstrates generalisability to samples from domains not covered in the training set. This is a crucial
property - if hallucination detection deteriorates out-of-domain, the LLM is left with no guard.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Ablating design choices for CLAP</title>
      <p>First, in table 4, we assess the sensitivity of CLAP to the number of encoder layers used and input
dimensionality reduction. For TQA and NQ, increasing the projection dimensionality () has
negligible efect while adding another encoder layer (  = 2) can result in a slight gain. For STR,
performance is sometimes improved with higher projection dimensionality. We note that directly
using raw activations or projecting to high dimensions becomes prohibitively expensive for larger
LLMs. In this regard, we interpret our results as indicating that discriminative information for detecting
hallucinations is retained at lower dimensions, making the method viable for larger LLMs. We note that
CLAP with  = 128 and  = 2 has only 15K parameters for an LLM of 2B parameters.</p>
      <p>
        Next, the design of CLAP is ablated in table 5 by comparing to two alternative probes that also take
activations from all LLM layers but without any cross-layer attention mechanism. Maxpool denotes
element-wise max-pooling of all activations before training a linear classifier layer. Project + Concat
denotes use of a learnable down-projection layer on layer-wise activations followed by concatenation
before training a linear classifier layer. We see that Maxpool, though memory and compute-wise more
eficient, performs much worse than Project + Concat. This indicates the benefit of modelling layer-wise
activations jointly. As we increase the projection dimensions, the performance of Project + Concat
sometimes improves but memory/compute cost increases significantly. The benefit of performing
crosslayer attention is evident in the out-of-distribution tests, where CLAP ( = 2) provides significant
gains (1) at comparable costs over Project + Concat ( = 256) (2) by trading computation for
memory eficiency over Project + Concat (  = 4096* ). In appendix C.1, CLAP is compared to
token-wise attention-pooling [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], showing again the advantage of CLAP in out-of-distribution testing.
7. Conclusion
This work proposed a novel probing technique for detecting hallucinations in LLMs, called
CrossLayer Attention Probing (CLAP), that takes the entire LLM residual stream as a sequence of input
tokens, with an attention mechanism operating over the layer-wise activations. CLAP outperforms
uncertainty baselines and probes that consider only individual layers. Further, leveraging responses in
the sampled space at train time helps CLAP achieve fine-grained detection between hallucinated and
non-hallucinated responses to the same prompt at test time. This allowed us to apply CLAP as a
finegrained detector to reduce LLM hallucination rate by sampling alternative responses to a given prompt
and distinguishing hallucinated outputs from non-hallucinated ones. Finally, an out-of-distribution
study revealed that attending to diferent layers enables CLAP to generalise more efectively.
      </p>
      <p>We focus on small LLMs of 2B-8B where hallucination is more prominent, making detection crucial.
Our ablation study indicates that hallucinations can still be detected after projecting to lower dimensions,
providing evidence for scaling CLAP to larger LLMs - we leave this to future work. While CLAP takes
input from all layers, we leave the investigation of the role of each layer within CLAP to future work.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Experiment Setup</title>
      <sec id="sec-8-1">
        <title>A.1. Prompt Formats</title>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. Data Labelling and Dataset Statistics</title>
        <p>
          For Trivia QA and Natural Questions, each LLM response is labelled as hallucinated/non-hallucinated
using a rouge-1 cut-of of 0.3, following prior work [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ], where the rouge labels are validated against
human annotated labels finding a 0.96 accuracy. For StrategyQA, each LLM response is labelled by
matching the final answer produced after the COT against the gold reference of YES/NO, following
prior work [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Table 6 shows the number of prompts used, number of additional responses sampled per
prompt at high temperature and the hallucination rate among greedy and sampled responses. We note
that since we use LLMs of at most 8B parameters, given the wide range of facts queried and with no
access to external information, such high hallucination rates are expected. For NQ with Llama3.1-I-8B,
we find a very high hallucination rate of &gt;95% among greedy responses and exclude this from the
analysis. We note that high hallucination rates for NQ are in line with observations in prior work [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]
and is generally attributed to the diference between typical LLM pre-training data and data used for
creating NQ (Google search queries).
        </p>
      </sec>
      <sec id="sec-8-3">
        <title>A.3. Response Refusal Rate</title>
        <p>Depending on the LLM, and particularly with instruction fine-tuned models, the LLM may sometimes
refuse to respond to queries, providing an "I don’t know" type response instead. In our experiments, we
are concerned with surfacing a factually correct response, when one exists, and therefore model the
problem as a binary classification task of non-hallucination-vs-all, treating both true hallucinations
as well as refusal responses under the same label. In order to validate that the hallucination label
category is not dominated by refusal responses, in table 7, we analyse the % responses containing any
of the following common refusal phrases - ["don’t know", "do not know", "don’t have", "do not have",
"can’t", "cannot", "unable"]. We find that this is only a small proportion of responses and further manual
inspection in fact indicates that the numbers reported are slight over-estimations since the phrases are
also used in non refusal responses such as "Q: Who is featured on Puf Daddy’s Can’t Hold Me Down?
A: jimmy page is featured on puf daddy’s can’t hold me down." For Llama-3.1-Instruct 8B, being much
more capable at answering STR (see % Hallucinations in table 6), the over-estimation is higher since
the chain-of-thought reasoning often contains these phrases, eg. "Q: Do you have to pass through circle
of lust to find Saladin in Dante’s Inferno? A: dante’s inferno is in three main circles: lust, gluttony, and
the rest. saladin is mentioned in limbo. limbo is not one of the main circles. so the answer is no. in fact, it
seems you have to pass through lust to get away from saladin. however, this is not an explicitly clear path
in dante’s inferno. it seems more likely that you simply cannot pass through lust to find saladin." .</p>
      </sec>
      <sec id="sec-8-4">
        <title>A.4. Implementation Details</title>
        <p>All probes are trained with a batch size of 128, using AdamW optimiser with linear warm-up for 5
epochs and cosine annealing for a maximum of 50 epochs. For each dataset and method, learning rate
is selected from a coarse grid search ∈ [0.5, 0.05, 0.005, 0.0005, 0.00005] using a held-out validation set.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>B. Additional Results</title>
      <sec id="sec-9-1">
        <title>B.1. Hallucination Detection: Expanded Results for Table 1</title>
      </sec>
      <sec id="sec-9-2">
        <title>B.2. Analysis of Train-time Batching Strategy of Sampled Responses</title>
        <p>In table 10, we compare the efect of arranging sampled responses of the same prompt to be in the
same training batch, denoted as prompt-wise sampling (pw), against the strategy of randomly sampling
each batch from the set of all sampled responses across prompts, denoted as random sampling (rs). On
greedy responses, we observe minor improvements using the prompt-wise sampling strategy for each
method. On sampled responses, we observe significant gains using the prompt-wise sampling strategy.
For the experiments reported in the main-text we use the prompt-wise sampling strategy when training
on sampled responses.</p>
      </sec>
      <sec id="sec-9-3">
        <title>B.3. Hallucination Mitigation: CLAP with Random Sampling</title>
      </sec>
      <sec id="sec-9-4">
        <title>B.4. Hallucination Mitigation: CLAP versus Last Layer Probing</title>
        <p>Table 12 compares mitigation using +CLAP-I against using baseline probes. We see that +CLAP-I results
in better overall non-hallucination rates compared to the two baselines and that this stems from the
higher H-&gt;NH replacements using +CLAP-I. Tables 13 and 14 compare mitigation using +CLAP-II
against using baseline probes. We see that +CLAP-II results in better overall non-hallucination rates,
while maintaining comparable abstention rates, and that again the improvement stems from the higher
H-&gt;NH replacements using +CLAP-II.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Design Ablations</title>
      <sec id="sec-10-1">
        <title>Comparing CLAP with Token-wise Attention-pooling</title>
        <p>
          In table 15 we compare CLAP with attention pooling [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], which implements a learnable query vector
followed by softmax pooling to aggregate token-wise activations at each layer before training a logistic
regression probe on the pooled activation vector. Following the original work, we train 2L
attentionpooling probes where L denotes the number of LLM decoder layers and probes are trained at both
layer output as well as attention output (after residual connection) positions. After training the 2L
probes, the individual probe weights are frozen and an ensemble logistic regression probe is trained
on the output of the individual probes. Att-Pool (MA) denotes the best individual probe out of 2L
probes (chosen using in-distribution validation data), while Att-Pool-Ens denotes the ensemble probe.
We implement attention pooling with 20 tokens, taking either the last 20 or padding to 20 with zero
vectors, as required1. We find that while token-wise attention pooling slightly outperforms CLAP on
in-distribution testing, CLAP significantly outperforms in the out-of-distribution setting, demonstrating
its superiority.
1We train all probes including CLAP on 2000 samples instead of the 5000 samples used for the main experiments, due to the
GPU memory constraint of loading token-wise activations for all layers when training.
        </p>
      </sec>
      <sec id="sec-10-2">
        <title>C.2. Analysis of Hyper-parameter Choices for CLAP</title>
        <p>Table 16 reports the efect of varying the two architectural hyper-parameters  and  on the
validation data for the Alpaca 7B, Vicuna 7B, Gemma 2B and Llama3.1-Instruct 8B models.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>D. Inference cost</title>
      <p>Table 17 shows the memory and computation cost at inference time for the compared hallucination
detection methods, measured in terms of the number of parameters and the number of floating point
operations (flops), respectively. For the black-box methods that involve additional response sampling,
lfops for generating one output token is estimated using the standard formula for transformers - 2 x N,
where N denotes the number of parameters of the LLM. The total cost of detection then involves the
cost of generating  additional samples of  tokens each and the cost of NLI-based/prompt-based
comparison of the greedy response against each of the  sampled responses. For Self-Check NLI,
the recommended DeBERTa-v3-large-mnli model is assumed. For Self-Check Prompt, a single token
YES/NO response is assumed. Unsurprisingly, the probing based methods are significantly more compute
eficient than the black-box methods. Amongst the probing based methods, while CLAP increases the
compute cost, this is still negligible compared to performing black-box detection.</p>
      <p>Method</p>
      <p>AH
LP/SEP/Most Accurate</p>
      <p>NLP
Most Confident/Majority Voting
CLAP ( = 1,  = 128)
CLAP ( = 2,  = 128)</p>
      <p>Self-Check NLI
# Params
128
4K
1.1M
135K
826K
1.1M</p>
      <p>7B
+304M
7B</p>
      <p>Flops</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kossen</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>M.</given-names>
            <surname>Razzak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <article-title>Semantic entropy probes: Robust and cheap hallucination detection in llms, 2024</article-title>
          . URL: https://arxiv.org/abs/2406.15927. arXiv:
          <volume>2406</volume>
          .
          <fpage>15927</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Azaria</surname>
          </string-name>
          , T. Mitchell,
          <article-title>The internal state of an LLM knows when it's lying</article-title>
          ,
          <source>in: The 2023 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          . URL: https://openreview.net/forum? id=y2V6YgLaW7.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <article-title>Discovering latent knowledge in language models without supervision</article-title>
          ,
          <source>arXiv preprint arXiv:2212.03827</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.-S.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Dola: Decoding by contrasting layers improves factuality in large language models</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=
          <fpage>Th6NyL07na</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Viégas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pfister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wattenberg</surname>
          </string-name>
          ,
          <article-title>Inference-time intervention: Eliciting truthful answers from a language model</article-title>
          ,
          <source>in: Thirty-seventh Conference on Neural Information Processing Systems</source>
          ,
          <year>2023</year>
          . URL: https://openreview.net/forum?id=aLLuYpn83y.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          wen Dong,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation</article-title>
          ,
          <source>ArXiv abs/2311</source>
          .17911 (
          <year>2023</year>
          ). URL: https://api.semanticscholar.org/ CorpusID:265498818.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schuster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <article-title>Confident adaptive language modeling</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>17456</fpage>
          -
          <lpage>17472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Geva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Caciularu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <article-title>Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>45</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.3/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>3</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Montasser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sous</surname>
          </string-name>
          , G. Velegkas,
          <article-title>(im)possibility of automated hallucination detection in large language models</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2504.17004. arXiv:
          <volume>2504</volume>
          .
          <fpage>17004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liusie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. F.</given-names>
            <surname>Gales</surname>
          </string-name>
          , Selfcheckgpt:
          <article-title>Zero-resource black-box hallucination detection for generative large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08896</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mündler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vechev</surname>
          </string-name>
          ,
          <article-title>Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation</article-title>
          ,
          <source>arXiv preprint arXiv:2305.15852</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dhuliawala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Komeili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raileanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Celikyilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Chain-of-verification reduces hallucination in large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2309.11495</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Farquhar</surname>
          </string-name>
          ,
          <article-title>Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation</article-title>
          ,
          <source>ArXiv abs/2302</source>
          .09664 (
          <year>2023</year>
          ). URL: https://api.semanticscholar. org/CorpusID:257039062.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Duan</surname>
          </string-name>
          , H. Cheng,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zavalny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kailkhura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Shifting attention to relevance: Towards the uncertainty estimation of large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>01379</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Haloscope: Harnessing unlabeled LLM generations for hallucination detection</article-title>
          ,
          <source>in: The Thirty-eighth Annual Conference on Neural Information Processing Systems</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=nfK0ZXFFSn.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.-S.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , C.-Y. Hsieh,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <article-title>Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.07071. arXiv:
          <volume>2407</volume>
          .
          <fpage>07071</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yuksekgonul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gunasekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Naik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Palangi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Nushi</surname>
          </string-name>
          ,
          <article-title>Attention satisfies: A constraint-satisfaction lens on factual errors of language models</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          . URL: https://openreview. net/forum?id=gfFVATfPd.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferrando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. B.</given-names>
            <surname>Obeso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajamanoharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nanda</surname>
          </string-name>
          ,
          <article-title>Do i know this entity? knowledge awareness and hallucinations in language models</article-title>
          ,
          <source>in: The Thirteenth International Conference on Learning Representations</source>
          ,
          <year>2025</year>
          . URL: https://openreview.net/forum?id=
          <fpage>WCRQFlji2q</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsvetkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , S. W. tau Yih,
          <article-title>Trusting your evidence: Hallucinate less with context-aware decoding</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>14739</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>Latent retrieval for weakly supervised open domain question answering, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>6086</fpage>
          -
          <lpage>6096</lpage>
          . URL: https://www.aclweb.org/anthology/P19-1612. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1612.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Zettlemoyer, triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension</article-title>
          , arXiv e-prints (
          <year>2017</year>
          ) arXiv:
          <fpage>1705</fpage>
          .03551. arXiv:
          <volume>1705</volume>
          .
          <fpage>03551</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Geva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          , E. Segal,
          <string-name>
            <given-names>T.</given-names>
            <surname>Khot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <article-title>Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies, Transactions of the Association for Computational Linguistics (TACL) (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2302.13971. arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          , I. Gulrajani,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dubois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Hashimoto</surname>
          </string-name>
          , Stanford alpaca:
          <article-title>An instruction-following llama model</article-title>
          , https://github.com/tatsu-lab/stanford_alpaca,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>W.-L. Chiang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , L. Zheng,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          <string-name>
            <surname>Xing</surname>
            ,
            <given-names>Vicuna:</given-names>
          </string-name>
          <article-title>An open-source chatbot impressing gpt-4 with 90%* chatgpt quality</article-title>
          ,
          <year>2023</year>
          . URL: https://lmsys.org/blog/2023-03-30-vicuna/.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mesnard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hardin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dadashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhupatiraju</surname>
          </string-name>
          , et. al.,
          <source>Gemma: Open models based on gemini research and technology</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.08295. arXiv:
          <volume>2403</volume>
          .
          <fpage>08295</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          , et. al.,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          . URL: https://doi.org/10.1145/2629489. doi:
          <volume>10</volume>
          .1145/2629489.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>CH-Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Van Durme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kedzie</surname>
          </string-name>
          ,
          <article-title>Do androids know they're only dreaming of electric sheep?</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>4401</fpage>
          -
          <lpage>4420</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-acl.
          <volume>260</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . findings-acl.
          <volume>260</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , S. Cheng,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Evaluating open-QA evaluation</article-title>
          ,
          <source>in: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track</source>
          ,
          <year>2023</year>
          . URL: https://openreview.net/forum?id=
          <fpage>UErNpveP6R</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          Llama3.
          <fpage>1</fpage>
          -Instruct 8B
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>