<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chen Amiraz</string-name>
          <email>chen.amiraz@tii.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florin Cuconasu</string-name>
          <email>cuconasu@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Filice</string-name>
          <email>filice.simone@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zohar Karnin</string-name>
          <email>zohar.karnin@tii.ae</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Retrieval Augmented Generation, Large Language Models, Information Retrieval</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sapienza University</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technology Innovation Institute</institution>
          ,
          <addr-line>Haifa</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
      </contrib-group>
      <fpage>3</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>Retrieval Augmented Generation (RAG) systems often struggle with irrelevant passages that mislead LLMs during answer generation. This work introduces a comprehensive framework for quantifying and understanding the distracting nature of such passages. We propose a novel metric to measure passage-level distraction efects, demonstrating its robustness across diferent models. Our methodology combines retrieval-based approaches with controlled synthetic generation techniques that create distracting content spanning multiple categories. Through experimental validation on standard question-answering benchmarks, we show that passages with higher distraction scores consistently degrade model efectiveness, even when relevant content is present. Leveraging this framework, we construct an enhanced training dataset featuring systematically curated distracting passages. When fine-tuned on this dataset, LLMs demonstrate substantial improvements, achieving up to 7.5% accuracy gains over baselines trained on standard RAG data. Our contributions provide both theoretical insights into distraction mechanisms in RAG and practical solutions for developing more robust retrieval-augmented language models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The integration of retrieval mechanisms with Large Language Models (LLMs) has become a cornerstone
approach for addressing knowledge-intensive tasks [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. By incorporating external knowledge
through retrieved passages, RAG systems efectively mitigate hallucination issues [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and provide
access to current information beyond the model’s training data. However, the retrieval process may
introduce distracting passages that can mislead the generation process [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. Unlike completely
unrelated content, distracting passages exhibit semantic similarity to the input query while failing to
contain the correct answer. This subtle relationship creates a particularly problematic scenario: when
no relevant content is present, LLMs may generate responses based on misleading information rather
than abstaining from answering; whereas, when relevant content is available, distracting passages may
prevent the LLM from focusing on the correct information, leading to erroneous responses despite
having access to the right answer.
      </p>
      <p>Our work addresses these challenges by developing a systematic approach to identify distracting
passages and quantify their distracting efect . We demonstrate that despite the apparent model-dependency
of distraction susceptibility, the relative distracting efects of passages show remarkable consistency
across diferent LLMs, as evidenced by high correlation scores between models. Furthermore, we
validate our measure by showing that passages with higher distracting efect scores cause more significant
degradation in answer accuracy, even when relevant information is also available to the model.</p>
      <p>
        Our investigation encompasses two complementary methodologies for fetching distracting content.
First, we analyze passages obtained through various retrieval strategies, including a novel
answer⋆This is an extended abstract of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
skewed approach designed to surface topically related but answer-irrelevant content. Second, we
generate distracting passages across predefined categories, drawing inspiration from established
taxonomies of problematic content [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. This dual approach enables both empirical analysis of naturally
occurring distracting content and controlled examination of specific distraction mechanisms.
      </p>
      <p>Finally, we demonstrate that LLMs fine-tuned using training sets enriched with our identified
distracting passages achieve substantially improved accuracy compared to models trained on conventional
RAG datasets derived through standard retrieval methods.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Quantifying the Distracting Efect</title>
      <p>We introduce a quantitative framework for measuring the distracting efect of irrelevant passages. Our
approach evaluates a passage’s ability to mislead an LLM when the passage does not contain the answer
to a given query. We test this by tasking the model to respond with “NO-RESPONSE” when the passage
content is insuficient to answer the question. For a given query  and an irrelevant passage  , we
define the distracting efect DE  () as:</p>
      <p>DE () = 1 − PLLM(NO-RESPONSE|, )
This formulation captures the probability that an LLM will attempt to answer a query based on an
irrelevant passage rather than appropriately abstaining. The metric ranges from 0 (no distraction) to 1
(maximum distraction), providing an interpretable measure of the passage’s distraction.</p>
      <p>Our analysis demonstrates remarkable consistency in distracting efect measurements across diferent
language model architectures. Despite varying model sizes and training procedures, we observe strong
correlations (Spearman coeficients ranging from 0.47 to 0.76) in distraction assessments across models
from diferent families (Llama, Falcon, and Qwen) ranging from 3B to 70B parameters. This suggests
that the distracting efect represents an intrinsic property of passages and that diferent LLMs share the
same weaknesses.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Analysis of Distracting Passages</title>
      <p>
        This section presents our empirical investigation into diferent methods for obtaining distracting
passages and analyzes their efectiveness in RAG systems. In this paper, we show results on the Natural
Questions (NQ) dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], employing Llama-3.2-3B and Llama-3.1-8B instruct models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Our
retrieval pipeline utilizes E5-base [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] with optional reranking via the BGE-M3-v2 cross-encoder [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
applied to the top-20 retrieved passages. Additional results across diferent models and datasets can
be found in our full paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In our analysis, a passage is considered relevant if it either explicitly
contains the ground truth answer or entails the hypothesis “the answer to {question} is {answer}” using
the NLI model from Honovich et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. We systematically exclude such relevant passages when
computing distracting efect scores to ensure our measurements focus purely on irrelevant content.
      </p>
      <p>To comprehensively analyze distracting passages, we examine both standard retrieval approaches
and novel techniques for obtaining highly distracting content, addressing cases where standard retrieval
does not return distracting passages.</p>
      <p>Regarding retrieval, we introduce an answer-skewed retriever that seeks passages topically related to
queries while avoiding answer-relevant content. This approach modifies the standard dense retrieval
embedding by subtracting the answer representation from the query embedding:  sub(, ) =   () −
  () , where  controls the aggressiveness of answer exclusion1.</p>
      <p>
        Additionally, we generate irrelevant passages using Claude 3.5 Sonnet V2.0. Following established
categorizations [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], we generate four distinct types of distracting content: Related Topic passages
( rel) discussing query-adjacent subjects, Hypothetical scenarios ( hypo) with alternative answers,
Negation statements ( neg) providing incorrect information but in negation, and Modal statements
( modal) expressing uncertainty about incorrect answers.
1We set  = 1 in our experiments.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Ranking Efects and Method Comparison</title>
        <p>Figure 1a demonstrates a critical finding: higher-ranked passages consistently show greater distracting
efects compared to lower positions across all retrieval strategies. This pattern holds for both standard
dense retrieval ( st) and our answer-skewed approach ( sk). Notably, the integration of reranking
modules ( +st and  +sk) amplifies this phenomenon. While reranking improves overall retrieval quality,
it paradoxically elevates the most distracting passages to top positions. This suggests that
contemporary retrieval systems tend to surface passages that are semantically related to queries but factually
misleading.</p>
        <p>Figure 1b reveals the complementary nature of our diferent approaches to identifying distracting
passages. For the retrieval-based methods, the analysis focuses on the first non-relevant passage
returned by each method, which is also the most distracting passage accessible through that approach,
as shown in Figure 1a. The vertical bars show the percentage of queries where each method contributes
the most distracting passage. The blue parts highlight the times when no other method reaches the same
distracting efect as the others. The substantial contribution from both retrieval-based and
generationbased methods indicates that a hybrid approach yields superior coverage of distracting passage types.
This diversity is particularly valuable for creating robust training datasets, as we will show in Section 4.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Impact on Answer Generation Quality</title>
        <p>To validate our distracting efect measure, we conducted controlled experiments examining how passages
with varying distraction levels influence answer accuracy when combined with relevant content. Our
experiments account for positional bias [16, 17] by testing both ordering configurations (gold-first
and distracting-first) and reporting averaged results. We establish three experimental conditions to
demonstrate the progressive impact of distraction. First, when prompts include only the relevant
passage, baseline accuracy reaches 82.6 and 80.6 for Llama-3.2-3B and Llama-3.1-8B, respectively.
Second, adding a weak distractor (distracting efect smaller than 0.2) alongside the relevant passage
causes modest performance degradation, with accuracy declining to 79.4 and 80.1. Third, incorporating
a hard distractor (distracting efect greater than 0.8) produces substantially greater impact, with accuracy
dropping more dramatically to 71.5 and 73.9 for the same models.</p>
        <p>These results validate that our proposed metric efectively identifies truly distracting passages. The
substantial accuracy gap between weak and hard distractors, with hard distractors causing accuracy
decreases of 6 to 11 percentage points, demonstrates the metric’s ability to distinguish between diferent
levels of distracting content.
NQ
TriviaQA</p>
        <sec id="sec-3-2-1">
          <title>None</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Retrieve</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Rerank</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Hard</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>None</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Retrieve</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>Rerank</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Hard</title>
          <p>Answer accuracy averaged over all 4 test sets. None is the non-fine-tuned baseline, Retrieve, Rerank and Hard
are fine-tuning strategies. Metrics: (1) 
 , accuracy on ungrounded instances, (2) 
 , accuracy on grounded
instances, and (3)  , overall accuracy. Bold values indicate the highest per model and dataset.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. RAG Fine-Tuning</title>
      <p>Leveraging insights from our distracting efect analysis, we propose a training strategy to fine-tune
LLMs for improved robustness in RAG applications. Our approach constructs a training dataset using
800 queries from NQ, where each query is paired with 5 passages selected according to three distinct
strategies: (1) Retrieve uses the top 5 passages from standard dense retrieval without reranking;
(2) Rerank same as Retrieve, but in this case we enable reranking; and (3) Hard applies a mixed
composition where 50% of instances contain the first relevant passage from the reranker plus the 4
most distracting passages, and 50% contain the 5 most distracting passages using methods from Section
3. Lastly, all passages are randomly shufled to eliminate undesired positional biases during training.
7 percentage points gains. The benefits prove particularly pronounced for ungrounded examples,
where no relevant passage appears in the prompt and correct answers must be retrieved from the
model’s parametric memory. This pattern suggests that training with distracting passages enhances
models’ ability to resist misleading contextual information while appropriately relying on their internal
knowledge when external context appears unreliable.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This work introduces a formal framework for quantifying the distracting efect of irrelevant passages
in RAG systems and demonstrates its efectiveness across multiple LLMs. We reveal that stronger
retrieval systems paradoxically surface more highly distracting passages, reducing accuracy by up to 11
percentage points when included in LLM contexts. Leveraging this insight, we develop a comprehensive
approach combining retrieved and generated passages to create a challenging dataset with distracting
content. Fine-tuning on this dataset enhances model robustness, achieving up to 7.5% accuracy
improvements on question-answering benchmarks compared to conventional training approaches. We
believe that our framework for quantifying distracting efects will enable new approaches to robust
information retrieval in LLM-based systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Claude Sonnet 4 to check grammar and spelling.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research was conducted while Florin Cuconasu was enrolled in the Italian National Doctorate on
Artificial Intelligence at Sapienza University of Rome. The project received support from PNRR MUR
project PE0000013-FAIR.
2022 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, 2022, pp. 3905–3920.
[16] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the middle: How
language models use long contexts, Transactions of the Association for Computational Linguistics
12 (2024) 157–173. URL: https://aclanthology.org/2024.tacl-1.9/. doi:10.1162/tacl_a_00638.
[17] F. Cuconasu, S. Filice, G. Horowitz, Y. Maarek, F. Silvestri, Do rag systems sufer from positional
bias?, 2025. URL: https://arxiv.org/abs/2505.15561. arXiv:2505.15561.
[18] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, TriviaQA: A large scale distantly supervised challenge
dataset for reading comprehension, in: Proceedings of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1601–1611.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Amiraz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cuconasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Filice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Karnin</surname>
          </string-name>
          ,
          <article-title>The distracting efect: Understanding irrelevant passages in RAG</article-title>
          , in: W. Che,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nabende</surname>
          </string-name>
          , E. Shutova, M. T. Pilehvar (Eds.),
          <source>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vienna, Austria,
          <year>2025</year>
          , pp.
          <fpage>18228</fpage>
          -
          <lpage>18258</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>892</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <article-title>Reading wikipedia to answer open-domain questions</article-title>
          ,
          <source>in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>1870</fpage>
          -
          <lpage>1879</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yazdani</surname>
          </string-name>
          , N. De Cao,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , S. Riedel,
          <article-title>KILT: a benchmark for knowledge intensive language tasks</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2523</fpage>
          -
          <lpage>2544</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santilli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodolà</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          , et al.,
          <article-title>Fauno: The italian large language model that will leave you senza parole!</article-title>
          ,
          <source>in: CEUR WORKSHOP PROCEEDINGS</source>
          , volume
          <volume>3448</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          , T.-S. Chua,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A survey on RAG meeting LLMs: Towards retrieval-augmented large language models</article-title>
          ,
          <source>in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>6491</fpage>
          -
          <lpage>6501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Yoran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolfson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <article-title>Making retrieval-augmented language models robust to irrelevant context</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cuconasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Filice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Campagnano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>The power of noise: Redefining retrieval for RAG systems</article-title>
          ,
          <source>in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>719</fpage>
          -
          <lpage>729</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cuconasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Filice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Campagnano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          , et al.,
          <article-title>Rethinking relevance: How noise and distractors impact retrieval-augmented generation</article-title>
          ,
          <source>in: CEUR WORKSHOP PROCEEDINGS</source>
          , volume
          <volume>3802</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basmov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , R. Tsarfaty,
          <article-title>LLMs' reading comprehension is afected by parametric knowledge and struggles with hypothetical statements</article-title>
          ,
          <source>arXiv preprint arXiv:2404.06283</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Abdumalikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Minervini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kementchedjhieva</surname>
          </string-name>
          ,
          <article-title>Answerability in retrieval-augmented opendomain question answering</article-title>
          ,
          <source>arXiv preprint arXiv:2403.01461</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Redfield</surname>
          </string-name>
          , M. Collins,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Epstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , et al.,
          <article-title>Natural questions: a benchmark for question answering research, Transactions of the Association for Computational Linguistics 7 (</article-title>
          <year>2019</year>
          )
          <fpage>453</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Text embeddings by weakly-supervised contrastive pre-training</article-title>
          ,
          <source>arXiv preprint arXiv:2212.03533</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <article-title>Liu, BGE M3-embedding: Multi-lingual, multifunctionality, multi-granularity text embeddings through self-knowledge distillation</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>03216</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>O.</given-names>
            <surname>Honovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aharoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Herzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Taitelbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kukliansy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Scialom</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Szpektor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hassidim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matias</surname>
          </string-name>
          , True:
          <article-title>Re-evaluating factual consistency evaluation</article-title>
          ,
          <source>in: Proceedings of the</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>