<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigating the Robustness of Embedding Models on Noisy Input Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Dinzinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Caspari</string-name>
          <email>laura.caspari@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kanishka Ghosh Dastidar</string-name>
          <email>kanishka.ghoshdastidar@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jelena Mitrović</string-name>
          <email>jelena.mitrovic@uni-passau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Granitzer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Passau</institution>
          ,
          <addr-line>Innstraße 41, 94032 Passau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Integrating dense retrieval into Retrieval-Augmented Generation systems has significantly enhanced the power of generative language models, especially in the context of conversational search. Despite this advancement, the efect of noisy and poorly structured input text, often drawn from web pages covering multiple topics or languages, has not been thoroughly investigated. This paper addresses that gap by evaluating the robustness of embedding models in handling documents under realistic, noisy conditions beyond the clean and concise golden passage found in most academic datasets. We therefore introduce a synthetic dataset based on the MS-MARCO Passage Collection, in which the original passages are concatenated with text snippets extracted from extraneous web pages. This dataset consists of multiple subcorpora, whose concatenated context varies in length, positioning and relevance to the original query-relevant passage. Our findings reveal two critical limitations to model robustness: ifrst, a positional bias in processing long documents, and second, a significant drop in retrieval performance when models are faced with semantically inconsistent input documents that combine query-relevant information with thematically unrelated (Out-of-Domain) text passages. Motivated by our observations, we evaluate various chunking methods regarding their ability to preserve contextual and semantic integrity during index creation but find them falling short of the Naive approach in several instances. The paper calls for further research into advanced chunking techniques and alternative retrieval strategies, to improve the robustness of embedding models in RAG frameworks. Our code, datasets and results are made available on GitHub1 and HuggingFace.2</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Retrieval-Augmented Generation</kwd>
        <kwd>Dense Retrieval</kwd>
        <kwd>Text Embeddings</kwd>
        <kwd>Document Chunking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Retrieval-Augmented Generation (RAG) has emerged as a powerful framework for enhancing
generative language models by incorporating information retrieval techniques [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. A core component
of most RAG pipelines is dense retrieval, which relies on embedding models to retrieve relevant
information. However, the performance of these embedding models is typically evaluated on a limited
set of benchmarks, such as MTEB [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], whose datasets primarily feature clean and curated documents.
These documents contain minimal content beyond the query-relevant information, making them an
unrealistic representation of the noisy, heterogeneous data found in real-world applications.
      </p>
      <p>
        In recent years, web search has been transitioning from traditional result lists to more interactive
modes of information retrieval. In such scenarios of conversational search, the web itself acts as a
dynamic corpus for retrieving relevant content. This shift introduces new challenges in handling the
inherent heterogeneity of web data extracted from various sources. The quality of this corpus is often
compromised by suboptimal HTML parsing tools [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As conversational search increasingly relies on
dense retrievers, the robustness of these retrieval models on low-quality data becomes a significant
concern. Despite their wide application, e.g. in RAG systems, the question of how dense retrievers handle
noisy documents – those containing a query-relevant passage alongside extraneous information, either
related (In-Domain) or unrelated (Out-of-Domain) – remains underexplored. Therefore, investigating
the resilience of dense retrievers to semantically inconsistent web texts is critical for maintaining
reasonable performance.
      </p>
      <p>With our work, we aim to fill this gap by studying the impact of noisy input text on the performance
of recent embedding models. Through a series of experiments using our tailor-made synthetic dataset,
we demonstrate the extent to which text quality and structure afect the accuracy of dense retrieval
models. This paper makes the following contributions:
∙ Synthetic Dataset:</p>
      <p>We introduce a tailor-made synthetic dataset based on the MS-MARCO Passage Collection. The
dataset includes multiple subcorpora simulating semantically inconsistent web texts.
∙ Robustness Analysis of Embedding Models:</p>
      <p>We present a study of recent embedding models, focusing on their robustness when exposed
to noisy input texts that combine query-relevant passages with extraneous information. Our
experiments reveal two key constraints:
– Handling long documents introduces a model-dependent positional bias.
– The presence of topically unrelated information (Out-of-Domain context) significantly
degrades retrieval performance.
∙ Evaluation of Chunking Methods:</p>
      <p>We compare various chunking methods and assess their ability to preserve contextual coherence
when segmenting text documents. However, our findings suggest that existing approaches are
either inadequate or impractical, highlighting the need for further research to develop more
robust solutions.</p>
      <p>The remainder of this paper is organized as follows. In Section 2, we review related work on
Retrieval-Augmented Generation and embedding models in dense retrieval. In Section 3, we describe
the construction of our synthetic dataset and provide an overview of the embedding models selected
for evaluation. Section 4 discusses the experimental results and presents two key limitations of current
models. Section 5 compares chunking methods and evaluates their efectiveness in preserving contextual
coherence as a potential solution to overcome the models’ limitations. Finally, Section 6 concludes the
paper with a summary of our contributions and a discussion of open research questions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Dense retrieval in RAG systems</title>
        <p>
          Dense retrieval originated from advancements in deep learning, particularly in the application of neural
networks to map queries and documents into dense vector representations using
SentenceTransformers [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. In many cases, such models allow for more efective and scalable information retrieval compared
to traditional sparse methods like term-based matching. Well known dense retrieval techniques are,
e.g., Contriever [6] and DPR [7]. These two as well as more recent embedding models applicable in
RAG systems, such as the series of Snowflake Arctic models [ 8], are pre-trained, fine-tuned and/or
benchmarked on the prominent BEIR collection [9]. The respective datasets contain mostly clean and
concise passages, which are on average shorter than 512 tokens and thus small enough to fit in the
context window of most common embedding models. Hence, most prior work assumes clean and
well-structured documents, whereas real-world scenarios involve noisy, heterogeneous web data. The
impact of such noise on dense retrievers remains underexplored, and our work addresses this gap by
systematically generating controlled noisy data.
        </p>
        <sec id="sec-2-1-1">
          <title>2.2. The Power of Noise</title>
          <p>Investigating further on the cleanliness of text chunks, Cuconasu et al. study the Power of Noise [10].
Their work analyzes which types of documents a RAG system should retrieve for efective prompt
formulation, regarding the relevance of the retrieved documents to the query, their position and the
number included in the context. The authors cannot conclude to any clear strategy for integrating
retrieved documents into the input prompt of generative language models. However, they observe
that accuracy is higher when the gold document is positioned near the query, lower when it is placed
furthest from the query, and lowest when it appears in the middle of the context. Interestingly, even
the inclusion of irrelevant chunks can unexpectedly boost accuracy by more than 30%.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Study of retrieval units</title>
        <p>Previous works on dense retrieval consider paragraphs as a reasonable retrieval unit. Recent studies
in the context of RAG, however, explore very short as well as long retrieval units, both as seemingly
efective measures to increase the performance of RAG systems in regards to their Question-Answering
capabilities. On the one hand, Chen et al [11] introduce propositions as the most fine-granular unit of
segmenting and indexing the retrieval corpus. Propositions are defined as atomic and self-contained
expressions, each encapsulating a distinct factoid in natural language. The authors conclude that
selecting the proper retrieval granularity for a given query at inference time, including propositions,
sentences and passages, can improve the dense retrievers’ performance.</p>
        <p>On the other hand, Jiang et al [12] propose a new framework called LongRAG, consisting of a
longcontext retriever and a long-context reader. The authors want RAG systems to fully leverage the
long-context capabilities of recent LLMs. First, by extending the retrieval granularity up to documents
or even groups of documents, the number of elements in the corpus decreases drastically, helping the
retrieval performance. Second, as the entire original document, e.g. the main content of a web page, is
put in the context of the LLM, the model performs well on multi-hop questions due to its extensive
capabilities to reason over long input text. Instead of leaving the task of finding the “needle in the
haystack” to the retriever, the authors pledge for long-context LLMs to do the job and extract the
relevant information from the retrieved text in order to answer the user query accurately.</p>
        <p>Another recent approach that contributes to the study of retrieval units is Landmark Embeddings,
introduced by Luo et al. [13]. Unlike conventional chunking, which breaks the text into discrete
segments and potentially loses contextual coherence, Landmark Embeddings use a chunking-free
strategy by attaching a special token at the end of each sentence, called a landmark. These tokens
capture the semantic information of the sentence and its surrounding context, preserving a more unified
representation of long texts. This method again leverages the long-context capabilities of recent LLMs,
to improve the quality of dense retrievals without sacrificing context. However, both approaches,
LongRAG and Landmarks Embeddings, operate under the assumption of relatively clean input data and
do not systematically evaluate robustness against heterogeneous or noisy documents as encountered in
real-world conversational search.</p>
        <sec id="sec-2-2-1">
          <title>2.4. The Dwell in the Beginning efect</title>
          <p>Coelho et al. study the existence of a Dwell in the Beginning efect in Transformer-based models [ 14].
Dwell in the Beginning describes a positional bias in text representation learning, particularly in the
context of web document retrieval, in which vector embeddings of long documents tend to capture
information located at the beginning of the input text better. The authors observe that this phenomenon
starts emerging during unsupervised contrastive pretraining, and furthermore heavily relies on the
data distribution during fine-tuning. Coelho’s findings emphasize that embedding quality for long
inputs depends on various – some even counter-intuitive – factors. One such factor, among others,
is the positioning of the relevant information within the input sequence. This positional bias further
motivates our investigation into how the placement of relevant information in noisy contexts afects
retrieval performance.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.5. Chunking strategies</title>
        <p>During indexing, the first phase of a vector-based RAG pipeline, documents are processed, segmented,
and transformed into embeddings to be stored in a vector database. In this context, chunking refers to
the process of breaking down input text into smaller, manageable pieces [15]. This is often necessary
to ensure that long documents are split into portions that fit in the embedding model’s token limit.
However, chunking does more than just accommodate model constraints; it also enhances contextual
coherence within each segment.</p>
        <p>Naive chunking methods, such as splitting the text into fixed-length segments, are widely used
because of their simplicity and efectiveness in many cases. These methods can be enhanced by
introducing chunk overlap or by setting breakpoints only at sentence boundaries. Furthermore, Günther
et al presented an innovative approach called Late Chunking, where text is split after being processed by
the Transformer model but before mean pooling [16]. This technique takes advantage of Transformer
models’ long-context capabilities, and ensures that each chunk captures the broader document’s context
within its embedding.</p>
        <p>Beyond naive approaches, more advanced techniques like Semantic Chunking, Agentic Chunking [17],
and Contextual Retrieval [18] have emerged using LMs in the chunking process. Such methods may
include dynamic breakpoints, where text is divided based on semantic coherence rather than fixed
length, or augmenting chunks with summarized information from neighboring segments. The goal
of these approaches is to preserve both the contextual richness and semantic integrity of the text,
resulting in embeddings that are better aligned with the retrieval task. Yet, while these methods improve
contextual coherence, it remains unclear to which degree they are capable of handling semantically
inconsistent and noisy web documents, which we systematically simulate in our synthetic dataset.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <sec id="sec-3-1">
        <title>3.1. Synthetic dataset</title>
        <p>To evaluate the robustness of embedding models and chunking methods under real-world conditions,
we created a synthetic dataset designed to simulate “noisy” documents. This noise is modeled to
intentionally break contextual coherence, in order to study its impact on retrieval performance. For
instance, we observe that, after applying respective parsing tools, text snippets extracted from web
documents still contain disconnected content such as user comments or boilerplate sections. Web
texts hence remain inconsistent in their structure and content, combining text segments which are not
necessarily related to each other.</p>
        <p>Figure 1 illustrates the composition of the Synthetic Noise dataset, which consists of 41,371 document
samples per subcorpus. It is constructed using passages from the MS-MARCO collection [19], augmented
by adding artificial context. The original passages are concise, averaging about 72 tokens (measured
using the BERT tokenizer), and typically, these snippets contain only the information directly relevant
to their corresponding query. In addition to the passage text, the MS-MARCO dataset includes URLs
linking to the source web pages, allowing us to scrape further content related to the original piece of
information. The Synthetic Noise samples are constructed by padding the given short passages with
extraneous content from web pages linked in the MS-MARCO collection. These synthetic documents
vary along three key dimensions, resulting in 36 distinct subcorpora derived from the 41,371 original
core passages:
∙ Document length:</p>
        <p>We vary the length of the synthetic documents to measure how the ratio of relevant to
queryirrelevant information impacts retrieval performance.
∙ Positioning of the original passage:</p>
        <p>We adjust the position of the original text snippet within the document to study positional biases
in its dense vector representation.</p>
        <p>∙ Context type:</p>
        <p>We include both related and unrelated contexts to assess how semantic consistency in the input
text (or lack thereof) influences retrieval performance.</p>
        <p>Our dataset thus includes samples with concatenated contexts in two distinct levels of relevance to
the core passage. In-Domain contexts are extracted from web pages that contain information related,
though not necessarily relevant to the query associated with the original passage1. While these contexts
do not directly address the query, they often discuss similar topics. This maintains – to a certain extent
– contextual coherence within the synthetic document. In contrast, Out-of-Domain contexts are taken
from web pages randomly selected from the MS-MARCO corpus and lack any objective relevance to
the original query or passage. Since these snippets typically cover entirely unrelated subjects, they
break the contextual coherence. The lack of coherence makes it more dificult for embedding models to
accurately represent the relevant information of the original passage in the document vector.</p>
        <p>As discussed, web texts often lack consistent contextual coherence. Using our synthetic dataset, we
systematically examine the impact of document length, positional bias, and the consistency of context
relevance on embedding models. By isolating these factors, we aim to understand how coherence
deteriorates and how noisy, real-world data afects both retrieval accuracy and eficiency.
1URLs are considered as In-Domain if marked as “not selected” in the MS-MARCO collection. Selecting the original, selected
web pages however yields similar results.</p>
        <p>Embedding
model</p>
        <p>Context
type
BGE small ID
baseline: 77.69 OOD
Arctic M ID
baseline: 79.64 OOD
Stella ID
baseline: 81.95 OOD
Jina ID
baseline: 77.33 OOD
GTE Q. 1.5B ID
baseline: 77.84 OOD
GTE Q. 7B ID
baseline: 83.04 OOD
start</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Embedding models</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Robustness of Embedding Models</title>
      <p>This section studies the robustness of embedding models as retrievers when exposed to documents
containing extraneous information varying in length, positioning and relevance to the original answer
text. Our experiment tested the performance of six models (listed in Table 1) on the Synthetic Noise
dataset (described in Section 3.1). Each subcorpus in the dataset consists of 41,371 synthetic documents
as well as additional passages from the MS-MARCO collection, increasing the total corpus size to 200,000
documents. These supplementary documents are not relevant to any of the 41,371 queries, hence they
introduce further dificulty to the retrieval task without contributing relevant samples.</p>
      <p>
        To efectively measure the impact of noise, the models were evaluated on a downstream retrieval
task, closely following evaluation setups like those used in MTEB [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The primary performance metric
used was nDCG@10.
      </p>
      <p>Retrieval results: In Table 2, we observe that when evaluating on datasets containing synthetic
documents of 100 tokens, the performance does not significantly deviate from the baseline. The
baseline retrieval performance is provided by the core passages without any dispensable information.
Interestingly, for four out of six models, performance with 100-token documents slightly improves over
the baseline. This observation regarding embedding models aligns with the findings of Cuconasu et
al [10], who concluded that the influence of noise in the input context of generative language models is
not yet fully understood and may, under certain conditions, even improve performance.</p>
      <p>As the length of the synthetic documents increases, the ratio of relevant information to manually
added context becomes smaller. For instance, in a 4,000-token document, the original relevant passage
accounts for only around 2% of the total text, meaning that around 98% of the input consists of potentially
irrelevant information. Such an overwhelming presence of noise can significantly distract the embedding
models from the query-relevant snippet, which is expected to be well represented in the final dense
vector. Consequently, this vast amount of noise inevitably leads to a drop in performance on downstream
retrieval tasks.</p>
      <p>Consequently, for most of the models nDCG@10 declines as documents become longer. While the
performance drop is relatively moderate when the noise is thematically related (In-Domain) to the core
passage, it becomes much more significant when the context is unrelated. OOD noise tends to disrupt
the semantic integrity of the document, making it harder for the models to retain and represent the key
information accurately. This suggests that the relevance of the context is a critical factor in determining
the models’ ability to preserve the semantics of the query-relevant information amidst noise.
Visualizations: Furthermore, we conduct an in-depth comparison between the Arctic M and GTE
Qwen2 1.5B models, illustrated in Figures 2 and 3. Arctic M, with a 512-token limit, and GTE Qwen2 1.5B,
optimized for long-context processing, exemplify their respective groups in Table 1 and demonstrate
good overall performance. Figure 2 depicts the nDCG@10 scores for both models, visualizing the
numeric retrieval metrics of Table 2. Beyond that, Figure 3 visualizes the normalized diferences in
similarity values, defined as:
nDif () =  − 
std( )
where  represents the cosine similarity between the query and the noisy document embedding.
Similarly,  is the cosine similarity2 between the query and the original passage embedding, and
std( ) is the standard deviation of similarity values across the corpus. By dividing by std( ),
we apply a paired Z-score normalization. Since diferent models may interpret similarity with varying
internal scales, this normalization ensures that diferences are comparable across models, regardless of
the specific range of their cosine similarity scores. Hence, it makes the comparison across models more
consistent.</p>
      <p>The analysis of Table 2 together with Figures 2 and 3 uncover two noteworthy observations:
Positional bias: A significant positional bias is observed, though the nature of this bias varies
between models. For Arctic M, the normalized similarity diferences in Figure 3 are consistently higher
when the query-relevant information is positioned at the beginning of the document. This model
relies on absolute positional encoding, a standard approach for Transformer models. In contrast, GTE
Qwen2 1.5B [22] employs a more advanced relative positional encoding system, which allows it to
better handle longer sequences. Interestingly, for this model the smallest similarity diference is found
when the query-relevant information is positioned at the end of the document. GTE Qwen2 1.5B thus
reverses the “Dwell in the Beginning” efect [ 14], which refers to the bias towards the beginning of a
document in traditional Transformers. In both models, the magnitude of the positional bias increases
as document length grows. This observation is consistent with the higher nDCG@10 scores at the
respective positions in Figure 2, to which the positional bias applies.</p>
      <p>Context relevance: In Figure 3, the diferences in similarity remain relatively small for In-Domain
noise, but become an order of magnitude larger for Out-of-Domain noise. This is again consistent
with the observed drop in retrieval performance for OOD contexts, depicted in Figure 2. Notably, in
the presence of In-Domain noise, both models perform near the baseline, even for long documents
approaching their maximum context length. This highlights the models’ ability in managing coherent
inputs of dynamic lengths. But it also demonstrates the major constraint of this robustness, namely
their inability in handling input texts combining query-relevant and unrelated information.</p>
      <p>Even though this observation is entirely intuitive, the extent to which retrieval performance declines,
is nevertheless remarkable. For example, nDCG@10 scores for Arctic M drop by at least 33 points for
500-token documents, and at least 50 points for GTE Qwen2 1.5B for documents of the same size.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Document Chunking</title>
      <p>In the previous section, we identified two critical limitations in the robustness of embedding models:
positional biases and the models’ inability in keeping good performance despite the existence of
queryirrelevant information in the input documents. In the last part of our paper, we turn our attention to
techniques that can potentially overcome these constraints. Such techniques may enhance retrieval
performance in RAG systems by creating contextually more coherent text documents during the
indexing phase. One of the most prominent approaches for optimizing index construction is the use of
efective chunking methods [23].</p>
      <p>In the following study, we evaluate how well current chunking strategies mitigate the impact of noisy
input texts. We compare full document retrieval (i.e., no chunking) against three popular chunking
techniques on several benchmark datasets using the Jina (v3) embedding model. The first four datasets
listed in Table 3 are standard datasets used in the MTEB Retrieval Evaluation. In addition, the study
includes the MS-MARCO Document Collection3 [19] and the NarrativeQA dataset from the LongEmbed
benchmark [24], both of which contain longer documents than typical academic retrieval benchmarks.
For the final dataset, we sampled 41,371 documents from the 18 OOD subcorpora of our Synthetic Noise
dataset.
2We repeated the analysis using dot product, yielding nearly identical results.
3To reduce comput. costs, we use a randomly sampled subset of 300,000 documents.
Fixed-length text splitting: In Table 3, we observe that choosing the appropriate chunking strategy
can significantly impact retrieval performance. For short-document datasets, naive chunking – splitting
the text based on a fixed token length – performs worse than full document retrieval. This is because
naive chunking often results in the loss of important contextual information across chunks, thereby
reducing retrieval accuracy. Late Chunking provides a solution in these cases by preserving contextual
information from surrounding chunks, leading to noticeably higher retrieval scores.</p>
      <p>However, for datasets with longer documents, such as MS-MARCO Document, NarrativeQA, and
Synthetic Noise, the trend reverses. These datasets contain documents that typically span beyond one
or two paragraphs, and under such conditions, naive chunking outperforms full document retrieval
as well as more advanced methods. Interestingly, it even slightly surpasses the performance of Late
Chunking. For longer documents surpassing 1000+ tokens, it is seemingly neither efective to embed
the full document nor to apply chunked mean pooling, as the contextual information captured from the
entire document dilutes the semantic vector representation.</p>
      <p>Semantic Chunking: Our study furthermore demonstrates that there is currently no chunking
strategy that serves as a one-size-fits-all solution, and while advanced methods may ofer theoretical
advantages, they are not always practical in real-world applications. For example, Semantic Chunking,
which involves dynamically setting breakpoints based on semantic similarity thresholds, is
computationally expensive and highly sensitive to hyperparameter tuning. We used, for instance, the NLTK
Sentence Splitter, applied Jina (v2, small) to score sentence similarity and tried three diferent breakpoint
thresholds4. But in our case – as well as in other scenarions where fine-tuning is dificult or impractical
–, Semantic Chunking often fails to come up with reasonable breakpoints between text segments, and
hence also fails to achieve the expected improvements in contextual coherence. Furthermore, due to its
high computational costs, we could not evaluate Semantic Chunking on the extensive long-document
datasets.</p>
      <p>Optimal text splitting: When analyzing the Synthetic Noise samples, we can precisely identify the
query-relevant passages that were injected into the synthetic documents, allowing us to benchmark
retrieval performance against ideal text splitting. In such a scenario, when text chunking is perfectly
applied to isolate query-relevant information, it significantly enhances retrieval performance. When
compared to existing chunking methods, the optimal approach achieves a substantial improvement,
with over 30% higher nDCG@10 scores.
4We experimented with the 50ℎ, 70ℎ and 95ℎ percentile as breakpoint thresholds for similarity values, with the median
giving the best results (listed in Table 3).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we have explored the robustness of embedding models within the framework of RAG,
focusing particularly on the challenges posed by noisy, heterogeneous web data. In line with Coelho
et al. [14], our results confirm the presence of positional biases on longer input texts. However, we
show that these biases have a stronger negative impact when combined with query-irrelevant content,
a scenario typical of web data, but less explored in prior work. This negative impact leads to a drop in
retrieval performance of at least 33 points in nDCG@10 across all embedding models.</p>
      <p>We further investigated document chunking as a potential solution to these issues by evaluating
several chunking strategies across multiple benchmarks, including long-document datasets and
subcorpora samples from our Synthetic Noise dataset. Our experiments revealed that no single strategy is
universally efective. Existing approaches either fail to produce semantically rich chunks or remain
impractical due to the need for extensive hyper-parameter tuning and high computational costs.</p>
      <p>Surprisingly, while prior works proposed advanced chunking strategies (e.g., Late and Semantic
Chunking) as solutions, our experiments reveal that these methods often fail to outperform simpler Naive
Chunking when applied to noisy data. Nevertheless, these recent ideas lay the conceptual foundation for
future exploration towards more advanced chunking and retrieval techniques. For instance, the recent
concept of Landmark Embeddings introduces a chunking-free approach that leverages large language
models’ long-context capabilities. Combining these approaches with innovative techniques, such as
Contextual Retrieval [18], may provide more robust solutions to the challenges posed by semantically
inconsistent web texts in retrieval-augmented systems.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is part of OpenWebSearch.eu, funded by the EU under GA 101070014, and part of CAROLL,
funded by the German Federal Ministry of Education and Research (BMBF) under 01|S20049.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Improve writing style,
Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
[6] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, E. Grave, Unsupervised Dense</p>
      <p>Information Retrieval with Contrastive Learning, 2021. doi:10.48550/ARXIV.2112.09118.
[7] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense Passage
Retrieval for Open-Domain Question Answering, in: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Association for Computational
Linguistics, 2020. doi:10.18653/v1/2020.emnlp-main.550.
[8] L. Merrick, D. Xu, G. Nuti, D. Campos, Arctic-Embed: Scalable, Eficient, and Accurate Text</p>
      <p>Embedding Models, 2024. doi:10.48550/ARXIV.2405.05374.
[9] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, I. Gurevych, BEIR: A Heterogenous Benchmark for</p>
      <p>Zero-shot Evaluation of Information Retrieval Models, 2021. doi:10.48550/ARXIV.2104.08663.
[10] F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, C. Campagnano, Y. Maarek, N. Tonellotto,
F. Silvestri, The Power of Noise: Redefining retrieval for rag systems, in: Proceedings of the 47th
International ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR 2024, ACM, 2024. doi:10.1145/3626772.3657834.
[11] T. Chen, H. Wang, S. Chen, W. Yu, K. Ma, X. Zhao, H. Zhang, D. Yu, Dense X Retrieval: What</p>
      <p>Retrieval Granularity Should We Use?, 2023. doi:10.48550/ARXIV.2312.06648.
[12] Z. Jiang, X. Ma, W. Chen, LongRAG: Enhancing Retrieval-Augmented Generation with
Longcontext LLMs, 2024. doi:10.48550/ARXIV.2406.15319.
[13] K. Luo, Z. Liu, S. Xiao, K. Liu, BGE Landmark Embedding: A Chunking-Free Embedding Method
For Retrieval Augmented Long-Context Large Language Models, 2024. doi:10.48550/ARXIV.
2402.11573.
[14] J. Coelho, B. Martins, J. Magalhães, J. Callan, C. Xiong, Dwell in the Beginning: How language
models embed long documents for dense retrieval, 2024. doi:10.48550/ARXIV.2404.04163.
[15] B. Smith, A. Troynikov, Evaluating Chunking Strategies for Retrieval, Technical Report, Chroma,
2024. URL: https://research.trychroma.com/evaluating-chunking.
[16] M. Günther, I. Mohr, B. Wang, H. Xiao, Late Chunking: Contextual Chunk Embeddings Using</p>
      <p>Long-Context Embedding Models, 2024. doi:10.48550/ARXIV.2409.04701.
[17] G. Kamradt, 5 Levels of Text Splitting, https://github.com/FullStackRetrieval-com/</p>
      <p>RetrievalTutorials, 2024. Accessed: 2024-01-10.
[18] Anthropic AI, Introducing contextual retrieval, https://www.anthropic.com/news/
\contextual-retrieval, 2024. Accessed: 2024-01-10.
[19] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra,
T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, T. Wang, MS MARCO: A Human Generated
MAchine Reading COmprehension Dataset, 2016. doi:10.48550/ARXIV.1611.09268.
[20] S. Xiao, Z. Liu, P. Zhang, N. Muennighof, D. Lian, J.-Y. Nie, C-Pack: Packed Resources For General</p>
      <p>Chinese Embeddings, 2023. doi:10.48550/ARXIV.2309.07597.
[21] S. Sturua, I. Mohr, M. K. Akram, M. Günther, B. Wang, M. Krimmel, F. Wang, G. Mastrapas,
A. Koukounas, N. Wang, H. Xiao, jina-embeddings-v3: Multilingual Embeddings With Task LoRA,
2024. doi:10.48550/ARXIV.2409.10173.
[22] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, M. Zhang, Towards General Text Embeddings with</p>
      <p>Multi-stage Contrastive Learning, 2023. doi:10.48550/ARXIV.2308.03281.
[23] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-Augmented</p>
      <p>Generation for Large Language Models: A Survey, 2023. doi:10.48550/ARXIV.2312.10997.
[24] D. Zhu, L. Wang, N. Yang, Y. Song, W. Wu, F. Wei, S. Li, LongEmbed: Extending Embedding Models
for Long Context Retrieval, 2024. doi:10.48550/ARXIV.2404.12096.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive NLP tasks</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Neural Information Processing Systems</source>
          , NIPS '20, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Evaluation of Retrieval-Augmented Generation:</article-title>
          A Survey,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.2405.07437.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Magne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          , MTEB: Massive Text Embedding Benchmark,
          <source>in: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics</source>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . eacl-main.
          <volume>148</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>An Empirical Comparison of Web Content Extraction Algorithms</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '23</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1145/3539618.3591920.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence Embeddings using Siamese BERT-</article-title>
          <string-name>
            <surname>Networks</surname>
          </string-name>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1908</year>
          .
          <volume>10084</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>