Improving RAG Systems via Sentence Clustering and Reordering

Improving RAG Systems via Sentence Clustering and Reordering MarcoAlessio marco.alessio@isti.cnr.it Institute of Information Science and Technologies (ISTI) National Research Council of Italy (CNR)

Pisa Italy

GuglielmoFaggioli guglielmo.faggioli@unipd.it Department of Information Engineering (DEI) University of Padua

Padua Italy

NicolaFerro nicola.ferro@unipd.it Department of Information Engineering (DEI) University of Padua

Padua Italy

FrancoMariaNardini francomaria.nardini@isti.cnr.it Institute of Information Science and Technologies (ISTI) National Research Council of Italy (CNR)

Pisa Italy

RaffaelePerego raffaele.perego@isti.cnr.it Institute of Information Science and Technologies (ISTI) National Research Council of Italy (CNR)

Pisa Italy

Improving RAG Systems via Sentence Clustering and Reordering 1613-0073 9174BEB69D6338958B5B2BF9BCECBCB4 GROBID - A machine learning software for extracting information from scholarly documents Retrieval Augmented Generation Conversational Search Positional Bias Arrangement Strategy

Large Language Models (LLMs) have gained noteworthy importance and attention across different domains and fields in recent years. Information Retrieval (IR) is one of the domains they impacted the most, as witnessed by the recent increase in the number of IR systems incorporating generative models. Specifically, Retrieval Augmented Generation (RAG) is the emerging paradigm that integrates existing knowledge from large-scale document corpora into the generation process, enabling the model to generate more coherent, contextually relevant, and accurate text across various tasks. Such tasks include summarization, question answering, and dialogue systems. Recent studies have highlighted the significant positional dependence exhibited by RAG systems. Such studies observed how the placement of information within the LLM input prompt drastically affects the generated output. We ground our study on this property by investigating alternative strategies for ordering sentences within the LLM prompt to improve the average quality of the generated responses in the user and conversational system dialogues. We propose the architecture of an end-to-end RAG-based conversational assistant and empirically evaluate our strategies using the TREC CAsT 2022 collection. Our experiments highlight significant differences between distinct arrangement strategies. By employing an evaluation methodology based on RankVicuna, we show that our best approach achieves improvements up to 54% in terms of overall response quality over baseline methods.

Introduction

Retrieval Augmented Generation (RAG) is an emerging paradigm in the field of Artificial Intelligence (AI) to enhance the accuracy and reliability of generative models by exploiting external data sources. In recent years, RAG has gained noteworthy importance and attention across different domains and fields [1] as it allows to combine the strengths of Information Retrieval (IR) systems and generative models to overcome each other's limitations.

RAG can improve the output of a generative model in several ways. First, it allows the generation process to be grounded on information from trusted knowledge sources incorporated in the provided prompt, thus avoiding or at least mitigating the well-known Large Language Model (LLM) hallucination problem, i.e., when the model generates contents not factually true or that do not concern the prompted text [2,3,4]. Second, RAG allows for continuous knowledge updates and integration of domain-specific information: the LLM can successfully respond to facts and topics not covered in its training data; moreover, it is easily adapted to different scenarios and contexts, without retraining or fine-tuning the entire model using datasets that might be unavailable or limited in scope or size. Finally, grounding the generation process on external knowledge incorporated in the input permits linking the output to verifiable external documents, thus enhancing trustworthiness and transparency [2,3,4].

Current RAG systems, however, suffer of some drawbacks highlighted in the literature. One of these issues originates from the notable positional sensitivity shown by LLMs. The placement of information within the input prompt significantly impacts the resulting output. Previous research [5,6,7] has highlighted biases towards "primacy" and "recency", suggesting that generative models tend to prioritize information placed at the beginning or end of the input while neglecting the central portion.

In this paper, we advance over previous studies by investigating the positional bias in the context of RAG-based conversational systems. Specifically, we propose a novel strategy for arranging sentences within the input prompt of the LLM to improve the average quality of the generated responses over simpler methods. Our approach is based on the intuition that as coherent, fluent, and well-structured text are critical factors for successful communication between human beings, the same should also apply to LLMs: among all the possible arrangements of the input, those having sentences with similar meaning placed closer in the LLM prompt should generate, on average, better quality output. Therefore, we propose an end-to-end RAG architecture to test our hypothesis. The components of this architecture allow us to precisely identify which sentences are likely useful for answering user queries. To this end, we cluster sentences by their similarity and we define alternative strategies for ordering them both inter and intra-cluster. In this way, we can study the effect on the generated response of these alternatives for prompting the generative LLM. To our knowledge, this is the first work that explicitly considers this aspect and allows us to fine-tune in a principled way the ordering of input sentences provided to the generative component of a RAG system. We compare our proposed approach against competitive baselines that represent the solutions employed by current RAG systems. We experimentally evaluate the performance of our proposed approach using the TREC Conversational Assistance Track (CAsT) 2022 collection [8], which allows us to compare the results that different arrangement strategies can achieve in a widely accepted Conversational Search (CS) scenario. Results highlight remarkable differences among the tested sentence placement strategies, with improvements up to 8.66% w.r.t. the best baseline and 54.94% w.r.t. random ordering.

The remainder of this work is organized as follows: Section 2 surveys the current state-of-the-art about RAG systems and quality evaluation for their responses. Section 3 details the architecture of our RAG system. Section 4 and Section 5 detail the results of an experimental analysis, which aims to highlight how the ordering of clusters and sentences affects the quality of the generated response. Finally, Section 6 draws some conclusions and outlines future directions and extensions of our research.

Related Work

In the following, we survey the main works dealing with LLM positional dependencies and the difficulties of RAG systems in conciliating internal and external knowledge. Then, we analyze the challenges related to the evaluation of the quality of RAG responses and to the use of an "LLMas-a-judge".

Retrieval Augmented Generation

RAG enhances LLMs by retrieving additional information from an external knowledge source, enabling them to successfully answer queries beyond the scope of the training data. At the same time, RAG mitigates the hallucination problem, which is generating factually incorrect text, by referencing the provided external knowledge.

The RAG paradigm is organized into two main stages: retrieval and generation. Upon receiving a query from the user, the relevant information is retrieved from an external knowledge source. This task is undertaken by a standard IR pipeline that outputs a ranked list of documents. Afterwards, in the generation phase, the LLM synthesizes the response to answer the user query using the information carried by the selected documents.

Despite its clear advantages, RAG has drawbacks and limitations, which spark several challenges. First, RAG systems employ the external knowledge as their main source of information, disregarding the internal knowledge memorized within the LLM [9,10]. This, in turn, may determine a decrease in the quality of the generated output when the provided content is not high-quality [10]. It is not uncommon for RAG to obtain worse outputs w.r.t. what the LLM can achieve in the closed-book scenario, i.e., without supplying retrieved results [10]. In this line, it has been observed that the LLM produces better results without injecting external knowledge when the topic popularity is very high [9]. In general, state-of-the-art LLMs provide good quality responses for a wide range of questions but require assistance from an IR system when the internal knowledge of the model lacks information about the current topic. This phenomenon is likely to occur if the topic is not very popular, requires exceptional expertise, or when scaling the number of parameters of the generative model produces little to no effect [9]. Another challenge lies in the significant positional dependence [5,6,7] exhibited by LLMs, whereby the placement of information within the input prompt drastically affects the generated output. Prior research [5] has identified "primacy" and "recency" biases, indicating the tendency of generative models to focus toward information positioned either at the beginning or the end of the input while disregarding the central part. Therefore, the performance degrades significantly when LLMs should rely on information in the middle of its input context, showing a characteristic U-shaped performance curve [5]. This, in turn, means that most state-of-the-art generative models do not use effectively their longer contexts w.r.t. smaller and earlier counterparts. These phenomena can be observed both in open-source, e.g., Llama [11,12] by Meta, and closed-source, e.g., GPT-4 [13] by OpenAI, models. It is not advisable to directly input all the retrieved information to the LLM for generating the response. Redundant information and very long contextual data can interfere with the generation quality, leading to repetitive, disjointed, or incoherent outputs [1]. Therefore, the retrieved content is typically further processed before being given in input to the LLM [14]. A recent work in this direction systematically examines the retrieval strategy of RAG systems [15]. The authors consider multiple retrieval factors affecting the generation process, such as the relevance of the passages in the prompt context, their position, and their number. One counter-intuitive finding is that the retriever's highestscoring documents that are not directly relevant to the query, e.g., do not contain the answer, negatively impact the effectiveness of the LLM. Moreover, the authors discover that adding random documents in the prompt improves the LLM accuracy by up to 35%.

In this work, we rely on the intuition that the use of coherent, fluent, and well-structured inputs can improve RAG and we propose an end-to-end architecture for selecting and structuring the external information included in the LLM prompt for response generation.

Quality Evaluation

Another line of research is how to evaluate the overall quality of the generation output. Despite human assessment providing the most accurate and reliable measure for evaluating model performance, the high time and cost requirements severely limit the application. Therefore, there exists an ever-increasing demand for automated evaluation techniques that consistently align with human judgements while offering enhanced efficiency and cost-effectiveness.

In this paper, we focus on textual-based generative models. Classical automatic evaluation metrics, such as BLEU [16], ROUGE [17], and METEOR [18], are designed to quantify the degree of similarity between a candidate text and one or more reference texts, by assessing their n-grams matching. The simplicity and explainability, along with the good correlation with human judgements, make these metrics widely used as baselines. However, these metrics exhibit several limitations [19]: firstly, they cannot account for lexical diversity; secondly, they penalize variations in the semantic ordering of words; thirdly, they struggle to capture and match paraphrases effectively; lastly, they inadequately account for distant dependencies within the text. With the advent of word embeddings [20,21] and neural models [22,23,11,12,24] based on Transformers [25], new learned metrics [19,26] have been developed. For example, BERTScore [19] can capture the semantic similarity between the candidate and reference texts employing the contextual embeddings generated by an encoder model, such as BERT [22].

In recent years, the rapid advancements of LLMs showing remarkable performance across many tasks have gained considerable interest in their potential application also as annotators and evaluators. Due to their training using Reinforcement Learning from Human Feedback (RLHF), these models demonstrate significant human alignment. Many research have investigated leveraging state-of-theart LLMs to automatically produce assessments serving as proxies for human judgments, a paradigm known as "LLMas-a-judge". Furthermore, in recent years LLMs have gained popularity also as evaluators. For example, Zheng et al. [24] assessed the quality of conversations with various LLMs, both open and closed source, employing GPT-4 [13] as judge. They experimented with various prompts and different approaches, such as single answer grading and pairwise comparisons both between responses and against a reference text. GPT-3.5 Turbo and GPT-4 [13] have been employed as listwise rerankers [6,7] for the TREC Deep Learning 2019 and 2020 [27,28] and BEIR [29] experimental collections, obtaining state-of-the-art performance [6]. The same LLMs have also been employed as teacher models to fine-tune smaller open-source student models, such as Llama and Vicuna [30,31] (i.e.: RankVicuna [32]).

In this work, we rely on state-of-the-art assessment methods and evaluate the quality of the responses generated by the different methods using RankVicuna [32].

The Proposed RAG Architecture

Generative models exhibit strong biases towards information positioned at the start or the end of the input while disregarding the middle part [5]. This phenomenon motivates our research effort to determine how the order of the input sentences provided to a RAG-based conversational system affects the quality of the generated output and, in turn, the optimal ordering strategy to achieve the best response. This section describes each method and all variations considered in our experiments.

The architecture of our proposed RAG system is illustrated in Figure 1. It includes an IR pipeline, which retrieves top-𝑘 documents 𝐷 = {𝑑1, 𝑑2, ..., 𝑑 𝑘 } in response to each user utterance 𝑞. The retrieved documents are then processed by additional components responsible for splitting them into sentences, identifying the most relevant sentences, clustering such sentences based on their semantic similarity, and ordering them according to the various strategies analyzed. Finally, the selected-re-ordered-sentences are provided as input to the LLM for response generation. These components are the focus of our research. Their functionalities are detailed in the remainder of this section.

Document Pre-processing and Splitting

As observed in literature [33,34], the entire text of a relevant document rarely contains meaningful knowledge to satisfy the user information need expressed by a query 𝑞. In most cases, only one or a few portions of the document are relevant to the query, while the remaining parts contain irrelevant information. The proposed architecture aims to precisely identify the key information in the retrieved documents, i.e., the sentences, to reduce the noise in the prompt used for response generation.

Hereinafter, we consider sentences in the documents as the atomic units of information. Our pipeline, illustrated in Figure 1 works as follows. First, for each query 𝑞 we consider only the top-𝑘 documents {𝑑1, 𝑑2, ..., 𝑑 𝑘 } retrieved by the IR system. Then, a state-of-the-art co-reference resolution model is applied to all documents to replace pronouns and other generic terms within a sentence with the fully specified entity mentioned in a previous sentence. This allows us to remove the contextual dependencies among sentences in a document so they can be considered self-explanatory. The third step splits each document 𝑑𝑖 into a sequence of sentences {𝑠𝑖,1, 𝑠𝑖,2, ..., 𝑠𝑖,𝑛 𝑖 }. Afterwards, near-duplicate removal is employed to the sentences originated by all documents by discarding sentences with a Jaccard similarity ≥ 0.9 between their Bag-of-Words (BoW) representations1 .

Sentence Selection

After the first pre-processing phase, we obtain a sentence candidate set for each query to be included in the LLM prompt of our RAG system (see Figure 1). Since the cardinality of this set can be large and not all the sentences are useful for answering the query, we employ the BERT-based cross encoder answer-in-the-sentence classifier 2 developed by Lajewska and Balog [35] to rank the candidate sentences according to their predicted usefulness to (at least partially) answer the query and we retain the top-𝑛 ranked sentences thus discarding the remaining ones. As a possible limitation, please note that the model by Lajewska and Balog [35] employed have been trained on queries and passages used in our experiments. Therefore, it is very likely that the model performs significantly better on our data w.r.t. any other model, ensuring that top-ranked sentences are indeed relevant to the query. Even though such a model is not available in a real practical scenario, this choice is justified by our research effort being focused exclusively on comparing the ordering strategy for sentences in the LLM input rather than on the absolute results achievable by our RAG system.

Sentence Clustering and Ordering

The previous steps of the pipeline constrain the number of sentences per query while increasing their expected utility in answering the query. Furthermore, they allow us to control other noise sources, such as the number or the variable length of the retrieved documents. Therefore, we can assess how the positional bias affects the generation process. We highlight again that the positional bias of LLM has already been observed in prior research [5,6,7]. However, it has been considered exclusively as a limitation of LLMs and RAG systems. Our research moves a step forward by investigating the best ordering strategy to maximize, on average, the quality of the generated responses over a testing query set 𝑄. We believe that logically organized text where sentences with akin meanings are positioned closer in the LLM prompt should, on average, yield superior output quality. Consequently, our sentence ordering strategies exploit the similarities among sentences selected by the sentence selection step. To measure semantic inter-sentence similarity, we resort to the contextualized embeddings generated with the tct-colbert model 3 [36]. We generate the representation of the 𝑛 selected sentences for each query and measure their pair-wise cosine similarity. Then, we progressively aggregate the most similar sentences by employing a hierarchical clustering algorithm. The maximum value of Silhouette statistic is used as the criteria to determine the optimal clustering among all possible. As a result, for each query 𝑞 ∈ 𝑄, the top-𝑛 sentences are grouped in a variable number 𝑁𝑐 ≥ 1 of clusters, each composed of one or more sentences with similar semantic meaning. To devise different strategies for ordering input sentences, we leverage the above clustering that allows us to study the impact of sentence placement variations occurring in both inter and intra-clusters.

More formally, given a query, the set 𝑆 of the 𝑛 previously selected sentences, and the prompt 𝑝, we aim to find the ordering 𝑜𝑟𝑑 * of 𝑆 such that:

𝑜𝑟𝑑 * = argmax 𝑜𝑟𝑑 ∑︁ 𝑞∈𝑄 𝑠(𝑞, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆))),

where 𝑜𝑟𝑑(𝑆) is a sentence ordering strategy that returns an ordering of the sentences in 𝑆, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) is the response generated by the LLM used for prompt 𝑝, query 𝑞 and sentence ordering 𝑜𝑟𝑑(𝑆), and, finally, 𝑠(𝑞, 𝑟) is a scoring function evaluating the perceived quality of the generated response 𝑟 = 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) for query 𝑞. The order of clusters and the order of the sentences within the same cluster uniquely determine the possible global ordering of the 𝑛 sentences we consider for inputting the LLM. Our experimental assessment will evaluate six different ordering strategies for placing the clusters of sentences in the input, and four different methods for ordering sentences within the same cluster. Cluster placements consider different aspects, such as the clusters' cardinality and similarity to the query. The ordering tested includes the random one and those obtained by decreasing/increasing the value of each aspect. Finally, the U-shaped order suggested in [5] is also tested. Regarding the ordering within clusters, we consider random order, order by reranker score, visiting order, and the clustering aggregation order.

Experimental Evaluation

We can now formulate the research questions we aim to answer with our experimental framework.

Research Questions. Given the sentence selection and clustering steps discussed above, the two main aspects to consider for defining our ordering strategies 𝑜𝑟𝑑(•) are the order of placement in the LLM prompt of the clusters and of the sentences within the same cluster. They uniquely determine the global ordering 𝑜𝑟𝑑(•) of the top-𝑛 sentences given in input to the LLM for response generation. Our research questions assess which is the best solution among these alternatives considered. Specifically, RQ1 What is the best cluster ordering strategy? RQ2 What is the best ordering strategy for sentences within the same cluster? RQ3 Can our proposed strategy enhance the effectiveness of the RAG system w.r.t. baseline methods?

Experimental Settings. We experiment with the TREC CAsT 2022 dataset, a standard experimental collection for CS [8]. This choice is due to prior research that released additional datasets, models, and human judgments for this benchmark [34,35]. The corpus is composed of three documents collections, MS-MARCO v2 [37], KILT [38], and Washington Post v4, which are subdivided into 106𝑀 short documents. CAsT 2022 includes 18 information needs (topics) and 205 user utterances (queries), with an average length of 11.39 user utterances per topic. The number of utterances for which relevance judgements are provided is 163.

For our experiments, as the retrieval system, we employ as the output of the retrieval pipeline the best-performing run originally submitted to TREC CAsT 2022 4 [39]. This allows us to focus exclusively on the following steps of our pipeline. In all our experiments, we consider only the top-20 retrieved documents, leaving the investigation about the implications of this choice and possible alternatives as future work. To provide meaningful results, all queries where 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@20 < 0.2, that is, having at most 3 relevant passages in the top-20 results, are discarded 5 , ensuring that enough relevant information is retrieved to answer the considered queries successfully.

Table 1

Comparisons between the six approaches proposed for RQ1: "What is the best ordering strategy for clusters?". In the top half, each row reports three numbers, which are the wins for the approach in the column label, the ties, and the wins for the approach in the row label, respectively. In the bottom half, the overall results are reported. Furthermore, in the steps of the pipeline where the query text is needed, i.e., sentence ranking and response generation, we employed the manually rewritten text for every query. This allows us to account for the possible bias introduced by different query rewriting approaches. Future developments will investigate the relationship between query rewriting approaches and RAG solutions.

For co-reference resolution at the document level, i.e., removing co-references across different sentences in the "document processing" step, we use the "F-Coref" model6 [40] based on the "LingMess" architecture [41]. After this step, we use the well-known SpaCy Python library to divide each document into a sequence of independent sentences.

In the following section, we report two different metrics for each comparison. The former is the average score of every approach when assessing all 10 random permutations using RankVicuna. The latter, instead, is a pairwise metric, assessing the number of queries for which the first approach obtains higher/the same/lower score w.r.t. the other one. This information should better highlight the differences and provide a more comprehensive view than a single average value.

Response Generation. For the response generation, we employ Vicuna 7B7 [24], a LLM based on Llama 2 [11,12] fine-tuned on 125K user conversations with ChatGPT gathered using public APIs from the ShareGPT.com website.

Quality Evaluation.

To evaluate the quality of the generated responses, we employ RankVicuna [32] to perform listwise ranking between all responses being compared. To mitigate the positional bias intrinsic in RankVicuna, we assess 10 different random permutations of the same responses, averaging the results obtained. This is a reasonable trade-off between evaluation accuracy and the computational runtime required. For each assessment, we assign 𝑁 +1−𝑖 𝑁 points to the i-th ranked response, where 1 ≤ 𝑖 ≤ 𝑁 and 𝑁 is the number of responses being compared. Furthermore, we also evaluate the number of wins and ties between pairs of responses considered. Whether a valid judgment from the LLM can not be determined, the entire comparison is discarded from the evaluation.

RQ1: Order of Clusters

For the first experiment, we evaluate the effects of different ordering of the clusters while keeping the order of sentences within the same cluster (based on the clustering aggregation

Table 2

Comparisons between the four approaches proposed for RQ2: "What is the best ordering strategy for sentences within the same cluster?". In the top half, each row reports three numbers, which are the wins for the approach in the column label, the ties, and the wins for the approach in the row label, respectively. In the bottom half, the overall results are reported. order) fixed. We test six different strategies for ordering clusters: clusters selected in random order (strategy A); clusters selected in descending order of cardinality (strategy B); clusters selected in ascending order of similarity with the query8 (strategy C); clusters selected in descending order of similarity with the query (strategy D); clusters selected in descending order by similarity with the query using a ping-pong layout from top to bottom (strategy E) 9 ; clusters selected by similarity with the query in descending order, using a ping-pong layout from bottom to top (strategy F) 10 .

As shown in Table 1, sorting the clusters in descending order by their similarity with the query (strategy D) is the clear winner in this comparison, in terms of both score and pairwise wins. This approach performs 18.77%, 15.24%, 20.16%, 23.51%, and 14.81% better than other options. This figures suggest that the LLM used to generate the responses exhibit a much stronger "primacy" rather than "recency" biases, as highlighted by option C being overall the worst performing among those considered. Instead, methods E and F were designed to place the least important clusters towards the center, since LLMs struggle to utilize the information in the middle of their prompt effectively. However, we can see that both approaches are ineffective: we suspect this is due to the length of the input text being much smaller than the maximum context window of the model. Different results may be observed when varying the amount of input data provided to the LLM for generation.

RQ2: Order of Sentences within the same Cluster

In this second experiment, we evaluate different sorting schemes for sentences within the same cluster, keeping the cluster's order fixed at the best strategy determined in RQ1. We test four different strategies for ordering sentences within the same cluster: sentences selected in random order (strategy A); sentences selected in descending order by reranker score (strategy B); sentences selected by visiting order11 (strategy C); sentences selected by aggregation order (strategy D).

As shown in Table 2, the best results are achieved by two

Table 3

Comparisons between the five approaches considered for RQ3: "Can our proposed strategy enhance the effectiveness of the RAG system w.r.t. baseline methods?". In the top half, each row reports three numbers, which are the wins for approach in the column label, the ties, and the wins for approach in the row label, respectively. In the bottom half, the overall results are reported.

A We note however that the difference in performance of the various strategies are not large as the sentences are grouped in the clusters by their similarity. The LLM response appears to be more impacted by the order of the clusters than by the order of sentences within each cluster.

RQ3: Comparison with Baselines

Our last experiment investigates whether our proposed approach is beneficial in enhancing the overall effectiveness of the RAG system w.r.t. four simpler baseline methods that may be used in practice by current state-of-the-art RAG systems. We test five different strategies: i) the top-5 retrieved documents ii) the top-40 sentences taken in random order (B), iii) the top-40 sentences taken in descending order by re-ranker score (C), iv) the top-40 sentences selected by visiting order (D), v) the best clusterization-based approach determined from RQ1 and RQ2 (CL).

The results obtained are shown in Table 3. The clusterization-based approach demonstrate superior performance, resulting as the best strategy in this comparison. The four baselines yield notably lower results: 15.14%, 54.94%, 8.66%, and 15.67%, respectively. Among the methods considered in this work, randomly sorting the top-ℎ sentences is by far the least performing approach. This, in turn, proves our starting intuition about coherent, fluent, and well-structured text being critical factors for LLMs to generate high quality output.

Additional Experiments

The clusterization-based ordering strategy proposed in this work is designed to position sentences sharing analogous semantic content close together in the LLM prompt. Given the results obtained in Section 4.3, we have shown its effectiveness in our experimental settings. Nevertheless, we answer two additional research questions in this section to gain additional insights. Specifically, RQ4 Is there a correlation between the similarity of subsequent sentences in the LLM prompt and the quality of the generated response?

Table 4

Comparisons between the seven approaches proposed for RQ4: "Is there a correlation between the similarity of subsequent sentences in the LLM prompt and the quality of the generated response?". In the top half, each row reports three numbers, which are the wins for the approach in the column label, the ties, and the wins for the approach in the row label, respectively. In the bottom half, the overall results are reported. Experimental Settings. We determine heuristically the two ordering 𝑜𝑟𝑑 + and 𝑜𝑟𝑑 − , which maximize and minimize the overall similarity between subsequent sentences. Let 𝑠𝑢𝑚 + and 𝑠𝑢𝑚 − be the sum of similarity between subsequent sentences for 𝑜𝑟𝑑 + and 𝑜𝑟𝑑 − respectively. The similarity 𝑠𝑖𝑚(𝑝) for a sentence permutation 𝑝 is given by the following equation, where min-max normalization is used, and 𝑠𝑖 are the embedding representations of the respective sentences:

𝑠𝑖𝑚(𝑝) = (︁ ∑︀ ℎ 𝑖=2 𝑐𝑜𝑠(𝑠𝑖−1, 𝑠𝑖) )︁ − 𝑠𝑢𝑚 − 𝑠𝑢𝑚 + − 𝑠𝑢𝑚 −

In our experiments, for each query, we generate one million random permutations, then we determine which is the permutation with similarity closer to each of the following thresholds: 0.125, 0.250, 0.375, 0.500, and 0.625. We decided to stop at 0.625 because higher values are unlikely to be observed given that the average similarity of these permutations is 0.3433 with standard deviation 0.0530.

Results. We determine how the quality of the generated response is influenced when varying the similarity between subsequent sentences at various predefined thresholds, as shown in Table 4. It is interesting to note that the highest results are obtained by permutations with 0.625 normalised similarity, rather than 1.000 which is the ordering maximising the similarity between subsequent sentences (𝑜𝑟𝑑 + ). This method achieves 4.47% and 26.67% more pairwise wins w.r.t. 𝑜𝑟𝑑 + and 𝑜𝑟𝑑 − , respectively. To answer RQ5, we assess the responses generated using the best clustering strategy against the approach defined above. The average scores are 0.7652 and 0.7348 while the pairwise wins and ties are 38 -46 -31, respectively.

From these experiments, we can conclude that a positive correlation exists between similarity between subsequent sentences and response quality, while proving that sentence similarity may not be the only factor that should be considered. Moreover, subdividing and explicitly grouping together sentences by subtopic is beneficial w.r.t. considering the sentence similarity only in a pairwise fashion and thus lacking a global vision of the retrieved knowledge.

Conclusions and Future Work

In this work, we presented a novel pipelined RAG architecture aimed at selecting a set of relevant sentences for each query and arranging them in a specific order to optimize the quality of responses generated by a LLM. For this purpose, sentences are first extracted from the top documents retrieved. Then, they are reranked, and the most relevant sentences are organized in clusters by similarity. We proposed different strategies for ordering clusters and the sentences within clusters in the input given to the LLM for response generation. To the best of our knowledge, this is the first work investigating sentence clustering and re-ordering to improve the quality of the response generated by RAG systems. Our empirical assessment is based on a well-known-public-framework for conversational search. The results of the experiments show that different sequences of sentences in the LLM prompt significantly impact response quality despite all methodologies processing identical information from the same set of sentences. Random permutations yield the lowest results, whereas our proposed approach based on sentence clusterization yields superior results. Additionally, we examined whether maximizing the similarity between consecutive sentences in the LLM prompt enhances response quality. While a positive correlation between these factors was observed, it is not the exclusive determinant. Consequently, while we infer that sentence similarity constitutes a pivotal aspect, other contributing factors remain unidentified, warranting further investigation. Moreover, although our experimental evaluation employs a well-known conversational collection, the methodology and results shown in this work are general. They could also be applied to other scenarios, such as ad-hoc search.

In future work, we intend to evaluate the impact of the number of clusters selected by our method for generating the response. Our intuition is that the number of clusters identified for a given query is a proxy of the difficulty of the query itself. Fewer clusters or even a single large should characterize simple and close queries. In contrast, difficultmulti-faceted-queries are possibly characterized by more clusters, each addressing a different facet of the query. This intuition paves the way for the extension of the evaluation methodology by adopting diversification-based metrics [42], allowing us to understand how well the generated answers cover the query facets and the topical distribution of the clusters.

3 https://huggingface.co/castorini/tct_colbert-v2-hnp-msmarco This step is particularly important in our setting because the CAsT corpus contains a multitude of near-duplicate documents. In particular, the same Wikipedia article is often replicated in documents retrieved from the KILT and MS-MARCO collections.2 The model named "squad_snippets_unanswerable" is available at https: //iai.group/downloads/emnlp2023-answerability_prediction. The run is identified as "udinfo_mi_b2021" from the "udel_fang" group, University of Delaware (USA) The number of queries considered in these experiments is 115 out of 163 evaluated in the official relevance judgments. https://huggingface.co/biu-nlp/f-coref https://huggingface.co/lmsys/vicuna-7b-v1.5 The similarity between a cluster 𝐶 and the query is defined as the maximum cosine similarity between the query 𝑞 ∈ 𝑄 with any sentence 𝑠𝑖,𝑗 ∈ 𝐶 belonging to the cluster. 9 The clusters are placed first, last, second, second-to-last, third, and so on, e.g., [A, B, C, D, E] becomes [A, C, E, D, B] .10 The clusters are placed last, first, second-to-last, second, third-to-last, and so on, e.g., [A, B, C, D, E] becomes [B, D, E, C, A].11 The sentences are sorted based on the order in which they appear when sequentially scanning through the set of top-𝑘 retrieved documents.

YGao YXiong XGao KJia JPan YBi YDai JSun MWang HWang arXiv:2312.10997 Retrieval-augmented generation for large language models: A survey 2024 A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions LHuang WYu WMa WZhong ZFeng HWang QChen WPeng XFeng BQin TLiu 10.48550/ARXIV.2311.05232 arXiv:2311.05232 2023 Siren's song in the AI ocean: A survey on hallucination in large language models YZhang YLi LCui DCai LLiu TFu XHuang EZhao YZhang YChen LWang ATLuu WBi FShi SShi 10.48550/ARXIV.2309.01219 arXiv:2309.01219 2023 Survey of hallucination in natural language generation ZJi NLee RFrieske TYu DSu YXu EIshii YBang AMadotto PFung 10.1145/3571730 ACM Comput. Surv 55 38 2023 Lost in the middle: How language models use long contexts NFLiu KLin JHewitt AParanjape MBevilacqua FPetroni PLiang 10.48550/ARXIV.2307.03172 arXiv:2307.03172 2023 Is chatgpt good at search? investigating large language models as re-ranking agents WSun LYan XMa SWang PRen ZChen DYin ZRen 10.18653/V1/2023.EMNLP-MAIN.923 Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 HBouamor JPino KBali the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023

Singapore

December 6-10, 2023. 2023 Association for Computational Linguistics Found in the middle: Permutation self-consistency improves listwise ranking in large language models RTang XZhang XMa JLin FTure 10.48550/ARXIV.2310.07712 arXiv:2310.07712 2023 TREC cast 2022: Going beyond user ask and system retrieve with initiative and response generation POwoicho JDalton MAliannejadi LAzzopardi JRTrippas SVakulenko Proceedings of the Thirty-First Text REtrieval Conference, TREC 2022 ISoboroff AEllis the Thirty-First Text REtrieval Conference, TREC 2022 NIST Special Publication November 15-19, 2022. 2022 National Institute of Standards and Technology (NIST) When not to trust language models: Investigating effectiveness of parametric and nonparametric memories AMallen AAsai VZhong RDas DKhashabi HHajishirzi 10.18653/V1/2023.ACL-LONG.546 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023 ARogers JLBoyd-Graber NOkazaki the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023

Toronto, Canada

Association for Computational Linguistics July 9-14, 2023. 2023 Investigating the factual knowledge boundary of large language models with retrieval augmentation RRen YWang YQu WXZhao JLiu HTian HWu JWen HWang 10.48550/ARXIV.2307.11019 arXiv:2307.11019 2023 HTouvron TLavril GIzacard XMartinet MLachaux TLacroix BRozière NGoyal EHambro FAzhar ARodriguez AJoulin EGrave GLample 10.48550/ARXIV.2302.13971 arXiv:2302.13971 Llama: Open and efficient foundation language models 2023 HTouvron LMartin KStone PAlbert AAlmahairi YBabaei NBashlykov SBatra PBhargava SBhosale DBikel LBlecher CCanton-Ferrer MChen GCucurull DEsiobu JFernandes JFu WFu BFuller CGao VGoswami NGoyal AHartshorn SHosseini RHou HInan MKardas VKerkez MKhabsa IKloumann AKorenev PSKoura MLachaux TLavril JLee DLiskovich YLu YMao XMartinet TMihaylov PMishra IMolybog YNie APoulton JReizenstein RRungta KSaladi ASchelten RSilva EMSmith RSubramanian XETan BTang RTaylor AWilliams JXKuan PXu ZYan IZarov YZhang AFan MKambadur SNarang ARodriguez RStojnic SEdunov TScialom 10.48550/ARXIV.2307.09288 arXiv:2307.09288 Llama 2: Open foundation and fine-tuned chat models 2023 <author> <persName><surname>Openai</surname></persName> </author> <idno type="DOI">10.48550/ARXIV.2303.08774</idno> <idno type="arXiv">arXiv:2303.08774</idno> <ptr target="https://doi.org/10.48550/arXiv.2303.08774" /> <imprint> <date type="published" when="2023">2023</date> </imprint> </monogr> <note type="report_type">GPT-4 technical report</note> </biblStruct> <biblStruct xml:id="b13"> <monogr> <title level="m" type="main">RECOMP: improving retrievalaugmented lms with compression and selective augmentation FXu WShi EChoi 10.48550/ARXIV.2310.04408 arXiv:2310.04408 2023 FCuconasu GTrappolini FSiciliano SFilice CCampagnano YMaarek NTonellotto FSilvestri arXiv:2401.14887 The power of noise: Redefining retrieval for rag systems 2024 arXiv preprint Bleu: a method for automatic evaluation of machine translation KPapineni SRoukos TWard WZhu 10.3115/1073083.1073135 Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics the 40th Annual Meeting of the Association for Computational Linguistics

Philadelphia, PA, USA

ACL July 6-12, 2002. 2002 ROUGE: A package for automatic evaluation of summaries C.-YLin Text Summarization Branches Out, Association for Computational Linguistics

Barcelona, Spain

2004 METEOR: an automatic metric for MT evaluation with improved correlation with human judgments SBanerjee ALavie Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005 Association for Computational Linguistics JGoldstein ALavie CLin CRVoss the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005

Ann Arbor, Michigan, USA

June 29, 2005. 2005 Bertscore: Evaluating text generation with BERT TZhang VKishore FWu KQWeinberger YArtzi 8th International Conference on Learning Representations, ICLR 2020

Addis Ababa, Ethiopia

OpenReview April 26-30, 2020. 2020 Efficient estimation of word representations in vector space TMikolov KChen GCorrado JDean 1st International Conference on Learning Representations, ICLR 2013 YBengio YLecun

Scottsdale, Arizona, USA

May 2-4, 2013. 2013 Workshop Track Proceedings Glove: Global vectors for word representation JPennington RSocher CDManning 10.3115/V1/D14-1162 Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014 AMoschitti BPang WDaelemans the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014

Doha, Qatar

October 25-29, 2014. 2014 , A meeting of SIGDAT, a Special Interest Group of the ACL, ACL BERT: pre-training of deep bidirectional transformers for language understanding JDevlin MChang KLee KToutanova 10.18653/V1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 JBurstein CDoran TSolorio the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019

Minneapolis, MN, USA

June 2-7, 2019. 2019 1 Association for Computational Linguistics Exploring the limits of transfer learning with a unified text-to-text transformer CRaffel NShazeer ARoberts KLee SNarang MMatena YZhou WLi PJLiu J. Mach. Learn. Res 21 67 2020 Judging llm-as-a-judge with mt-bench and chatbot arena LZheng WChiang YSheng SZhuang ZWu YZhuang ZLin ZLi DLi EPXing HZhang JEGonzalez IStoica Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023 AOh TNaumann AGloberson KSaenko MHardt SLevine

New Orleans, LA, USA

December 10 -16, 2023, 2023 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LKaiser IPolosukhin Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 IGuyon ULuxburg SBengio HMWallach RFergus SV NVishwanathan RGarnett

Long Beach, CA, USA

December 4-9, 2017. 2017 SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation EClark SRijhwani SGehrmann JMaynez RAharoni VNikolaev TSellam ASiddhant DDas APParikh 10.18653/V1/2023.EMNLP-MAIN.584 Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 HBouamor JPino KBali the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023

Singapore, De

Association for Computational Linguistics cember 6-10, 2023. 2023 Overview of the TREC 2019 deep learning track NCraswell BMitra EYilmaz DCampos EMVoorhees CoRR abs/2003.07820 2020 Overview of the TREC 2020 deep learning track NCraswell BMitra EYilmaz DCampos Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event NIST Special Publication EMVoorhees AEllis the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event

Gaithersburg, Maryland, USA

National Institute of Standards and Technology (NIST November 16-20, 2020. 2020 1266 BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models NThakur NReimers ARücklé ASrivastava IGurevych CoRR abs/2104.08663 2021 Zero-shot listwise document reranking with a large language model XMa XZhang RPradeep JLin 10.48550/ARXIV.2305.02156 arXiv:2305.02156 2023 Instruction distillation makes large language models efficient zero-shot rankers WSun ZChen XMa LYan SWang PRen ZChen DYin ZRen 10.48550/ARXIV.2311.01555 arXiv:2311.01555 2023 Rankvicuna: Zero-shot listwise document reranking with opensource large language models RPradeep SSharifymoghaddam JLin 10.48550/ARXIV.2309.15088 arXiv:2309.15088 2023 Conversations with search engines: Serp-based conversational response generation PRen ZChen ZRen EKanoulas CMonz MDe Rijke 10.1145/3432726 ACM Trans. Inf. Syst 39 29 2021 Towards filling the gap in conversational search: From passage retrieval to conversational response generation WLajewska KBalog 10.1145/3583780.3615132 doi:10.1145/3583780.3615132 Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023 IFrommholz FHopfgartner MLee MOakes MLalmas MZhang RL TSantos the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023

Birmingham, United Kingdom

ACM October 21-25, 2023. 2023 Towards reliable and factual response generation: Detecting unanswerable questions in information-seeking conversations WLajewska KBalog 10.1007/978-3-031-56063-7_25 doi: Advances in Information Retrieval -46th European Conference on Information Retrieval, ECIR 2024 Lecture Notes in Computer Science NGoharian NTonellotto YHe ALipani GMcdonald CMacdonald IOunis

Glasgow, UK

Springer March 24-28, 2024. 2024 14610 Proceedings, Part III In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval SLin JYang JLin 10.18653/V1/2021.REPL4NLP-1.17 Proceedings of the 6th Workshop on Representation Learning for NLP, RepL4NLP@ACL-IJCNLP 2021, Online ARogers ICalixto IVulic NSaphra NKassner OCamburu TBansal VShwartz the 6th Workshop on Representation Learning for NLP, RepL4NLP@ACL-IJCNLP 2021, Online Association for Computational Linguistics August 6, 2021. 2021 MS MARCO: A human generated machine reading comprehension dataset TNguyen MRosenberg XSong JGao STiwary RMajumder LDeng Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016) CEUR Workshop Proceedings TRBesold ABordes ASAvila Garcez GWayne the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016)

Barcelona, Spain

December 9, 2016. 2016 1773 KILT: a benchmark for knowledge intensive language tasks FPetroni APiktus AFan PS HLewis MYazdani NDCao JThorne YJernite VKarpukhin JMaillard VPlachouras TRocktäschel SRiedel 10.18653/V1/2021.NAACL-MAIN.200 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online KToutanova ARumshisky LZettlemoyer DHakkani-Tür IBeltagy SBethard RCotterell TChakraborty YZhou the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online June 6-11, 2021. 2021 Association for Computational Linguistics An exploration study of mixed-initiative query reformulation in conversational passage retrieval DYang YZhang HFang Proceedings of the Thirty-First Text REtrieval Conference, TREC 2022 ISoboroff AEllis the Thirty-First Text REtrieval Conference, TREC 2022 NIST Special Publication November 15-19, 2022. 2022 National Institute of Standards and Technology (NIST) F-coref: Fast, accurate and easy to use coreference resolution SOtmazgin ACattan YGoldberg Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL/IJCNLP 2022 -System Demostrations Association for Computational Linguistics the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL/IJCNLP 2022 -System Demostrations

Taipei, Taiwan

November 20 -23, 2022. 2022 Lingmess: Linguistically informed multi expert scorers for coreference resolution SOtmazgin ACattan YGoldberg 10.18653/V1/2023.EACL-MAIN.202 Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023 AVlachos IAugenstein the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023

Dubrovnik, Croatia

Association for Computational Linguistics May 2-6, 2023. 2023 Novelty and diversity in information retrieval evaluation CLClarke MKolla GVCormack OVechtomova AAshkan SBüttcher IMackinnon 10.1145/1390334.1390446 doi:10. 1145/1390334.1390446 Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08 the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '08

New York, NY, USA

Association for Computing Machinery 2008