Improving RAG Systems via Sentence Clustering and Reordering Marco Alessio1,* , Guglielmo Faggioli2 , Nicola Ferro2 , Franco Maria Nardini1 and Raffaele Perego1 1 Institute of Information Science and Technologies (ISTI), National Research Council of Italy (CNR), Pisa, Italy 2 Department of Information Engineering (DEI), University of Padua, Padua, Italy Abstract Large Language Models (LLMs) have gained noteworthy importance and attention across different domains and fields in recent years. Information Retrieval (IR) is one of the domains they impacted the most, as witnessed by the recent increase in the number of IR systems incorporating generative models. Specifically, Retrieval Augmented Generation (RAG) is the emerging paradigm that integrates existing knowledge from large-scale document corpora into the generation process, enabling the model to generate more coherent, contextually relevant, and accurate text across various tasks. Such tasks include summarization, question answering, and dialogue systems. Recent studies have highlighted the significant positional dependence exhibited by RAG systems. Such studies observed how the placement of information within the LLM input prompt drastically affects the generated output. We ground our study on this property by investigating alternative strategies for ordering sentences within the LLM prompt to improve the average quality of the generated responses in the user and conversational system dialogues. We propose the architecture of an end-to-end RAG-based conversational assistant and empirically evaluate our strategies using the TREC CAsT 2022 collection. Our experiments highlight significant differences between distinct arrangement strategies. By employing an evaluation methodology based on RankVicuna, we show that our best approach achieves improvements up to 54% in terms of overall response quality over baseline methods. Keywords Retrieval Augmented Generation, Conversational Search, Positional Bias, Arrangement Strategy 1. Introduction prioritize information placed at the beginning or end of the input while neglecting the central portion. Retrieval Augmented Generation (RAG) is an emerging In this paper, we advance over previous studies by in- paradigm in the field of Artificial Intelligence (AI) to en- vestigating the positional bias in the context of RAG-based hance the accuracy and reliability of generative models by conversational systems. Specifically, we propose a novel exploiting external data sources. In recent years, RAG has strategy for arranging sentences within the input prompt of gained noteworthy importance and attention across dif- the LLM to improve the average quality of the generated re- ferent domains and fields [1] as it allows to combine the sponses over simpler methods. Our approach is based on the strengths of Information Retrieval (IR) systems and genera- intuition that as coherent, fluent, and well-structured text tive models to overcome each other’s limitations. are critical factors for successful communication between RAG can improve the output of a generative model in human beings, the same should also apply to LLMs: among several ways. First, it allows the generation process to be all the possible arrangements of the input, those having grounded on information from trusted knowledge sources sentences with similar meaning placed closer in the LLM incorporated in the provided prompt, thus avoiding or at prompt should generate, on average, better quality output. least mitigating the well-known Large Language Model Therefore, we propose an end-to-end RAG architecture to (LLM) hallucination problem, i.e., when the model gener- test our hypothesis. The components of this architecture ates contents not factually true or that do not concern the allow us to precisely identify which sentences are likely prompted text [2, 3, 4]. Second, RAG allows for continuous useful for answering user queries. To this end, we clus- knowledge updates and integration of domain-specific in- ter sentences by their similarity and we define alternative formation: the LLM can successfully respond to facts and strategies for ordering them both inter and intra-cluster. topics not covered in its training data; moreover, it is eas- In this way, we can study the effect on the generated re- ily adapted to different scenarios and contexts, without re- sponse of these alternatives for prompting the generative training or fine-tuning the entire model using datasets that LLM. To our knowledge, this is the first work that explic- might be unavailable or limited in scope or size. Finally, itly considers this aspect and allows us to fine-tune in a grounding the generation process on external knowledge principled way the ordering of input sentences provided to incorporated in the input permits linking the output to veri- the generative component of a RAG system. We compare fiable external documents, thus enhancing trustworthiness our proposed approach against competitive baselines that and transparency [2, 3, 4]. represent the solutions employed by current RAG systems. Current RAG systems, however, suffer of some draw- We experimentally evaluate the performance of our pro- backs highlighted in the literature. One of these issues posed approach using the TREC Conversational Assistance originates from the notable positional sensitivity shown Track (CAsT) 2022 collection [8], which allows us to com- by LLMs. The placement of information within the input pare the results that different arrangement strategies can prompt significantly impacts the resulting output. Previous achieve in a widely accepted Conversational Search (CS) sce- research [5, 6, 7] has highlighted biases towards “primacy” nario. Results highlight remarkable differences among the and “recency”, suggesting that generative models tend to tested sentence placement strategies, with improvements up to 8.66% w.r.t. the best baseline and 54.94% w.r.t. random Information Retrieval’s Role in RAG Systems (IR-RAG) - 2024, July 18, 2024, Washington, DC ordering. * Corresponding author. The remainder of this work is organized as follows: Sec- $ marco.alessio@isti.cnr.it (M. Alessio); guglielmo.faggioli@unipd.it tion 2 surveys the current state-of-the-art about RAG sys- (G. Faggioli); nicola.ferro@unipd.it (N. Ferro); tems and quality evaluation for their responses. Section francomaria.nardini@isti.cnr.it (F. M. Nardini); raffaele.perego@isti.cnr.it (R. Perego) 3 details the architecture of our RAG system. Section 4 © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License and Section 5 detail the results of an experimental analysis, Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings which aims to highlight how the ordering of clusters and and closed-source, e.g., GPT-4 [13] by OpenAI, models. It is sentences affects the quality of the generated response. Fi- not advisable to directly input all the retrieved information nally, Section 6 draws some conclusions and outlines future to the LLM for generating the response. Redundant infor- directions and extensions of our research. mation and very long contextual data can interfere with the generation quality, leading to repetitive, disjointed, or incoherent outputs [1]. Therefore, the retrieved content is 2. Related Work typically further processed before being given in input to the LLM [14]. A recent work in this direction systemati- In the following, we survey the main works dealing with cally examines the retrieval strategy of RAG systems [15]. LLM positional dependencies and the difficulties of RAG The authors consider multiple retrieval factors affecting the systems in conciliating internal and external knowledge. generation process, such as the relevance of the passages Then, we analyze the challenges related to the evaluation in the prompt context, their position, and their number. of the quality of RAG responses and to the use of an “LLM- One counter-intuitive finding is that the retriever’s highest- as-a-judge”. scoring documents that are not directly relevant to the query, e.g., do not contain the answer, negatively impact the effec- 2.1. Retrieval Augmented Generation tiveness of the LLM. Moreover, the authors discover that RAG enhances LLMs by retrieving additional information adding random documents in the prompt improves the LLM from an external knowledge source, enabling them to suc- accuracy by up to 35%. cessfully answer queries beyond the scope of the training In this work, we rely on the intuition that the use of co- data. At the same time, RAG mitigates the hallucination herent, fluent, and well-structured inputs can improve RAG problem, which is generating factually incorrect text, by and we propose an end-to-end architecture for selecting and referencing the provided external knowledge. structuring the external information included in the LLM The RAG paradigm is organized into two main stages: prompt for response generation. retrieval and generation. Upon receiving a query from the user, the relevant information is retrieved from an external 2.2. Quality Evaluation knowledge source. This task is undertaken by a standard IR Another line of research is how to evaluate the overall qual- pipeline that outputs a ranked list of documents. Afterwards, ity of the generation output. Despite human assessment in the generation phase, the LLM synthesizes the response providing the most accurate and reliable measure for evalu- to answer the user query using the information carried by ating model performance, the high time and cost require- the selected documents. ments severely limit the application. Therefore, there exists Despite its clear advantages, RAG has drawbacks and an ever-increasing demand for automated evaluation tech- limitations, which spark several challenges. First, RAG sys- niques that consistently align with human judgements while tems employ the external knowledge as their main source offering enhanced efficiency and cost-effectiveness. of information, disregarding the internal knowledge memo- In this paper, we focus on textual-based generative rized within the LLM [9, 10]. This, in turn, may determine models. Classical automatic evaluation metrics, such as a decrease in the quality of the generated output when BLEU [16], ROUGE [17], and METEOR [18], are designed to the provided content is not high-quality [10]. It is not un- quantify the degree of similarity between a candidate text common for RAG to obtain worse outputs w.r.t. what the and one or more reference texts, by assessing their n-grams LLM can achieve in the closed-book scenario, i.e., without matching. The simplicity and explainability, along with supplying retrieved results [10]. In this line, it has been the good correlation with human judgements, make these observed that the LLM produces better results without in- metrics widely used as baselines. However, these metrics jecting external knowledge when the topic popularity is exhibit several limitations [19]: firstly, they cannot account very high [9]. In general, state-of-the-art LLMs provide for lexical diversity; secondly, they penalize variations in good quality responses for a wide range of questions but the semantic ordering of words; thirdly, they struggle to require assistance from an IR system when the internal capture and match paraphrases effectively; lastly, they inad- knowledge of the model lacks information about the cur- equately account for distant dependencies within the text. rent topic. This phenomenon is likely to occur if the topic With the advent of word embeddings [20, 21] and neural is not very popular, requires exceptional expertise, or when models [22, 23, 11, 12, 24] based on Transformers [25], new scaling the number of parameters of the generative model learned metrics [19, 26] have been developed. For example, produces little to no effect [9]. Another challenge lies in BERTScore [19] can capture the semantic similarity between the significant positional dependence [5, 6, 7] exhibited by the candidate and reference texts employing the contex- LLMs, whereby the placement of information within the tual embeddings generated by an encoder model, such as input prompt drastically affects the generated output. Prior BERT [22]. research [5] has identified “primacy” and “recency” biases, In recent years, the rapid advancements of LLMs show- indicating the tendency of generative models to focus to- ing remarkable performance across many tasks have gained ward information positioned either at the beginning or the considerable interest in their potential application also as end of the input while disregarding the central part. There- annotators and evaluators. Due to their training using fore, the performance degrades significantly when LLMs Reinforcement Learning from Human Feedback (RLHF), should rely on information in the middle of its input context, these models demonstrate significant human alignment. showing a characteristic U-shaped performance curve [5]. Many research have investigated leveraging state-of-the- This, in turn, means that most state-of-the-art generative art LLMs to automatically produce assessments serving as models do not use effectively their longer contexts w.r.t. proxies for human judgments, a paradigm known as “LLM- smaller and earlier counterparts. These phenomena can be as-a-judge”. observed both in open-source, e.g., Llama [11, 12] by Meta, p1 s1,1 s1,2 ... s1,7 s3,7 p2 s2,1 s2,2 s2,3 CO-REF. RESOLUTION s4,2 QUERY RETRIEVAL ... SENTENCE SPLITTER ... ... ... ... SENTENCE SYSTEM RERANKING ... pN DUPLICATES REMOVAL sN,1 sN,2 sN,3 sN,4 CONVERSATION s19,6 RETRIEVED RETRIEVED SENTENCES PASSAGES TOP-K SELECTION CLUSTERS ORDER STRATEGY C1 C2 ... CK s3,7 s4,2 ... s0,1 s3,7 LARGE s4,2 s0,1 ... s12,1 s5,6 ... s15,4 s4,2 RESPONSE SENTENCE SENTENCE LANGUAGE s12,1 ... ... REORDERED SENTENCES REORDERING CLUSTERING ... MODEL ... s9,3 s9,3 CLUSTERED SENTENCES PROMPT SENTENCES WITHIN CLUSTER ORDER STRATEGY Figure 1: Architecture of our proposed RAG system. Furthermore, in recent years LLMs have gained popu- 3.1. Document Pre-processing and Splitting larity also as evaluators. For example, Zheng et al. [24] As observed in literature [33, 34], the entire text of a rele- assessed the quality of conversations with various LLMs, vant document rarely contains meaningful knowledge to both open and closed source, employing GPT-4 [13] as judge. satisfy the user information need expressed by a query 𝑞. They experimented with various prompts and different ap- In most cases, only one or a few portions of the document proaches, such as single answer grading and pairwise com- are relevant to the query, while the remaining parts contain parisons both between responses and against a reference irrelevant information. The proposed architecture aims to text. GPT-3.5 Turbo and GPT-4 [13] have been employed as precisely identify the key information in the retrieved docu- listwise rerankers [6, 7] for the TREC Deep Learning 2019 ments, i.e., the sentences, to reduce the noise in the prompt and 2020 [27, 28] and BEIR [29] experimental collections, used for response generation. obtaining state-of-the-art performance [6]. The same LLMs Hereinafter, we consider sentences in the documents as have also been employed as teacher models to fine-tune the atomic units of information. Our pipeline, illustrated in smaller open-source student models, such as Llama and Figure 1 works as follows. First, for each query 𝑞 we consider Vicuna [30, 31] (i.e.: RankVicuna [32]). only the top-𝑘 documents {𝑑1 , 𝑑2 , ..., 𝑑𝑘 } retrieved by the In this work, we rely on state-of-the-art assessment meth- IR system. Then, a state-of-the-art co-reference resolution ods and evaluate the quality of the responses generated by model is applied to all documents to replace pronouns and the different methods using RankVicuna [32]. other generic terms within a sentence with the fully speci- fied entity mentioned in a previous sentence. This allows 3. The Proposed RAG Architecture us to remove the contextual dependencies among sentences in a document so they can be considered self-explanatory. Generative models exhibit strong biases towards informa- The third step splits each document 𝑑𝑖 into a sequence of tion positioned at the start or the end of the input while dis- sentences {𝑠𝑖,1 , 𝑠𝑖,2 , ..., 𝑠𝑖,𝑛𝑖 }. Afterwards, near-duplicate regarding the middle part [5]. This phenomenon motivates removal is employed to the sentences originated by all doc- our research effort to determine how the order of the input uments by discarding sentences with a Jaccard similarity sentences provided to a RAG-based conversational system ≥ 0.9 between their Bag-of-Words (BoW) representations1 . affects the quality of the generated output and, in turn, the optimal ordering strategy to achieve the best response. This 3.2. Sentence Selection section describes each method and all variations considered in our experiments. After the first pre-processing phase, we obtain a sentence The architecture of our proposed RAG system is illus- candidate set for each query to be included in the LLM trated in Figure 1. It includes an IR pipeline, which retrieves prompt of our RAG system (see Figure 1). Since the cardi- top-𝑘 documents 𝐷 = {𝑑1 , 𝑑2 , ..., 𝑑𝑘 } in response to each nality of this set can be large and not all the sentences are user utterance 𝑞. The retrieved documents are then pro- useful for answering the query, we employ the BERT-based cessed by additional components responsible for splitting cross encoder answer-in-the-sentence classifier2 developed them into sentences, identifying the most relevant sentences, by Lajewska and Balog [35] to rank the candidate sentences clustering such sentences based on their semantic similar- ity, and ordering them according to the various strategies 1 This step is particularly important in our setting because the CAsT 2022 analyzed. Finally, the selected—re-ordered—sentences are corpus contains a multitude of near-duplicate documents. In particular, provided as input to the LLM for response generation. These the same Wikipedia article is often replicated in documents retrieved components are the focus of our research. Their functional- from the KILT and MS-MARCO collections. 2 The model named “squad_snippets_unanswerable” is available at https: ities are detailed in the remainder of this section. //iai.group/downloads/emnlp2023-answerability_prediction. according to their predicted usefulness to (at least partially) The order of clusters and the order of the sentences within answer the query and we retain the top-𝑛 ranked sentences the same cluster uniquely determine the possible global or- thus discarding the remaining ones. As a possible limita- dering of the 𝑛 sentences we consider for inputting the LLM. tion, please note that the model by Lajewska and Balog Our experimental assessment will evaluate six different or- [35] employed have been trained on queries and passages dering strategies for placing the clusters of sentences in the used in our experiments. Therefore, it is very likely that input, and four different methods for ordering sentences the model performs significantly better on our data w.r.t. within the same cluster. Cluster placements consider differ- any other model, ensuring that top-ranked sentences are ent aspects, such as the clusters’ cardinality and similarity indeed relevant to the query. Even though such a model to the query. The ordering tested includes the random one is not available in a real practical scenario, this choice is and those obtained by decreasing/increasing the value of justified by our research effort being focused exclusively on each aspect. Finally, the U-shaped order suggested in [5] comparing the ordering strategy for sentences in the LLM is also tested. Regarding the ordering within clusters, we input rather than on the absolute results achievable by our consider random order, order by reranker score, visiting RAG system. order, and the clustering aggregation order. 3.3. Sentence Clustering and Ordering 4. Experimental Evaluation The previous steps of the pipeline constrain the number of sentences per query while increasing their expected utility We can now formulate the research questions we aim to in answering the query. Furthermore, they allow us to con- answer with our experimental framework. trol other noise sources, such as the number or the variable Research Questions. Given the sentence selection and length of the retrieved documents. Therefore, we can as- clustering steps discussed above, the two main aspects to sess how the positional bias affects the generation process. consider for defining our ordering strategies 𝑜𝑟𝑑(·) are the We highlight again that the positional bias of LLM has al- order of placement in the LLM prompt of the clusters and ready been observed in prior research [5, 6, 7]. However, of the sentences within the same cluster. They uniquely it has been considered exclusively as a limitation of LLMs determine the global ordering 𝑜𝑟𝑑(·) of the top-𝑛 sentences and RAG systems. Our research moves a step forward by given in input to the LLM for response generation. Our investigating the best ordering strategy to maximize, on research questions assess which is the best solution among average, the quality of the generated responses over a test- these alternatives considered. Specifically, ing query set 𝑄. We believe that logically organized text RQ1 What is the best cluster ordering strategy? where sentences with akin meanings are positioned closer in the LLM prompt should, on average, yield superior out- RQ2 What is the best ordering strategy for sentences put quality. Consequently, our sentence ordering strategies within the same cluster? exploit the similarities among sentences selected by the sen- RQ3 Can our proposed strategy enhance the effectiveness tence selection step. To measure semantic inter-sentence of the RAG system w.r.t. baseline methods? similarity, we resort to the contextualized embeddings gen- erated with the tct-colbert model3 [36]. We generate the Experimental Settings. We experiment with the TREC representation of the 𝑛 selected sentences for each query CAsT 2022 dataset, a standard experimental collection for and measure their pair-wise cosine similarity. Then, we CS [8]. This choice is due to prior research that released progressively aggregate the most similar sentences by em- additional datasets, models, and human judgments for this ploying a hierarchical clustering algorithm. The maximum benchmark [34, 35]. The corpus is composed of three doc- value of Silhouette statistic is used as the criteria to deter- uments collections, MS-MARCO v2 [37], KILT [38], and mine the optimal clustering among all possible. As a result, Washington Post v4, which are subdivided into 106𝑀 short for each query 𝑞 ∈ 𝑄, the top-𝑛 sentences are grouped in documents. CAsT 2022 includes 18 information needs (top- a variable number 𝑁𝑐 ≥ 1 of clusters, each composed of ics) and 205 user utterances (queries), with an average one or more sentences with similar semantic meaning. To length of 11.39 user utterances per topic. The number of devise different strategies for ordering input sentences, we utterances for which relevance judgements are provided is leverage the above clustering that allows us to study the 163. impact of sentence placement variations occurring in both For our experiments, as the retrieval system, we employ inter and intra-clusters. as the output of the retrieval pipeline the best-performing More formally, given a query, the set 𝑆 of the 𝑛 previously run originally submitted to TREC CAsT 20224 [39]. This selected sentences, and the prompt 𝑝, we aim to find the allows us to focus exclusively on the following steps of our ordering 𝑜𝑟𝑑* of 𝑆 such that: pipeline. In all our experiments, we consider only the top-20 retrieved documents, leaving the investigation about the implications of this choice and possible alternatives as fu- ture work. To provide meaningful results, all queries where ∑︁ 𝑜𝑟𝑑* = argmax 𝑠(𝑞, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆))), 𝑜𝑟𝑑 𝑞∈𝑄 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@20 < 0.2, that is, having at most 3 relevant passages in the top-20 results, are discarded5 , ensuring that where 𝑜𝑟𝑑(𝑆) is a sentence ordering strategy that returns enough relevant information is retrieved to answer the con- an ordering of the sentences in 𝑆, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) is the sidered queries successfully. response generated by the LLM used for prompt 𝑝, query 𝑞 and sentence ordering 𝑜𝑟𝑑(𝑆), and, finally, 𝑠(𝑞, 𝑟) is a scoring function evaluating the perceived quality of the 4 The run is identified as “udinfo_mi_b2021” from the “udel_fang” group, generated response 𝑟 = 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) for query 𝑞. University of Delaware (USA) 5 The number of queries considered in these experiments is 115 out of 3 https://huggingface.co/castorini/tct_colbert-v2-hnp-msmarco 163 evaluated in the official relevance judgments. Table 1 Table 2 Comparisons between the six approaches proposed for RQ1: Comparisons between the four approaches proposed for RQ2: “What is the best ordering strategy for clusters?”. In the top “What is the best ordering strategy for sentences within the same half, each row reports three numbers, which are the wins for cluster?”. In the top half, each row reports three numbers, which the approach in the column label, the ties, and the wins for the are the wins for the approach in the column label, the ties, and approach in the row label, respectively. In the bottom half, the the wins for the approach in the row label, respectively. In the overall results are reported. bottom half, the overall results are reported. A vs. B vs. C vs. D vs. E vs. F vs. A vs. B vs. C vs. D vs. A — 56-4-51 57-2-52 62-4-45 51-1-59 55-2-54 A — 53-3-59 48-8-59 55-4-56 B 51-4-56 — 47-8-56 61-4-46 52-5-54 52-2-57 B 59-3-53 — 54-3-58 60-7-48 C 52-2-57 56-8-47 — 58-3-50 55-0-56 59-4-48 C 59-8-48 58-3-54 — 57-6-52 D 45-4-62 46-4-61 50-3-58 — 44-1-66 47-1-63 D 56-4-55 48-7-60 52-6-57 — E 59-1-51 54-5-52 56-0-55 66-1-44 — 57-5-49 F 54-2-55 57-2-52 48-4-59 63-1-47 49-5-57 — Overall 174-15 159-13 154-17 172-17 Overall 261-13 269-23 258-17 310-13 251-12 270-14 Avg. Score 0.6281 0.6143 0.6124 0.6451 Avg. Score 0.5723 0.5844 0.5510 0.6219 0.5736 0.5969 order) fixed. We test six different strategies for ordering Furthermore, in the steps of the pipeline where the query clusters: clusters selected in random order (strategy A); text is needed, i.e., sentence ranking and response gener- clusters selected in descending order of cardinality (strategy ation, we employed the manually rewritten text for every B); clusters selected in ascending order of similarity with the query. This allows us to account for the possible bias intro- query8 (strategy C); clusters selected in descending order duced by different query rewriting approaches. Future de- of similarity with the query (strategy D); clusters selected velopments will investigate the relationship between query in descending order by similarity with the query using a rewriting approaches and RAG solutions. ping-pong layout from top to bottom (strategy E)9 ; clusters For co-reference resolution at the document level, i.e., re- selected by similarity with the query in descending order, moving co-references across different sentences in the “doc- using a ping-pong layout from bottom to top (strategy F)10 . ument processing” step, we use the “F-Coref” model6 [40] As shown in Table 1, sorting the clusters in descend- based on the “LingMess” architecture [41]. After this step, ing order by their similarity with the query (strategy D) we use the well-known SpaCy Python library to divide each is the clear winner in this comparison, in terms of both document into a sequence of independent sentences. score and pairwise wins. This approach performs 18.77%, In the following section, we report two different metrics 15.24%, 20.16%, 23.51%, and 14.81% better than other for each comparison. The former is the average score of options. This figures suggest that the LLM used to gener- every approach when assessing all 10 random permutations ate the responses exhibit a much stronger “primacy” rather using RankVicuna. The latter, instead, is a pairwise metric, than “recency” biases, as highlighted by option C being over- assessing the number of queries for which the first approach all the worst performing among those considered. Instead, obtains higher/the same/lower score w.r.t. the other one. methods E and F were designed to place the least important This information should better highlight the differences and clusters towards the center, since LLMs struggle to utilize provide a more comprehensive view than a single average the information in the middle of their prompt effectively. value. However, we can see that both approaches are ineffective: Response Generation. For the response generation, we we suspect this is due to the length of the input text being employ Vicuna 7B7 [24], a LLM based on Llama 2 [11, 12] much smaller than the maximum context window of the fine-tuned on 125K user conversations with ChatGPT gath- model. Different results may be observed when varying the ered using public APIs from the ShareGPT.com website. amount of input data provided to the LLM for generation. Quality Evaluation. To evaluate the quality of the gener- ated responses, we employ RankVicuna [32] to perform list- 4.2. RQ2: Order of Sentences within the wise ranking between all responses being compared. To mit- same Cluster igate the positional bias intrinsic in RankVicuna, we assess 10 different random permutations of the same responses, In this second experiment, we evaluate different sorting averaging the results obtained. This is a reasonable trade-off schemes for sentences within the same cluster, keeping between evaluation accuracy and the computational run- the cluster’s order fixed at the best strategy determined in time required. For each assessment, we assign 𝑁 +1−𝑖 points RQ1. We test four different strategies for ordering sentences 𝑁 to the i-th ranked response, where 1 ≤ 𝑖 ≤ 𝑁 and 𝑁 is within the same cluster: sentences selected in random order the number of responses being compared. Furthermore, we (strategy A); sentences selected in descending order by re- also evaluate the number of wins and ties between pairs ranker score (strategy B); sentences selected by visiting of responses considered. Whether a valid judgment from order11 (strategy C); sentences selected by aggregation order the LLM can not be determined, the entire comparison is (strategy D). discarded from the evaluation. As shown in Table 2, the best results are achieved by two 8 The similarity between a cluster 𝐶 and the query is defined as the max- 4.1. RQ1: Order of Clusters imum cosine similarity between the query 𝑞 ∈ 𝑄 with any sentence 𝑠𝑖,𝑗 ∈ 𝐶 belonging to the cluster. For the first experiment, we evaluate the effects of different 9 The clusters are placed first, last, second, second-to-last, third, and so ordering of the clusters while keeping the order of sentences on, e.g., [A, B, C, D, E] becomes [A, C, E, D, B]. 10 within the same cluster (based on the clustering aggregation The clusters are placed last, first, second-to-last, second, third-to-last, and so on, e.g., [A, B, C, D, E] becomes [B, D, E, C, A]. 6 11 https://huggingface.co/biu-nlp/f-coref The sentences are sorted based on the order in which they appear when 7 https://huggingface.co/lmsys/vicuna-7b-v1.5 sequentially scanning through the set of top-𝑘 retrieved documents. Table 3 Table 4 Comparisons between the five approaches considered for RQ3: Comparisons between the seven approaches proposed for RQ4: “Can our proposed strategy enhance the effectiveness of the RAG “Is there a correlation between the similarity of subsequent sen- system w.r.t. baseline methods?”. In the top half, each row reports tences in the LLM prompt and the quality of the generated re- three numbers, which are the wins for approach in the column sponse?”. In the top half, each row reports three numbers, which label, the ties, and the wins for approach in the row label, respec- are the wins for the approach in the column label, the ties, and tively. In the bottom half, the overall results are reported. the wins for the approach in the row label, respectively. In the bottom half, the overall results are reported. A vs. B vs. C vs. D vs. CL vs. 1.000 vs. 0.625 vs. 0.500 vs. 0.375 vs. 0.250 vs. 0.125 vs. 0.000 vs. A — 45-4-62 54-1-56 54-0-57 66-2-43 1.000 — 46-2-43 38-2-51 45-0-46 40-2-49 40-1-50 38-1-52 B 62-4-45 — 71-1-39 64-8-39 67-5-39 0.625 43-2-46 — 37-2-52 42-2-47 41-1-49 36-0-55 35-1-55 C 56-1-54 39-1-71 — 50-4-57 59-3-49 0.500 51-2-38 52-2-37 — 51-2-38 52-0-39 37-0-54 44-2-45 D 57-0-54 39-8-64 57-4-50 — 59-3-49 0.375 46-0-45 47-2-42 38-2-51 — 42-2-47 37-3-51 37-1-53 0.250 49-2-40 49-1-41 39-0-52 47-2-42 — 43-3-45 42-1-48 CL 43-2-66 39-5-67 49-3-59 49-3-59 — 0.125 50-1-40 55-0-36 54-0-37 51-3-37 45-3-43 — 44-1-46 0.000 52-1-38 55-1-35 45-2-44 53-1-37 48-1-42 46-1-44 — Overall 218-7 162-18 231-9 217-15 251-13 Avg. Score 0.5882 0.5533 0.6177 0.6016 0.6392 Overall 291-8 304-8 251-8 289-10 268-9 239-8 240-7 Avg. Score 0.5731 0.5866 0.5480 0.5617 0.5516 0.5349 0.5143 different strategies: option D, sorting sentences within the RQ5 Is the proposed clusterization strategy more effective same cluster based on aggregation order, and interestingly, than directly optimising the similarity of subsequent option A, randomly sorting the sentences. Both strategies sentences? are preferable to the other two methods considered, per- forming 8.18% and 11.69% better w.r.t. options B and C, Experimental Settings. We determine heuristically the respectively. We note however that the difference in perfor- two ordering 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− , which maximize and mini- mance of the various strategies are not large as the sentences mize the overall similarity between subsequent sentences. are grouped in the clusters by their similarity. The LLM re- Let 𝑠𝑢𝑚+ and 𝑠𝑢𝑚− be the sum of similarity between sub- sponse appears to be more impacted by the order of the sequent sentences for 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− respectively. The clusters than by the order of sentences within each cluster. similarity 𝑠𝑖𝑚(𝑝) for a sentence permutation 𝑝 is given by the following equation, where min-max normalization 4.3. RQ3: Comparison with Baselines is used, and 𝑠𝑖 are the embedding representations of the respective sentences: Our last experiment investigates whether our proposed ap- proach is beneficial in enhancing the overall effectiveness (︁∑︀ ℎ )︁ of the RAG system w.r.t. four simpler baseline methods that 𝑖=2 𝑐𝑜𝑠(𝑠𝑖−1 , 𝑠𝑖 ) − 𝑠𝑢𝑚− 𝑠𝑖𝑚(𝑝) = may be used in practice by current state-of-the-art RAG sys- 𝑠𝑢𝑚+ − 𝑠𝑢𝑚− tems. We test five different strategies: i) the top-5 retrieved documents (A), ii) the top-40 sentences taken in random In our experiments, for each query, we generate one mil- order (B), iii) the top-40 sentences taken in descending order lion random permutations, then we determine which is the by re-ranker score (C), iv) the top-40 sentences selected by permutation with similarity closer to each of the following visiting order (D), v) the best clusterization-based approach thresholds: 0.125, 0.250, 0.375, 0.500, and 0.625. We de- determined from RQ1 and RQ2 (CL). cided to stop at 0.625 because higher values are unlikely The results obtained are shown in Table 3. The to be observed given that the average similarity of these clusterization-based approach demonstrate superior perfor- permutations is 0.3433 with standard deviation 0.0530. mance, resulting as the best strategy in this comparison. The Results. We determine how the quality of the generated four baselines yield notably lower results: 15.14%, 54.94%, response is influenced when varying the similarity between 8.66%, and 15.67%, respectively. Among the methods subsequent sentences at various predefined thresholds, as considered in this work, randomly sorting the top-ℎ sen- shown in Table 4. It is interesting to note that the highest tences is by far the least performing approach. This, in results are obtained by permutations with 0.625 normalised turn, proves our starting intuition about coherent, fluent, similarity, rather than 1.000 which is the ordering maximis- and well-structured text being critical factors for LLMs to ing the similarity between subsequent sentences (𝑜𝑟𝑑+ ). generate high quality output. This method achieves 4.47% and 26.67% more pairwise wins w.r.t. 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− , respectively. To answer RQ5, we 5. Additional Experiments assess the responses generated using the best clustering strategy against the approach defined above. The average The clusterization-based ordering strategy proposed in this scores are 0.7652 and 0.7348 while the pairwise wins and work is designed to position sentences sharing analogous ties are 38 - 46 - 31, respectively. semantic content close together in the LLM prompt. Given From these experiments, we can conclude that a positive the results obtained in Section 4.3, we have shown its ef- correlation exists between similarity between subsequent fectiveness in our experimental settings. Nevertheless, we sentences and response quality, while proving that sentence answer two additional research questions in this section to similarity may not be the only factor that should be con- gain additional insights. Specifically, sidered. Moreover, subdividing and explicitly grouping to- gether sentences by subtopic is beneficial w.r.t. considering RQ4 Is there a correlation between the similarity of subse- the sentence similarity only in a pairwise fashion and thus quent sentences in the LLM prompt and the quality lacking a global vision of the retrieved knowledge. of the generated response? 6. Conclusions and Future Work ocean: A survey on hallucination in large language models, CoRR abs/2309.01219 (2023). URL: https: In this work, we presented a novel pipelined RAG archi- //doi.org/10.48550/arXiv.2309.01219. doi:10.48550/ tecture aimed at selecting a set of relevant sentences for ARXIV.2309.01219. arXiv:2309.01219. each query and arranging them in a specific order to op- [4] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, timize the quality of responses generated by a LLM. For Y. Bang, A. Madotto, P. Fung, Survey of hallucination this purpose, sentences are first extracted from the top doc- in natural language generation, ACM Comput. Surv. uments retrieved. Then, they are reranked, and the most 55 (2023) 248:1–248:38. URL: https://doi.org/10.1145/ relevant sentences are organized in clusters by similarity. 3571730. doi:10.1145/3571730. We proposed different strategies for ordering clusters and [5] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, the sentences within clusters in the input given to the LLM F. Petroni, P. Liang, Lost in the middle: How language for response generation. To the best of our knowledge, models use long contexts, CoRR abs/2307.03172 this is the first work investigating sentence clustering and (2023). URL: https://doi.org/10.48550/arXiv.2307. re-ordering to improve the quality of the response gener- 03172. doi:10.48550/ARXIV.2307.03172. ated by RAG systems. Our empirical assessment is based arXiv:2307.03172. on a well-known—public—framework for conversational [6] W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, search. The results of the experiments show that different D. Yin, Z. Ren, Is chatgpt good at search? investi- sequences of sentences in the LLM prompt significantly gating large language models as re-ranking agents, in: impact response quality despite all methodologies process- H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the ing identical information from the same set of sentences. 2023 Conference on Empirical Methods in Natural Lan- Random permutations yield the lowest results, whereas our guage Processing, EMNLP 2023, Singapore, December proposed approach based on sentence clusterization yields 6-10, 2023, Association for Computational Linguis- superior results. Additionally, we examined whether maxi- tics, 2023, pp. 14918–14937. URL: https://doi.org/10. mizing the similarity between consecutive sentences in the 18653/v1/2023.emnlp-main.923. doi:10.18653/V1/ LLM prompt enhances response quality. While a positive 2023.EMNLP-MAIN.923. correlation between these factors was observed, it is not [7] R. Tang, X. Zhang, X. Ma, J. Lin, F. Ture, Found the exclusive determinant. Consequently, while we infer in the middle: Permutation self-consistency im- that sentence similarity constitutes a pivotal aspect, other proves listwise ranking in large language mod- contributing factors remain unidentified, warranting fur- els, CoRR abs/2310.07712 (2023). URL: https: ther investigation. Moreover, although our experimental //doi.org/10.48550/arXiv.2310.07712. doi:10.48550/ evaluation employs a well-known conversational collection, ARXIV.2310.07712. arXiv:2310.07712. the methodology and results shown in this work are gen- [8] P. Owoicho, J. Dalton, M. Aliannejadi, L. Azzopardi, eral. They could also be applied to other scenarios, such as J. R. Trippas, S. Vakulenko, TREC cast 2022: Going be- ad-hoc search. yond user ask and system retrieve with initiative and In future work, we intend to evaluate the impact of the response generation, in: I. Soboroff, A. Ellis (Eds.), Pro- number of clusters selected by our method for generating ceedings of the Thirty-First Text REtrieval Conference, the response. Our intuition is that the number of clusters TREC 2022, online, November 15-19, 2022, volume 500- identified for a given query is a proxy of the difficulty of 338 of NIST Special Publication, National Institute of the query itself. Fewer clusters or even a single large should Standards and Technology (NIST), 2022. URL: https:// characterize simple and close queries. In contrast, difficult— trec.nist.gov/pubs/trec31/papers/Overview_cast.pdf. multi-faceted—queries are possibly characterized by more [9] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, clusters, each addressing a different facet of the query. This H. Hajishirzi, When not to trust language models: intuition paves the way for the extension of the evaluation Investigating effectiveness of parametric and non- methodology by adopting diversification-based metrics [42], parametric memories, in: A. Rogers, J. L. Boyd- allowing us to understand how well the generated answers Graber, N. Okazaki (Eds.), Proceedings of the 61st cover the query facets and the topical distribution of the Annual Meeting of the Association for Computa- clusters. tional Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Association for Computational Linguistics, 2023, pp. 9802–9822. URL: References https://doi.org/10.18653/v1/2023.acl-long.546. doi:10. [1] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, 18653/V1/2023.ACL-LONG.546. J. Sun, M. Wang, H. Wang, Retrieval-augmented gen- [10] R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian, eration for large language models: A survey, 2024. H. Wu, J. Wen, H. Wang, Investigating the factual arXiv:2312.10997. knowledge boundary of large language models [2] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, with retrieval augmentation, CoRR abs/2307.11019 Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A (2023). URL: https://doi.org/10.48550/arXiv.2307. survey on hallucination in large language models: 11019. doi:10.48550/ARXIV.2307.11019. Principles, taxonomy, challenges, and open ques- arXiv:2307.11019. tions, CoRR abs/2311.05232 (2023). URL: https: [11] H. Touvron, T. Lavril, G. Izacard, X. Martinet, //doi.org/10.48550/arXiv.2311.05232. doi:10.48550/ M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- ARXIV.2311.05232. arXiv:2311.05232. bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lam- [3] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, ple, Llama: Open and efficient foundation language E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, models, CoRR abs/2302.13971 (2023). URL: https: W. Bi, F. Shi, S. Shi, Siren’s song in the AI //doi.org/10.48550/arXiv.2302.13971. doi:10.48550/ ARXIV.2302.13971. arXiv:2302.13971. [12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- abs/1301.3781. hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, [21] J. Pennington, R. Socher, C. D. Manning, Glove: S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, Global vectors for word representation, in: A. Mos- M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, chitti, B. Pang, W. Daelemans (Eds.), Proceedings W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, of the 2014 Conference on Empirical Methods in A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, Natural Language Processing, EMNLP 2014, October V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Special Interest Group of the ACL, ACL, 2014, pp. Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, 1532–1543. URL: https://doi.org/10.3115/v1/d14-1162. Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, doi:10.3115/V1/D14-1162. A. Schelten, R. Silva, E. M. Smith, R. Subramanian, [22] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, pre-training of deep bidirectional transformers for P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kam- language understanding, in: J. Burstein, C. Doran, badur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Solorio (Eds.), Proceedings of the 2019 Conference T. Scialom, Llama 2: Open foundation and fine-tuned of the North American Chapter of the Association for chat models, CoRR abs/2307.09288 (2023). URL: https: Computational Linguistics: Human Language Tech- //doi.org/10.48550/arXiv.2307.09288. doi:10.48550/ nologies, NAACL-HLT 2019, Minneapolis, MN, USA, ARXIV.2307.09288. arXiv:2307.09288. June 2-7, 2019, Volume 1 (Long and Short Papers), [13] OpenAI, GPT-4 technical report, CoRR abs/2303.08774 Association for Computational Linguistics, 2019, pp. (2023). URL: https://doi.org/10.48550/arXiv.2303. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423. 08774. doi:10.48550/ARXIV.2303.08774. doi:10.18653/V1/N19-1423. arXiv:2303.08774. [23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, [14] F. Xu, W. Shi, E. Choi, RECOMP: improving retrieval- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the lim- augmented lms with compression and selective aug- its of transfer learning with a unified text-to-text trans- mentation, CoRR abs/2310.04408 (2023). URL: https: former, J. Mach. Learn. Res. 21 (2020) 140:1–140:67. //doi.org/10.48550/arXiv.2310.04408. doi:10.48550/ URL: http://jmlr.org/papers/v21/20-074.html. ARXIV.2310.04408. arXiv:2310.04408. [24] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, [15] F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, C. Campagnano, Y. Maarek, N. Tonellotto, F. Silvestri, J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge with The power of noise: Redefining retrieval for rag sys- mt-bench and chatbot arena, in: A. Oh, T. Naumann, tems, arXiv preprint arXiv:2401.14887 (2024). A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), [16] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: a Advances in Neural Information Processing Systems method for automatic evaluation of machine trans- 36: Annual Conference on Neural Information lation, in: Proceedings of the 40th Annual Meet- Processing Systems 2023, NeurIPS 2023, New Or- ing of the Association for Computational Linguistics, leans, LA, USA, December 10 - 16, 2023, 2023. URL: July 6-12, 2002, Philadelphia, PA, USA, ACL, 2002, http://papers.nips.cc/paper_files/paper/2023/hash/ pp. 311–318. URL: https://aclanthology.org/P02-1040/. 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_ doi:10.3115/1073083.1073135. and_Benchmarks.html. [17] C.-Y. Lin, ROUGE: A package for automatic evaluation [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor- of summaries, in: Text Summarization Branches Out, eit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- Association for Computational Linguistics, Barcelona, sukhin, Attention is all you need, in: I. Guyon, Spain, 2004, pp. 74–81. URL: https://aclanthology.org/ U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- W04-1013. gus, S. V. N. Vishwanathan, R. Garnett (Eds.), [18] S. Banerjee, A. Lavie, METEOR: an automatic met- Advances in Neural Information Processing Sys- ric for MT evaluation with improved correlation with tems 30: Annual Conference on Neural Infor- human judgments, in: J. Goldstein, A. Lavie, C. Lin, mation Processing Systems 2017, December 4-9, C. R. Voss (Eds.), Proceedings of the Workshop on 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. Intrinsic and Extrinsic Evaluation Measures for Ma- URL: https://proceedings.neurips.cc/paper/2017/hash/ chine Translation and/or Summarization@ACL 2005, 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. Ann Arbor, Michigan, USA, June 29, 2005, Association [26] E. Clark, S. Rijhwani, S. Gehrmann, J. Maynez, R. Aha- for Computational Linguistics, 2005, pp. 65–72. URL: roni, V. Nikolaev, T. Sellam, A. Siddhant, D. Das, https://aclanthology.org/W05-0909/. A. P. Parikh, SEAHORSE: A multilingual, multi- [19] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, faceted dataset for summarization evaluation, in: Bertscore: Evaluating text generation with BERT, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of 8th International Conference on Learning Representa- the 2023 Conference on Empirical Methods in Natural tions, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, Language Processing, EMNLP 2023, Singapore, De- 2020, OpenReview.net, 2020. URL: https://openreview. cember 6-10, 2023, Association for Computational Lin- net/forum?id=SkeHuCVFDr. guistics, 2023, pp. 9397–9413. URL: https://doi.org/10. [20] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi- 18653/v1/2023.emnlp-main.584. doi:10.18653/V1/ cient estimation of word representations in vector 2023.EMNLP-MAIN.584. space, in: Y. Bengio, Y. LeCun (Eds.), 1st Interna- [27] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. M. tional Conference on Learning Representations, ICLR Voorhees, Overview of the TREC 2019 deep learning 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Work- track, CoRR abs/2003.07820 (2020). URL: https://arxiv. shop Track Proceedings, 2013. URL: http://arxiv.org/ org/abs/2003.07820. arXiv:2003.07820. [28] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, Overview Proceedings of the 6th Workshop on Representa- of the TREC 2020 deep learning track, in: E. M. tion Learning for NLP, RepL4NLP@ACL-IJCNLP 2021, Voorhees, A. Ellis (Eds.), Proceedings of the Twenty- Online, August 6, 2021, Association for Computa- Ninth Text REtrieval Conference, TREC 2020, Virtual tional Linguistics, 2021, pp. 163–173. URL: https://doi. Event [Gaithersburg, Maryland, USA], November 16- org/10.18653/v1/2021.repl4nlp-1.17. doi:10.18653/ 20, 2020, volume 1266 of NIST Special Publication, Na- V1/2021.REPL4NLP-1.17. tional Institute of Standards and Technology (NIST), [37] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, 2020. URL: https://trec.nist.gov/pubs/trec29/papers/ R. Majumder, L. Deng, MS MARCO: A human gen- OVERVIEW.DL.pdf. erated machine reading comprehension dataset, in: [29] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, T. R. Besold, A. Bordes, A. S. d’Avila Garcez, G. Wayne I. Gurevych, BEIR: A heterogenous benchmark for (Eds.), Proceedings of the Workshop on Cognitive zero-shot evaluation of information retrieval models, Computation: Integrating neural and symbolic ap- CoRR abs/2104.08663 (2021). URL: https://arxiv.org/ proaches 2016 co-located with the 30th Annual Confer- abs/2104.08663. arXiv:2104.08663. ence on Neural Information Processing Systems (NIPS [30] X. Ma, X. Zhang, R. Pradeep, J. Lin, Zero-shot 2016), Barcelona, Spain, December 9, 2016, volume listwise document reranking with a large language 1773 of CEUR Workshop Proceedings, CEUR-WS.org, model, CoRR abs/2305.02156 (2023). URL: https: 2016. URL: http://ceur-ws.org/Vol-1773/CoCoNIPS_ //doi.org/10.48550/arXiv.2305.02156. doi:10.48550/ 2016_paper9.pdf. ARXIV.2305.02156. arXiv:2305.02156. [38] F. Petroni, A. Piktus, A. Fan, P. S. H. Lewis, M. Yaz- [31] W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren, dani, N. D. Cao, J. Thorne, Y. Jernite, V. Karpukhin, Z. Chen, D. Yin, Z. Ren, Instruction distillation J. Maillard, V. Plachouras, T. Rocktäschel, S. Riedel, makes large language models efficient zero-shot KILT: a benchmark for knowledge intensive language rankers, CoRR abs/2311.01555 (2023). URL: https: tasks, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, //doi.org/10.48550/arXiv.2311.01555. doi:10.48550/ D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, ARXIV.2311.01555. arXiv:2311.01555. T. Chakraborty, Y. Zhou (Eds.), Proceedings of the [32] R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankvicuna: 2021 Conference of the North American Chapter of Zero-shot listwise document reranking with open- the Association for Computational Linguistics: Hu- source large language models, CoRR abs/2309.15088 man Language Technologies, NAACL-HLT 2021, On- (2023). URL: https://doi.org/10.48550/arXiv.2309. line, June 6-11, 2021, Association for Computational 15088. doi:10.48550/ARXIV.2309.15088. Linguistics, 2021, pp. 2523–2544. URL: https://doi.org/ arXiv:2309.15088. 10.18653/v1/2021.naacl-main.200. doi:10.18653/V1/ [33] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz, M. de Ri- 2021.NAACL-MAIN.200. jke, Conversations with search engines: Serp-based [39] D. Yang, Y. Zhang, H. Fang, An exploration study conversational response generation, ACM Trans. of mixed-initiative query reformulation in conver- Inf. Syst. 39 (2021) 47:1–47:29. URL: https://doi.org/ sational passage retrieval, in: I. Soboroff, A. El- 10.1145/3432726. doi:10.1145/3432726. lis (Eds.), Proceedings of the Thirty-First Text RE- [34] W. Lajewska, K. Balog, Towards filling the gap in trieval Conference, TREC 2022, online, November 15- conversational search: From passage retrieval to con- 19, 2022, volume 500-338 of NIST Special Publication, versational response generation, in: I. Frommholz, National Institute of Standards and Technology (NIST), F. Hopfgartner, M. Lee, M. Oakes, M. Lalmas, M. Zhang, 2022. URL: https://trec.nist.gov/pubs/trec31/papers/ R. L. T. Santos (Eds.), Proceedings of the 32nd ACM udel_fang.C.pdf. International Conference on Information and Knowl- [40] S. Otmazgin, A. Cattan, Y. Goldberg, F-coref: Fast, edge Management, CIKM 2023, Birmingham, United accurate and easy to use coreference resolution, in: Kingdom, October 21-25, 2023, ACM, 2023, pp. 5326– Proceedings of the 2nd Conference of the Asia-Pacific 5330. URL: https://doi.org/10.1145/3583780.3615132. Chapter of the Association for Computational Linguis- doi:10.1145/3583780.3615132. tics and the 12th International Joint Conference on [35] W. Lajewska, K. Balog, Towards reliable and factual Natural Language Processing, AACL/IJCNLP 2022 - response generation: Detecting unanswerable ques- System Demostrations, Taipei, Taiwan, November 20 - tions in information-seeking conversations, in: N. Go- 23, 2022, Association for Computational Linguistics, harian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, 2022, pp. 48–56. URL: https://aclanthology.org/2022. C. Macdonald, I. Ounis (Eds.), Advances in Information aacl-demo.6. Retrieval - 46th European Conference on Information [41] S. Otmazgin, A. Cattan, Y. Goldberg, Lingmess: Lin- Retrieval, ECIR 2024, Glasgow, UK, March 24-28, 2024, guistically informed multi expert scorers for coref- Proceedings, Part III, volume 14610 of Lecture Notes erence resolution, in: A. Vlachos, I. Augenstein in Computer Science, Springer, 2024, pp. 336–344. URL: (Eds.), Proceedings of the 17th Conference of the https://doi.org/10.1007/978-3-031-56063-7_25. doi:10. European Chapter of the Association for Computa- 1007/978-3-031-56063-7\_25. tional Linguistics, EACL 2023, Dubrovnik, Croatia, [36] S. Lin, J. Yang, J. Lin, In-batch negatives for knowledge May 2-6, 2023, Association for Computational Lin- distillation with tightly-coupled teachers for dense re- guistics, 2023, pp. 2744–2752. URL: https://doi.org/ trieval, in: A. Rogers, I. Calixto, I. Vulic, N. Saphra, 10.18653/v1/2023.eacl-main.202. doi:10.18653/V1/ N. Kassner, O. Camburu, T. Bansal, V. Shwartz (Eds.), 2023.EACL-MAIN.202. [42] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, mation Retrieval, SIGIR ’08, Association for Comput- A. Ashkan, S. Büttcher, I. MacKinnon, Novelty and ing Machinery, New York, NY, USA, 2008, p. 659–666. diversity in information retrieval evaluation, in: Pro- URL: https://doi.org/10.1145/1390334.1390446. doi:10. ceedings of the 31st Annual International ACM SIGIR 1145/1390334.1390446. Conference on Research and Development in Infor-