Improving RAG Systems via Sentence Clustering and Reordering
                         Marco Alessio1,* , Guglielmo Faggioli2 , Nicola Ferro2 , Franco Maria Nardini1 and Raffaele Perego1
                         1
                             Institute of Information Science and Technologies (ISTI), National Research Council of Italy (CNR), Pisa, Italy
                         2
                             Department of Information Engineering (DEI), University of Padua, Padua, Italy


                                            Abstract
                                            Large Language Models (LLMs) have gained noteworthy importance and attention across different domains and fields in recent years.
                                            Information Retrieval (IR) is one of the domains they impacted the most, as witnessed by the recent increase in the number of IR
                                            systems incorporating generative models. Specifically, Retrieval Augmented Generation (RAG) is the emerging paradigm that integrates
                                            existing knowledge from large-scale document corpora into the generation process, enabling the model to generate more coherent,
                                            contextually relevant, and accurate text across various tasks. Such tasks include summarization, question answering, and dialogue
                                            systems. Recent studies have highlighted the significant positional dependence exhibited by RAG systems. Such studies observed
                                            how the placement of information within the LLM input prompt drastically affects the generated output. We ground our study on
                                            this property by investigating alternative strategies for ordering sentences within the LLM prompt to improve the average quality of
                                            the generated responses in the user and conversational system dialogues. We propose the architecture of an end-to-end RAG-based
                                            conversational assistant and empirically evaluate our strategies using the TREC CAsT 2022 collection. Our experiments highlight
                                            significant differences between distinct arrangement strategies. By employing an evaluation methodology based on RankVicuna, we
                                            show that our best approach achieves improvements up to 54% in terms of overall response quality over baseline methods.

                                            Keywords
                                            Retrieval Augmented Generation, Conversational Search, Positional Bias, Arrangement Strategy


                         1. Introduction                                                                                               prioritize information placed at the beginning or end of the
                                                                                                                                       input while neglecting the central portion.
                         Retrieval Augmented Generation (RAG) is an emerging                                                              In this paper, we advance over previous studies by in-
                         paradigm in the field of Artificial Intelligence (AI) to en-                                                  vestigating the positional bias in the context of RAG-based
                         hance the accuracy and reliability of generative models by                                                    conversational systems. Specifically, we propose a novel
                         exploiting external data sources. In recent years, RAG has                                                    strategy for arranging sentences within the input prompt of
                         gained noteworthy importance and attention across dif-                                                        the LLM to improve the average quality of the generated re-
                         ferent domains and fields [1] as it allows to combine the                                                     sponses over simpler methods. Our approach is based on the
                         strengths of Information Retrieval (IR) systems and genera-                                                   intuition that as coherent, fluent, and well-structured text
                         tive models to overcome each other’s limitations.                                                             are critical factors for successful communication between
                            RAG can improve the output of a generative model in                                                        human beings, the same should also apply to LLMs: among
                         several ways. First, it allows the generation process to be                                                   all the possible arrangements of the input, those having
                         grounded on information from trusted knowledge sources                                                        sentences with similar meaning placed closer in the LLM
                         incorporated in the provided prompt, thus avoiding or at                                                      prompt should generate, on average, better quality output.
                         least mitigating the well-known Large Language Model                                                          Therefore, we propose an end-to-end RAG architecture to
                         (LLM) hallucination problem, i.e., when the model gener-                                                      test our hypothesis. The components of this architecture
                         ates contents not factually true or that do not concern the                                                   allow us to precisely identify which sentences are likely
                         prompted text [2, 3, 4]. Second, RAG allows for continuous                                                    useful for answering user queries. To this end, we clus-
                         knowledge updates and integration of domain-specific in-                                                      ter sentences by their similarity and we define alternative
                         formation: the LLM can successfully respond to facts and                                                      strategies for ordering them both inter and intra-cluster.
                         topics not covered in its training data; moreover, it is eas-                                                 In this way, we can study the effect on the generated re-
                         ily adapted to different scenarios and contexts, without re-                                                  sponse of these alternatives for prompting the generative
                         training or fine-tuning the entire model using datasets that                                                  LLM. To our knowledge, this is the first work that explic-
                         might be unavailable or limited in scope or size. Finally,                                                    itly considers this aspect and allows us to fine-tune in a
                         grounding the generation process on external knowledge                                                        principled way the ordering of input sentences provided to
                         incorporated in the input permits linking the output to veri-                                                 the generative component of a RAG system. We compare
                         fiable external documents, thus enhancing trustworthiness                                                     our proposed approach against competitive baselines that
                         and transparency [2, 3, 4].                                                                                   represent the solutions employed by current RAG systems.
                            Current RAG systems, however, suffer of some draw-                                                         We experimentally evaluate the performance of our pro-
                         backs highlighted in the literature. One of these issues                                                      posed approach using the TREC Conversational Assistance
                         originates from the notable positional sensitivity shown                                                      Track (CAsT) 2022 collection [8], which allows us to com-
                         by LLMs. The placement of information within the input                                                        pare the results that different arrangement strategies can
                         prompt significantly impacts the resulting output. Previous                                                   achieve in a widely accepted Conversational Search (CS) sce-
                         research [5, 6, 7] has highlighted biases towards “primacy”                                                   nario. Results highlight remarkable differences among the
                         and “recency”, suggesting that generative models tend to                                                      tested sentence placement strategies, with improvements up
                                                                                                                                       to 8.66% w.r.t. the best baseline and 54.94% w.r.t. random
                          Information Retrieval’s Role in RAG Systems (IR-RAG) - 2024, July 18,
                          2024, Washington, DC                                                                                         ordering.
                         *
                           Corresponding author.                                                                                          The remainder of this work is organized as follows: Sec-
                          $ marco.alessio@isti.cnr.it (M. Alessio); guglielmo.faggioli@unipd.it                                        tion 2 surveys the current state-of-the-art about RAG sys-
                          (G. Faggioli); nicola.ferro@unipd.it (N. Ferro);                                                             tems and quality evaluation for their responses. Section
                          francomaria.nardini@isti.cnr.it (F. M. Nardini);
                          raffaele.perego@isti.cnr.it (R. Perego)
                                                                                                                                       3 details the architecture of our RAG system. Section 4
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   and Section 5 detail the results of an experimental analysis,
                                        Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
which aims to highlight how the ordering of clusters and         and closed-source, e.g., GPT-4 [13] by OpenAI, models. It is
sentences affects the quality of the generated response. Fi-     not advisable to directly input all the retrieved information
nally, Section 6 draws some conclusions and outlines future      to the LLM for generating the response. Redundant infor-
directions and extensions of our research.                       mation and very long contextual data can interfere with
                                                                 the generation quality, leading to repetitive, disjointed, or
                                                                 incoherent outputs [1]. Therefore, the retrieved content is
2. Related Work                                                  typically further processed before being given in input to
                                                                 the LLM [14]. A recent work in this direction systemati-
In the following, we survey the main works dealing with
                                                                 cally examines the retrieval strategy of RAG systems [15].
LLM positional dependencies and the difficulties of RAG
                                                                 The authors consider multiple retrieval factors affecting the
systems in conciliating internal and external knowledge.
                                                                 generation process, such as the relevance of the passages
Then, we analyze the challenges related to the evaluation
                                                                 in the prompt context, their position, and their number.
of the quality of RAG responses and to the use of an “LLM-
                                                                 One counter-intuitive finding is that the retriever’s highest-
as-a-judge”.
                                                                 scoring documents that are not directly relevant to the query,
                                                                 e.g., do not contain the answer, negatively impact the effec-
2.1. Retrieval Augmented Generation                              tiveness of the LLM. Moreover, the authors discover that
RAG enhances LLMs by retrieving additional information           adding random documents in the prompt improves the LLM
from an external knowledge source, enabling them to suc-         accuracy by up to 35%.
cessfully answer queries beyond the scope of the training           In this work, we rely on the intuition that the use of co-
data. At the same time, RAG mitigates the hallucination          herent, fluent, and well-structured inputs can improve RAG
problem, which is generating factually incorrect text, by        and we propose an end-to-end architecture for selecting and
referencing the provided external knowledge.                     structuring the external information included in the LLM
   The RAG paradigm is organized into two main stages:           prompt for response generation.
retrieval and generation. Upon receiving a query from the
user, the relevant information is retrieved from an external     2.2. Quality Evaluation
knowledge source. This task is undertaken by a standard IR
                                                                 Another line of research is how to evaluate the overall qual-
pipeline that outputs a ranked list of documents. Afterwards,
                                                                 ity of the generation output. Despite human assessment
in the generation phase, the LLM synthesizes the response
                                                                 providing the most accurate and reliable measure for evalu-
to answer the user query using the information carried by
                                                                 ating model performance, the high time and cost require-
the selected documents.
                                                                 ments severely limit the application. Therefore, there exists
   Despite its clear advantages, RAG has drawbacks and
                                                                 an ever-increasing demand for automated evaluation tech-
limitations, which spark several challenges. First, RAG sys-
                                                                 niques that consistently align with human judgements while
tems employ the external knowledge as their main source
                                                                 offering enhanced efficiency and cost-effectiveness.
of information, disregarding the internal knowledge memo-
                                                                    In this paper, we focus on textual-based generative
rized within the LLM [9, 10]. This, in turn, may determine
                                                                 models. Classical automatic evaluation metrics, such as
a decrease in the quality of the generated output when
                                                                 BLEU [16], ROUGE [17], and METEOR [18], are designed to
the provided content is not high-quality [10]. It is not un-
                                                                 quantify the degree of similarity between a candidate text
common for RAG to obtain worse outputs w.r.t. what the
                                                                 and one or more reference texts, by assessing their n-grams
LLM can achieve in the closed-book scenario, i.e., without
                                                                 matching. The simplicity and explainability, along with
supplying retrieved results [10]. In this line, it has been
                                                                 the good correlation with human judgements, make these
observed that the LLM produces better results without in-
                                                                 metrics widely used as baselines. However, these metrics
jecting external knowledge when the topic popularity is
                                                                 exhibit several limitations [19]: firstly, they cannot account
very high [9]. In general, state-of-the-art LLMs provide
                                                                 for lexical diversity; secondly, they penalize variations in
good quality responses for a wide range of questions but
                                                                 the semantic ordering of words; thirdly, they struggle to
require assistance from an IR system when the internal
                                                                 capture and match paraphrases effectively; lastly, they inad-
knowledge of the model lacks information about the cur-
                                                                 equately account for distant dependencies within the text.
rent topic. This phenomenon is likely to occur if the topic
                                                                 With the advent of word embeddings [20, 21] and neural
is not very popular, requires exceptional expertise, or when
                                                                 models [22, 23, 11, 12, 24] based on Transformers [25], new
scaling the number of parameters of the generative model
                                                                 learned metrics [19, 26] have been developed. For example,
produces little to no effect [9]. Another challenge lies in
                                                                 BERTScore [19] can capture the semantic similarity between
the significant positional dependence [5, 6, 7] exhibited by
                                                                 the candidate and reference texts employing the contex-
LLMs, whereby the placement of information within the
                                                                 tual embeddings generated by an encoder model, such as
input prompt drastically affects the generated output. Prior
                                                                 BERT [22].
research [5] has identified “primacy” and “recency” biases,
                                                                    In recent years, the rapid advancements of LLMs show-
indicating the tendency of generative models to focus to-
                                                                 ing remarkable performance across many tasks have gained
ward information positioned either at the beginning or the
                                                                 considerable interest in their potential application also as
end of the input while disregarding the central part. There-
                                                                 annotators and evaluators. Due to their training using
fore, the performance degrades significantly when LLMs
                                                                 Reinforcement Learning from Human Feedback (RLHF),
should rely on information in the middle of its input context,
                                                                 these models demonstrate significant human alignment.
showing a characteristic U-shaped performance curve [5].
                                                                 Many research have investigated leveraging state-of-the-
This, in turn, means that most state-of-the-art generative
                                                                 art LLMs to automatically produce assessments serving as
models do not use effectively their longer contexts w.r.t.
                                                                 proxies for human judgments, a paradigm known as “LLM-
smaller and earlier counterparts. These phenomena can be
                                                                 as-a-judge”.
observed both in open-source, e.g., Llama [11, 12] by Meta,
                                                           p1                                            s1,1     s1,2         ...     s1,7
                                                                                                                                                             s3,7
                                                           p2                                            s2,1     s2,2         s2,3
                                                                        CO-REF. RESOLUTION                                                                   s4,2
                      QUERY        RETRIEVAL               ...           SENTENCE SPLITTER                ...     ...          ...      ...   SENTENCE
                                    SYSTEM                                                                                                    RERANKING      ...
                                                           pN           DUPLICATES REMOVAL
                                                                                                         sN,1     sN,2         sN,3    sN,4
  CONVERSATION                                                                                                                                              s19,6
                                                        RETRIEVED
                                                                                                          RETRIEVED SENTENCES
                                                        PASSAGES


                                                                                                                                                             TOP-K
                                                                                                                                                           SELECTION
                                                                       CLUSTERS
                                                                    ORDER STRATEGY

                                                                                                C1          C2           ...          CK
                                                                                                s3,7       s4,2          ...          s0,1                   s3,7

                      LARGE         s4,2   s0,1   ...   s12,1                                   s5,6                     ...          s15,4                  s4,2
  RESPONSE                                                              SENTENCE                                                               SENTENCE
                     LANGUAGE                                                                  s12,1                     ...           ...
                                    REORDERED SENTENCES                REORDERING                                                             CLUSTERING     ...
                      MODEL
                                                                                                                         ...          s9,3
                                                                                                                                                             s9,3
                                                                                                       CLUSTERED SENTENCES
                     PROMPT                                            SENTENCES
                                                                      WITHIN CLUSTER
                                                                    ORDER STRATEGY


             Figure 1: Architecture of our proposed RAG system.


   Furthermore, in recent years LLMs have gained popu-                                 3.1. Document Pre-processing and Splitting
larity also as evaluators. For example, Zheng et al. [24]
                                                                                       As observed in literature [33, 34], the entire text of a rele-
assessed the quality of conversations with various LLMs,
                                                                                       vant document rarely contains meaningful knowledge to
both open and closed source, employing GPT-4 [13] as judge.
                                                                                       satisfy the user information need expressed by a query 𝑞.
They experimented with various prompts and different ap-
                                                                                       In most cases, only one or a few portions of the document
proaches, such as single answer grading and pairwise com-
                                                                                       are relevant to the query, while the remaining parts contain
parisons both between responses and against a reference
                                                                                       irrelevant information. The proposed architecture aims to
text. GPT-3.5 Turbo and GPT-4 [13] have been employed as
                                                                                       precisely identify the key information in the retrieved docu-
listwise rerankers [6, 7] for the TREC Deep Learning 2019
                                                                                       ments, i.e., the sentences, to reduce the noise in the prompt
and 2020 [27, 28] and BEIR [29] experimental collections,
                                                                                       used for response generation.
obtaining state-of-the-art performance [6]. The same LLMs
                                                                                          Hereinafter, we consider sentences in the documents as
have also been employed as teacher models to fine-tune
                                                                                       the atomic units of information. Our pipeline, illustrated in
smaller open-source student models, such as Llama and
                                                                                       Figure 1 works as follows. First, for each query 𝑞 we consider
Vicuna [30, 31] (i.e.: RankVicuna [32]).
                                                                                       only the top-𝑘 documents {𝑑1 , 𝑑2 , ..., 𝑑𝑘 } retrieved by the
   In this work, we rely on state-of-the-art assessment meth-
                                                                                       IR system. Then, a state-of-the-art co-reference resolution
ods and evaluate the quality of the responses generated by
                                                                                       model is applied to all documents to replace pronouns and
the different methods using RankVicuna [32].
                                                                                       other generic terms within a sentence with the fully speci-
                                                                                       fied entity mentioned in a previous sentence. This allows
3. The Proposed RAG Architecture                                                       us to remove the contextual dependencies among sentences
                                                                                       in a document so they can be considered self-explanatory.
Generative models exhibit strong biases towards informa-                               The third step splits each document 𝑑𝑖 into a sequence of
tion positioned at the start or the end of the input while dis-                        sentences {𝑠𝑖,1 , 𝑠𝑖,2 , ..., 𝑠𝑖,𝑛𝑖 }. Afterwards, near-duplicate
regarding the middle part [5]. This phenomenon motivates                               removal is employed to the sentences originated by all doc-
our research effort to determine how the order of the input                            uments by discarding sentences with a Jaccard similarity
sentences provided to a RAG-based conversational system                                ≥ 0.9 between their Bag-of-Words (BoW) representations1 .
affects the quality of the generated output and, in turn, the
optimal ordering strategy to achieve the best response. This                           3.2. Sentence Selection
section describes each method and all variations considered
in our experiments.                                                                    After the first pre-processing phase, we obtain a sentence
   The architecture of our proposed RAG system is illus-                               candidate set for each query to be included in the LLM
trated in Figure 1. It includes an IR pipeline, which retrieves                        prompt of our RAG system (see Figure 1). Since the cardi-
top-𝑘 documents 𝐷 = {𝑑1 , 𝑑2 , ..., 𝑑𝑘 } in response to each                           nality of this set can be large and not all the sentences are
user utterance 𝑞. The retrieved documents are then pro-                                useful for answering the query, we employ the BERT-based
cessed by additional components responsible for splitting                              cross encoder answer-in-the-sentence classifier2 developed
them into sentences, identifying the most relevant sentences,                          by Lajewska and Balog [35] to rank the candidate sentences
clustering such sentences based on their semantic similar-
ity, and ordering them according to the various strategies                             1
                                                                                         This step is particularly important in our setting because the CAsT 2022
analyzed. Finally, the selected—re-ordered—sentences are                                 corpus contains a multitude of near-duplicate documents. In particular,
provided as input to the LLM for response generation. These                              the same Wikipedia article is often replicated in documents retrieved
components are the focus of our research. Their functional-                              from the KILT and MS-MARCO collections.
                                                                                       2
                                                                                         The model named “squad_snippets_unanswerable” is available at https:
ities are detailed in the remainder of this section.                                     //iai.group/downloads/emnlp2023-answerability_prediction.
according to their predicted usefulness to (at least partially)      The order of clusters and the order of the sentences within
answer the query and we retain the top-𝑛 ranked sentences         the same cluster uniquely determine the possible global or-
thus discarding the remaining ones. As a possible limita-         dering of the 𝑛 sentences we consider for inputting the LLM.
tion, please note that the model by Lajewska and Balog            Our experimental assessment will evaluate six different or-
[35] employed have been trained on queries and passages           dering strategies for placing the clusters of sentences in the
used in our experiments. Therefore, it is very likely that        input, and four different methods for ordering sentences
the model performs significantly better on our data w.r.t.        within the same cluster. Cluster placements consider differ-
any other model, ensuring that top-ranked sentences are           ent aspects, such as the clusters’ cardinality and similarity
indeed relevant to the query. Even though such a model            to the query. The ordering tested includes the random one
is not available in a real practical scenario, this choice is     and those obtained by decreasing/increasing the value of
justified by our research effort being focused exclusively on     each aspect. Finally, the U-shaped order suggested in [5]
comparing the ordering strategy for sentences in the LLM          is also tested. Regarding the ordering within clusters, we
input rather than on the absolute results achievable by our       consider random order, order by reranker score, visiting
RAG system.                                                       order, and the clustering aggregation order.

3.3. Sentence Clustering and Ordering                             4. Experimental Evaluation
The previous steps of the pipeline constrain the number of
sentences per query while increasing their expected utility       We can now formulate the research questions we aim to
in answering the query. Furthermore, they allow us to con-        answer with our experimental framework.
trol other noise sources, such as the number or the variable      Research Questions. Given the sentence selection and
length of the retrieved documents. Therefore, we can as-          clustering steps discussed above, the two main aspects to
sess how the positional bias affects the generation process.      consider for defining our ordering strategies 𝑜𝑟𝑑(·) are the
We highlight again that the positional bias of LLM has al-        order of placement in the LLM prompt of the clusters and
ready been observed in prior research [5, 6, 7]. However,         of the sentences within the same cluster. They uniquely
it has been considered exclusively as a limitation of LLMs        determine the global ordering 𝑜𝑟𝑑(·) of the top-𝑛 sentences
and RAG systems. Our research moves a step forward by             given in input to the LLM for response generation. Our
investigating the best ordering strategy to maximize, on          research questions assess which is the best solution among
average, the quality of the generated responses over a test-      these alternatives considered. Specifically,
ing query set 𝑄. We believe that logically organized text
                                                                      RQ1 What is the best cluster ordering strategy?
where sentences with akin meanings are positioned closer
in the LLM prompt should, on average, yield superior out-             RQ2 What is the best ordering strategy for sentences
put quality. Consequently, our sentence ordering strategies               within the same cluster?
exploit the similarities among sentences selected by the sen-         RQ3 Can our proposed strategy enhance the effectiveness
tence selection step. To measure semantic inter-sentence                  of the RAG system w.r.t. baseline methods?
similarity, we resort to the contextualized embeddings gen-
erated with the tct-colbert model3 [36]. We generate the          Experimental Settings. We experiment with the TREC
representation of the 𝑛 selected sentences for each query         CAsT 2022 dataset, a standard experimental collection for
and measure their pair-wise cosine similarity. Then, we           CS [8]. This choice is due to prior research that released
progressively aggregate the most similar sentences by em-         additional datasets, models, and human judgments for this
ploying a hierarchical clustering algorithm. The maximum          benchmark [34, 35]. The corpus is composed of three doc-
value of Silhouette statistic is used as the criteria to deter-   uments collections, MS-MARCO v2 [37], KILT [38], and
mine the optimal clustering among all possible. As a result,      Washington Post v4, which are subdivided into 106𝑀 short
for each query 𝑞 ∈ 𝑄, the top-𝑛 sentences are grouped in          documents. CAsT 2022 includes 18 information needs (top-
a variable number 𝑁𝑐 ≥ 1 of clusters, each composed of            ics) and 205 user utterances (queries), with an average
one or more sentences with similar semantic meaning. To           length of 11.39 user utterances per topic. The number of
devise different strategies for ordering input sentences, we      utterances for which relevance judgements are provided is
leverage the above clustering that allows us to study the         163.
impact of sentence placement variations occurring in both            For our experiments, as the retrieval system, we employ
inter and intra-clusters.                                         as the output of the retrieval pipeline the best-performing
   More formally, given a query, the set 𝑆 of the 𝑛 previously    run originally submitted to TREC CAsT 20224 [39]. This
selected sentences, and the prompt 𝑝, we aim to find the          allows us to focus exclusively on the following steps of our
ordering 𝑜𝑟𝑑* of 𝑆 such that:                                     pipeline. In all our experiments, we consider only the top-20
                                                                  retrieved documents, leaving the investigation about the
                                                                  implications of this choice and possible alternatives as fu-
                                                                  ture work. To provide meaningful results, all queries where
                            ∑︁
         𝑜𝑟𝑑* = argmax            𝑠(𝑞, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆))),
                     𝑜𝑟𝑑
                            𝑞∈𝑄                                   𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@20 < 0.2, that is, having at most 3 relevant
                                                                  passages in the top-20 results, are discarded5 , ensuring that
  where 𝑜𝑟𝑑(𝑆) is a sentence ordering strategy that returns       enough relevant information is retrieved to answer the con-
an ordering of the sentences in 𝑆, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) is the      sidered queries successfully.
response generated by the LLM used for prompt 𝑝, query
𝑞 and sentence ordering 𝑜𝑟𝑑(𝑆), and, finally, 𝑠(𝑞, 𝑟) is a
scoring function evaluating the perceived quality of the          4
                                                                    The run is identified as “udinfo_mi_b2021” from the “udel_fang” group,
generated response 𝑟 = 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) for query 𝑞.              University of Delaware (USA)
                                                                  5
                                                                    The number of queries considered in these experiments is 115 out of
3
    https://huggingface.co/castorini/tct_colbert-v2-hnp-msmarco     163 evaluated in the official relevance judgments.
Table 1                                                                       Table 2
Comparisons between the six approaches proposed for RQ1:                      Comparisons between the four approaches proposed for RQ2:
“What is the best ordering strategy for clusters?”. In the top                “What is the best ordering strategy for sentences within the same
half, each row reports three numbers, which are the wins for                  cluster?”. In the top half, each row reports three numbers, which
the approach in the column label, the ties, and the wins for the              are the wins for the approach in the column label, the ties, and
approach in the row label, respectively. In the bottom half, the              the wins for the approach in the row label, respectively. In the
overall results are reported.                                                 bottom half, the overall results are reported.
                  A vs.     B vs.     C vs.      D vs.     E vs.     F vs.                              A vs.       B vs.      C vs.      D vs.
         A           —      56-4-51   57-2-52   62-4-45   51-1-59   55-2-54                  A            —        53-3-59    48-8-59    55-4-56
         B        51-4-56      —      47-8-56   61-4-46   52-5-54   52-2-57                  B         59-3-53        —       54-3-58    60-7-48
         C        52-2-57   56-8-47      —      58-3-50   55-0-56   59-4-48                  C         59-8-48     58-3-54       —       57-6-52
         D        45-4-62   46-4-61   50-3-58      —      44-1-66   47-1-63
                                                                                             D         56-4-55     48-7-60    52-6-57       —
         E        59-1-51   54-5-52   56-0-55   66-1-44      —      57-5-49
         F        54-2-55   57-2-52   48-4-59   63-1-47   49-5-57      —                 Overall       174-15      159-13     154-17     172-17
      Overall     261-13    269-23    258-17    310-13    251-12    270-14              Avg. Score     0.6281      0.6143     0.6124     0.6451
     Avg. Score   0.5723    0.5844    0.5510    0.6219    0.5736    0.5969


                                                                              order) fixed. We test six different strategies for ordering
   Furthermore, in the steps of the pipeline where the query                  clusters: clusters selected in random order (strategy A);
text is needed, i.e., sentence ranking and response gener-                    clusters selected in descending order of cardinality (strategy
ation, we employed the manually rewritten text for every                      B); clusters selected in ascending order of similarity with the
query. This allows us to account for the possible bias intro-                 query8 (strategy C); clusters selected in descending order
duced by different query rewriting approaches. Future de-                     of similarity with the query (strategy D); clusters selected
velopments will investigate the relationship between query                    in descending order by similarity with the query using a
rewriting approaches and RAG solutions.                                       ping-pong layout from top to bottom (strategy E)9 ; clusters
   For co-reference resolution at the document level, i.e., re-               selected by similarity with the query in descending order,
moving co-references across different sentences in the “doc-                  using a ping-pong layout from bottom to top (strategy F)10 .
ument processing” step, we use the “F-Coref” model6 [40]                         As shown in Table 1, sorting the clusters in descend-
based on the “LingMess” architecture [41]. After this step,                   ing order by their similarity with the query (strategy D)
we use the well-known SpaCy Python library to divide each                     is the clear winner in this comparison, in terms of both
document into a sequence of independent sentences.                            score and pairwise wins. This approach performs 18.77%,
   In the following section, we report two different metrics                  15.24%, 20.16%, 23.51%, and 14.81% better than other
for each comparison. The former is the average score of                       options. This figures suggest that the LLM used to gener-
every approach when assessing all 10 random permutations                      ate the responses exhibit a much stronger “primacy” rather
using RankVicuna. The latter, instead, is a pairwise metric,                  than “recency” biases, as highlighted by option C being over-
assessing the number of queries for which the first approach                  all the worst performing among those considered. Instead,
obtains higher/the same/lower score w.r.t. the other one.                     methods E and F were designed to place the least important
This information should better highlight the differences and                  clusters towards the center, since LLMs struggle to utilize
provide a more comprehensive view than a single average                       the information in the middle of their prompt effectively.
value.                                                                        However, we can see that both approaches are ineffective:
Response Generation. For the response generation, we                          we suspect this is due to the length of the input text being
employ Vicuna 7B7 [24], a LLM based on Llama 2 [11, 12]                       much smaller than the maximum context window of the
fine-tuned on 125K user conversations with ChatGPT gath-                      model. Different results may be observed when varying the
ered using public APIs from the ShareGPT.com website.                         amount of input data provided to the LLM for generation.
Quality Evaluation. To evaluate the quality of the gener-
ated responses, we employ RankVicuna [32] to perform list-                    4.2. RQ2: Order of Sentences within the
wise ranking between all responses being compared. To mit-                         same Cluster
igate the positional bias intrinsic in RankVicuna, we assess
10 different random permutations of the same responses,                       In this second experiment, we evaluate different sorting
averaging the results obtained. This is a reasonable trade-off                schemes for sentences within the same cluster, keeping
between evaluation accuracy and the computational run-                        the cluster’s order fixed at the best strategy determined in
time required. For each assessment, we assign 𝑁 +1−𝑖    points                RQ1. We test four different strategies for ordering sentences
                                                  𝑁
to the i-th ranked response, where 1 ≤ 𝑖 ≤ 𝑁 and 𝑁 is                         within the same cluster: sentences selected in random order
the number of responses being compared. Furthermore, we                       (strategy A); sentences selected in descending order by re-
also evaluate the number of wins and ties between pairs                       ranker score (strategy B); sentences selected by visiting
of responses considered. Whether a valid judgment from                        order11 (strategy C); sentences selected by aggregation order
the LLM can not be determined, the entire comparison is                       (strategy D).
discarded from the evaluation.                                                   As shown in Table 2, the best results are achieved by two

                                                                              8
                                                                                The similarity between a cluster 𝐶 and the query is defined as the max-
4.1. RQ1: Order of Clusters                                                     imum cosine similarity between the query 𝑞 ∈ 𝑄 with any sentence
                                                                                𝑠𝑖,𝑗 ∈ 𝐶 belonging to the cluster.
For the first experiment, we evaluate the effects of different                9
                                                                                The clusters are placed first, last, second, second-to-last, third, and so
ordering of the clusters while keeping the order of sentences                   on, e.g., [A, B, C, D, E] becomes [A, C, E, D, B].
                                                                              10
within the same cluster (based on the clustering aggregation                     The clusters are placed last, first, second-to-last, second, third-to-last,
                                                                                 and so on, e.g., [A, B, C, D, E] becomes [B, D, E, C, A].
6                                                                             11
    https://huggingface.co/biu-nlp/f-coref                                       The sentences are sorted based on the order in which they appear when
7
    https://huggingface.co/lmsys/vicuna-7b-v1.5                                  sequentially scanning through the set of top-𝑘 retrieved documents.
Table 3                                                                Table 4
Comparisons between the five approaches considered for RQ3:            Comparisons between the seven approaches proposed for RQ4:
“Can our proposed strategy enhance the effectiveness of the RAG        “Is there a correlation between the similarity of subsequent sen-
system w.r.t. baseline methods?”. In the top half, each row reports    tences in the LLM prompt and the quality of the generated re-
three numbers, which are the wins for approach in the column           sponse?”. In the top half, each row reports three numbers, which
label, the ties, and the wins for approach in the row label, respec-   are the wins for the approach in the column label, the ties, and
tively. In the bottom half, the overall results are reported.          the wins for the approach in the row label, respectively. In the
                                                                       bottom half, the overall results are reported.
                   A vs.    B vs.     C vs.     D vs.     CL vs.
                                                                                     1.000 vs.   0.625 vs.    0.500 vs.   0.375 vs.   0.250 vs.   0.125 vs.   0.000 vs.
         A           —      45-4-62   54-1-56   54-0-57   66-2-43
                                                                          1.000         —        46-2-43      38-2-51     45-0-46     40-2-49     40-1-50     38-1-52
         B        62-4-45      —      71-1-39   64-8-39   67-5-39         0.625      43-2-46        —         37-2-52     42-2-47     41-1-49     36-0-55     35-1-55
         C        56-1-54   39-1-71      —      50-4-57   59-3-49         0.500      51-2-38     52-2-37         —        51-2-38     52-0-39     37-0-54     44-2-45
         D        57-0-54   39-8-64   57-4-50      —      59-3-49         0.375      46-0-45     47-2-42      38-2-51        —        42-2-47     37-3-51     37-1-53
                                                                          0.250      49-2-40     49-1-41      39-0-52     47-2-42        —        43-3-45     42-1-48
         CL       43-2-66   39-5-67   49-3-59   49-3-59      —            0.125      50-1-40     55-0-36      54-0-37     51-3-37     45-3-43        —        44-1-46
                                                                          0.000      52-1-38     55-1-35      45-2-44     53-1-37     48-1-42     46-1-44        —
      Overall      218-7    162-18    231-9     217-15    251-13
     Avg. Score    0.5882   0.5533    0.6177    0.6016    0.6392         Overall      291-8       304-8        251-8       289-10      268-9       239-8       240-7
                                                                        Avg. Score    0.5731      0.5866       0.5480      0.5617      0.5516      0.5349      0.5143


different strategies: option D, sorting sentences within the             RQ5 Is the proposed clusterization strategy more effective
same cluster based on aggregation order, and interestingly,                  than directly optimising the similarity of subsequent
option A, randomly sorting the sentences. Both strategies                    sentences?
are preferable to the other two methods considered, per-
forming 8.18% and 11.69% better w.r.t. options B and C,                Experimental Settings. We determine heuristically the
respectively. We note however that the difference in perfor-           two ordering 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− , which maximize and mini-
mance of the various strategies are not large as the sentences         mize the overall similarity between subsequent sentences.
are grouped in the clusters by their similarity. The LLM re-           Let 𝑠𝑢𝑚+ and 𝑠𝑢𝑚− be the sum of similarity between sub-
sponse appears to be more impacted by the order of the                 sequent sentences for 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− respectively. The
clusters than by the order of sentences within each cluster.           similarity 𝑠𝑖𝑚(𝑝) for a sentence permutation 𝑝 is given
                                                                       by the following equation, where min-max normalization
4.3. RQ3: Comparison with Baselines                                    is used, and 𝑠𝑖 are the embedding representations of the
                                                                       respective sentences:
Our last experiment investigates whether our proposed ap-
proach is beneficial in enhancing the overall effectiveness                                        (︁∑︀
                                                                                                             ℎ
                                                                                                                                         )︁
of the RAG system w.r.t. four simpler baseline methods that                                                  𝑖=2 𝑐𝑜𝑠(𝑠𝑖−1 , 𝑠𝑖 )              − 𝑠𝑢𝑚−
                                                                                  𝑠𝑖𝑚(𝑝) =
may be used in practice by current state-of-the-art RAG sys-                                                      𝑠𝑢𝑚+ − 𝑠𝑢𝑚−
tems. We test five different strategies: i) the top-5 retrieved
documents (A), ii) the top-40 sentences taken in random                   In our experiments, for each query, we generate one mil-
order (B), iii) the top-40 sentences taken in descending order         lion random permutations, then we determine which is the
by re-ranker score (C), iv) the top-40 sentences selected by           permutation with similarity closer to each of the following
visiting order (D), v) the best clusterization-based approach          thresholds: 0.125, 0.250, 0.375, 0.500, and 0.625. We de-
determined from RQ1 and RQ2 (CL).                                      cided to stop at 0.625 because higher values are unlikely
   The results obtained are shown in Table 3. The                      to be observed given that the average similarity of these
clusterization-based approach demonstrate superior perfor-             permutations is 0.3433 with standard deviation 0.0530.
mance, resulting as the best strategy in this comparison. The          Results. We determine how the quality of the generated
four baselines yield notably lower results: 15.14%, 54.94%,            response is influenced when varying the similarity between
8.66%, and 15.67%, respectively. Among the methods                     subsequent sentences at various predefined thresholds, as
considered in this work, randomly sorting the top-ℎ sen-               shown in Table 4. It is interesting to note that the highest
tences is by far the least performing approach. This, in               results are obtained by permutations with 0.625 normalised
turn, proves our starting intuition about coherent, fluent,            similarity, rather than 1.000 which is the ordering maximis-
and well-structured text being critical factors for LLMs to            ing the similarity between subsequent sentences (𝑜𝑟𝑑+ ).
generate high quality output.                                          This method achieves 4.47% and 26.67% more pairwise wins
                                                                       w.r.t. 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− , respectively. To answer RQ5, we
5. Additional Experiments                                              assess the responses generated using the best clustering
                                                                       strategy against the approach defined above. The average
The clusterization-based ordering strategy proposed in this            scores are 0.7652 and 0.7348 while the pairwise wins and
work is designed to position sentences sharing analogous               ties are 38 - 46 - 31, respectively.
semantic content close together in the LLM prompt. Given                  From these experiments, we can conclude that a positive
the results obtained in Section 4.3, we have shown its ef-             correlation exists between similarity between subsequent
fectiveness in our experimental settings. Nevertheless, we             sentences and response quality, while proving that sentence
answer two additional research questions in this section to            similarity may not be the only factor that should be con-
gain additional insights. Specifically,                                sidered. Moreover, subdividing and explicitly grouping to-
                                                                       gether sentences by subtopic is beneficial w.r.t. considering
  RQ4 Is there a correlation between the similarity of subse-          the sentence similarity only in a pairwise fashion and thus
      quent sentences in the LLM prompt and the quality                lacking a global vision of the retrieved knowledge.
      of the generated response?
6. Conclusions and Future Work                                        ocean: A survey on hallucination in large language
                                                                      models, CoRR abs/2309.01219 (2023). URL: https:
In this work, we presented a novel pipelined RAG archi-               //doi.org/10.48550/arXiv.2309.01219. doi:10.48550/
tecture aimed at selecting a set of relevant sentences for            ARXIV.2309.01219. arXiv:2309.01219.
each query and arranging them in a specific order to op-          [4] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,
timize the quality of responses generated by a LLM. For               Y. Bang, A. Madotto, P. Fung, Survey of hallucination
this purpose, sentences are first extracted from the top doc-         in natural language generation, ACM Comput. Surv.
uments retrieved. Then, they are reranked, and the most               55 (2023) 248:1–248:38. URL: https://doi.org/10.1145/
relevant sentences are organized in clusters by similarity.           3571730. doi:10.1145/3571730.
We proposed different strategies for ordering clusters and        [5] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua,
the sentences within clusters in the input given to the LLM           F. Petroni, P. Liang, Lost in the middle: How language
for response generation. To the best of our knowledge,                models use long contexts, CoRR abs/2307.03172
this is the first work investigating sentence clustering and          (2023). URL: https://doi.org/10.48550/arXiv.2307.
re-ordering to improve the quality of the response gener-             03172.            doi:10.48550/ARXIV.2307.03172.
ated by RAG systems. Our empirical assessment is based                arXiv:2307.03172.
on a well-known—public—framework for conversational               [6] W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen,
search. The results of the experiments show that different            D. Yin, Z. Ren, Is chatgpt good at search? investi-
sequences of sentences in the LLM prompt significantly                gating large language models as re-ranking agents, in:
impact response quality despite all methodologies process-            H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the
ing identical information from the same set of sentences.             2023 Conference on Empirical Methods in Natural Lan-
Random permutations yield the lowest results, whereas our             guage Processing, EMNLP 2023, Singapore, December
proposed approach based on sentence clusterization yields             6-10, 2023, Association for Computational Linguis-
superior results. Additionally, we examined whether maxi-             tics, 2023, pp. 14918–14937. URL: https://doi.org/10.
mizing the similarity between consecutive sentences in the            18653/v1/2023.emnlp-main.923. doi:10.18653/V1/
LLM prompt enhances response quality. While a positive                2023.EMNLP-MAIN.923.
correlation between these factors was observed, it is not         [7] R. Tang, X. Zhang, X. Ma, J. Lin, F. Ture, Found
the exclusive determinant. Consequently, while we infer               in the middle: Permutation self-consistency im-
that sentence similarity constitutes a pivotal aspect, other          proves listwise ranking in large language mod-
contributing factors remain unidentified, warranting fur-             els,      CoRR abs/2310.07712 (2023). URL: https:
ther investigation. Moreover, although our experimental               //doi.org/10.48550/arXiv.2310.07712. doi:10.48550/
evaluation employs a well-known conversational collection,            ARXIV.2310.07712. arXiv:2310.07712.
the methodology and results shown in this work are gen-           [8] P. Owoicho, J. Dalton, M. Aliannejadi, L. Azzopardi,
eral. They could also be applied to other scenarios, such as          J. R. Trippas, S. Vakulenko, TREC cast 2022: Going be-
ad-hoc search.                                                        yond user ask and system retrieve with initiative and
   In future work, we intend to evaluate the impact of the            response generation, in: I. Soboroff, A. Ellis (Eds.), Pro-
number of clusters selected by our method for generating              ceedings of the Thirty-First Text REtrieval Conference,
the response. Our intuition is that the number of clusters            TREC 2022, online, November 15-19, 2022, volume 500-
identified for a given query is a proxy of the difficulty of          338 of NIST Special Publication, National Institute of
the query itself. Fewer clusters or even a single large should        Standards and Technology (NIST), 2022. URL: https://
characterize simple and close queries. In contrast, difficult—        trec.nist.gov/pubs/trec31/papers/Overview_cast.pdf.
multi-faceted—queries are possibly characterized by more          [9] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi,
clusters, each addressing a different facet of the query. This        H. Hajishirzi, When not to trust language models:
intuition paves the way for the extension of the evaluation           Investigating effectiveness of parametric and non-
methodology by adopting diversification-based metrics [42],           parametric memories, in: A. Rogers, J. L. Boyd-
allowing us to understand how well the generated answers              Graber, N. Okazaki (Eds.), Proceedings of the 61st
cover the query facets and the topical distribution of the            Annual Meeting of the Association for Computa-
clusters.                                                             tional Linguistics (Volume 1: Long Papers), ACL 2023,
                                                                      Toronto, Canada, July 9-14, 2023, Association for
                                                                      Computational Linguistics, 2023, pp. 9802–9822. URL:
References                                                            https://doi.org/10.18653/v1/2023.acl-long.546. doi:10.
 [1] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,         18653/V1/2023.ACL-LONG.546.
     J. Sun, M. Wang, H. Wang, Retrieval-augmented gen-          [10] R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian,
     eration for large language models: A survey, 2024.               H. Wu, J. Wen, H. Wang, Investigating the factual
     arXiv:2312.10997.                                                knowledge boundary of large language models
 [2] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang,              with retrieval augmentation, CoRR abs/2307.11019
     Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A                     (2023). URL: https://doi.org/10.48550/arXiv.2307.
     survey on hallucination in large language models:                11019.            doi:10.48550/ARXIV.2307.11019.
     Principles, taxonomy, challenges, and open ques-                 arXiv:2307.11019.
     tions,    CoRR abs/2311.05232 (2023). URL: https:           [11] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
     //doi.org/10.48550/arXiv.2311.05232. doi:10.48550/               M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-
     ARXIV.2311.05232. arXiv:2311.05232.                              bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lam-
 [3] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang,        ple, Llama: Open and efficient foundation language
     E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu,                  models, CoRR abs/2302.13971 (2023). URL: https:
     W. Bi, F. Shi, S. Shi,       Siren’s song in the AI              //doi.org/10.48550/arXiv.2302.13971. doi:10.48550/
                                                                      ARXIV.2302.13971. arXiv:2302.13971.
[12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-              abs/1301.3781.
     hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava,       [21] J. Pennington, R. Socher, C. D. Manning, Glove:
     S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer,               Global vectors for word representation, in: A. Mos-
     M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu,             chitti, B. Pang, W. Daelemans (Eds.), Proceedings
     W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal,                   of the 2014 Conference on Empirical Methods in
     A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas,            Natural Language Processing, EMNLP 2014, October
     V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S.              25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a
     Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu,        Special Interest Group of the ACL, ACL, 2014, pp.
     Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog,          1532–1543. URL: https://doi.org/10.3115/v1/d14-1162.
     Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi,         doi:10.3115/V1/D14-1162.
     A. Schelten, R. Silva, E. M. Smith, R. Subramanian,          [22] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
     X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan,           pre-training of deep bidirectional transformers for
     P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kam-                language understanding, in: J. Burstein, C. Doran,
     badur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov,            T. Solorio (Eds.), Proceedings of the 2019 Conference
     T. Scialom, Llama 2: Open foundation and fine-tuned               of the North American Chapter of the Association for
     chat models, CoRR abs/2307.09288 (2023). URL: https:              Computational Linguistics: Human Language Tech-
     //doi.org/10.48550/arXiv.2307.09288. doi:10.48550/                nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
     ARXIV.2307.09288. arXiv:2307.09288.                               June 2-7, 2019, Volume 1 (Long and Short Papers),
[13] OpenAI, GPT-4 technical report, CoRR abs/2303.08774               Association for Computational Linguistics, 2019, pp.
     (2023). URL: https://doi.org/10.48550/arXiv.2303.                 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423.
     08774.           doi:10.48550/ARXIV.2303.08774.                   doi:10.18653/V1/N19-1423.
     arXiv:2303.08774.                                            [23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
[14] F. Xu, W. Shi, E. Choi, RECOMP: improving retrieval-              M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the lim-
     augmented lms with compression and selective aug-                 its of transfer learning with a unified text-to-text trans-
     mentation, CoRR abs/2310.04408 (2023). URL: https:                former, J. Mach. Learn. Res. 21 (2020) 140:1–140:67.
     //doi.org/10.48550/arXiv.2310.04408. doi:10.48550/                URL: http://jmlr.org/papers/v21/20-074.html.
     ARXIV.2310.04408. arXiv:2310.04408.                          [24] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu,
[15] F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice,              Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang,
     C. Campagnano, Y. Maarek, N. Tonellotto, F. Silvestri,            J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge with
     The power of noise: Redefining retrieval for rag sys-             mt-bench and chatbot arena, in: A. Oh, T. Naumann,
     tems, arXiv preprint arXiv:2401.14887 (2024).                     A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.),
[16] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: a                  Advances in Neural Information Processing Systems
     method for automatic evaluation of machine trans-                 36: Annual Conference on Neural Information
     lation, in: Proceedings of the 40th Annual Meet-                  Processing Systems 2023, NeurIPS 2023, New Or-
     ing of the Association for Computational Linguistics,             leans, LA, USA, December 10 - 16, 2023, 2023. URL:
     July 6-12, 2002, Philadelphia, PA, USA, ACL, 2002,                http://papers.nips.cc/paper_files/paper/2023/hash/
     pp. 311–318. URL: https://aclanthology.org/P02-1040/.             91f18a1287b398d378ef22505bf41832-Abstract-Datasets_
     doi:10.3115/1073083.1073135.                                      and_Benchmarks.html.
[17] C.-Y. Lin, ROUGE: A package for automatic evaluation         [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor-
     of summaries, in: Text Summarization Branches Out,                eit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     Association for Computational Linguistics, Barcelona,             sukhin, Attention is all you need, in: I. Guyon,
     Spain, 2004, pp. 74–81. URL: https://aclanthology.org/            U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer-
     W04-1013.                                                         gus, S. V. N. Vishwanathan, R. Garnett (Eds.),
[18] S. Banerjee, A. Lavie, METEOR: an automatic met-                  Advances in Neural Information Processing Sys-
     ric for MT evaluation with improved correlation with              tems 30: Annual Conference on Neural Infor-
     human judgments, in: J. Goldstein, A. Lavie, C. Lin,              mation Processing Systems 2017, December 4-9,
     C. R. Voss (Eds.), Proceedings of the Workshop on                 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
     Intrinsic and Extrinsic Evaluation Measures for Ma-               URL: https://proceedings.neurips.cc/paper/2017/hash/
     chine Translation and/or Summarization@ACL 2005,                  3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
     Ann Arbor, Michigan, USA, June 29, 2005, Association         [26] E. Clark, S. Rijhwani, S. Gehrmann, J. Maynez, R. Aha-
     for Computational Linguistics, 2005, pp. 65–72. URL:              roni, V. Nikolaev, T. Sellam, A. Siddhant, D. Das,
     https://aclanthology.org/W05-0909/.                               A. P. Parikh, SEAHORSE: A multilingual, multi-
[19] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi,          faceted dataset for summarization evaluation, in:
     Bertscore: Evaluating text generation with BERT, in:              H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of
     8th International Conference on Learning Representa-              the 2023 Conference on Empirical Methods in Natural
     tions, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,             Language Processing, EMNLP 2023, Singapore, De-
     2020, OpenReview.net, 2020. URL: https://openreview.              cember 6-10, 2023, Association for Computational Lin-
     net/forum?id=SkeHuCVFDr.                                          guistics, 2023, pp. 9397–9413. URL: https://doi.org/10.
[20] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi-                   18653/v1/2023.emnlp-main.584. doi:10.18653/V1/
     cient estimation of word representations in vector                2023.EMNLP-MAIN.584.
     space, in: Y. Bengio, Y. LeCun (Eds.), 1st Interna-          [27] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. M.
     tional Conference on Learning Representations, ICLR               Voorhees, Overview of the TREC 2019 deep learning
     2013, Scottsdale, Arizona, USA, May 2-4, 2013, Work-              track, CoRR abs/2003.07820 (2020). URL: https://arxiv.
     shop Track Proceedings, 2013. URL: http://arxiv.org/              org/abs/2003.07820. arXiv:2003.07820.
[28] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, Overview            Proceedings of the 6th Workshop on Representa-
     of the TREC 2020 deep learning track, in: E. M.                  tion Learning for NLP, RepL4NLP@ACL-IJCNLP 2021,
     Voorhees, A. Ellis (Eds.), Proceedings of the Twenty-            Online, August 6, 2021, Association for Computa-
     Ninth Text REtrieval Conference, TREC 2020, Virtual              tional Linguistics, 2021, pp. 163–173. URL: https://doi.
     Event [Gaithersburg, Maryland, USA], November 16-                org/10.18653/v1/2021.repl4nlp-1.17. doi:10.18653/
     20, 2020, volume 1266 of NIST Special Publication, Na-           V1/2021.REPL4NLP-1.17.
     tional Institute of Standards and Technology (NIST),        [37] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary,
     2020. URL: https://trec.nist.gov/pubs/trec29/papers/             R. Majumder, L. Deng, MS MARCO: A human gen-
     OVERVIEW.DL.pdf.                                                 erated machine reading comprehension dataset, in:
[29] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava,                 T. R. Besold, A. Bordes, A. S. d’Avila Garcez, G. Wayne
     I. Gurevych, BEIR: A heterogenous benchmark for                  (Eds.), Proceedings of the Workshop on Cognitive
     zero-shot evaluation of information retrieval models,            Computation: Integrating neural and symbolic ap-
     CoRR abs/2104.08663 (2021). URL: https://arxiv.org/              proaches 2016 co-located with the 30th Annual Confer-
     abs/2104.08663. arXiv:2104.08663.                                ence on Neural Information Processing Systems (NIPS
[30] X. Ma, X. Zhang, R. Pradeep, J. Lin, Zero-shot                   2016), Barcelona, Spain, December 9, 2016, volume
     listwise document reranking with a large language                1773 of CEUR Workshop Proceedings, CEUR-WS.org,
     model, CoRR abs/2305.02156 (2023). URL: https:                   2016. URL: http://ceur-ws.org/Vol-1773/CoCoNIPS_
     //doi.org/10.48550/arXiv.2305.02156. doi:10.48550/               2016_paper9.pdf.
     ARXIV.2305.02156. arXiv:2305.02156.                         [38] F. Petroni, A. Piktus, A. Fan, P. S. H. Lewis, M. Yaz-
[31] W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren,                 dani, N. D. Cao, J. Thorne, Y. Jernite, V. Karpukhin,
     Z. Chen, D. Yin, Z. Ren, Instruction distillation                J. Maillard, V. Plachouras, T. Rocktäschel, S. Riedel,
     makes large language models efficient zero-shot                  KILT: a benchmark for knowledge intensive language
     rankers, CoRR abs/2311.01555 (2023). URL: https:                 tasks, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer,
     //doi.org/10.48550/arXiv.2311.01555. doi:10.48550/               D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell,
     ARXIV.2311.01555. arXiv:2311.01555.                              T. Chakraborty, Y. Zhou (Eds.), Proceedings of the
[32] R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankvicuna:             2021 Conference of the North American Chapter of
     Zero-shot listwise document reranking with open-                 the Association for Computational Linguistics: Hu-
     source large language models, CoRR abs/2309.15088                man Language Technologies, NAACL-HLT 2021, On-
     (2023). URL: https://doi.org/10.48550/arXiv.2309.                line, June 6-11, 2021, Association for Computational
     15088.             doi:10.48550/ARXIV.2309.15088.                Linguistics, 2021, pp. 2523–2544. URL: https://doi.org/
     arXiv:2309.15088.                                                10.18653/v1/2021.naacl-main.200. doi:10.18653/V1/
[33] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz, M. de Ri-         2021.NAACL-MAIN.200.
     jke, Conversations with search engines: Serp-based          [39] D. Yang, Y. Zhang, H. Fang, An exploration study
     conversational response generation, ACM Trans.                   of mixed-initiative query reformulation in conver-
     Inf. Syst. 39 (2021) 47:1–47:29. URL: https://doi.org/           sational passage retrieval, in: I. Soboroff, A. El-
     10.1145/3432726. doi:10.1145/3432726.                            lis (Eds.), Proceedings of the Thirty-First Text RE-
[34] W. Lajewska, K. Balog, Towards filling the gap in                trieval Conference, TREC 2022, online, November 15-
     conversational search: From passage retrieval to con-            19, 2022, volume 500-338 of NIST Special Publication,
     versational response generation, in: I. Frommholz,               National Institute of Standards and Technology (NIST),
     F. Hopfgartner, M. Lee, M. Oakes, M. Lalmas, M. Zhang,           2022. URL: https://trec.nist.gov/pubs/trec31/papers/
     R. L. T. Santos (Eds.), Proceedings of the 32nd ACM              udel_fang.C.pdf.
     International Conference on Information and Knowl-          [40] S. Otmazgin, A. Cattan, Y. Goldberg, F-coref: Fast,
     edge Management, CIKM 2023, Birmingham, United                   accurate and easy to use coreference resolution, in:
     Kingdom, October 21-25, 2023, ACM, 2023, pp. 5326–               Proceedings of the 2nd Conference of the Asia-Pacific
     5330. URL: https://doi.org/10.1145/3583780.3615132.              Chapter of the Association for Computational Linguis-
     doi:10.1145/3583780.3615132.                                     tics and the 12th International Joint Conference on
[35] W. Lajewska, K. Balog, Towards reliable and factual              Natural Language Processing, AACL/IJCNLP 2022 -
     response generation: Detecting unanswerable ques-                System Demostrations, Taipei, Taiwan, November 20 -
     tions in information-seeking conversations, in: N. Go-           23, 2022, Association for Computational Linguistics,
     harian, N. Tonellotto, Y. He, A. Lipani, G. McDonald,            2022, pp. 48–56. URL: https://aclanthology.org/2022.
     C. Macdonald, I. Ounis (Eds.), Advances in Information           aacl-demo.6.
     Retrieval - 46th European Conference on Information         [41] S. Otmazgin, A. Cattan, Y. Goldberg, Lingmess: Lin-
     Retrieval, ECIR 2024, Glasgow, UK, March 24-28, 2024,            guistically informed multi expert scorers for coref-
     Proceedings, Part III, volume 14610 of Lecture Notes             erence resolution, in: A. Vlachos, I. Augenstein
     in Computer Science, Springer, 2024, pp. 336–344. URL:           (Eds.), Proceedings of the 17th Conference of the
     https://doi.org/10.1007/978-3-031-56063-7_25. doi:10.            European Chapter of the Association for Computa-
     1007/978-3-031-56063-7\_25.                                      tional Linguistics, EACL 2023, Dubrovnik, Croatia,
[36] S. Lin, J. Yang, J. Lin, In-batch negatives for knowledge        May 2-6, 2023, Association for Computational Lin-
     distillation with tightly-coupled teachers for dense re-         guistics, 2023, pp. 2744–2752. URL: https://doi.org/
     trieval, in: A. Rogers, I. Calixto, I. Vulic, N. Saphra,         10.18653/v1/2023.eacl-main.202. doi:10.18653/V1/
     N. Kassner, O. Camburu, T. Bansal, V. Shwartz (Eds.),            2023.EACL-MAIN.202.
[42] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova,     mation Retrieval, SIGIR ’08, Association for Comput-
     A. Ashkan, S. Büttcher, I. MacKinnon, Novelty and         ing Machinery, New York, NY, USA, 2008, p. 659–666.
     diversity in information retrieval evaluation, in: Pro-   URL: https://doi.org/10.1145/1390334.1390446. doi:10.
     ceedings of the 31st Annual International ACM SIGIR       1145/1390334.1390446.
     Conference on Research and Development in Infor-