=Paper= {{Paper |id=Vol-3784/paper4 |storemode=property |title=Improving RAG Systems via Sentence Clustering and Reordering |pdfUrl=https://ceur-ws.org/Vol-3784/paper4.pdf |volume=Vol-3784 |authors=Marco Alessio,Guglielmo Faggioli,Nicola Ferro,Franco Maria Nardini,Raffaele Perego |dblpUrl=https://dblp.org/rec/conf/ir-rag/AlessioF0N024 }} ==Improving RAG Systems via Sentence Clustering and Reordering== https://ceur-ws.org/Vol-3784/paper4.pdf

Improving RAG Systems via Sentence Clustering and Reordering
Marco Alessio1,* , Guglielmo Faggioli2 , Nicola Ferro2 , Franco Maria Nardini1 and Raffaele Perego1
1
Institute of Information Science and Technologies (ISTI), National Research Council of Italy (CNR), Pisa, Italy
2
Department of Information Engineering (DEI), University of Padua, Padua, Italy

Abstract
Large Language Models (LLMs) have gained noteworthy importance and attention across different domains and fields in recent years.
Information Retrieval (IR) is one of the domains they impacted the most, as witnessed by the recent increase in the number of IR
systems incorporating generative models. Specifically, Retrieval Augmented Generation (RAG) is the emerging paradigm that integrates
existing knowledge from large-scale document corpora into the generation process, enabling the model to generate more coherent,
contextually relevant, and accurate text across various tasks. Such tasks include summarization, question answering, and dialogue
systems. Recent studies have highlighted the significant positional dependence exhibited by RAG systems. Such studies observed
how the placement of information within the LLM input prompt drastically affects the generated output. We ground our study on
this property by investigating alternative strategies for ordering sentences within the LLM prompt to improve the average quality of
the generated responses in the user and conversational system dialogues. We propose the architecture of an end-to-end RAG-based
conversational assistant and empirically evaluate our strategies using the TREC CAsT 2022 collection. Our experiments highlight
significant differences between distinct arrangement strategies. By employing an evaluation methodology based on RankVicuna, we
show that our best approach achieves improvements up to 54% in terms of overall response quality over baseline methods.

Keywords
Retrieval Augmented Generation, Conversational Search, Positional Bias, Arrangement Strategy

1. Introduction prioritize information placed at the beginning or end of the
input while neglecting the central portion.
Retrieval Augmented Generation (RAG) is an emerging In this paper, we advance over previous studies by in-
paradigm in the field of Artificial Intelligence (AI) to en- vestigating the positional bias in the context of RAG-based
hance the accuracy and reliability of generative models by conversational systems. Specifically, we propose a novel
exploiting external data sources. In recent years, RAG has strategy for arranging sentences within the input prompt of
gained noteworthy importance and attention across dif- the LLM to improve the average quality of the generated re-
ferent domains and fields [1] as it allows to combine the sponses over simpler methods. Our approach is based on the
strengths of Information Retrieval (IR) systems and genera- intuition that as coherent, fluent, and well-structured text
tive models to overcome each other’s limitations. are critical factors for successful communication between
RAG can improve the output of a generative model in human beings, the same should also apply to LLMs: among
several ways. First, it allows the generation process to be all the possible arrangements of the input, those having
grounded on information from trusted knowledge sources sentences with similar meaning placed closer in the LLM
incorporated in the provided prompt, thus avoiding or at prompt should generate, on average, better quality output.
least mitigating the well-known Large Language Model Therefore, we propose an end-to-end RAG architecture to
(LLM) hallucination problem, i.e., when the model gener- test our hypothesis. The components of this architecture
ates contents not factually true or that do not concern the allow us to precisely identify which sentences are likely
prompted text [2, 3, 4]. Second, RAG allows for continuous useful for answering user queries. To this end, we clus-
knowledge updates and integration of domain-specific in- ter sentences by their similarity and we define alternative
formation: the LLM can successfully respond to facts and strategies for ordering them both inter and intra-cluster.
topics not covered in its training data; moreover, it is eas- In this way, we can study the effect on the generated re-
ily adapted to different scenarios and contexts, without re- sponse of these alternatives for prompting the generative
training or fine-tuning the entire model using datasets that LLM. To our knowledge, this is the first work that explic-
might be unavailable or limited in scope or size. Finally, itly considers this aspect and allows us to fine-tune in a
grounding the generation process on external knowledge principled way the ordering of input sentences provided to
incorporated in the input permits linking the output to veri- the generative component of a RAG system. We compare
fiable external documents, thus enhancing trustworthiness our proposed approach against competitive baselines that
and transparency [2, 3, 4]. represent the solutions employed by current RAG systems.
Current RAG systems, however, suffer of some draw- We experimentally evaluate the performance of our pro-
backs highlighted in the literature. One of these issues posed approach using the TREC Conversational Assistance
originates from the notable positional sensitivity shown Track (CAsT) 2022 collection [8], which allows us to com-
by LLMs. The placement of information within the input pare the results that different arrangement strategies can
prompt significantly impacts the resulting output. Previous achieve in a widely accepted Conversational Search (CS) sce-
research [5, 6, 7] has highlighted biases towards “primacy” nario. Results highlight remarkable differences among the
and “recency”, suggesting that generative models tend to tested sentence placement strategies, with improvements up
to 8.66% w.r.t. the best baseline and 54.94% w.r.t. random
Information Retrieval’s Role in RAG Systems (IR-RAG) - 2024, July 18,
2024, Washington, DC ordering.
*
Corresponding author. The remainder of this work is organized as follows: Sec-
$ marco.alessio@isti.cnr.it (M. Alessio); guglielmo.faggioli@unipd.it tion 2 surveys the current state-of-the-art about RAG sys-
(G. Faggioli); nicola.ferro@unipd.it (N. Ferro); tems and quality evaluation for their responses. Section
francomaria.nardini@isti.cnr.it (F. M. Nardini);
raffaele.perego@isti.cnr.it (R. Perego)
3 details the architecture of our RAG system. Section 4
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License and Section 5 detail the results of an experimental analysis,
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
which aims to highlight how the ordering of clusters and and closed-source, e.g., GPT-4 [13] by OpenAI, models. It is
sentences affects the quality of the generated response. Fi- not advisable to directly input all the retrieved information
nally, Section 6 draws some conclusions and outlines future to the LLM for generating the response. Redundant infor-
directions and extensions of our research. mation and very long contextual data can interfere with
the generation quality, leading to repetitive, disjointed, or
incoherent outputs [1]. Therefore, the retrieved content is
2. Related Work typically further processed before being given in input to
the LLM [14]. A recent work in this direction systemati-
In the following, we survey the main works dealing with
cally examines the retrieval strategy of RAG systems [15].
LLM positional dependencies and the difficulties of RAG
The authors consider multiple retrieval factors affecting the
systems in conciliating internal and external knowledge.
generation process, such as the relevance of the passages
Then, we analyze the challenges related to the evaluation
in the prompt context, their position, and their number.
of the quality of RAG responses and to the use of an “LLM-
One counter-intuitive finding is that the retriever’s highest-
as-a-judge”.
scoring documents that are not directly relevant to the query,
e.g., do not contain the answer, negatively impact the effec-
2.1. Retrieval Augmented Generation tiveness of the LLM. Moreover, the authors discover that
RAG enhances LLMs by retrieving additional information adding random documents in the prompt improves the LLM
from an external knowledge source, enabling them to suc- accuracy by up to 35%.
cessfully answer queries beyond the scope of the training In this work, we rely on the intuition that the use of co-
data. At the same time, RAG mitigates the hallucination herent, fluent, and well-structured inputs can improve RAG
problem, which is generating factually incorrect text, by and we propose an end-to-end architecture for selecting and
referencing the provided external knowledge. structuring the external information included in the LLM
The RAG paradigm is organized into two main stages: prompt for response generation.
retrieval and generation. Upon receiving a query from the
user, the relevant information is retrieved from an external 2.2. Quality Evaluation
knowledge source. This task is undertaken by a standard IR
Another line of research is how to evaluate the overall qual-
pipeline that outputs a ranked list of documents. Afterwards,
ity of the generation output. Despite human assessment
in the generation phase, the LLM synthesizes the response
providing the most accurate and reliable measure for evalu-
to answer the user query using the information carried by
ating model performance, the high time and cost require-
the selected documents.
ments severely limit the application. Therefore, there exists
Despite its clear advantages, RAG has drawbacks and
an ever-increasing demand for automated evaluation tech-
limitations, which spark several challenges. First, RAG sys-
niques that consistently align with human judgements while
tems employ the external knowledge as their main source
offering enhanced efficiency and cost-effectiveness.
of information, disregarding the internal knowledge memo-
In this paper, we focus on textual-based generative
rized within the LLM [9, 10]. This, in turn, may determine
models. Classical automatic evaluation metrics, such as
a decrease in the quality of the generated output when
BLEU [16], ROUGE [17], and METEOR [18], are designed to
the provided content is not high-quality [10]. It is not un-
quantify the degree of similarity between a candidate text
common for RAG to obtain worse outputs w.r.t. what the
and one or more reference texts, by assessing their n-grams
LLM can achieve in the closed-book scenario, i.e., without
matching. The simplicity and explainability, along with
supplying retrieved results [10]. In this line, it has been
the good correlation with human judgements, make these
observed that the LLM produces better results without in-
metrics widely used as baselines. However, these metrics
jecting external knowledge when the topic popularity is
exhibit several limitations [19]: firstly, they cannot account
very high [9]. In general, state-of-the-art LLMs provide
for lexical diversity; secondly, they penalize variations in
good quality responses for a wide range of questions but
the semantic ordering of words; thirdly, they struggle to
require assistance from an IR system when the internal
capture and match paraphrases effectively; lastly, they inad-
knowledge of the model lacks information about the cur-
equately account for distant dependencies within the text.
rent topic. This phenomenon is likely to occur if the topic
With the advent of word embeddings [20, 21] and neural
is not very popular, requires exceptional expertise, or when
models [22, 23, 11, 12, 24] based on Transformers [25], new
scaling the number of parameters of the generative model
learned metrics [19, 26] have been developed. For example,
produces little to no effect [9]. Another challenge lies in
BERTScore [19] can capture the semantic similarity between
the significant positional dependence [5, 6, 7] exhibited by
the candidate and reference texts employing the contex-
LLMs, whereby the placement of information within the
tual embeddings generated by an encoder model, such as
input prompt drastically affects the generated output. Prior
BERT [22].
research [5] has identified “primacy” and “recency” biases,
In recent years, the rapid advancements of LLMs show-
indicating the tendency of generative models to focus to-
ing remarkable performance across many tasks have gained
ward information positioned either at the beginning or the
considerable interest in their potential application also as
end of the input while disregarding the central part. There-
annotators and evaluators. Due to their training using
fore, the performance degrades significantly when LLMs
Reinforcement Learning from Human Feedback (RLHF),
should rely on information in the middle of its input context,
these models demonstrate significant human alignment.
showing a characteristic U-shaped performance curve [5].
Many research have investigated leveraging state-of-the-
This, in turn, means that most state-of-the-art generative
art LLMs to automatically produce assessments serving as
models do not use effectively their longer contexts w.r.t.
proxies for human judgments, a paradigm known as “LLM-
smaller and earlier counterparts. These phenomena can be
as-a-judge”.
observed both in open-source, e.g., Llama [11, 12] by Meta,
p1 s1,1 s1,2 ... s1,7
s3,7
p2 s2,1 s2,2 s2,3
CO-REF. RESOLUTION s4,2
QUERY RETRIEVAL ... SENTENCE SPLITTER ... ... ... ... SENTENCE
SYSTEM RERANKING ...
pN DUPLICATES REMOVAL
sN,1 sN,2 sN,3 sN,4
CONVERSATION s19,6
RETRIEVED
RETRIEVED SENTENCES
PASSAGES

TOP-K
SELECTION
CLUSTERS
ORDER STRATEGY

C1 C2 ... CK
s3,7 s4,2 ... s0,1 s3,7

LARGE s4,2 s0,1 ... s12,1 s5,6 ... s15,4 s4,2
RESPONSE SENTENCE SENTENCE
LANGUAGE s12,1 ... ...
REORDERED SENTENCES REORDERING CLUSTERING ...
MODEL
... s9,3
s9,3
CLUSTERED SENTENCES
PROMPT SENTENCES
WITHIN CLUSTER
ORDER STRATEGY

Figure 1: Architecture of our proposed RAG system.

Furthermore, in recent years LLMs have gained popu- 3.1. Document Pre-processing and Splitting
larity also as evaluators. For example, Zheng et al. [24]
As observed in literature [33, 34], the entire text of a rele-
assessed the quality of conversations with various LLMs,
vant document rarely contains meaningful knowledge to
both open and closed source, employing GPT-4 [13] as judge.
satisfy the user information need expressed by a query 𝑞.
They experimented with various prompts and different ap-
In most cases, only one or a few portions of the document
proaches, such as single answer grading and pairwise com-
are relevant to the query, while the remaining parts contain
parisons both between responses and against a reference
irrelevant information. The proposed architecture aims to
text. GPT-3.5 Turbo and GPT-4 [13] have been employed as
precisely identify the key information in the retrieved docu-
listwise rerankers [6, 7] for the TREC Deep Learning 2019
ments, i.e., the sentences, to reduce the noise in the prompt
and 2020 [27, 28] and BEIR [29] experimental collections,
used for response generation.
obtaining state-of-the-art performance [6]. The same LLMs
Hereinafter, we consider sentences in the documents as
have also been employed as teacher models to fine-tune
the atomic units of information. Our pipeline, illustrated in
smaller open-source student models, such as Llama and
Figure 1 works as follows. First, for each query 𝑞 we consider
Vicuna [30, 31] (i.e.: RankVicuna [32]).
only the top-𝑘 documents {𝑑1 , 𝑑2 , ..., 𝑑𝑘 } retrieved by the
In this work, we rely on state-of-the-art assessment meth-
IR system. Then, a state-of-the-art co-reference resolution
ods and evaluate the quality of the responses generated by
model is applied to all documents to replace pronouns and
the different methods using RankVicuna [32].
other generic terms within a sentence with the fully speci-
fied entity mentioned in a previous sentence. This allows
3. The Proposed RAG Architecture us to remove the contextual dependencies among sentences
in a document so they can be considered self-explanatory.
Generative models exhibit strong biases towards informa- The third step splits each document 𝑑𝑖 into a sequence of
tion positioned at the start or the end of the input while dis- sentences {𝑠𝑖,1 , 𝑠𝑖,2 , ..., 𝑠𝑖,𝑛𝑖 }. Afterwards, near-duplicate
regarding the middle part [5]. This phenomenon motivates removal is employed to the sentences originated by all doc-
our research effort to determine how the order of the input uments by discarding sentences with a Jaccard similarity
sentences provided to a RAG-based conversational system ≥ 0.9 between their Bag-of-Words (BoW) representations1 .
affects the quality of the generated output and, in turn, the
optimal ordering strategy to achieve the best response. This 3.2. Sentence Selection
section describes each method and all variations considered
in our experiments. After the first pre-processing phase, we obtain a sentence
The architecture of our proposed RAG system is illus- candidate set for each query to be included in the LLM
trated in Figure 1. It includes an IR pipeline, which retrieves prompt of our RAG system (see Figure 1). Since the cardi-
top-𝑘 documents 𝐷 = {𝑑1 , 𝑑2 , ..., 𝑑𝑘 } in response to each nality of this set can be large and not all the sentences are
user utterance 𝑞. The retrieved documents are then pro- useful for answering the query, we employ the BERT-based
cessed by additional components responsible for splitting cross encoder answer-in-the-sentence classifier2 developed
them into sentences, identifying the most relevant sentences, by Lajewska and Balog [35] to rank the candidate sentences
clustering such sentences based on their semantic similar-
ity, and ordering them according to the various strategies 1
This step is particularly important in our setting because the CAsT 2022
analyzed. Finally, the selected—re-ordered—sentences are corpus contains a multitude of near-duplicate documents. In particular,
provided as input to the LLM for response generation. These the same Wikipedia article is often replicated in documents retrieved
components are the focus of our research. Their functional- from the KILT and MS-MARCO collections.
2
The model named “squad_snippets_unanswerable” is available at https:
ities are detailed in the remainder of this section. //iai.group/downloads/emnlp2023-answerability_prediction.
according to their predicted usefulness to (at least partially) The order of clusters and the order of the sentences within
answer the query and we retain the top-𝑛 ranked sentences the same cluster uniquely determine the possible global or-
thus discarding the remaining ones. As a possible limita- dering of the 𝑛 sentences we consider for inputting the LLM.
tion, please note that the model by Lajewska and Balog Our experimental assessment will evaluate six different or-
[35] employed have been trained on queries and passages dering strategies for placing the clusters of sentences in the
used in our experiments. Therefore, it is very likely that input, and four different methods for ordering sentences
the model performs significantly better on our data w.r.t. within the same cluster. Cluster placements consider differ-
any other model, ensuring that top-ranked sentences are ent aspects, such as the clusters’ cardinality and similarity
indeed relevant to the query. Even though such a model to the query. The ordering tested includes the random one
is not available in a real practical scenario, this choice is and those obtained by decreasing/increasing the value of
justified by our research effort being focused exclusively on each aspect. Finally, the U-shaped order suggested in [5]
comparing the ordering strategy for sentences in the LLM is also tested. Regarding the ordering within clusters, we
input rather than on the absolute results achievable by our consider random order, order by reranker score, visiting
RAG system. order, and the clustering aggregation order.

3.3. Sentence Clustering and Ordering 4. Experimental Evaluation
The previous steps of the pipeline constrain the number of
sentences per query while increasing their expected utility We can now formulate the research questions we aim to
in answering the query. Furthermore, they allow us to con- answer with our experimental framework.
trol other noise sources, such as the number or the variable Research Questions. Given the sentence selection and
length of the retrieved documents. Therefore, we can as- clustering steps discussed above, the two main aspects to
sess how the positional bias affects the generation process. consider for defining our ordering strategies 𝑜𝑟𝑑(·) are the
We highlight again that the positional bias of LLM has al- order of placement in the LLM prompt of the clusters and
ready been observed in prior research [5, 6, 7]. However, of the sentences within the same cluster. They uniquely
it has been considered exclusively as a limitation of LLMs determine the global ordering 𝑜𝑟𝑑(·) of the top-𝑛 sentences
and RAG systems. Our research moves a step forward by given in input to the LLM for response generation. Our
investigating the best ordering strategy to maximize, on research questions assess which is the best solution among
average, the quality of the generated responses over a test- these alternatives considered. Specifically,
ing query set 𝑄. We believe that logically organized text
RQ1 What is the best cluster ordering strategy?
where sentences with akin meanings are positioned closer
in the LLM prompt should, on average, yield superior out- RQ2 What is the best ordering strategy for sentences
put quality. Consequently, our sentence ordering strategies within the same cluster?
exploit the similarities among sentences selected by the sen- RQ3 Can our proposed strategy enhance the effectiveness
tence selection step. To measure semantic inter-sentence of the RAG system w.r.t. baseline methods?
similarity, we resort to the contextualized embeddings gen-
erated with the tct-colbert model3 [36]. We generate the Experimental Settings. We experiment with the TREC
representation of the 𝑛 selected sentences for each query CAsT 2022 dataset, a standard experimental collection for
and measure their pair-wise cosine similarity. Then, we CS [8]. This choice is due to prior research that released
progressively aggregate the most similar sentences by em- additional datasets, models, and human judgments for this
ploying a hierarchical clustering algorithm. The maximum benchmark [34, 35]. The corpus is composed of three doc-
value of Silhouette statistic is used as the criteria to deter- uments collections, MS-MARCO v2 [37], KILT [38], and
mine the optimal clustering among all possible. As a result, Washington Post v4, which are subdivided into 106𝑀 short
for each query 𝑞 ∈ 𝑄, the top-𝑛 sentences are grouped in documents. CAsT 2022 includes 18 information needs (top-
a variable number 𝑁𝑐 ≥ 1 of clusters, each composed of ics) and 205 user utterances (queries), with an average
one or more sentences with similar semantic meaning. To length of 11.39 user utterances per topic. The number of
devise different strategies for ordering input sentences, we utterances for which relevance judgements are provided is
leverage the above clustering that allows us to study the 163.
impact of sentence placement variations occurring in both For our experiments, as the retrieval system, we employ
inter and intra-clusters. as the output of the retrieval pipeline the best-performing
More formally, given a query, the set 𝑆 of the 𝑛 previously run originally submitted to TREC CAsT 20224 [39]. This
selected sentences, and the prompt 𝑝, we aim to find the allows us to focus exclusively on the following steps of our
ordering 𝑜𝑟𝑑* of 𝑆 such that: pipeline. In all our experiments, we consider only the top-20
retrieved documents, leaving the investigation about the
implications of this choice and possible alternatives as fu-
ture work. To provide meaningful results, all queries where
∑︁
𝑜𝑟𝑑* = argmax 𝑠(𝑞, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆))),
𝑜𝑟𝑑
𝑞∈𝑄 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@20 < 0.2, that is, having at most 3 relevant
passages in the top-20 results, are discarded5 , ensuring that
where 𝑜𝑟𝑑(𝑆) is a sentence ordering strategy that returns enough relevant information is retrieved to answer the con-
an ordering of the sentences in 𝑆, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) is the sidered queries successfully.
response generated by the LLM used for prompt 𝑝, query
𝑞 and sentence ordering 𝑜𝑟𝑑(𝑆), and, finally, 𝑠(𝑞, 𝑟) is a
scoring function evaluating the perceived quality of the 4
The run is identified as “udinfo_mi_b2021” from the “udel_fang” group,
generated response 𝑟 = 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) for query 𝑞. University of Delaware (USA)
5
The number of queries considered in these experiments is 115 out of
3
https://huggingface.co/castorini/tct_colbert-v2-hnp-msmarco 163 evaluated in the official relevance judgments.
Table 1 Table 2
Comparisons between the six approaches proposed for RQ1: Comparisons between the four approaches proposed for RQ2:
“What is the best ordering strategy for clusters?”. In the top “What is the best ordering strategy for sentences within the same
half, each row reports three numbers, which are the wins for cluster?”. In the top half, each row reports three numbers, which
the approach in the column label, the ties, and the wins for the are the wins for the approach in the column label, the ties, and
approach in the row label, respectively. In the bottom half, the the wins for the approach in the row label, respectively. In the
overall results are reported. bottom half, the overall results are reported.
A vs. B vs. C vs. D vs. E vs. F vs. A vs. B vs. C vs. D vs.
A — 56-4-51 57-2-52 62-4-45 51-1-59 55-2-54 A — 53-3-59 48-8-59 55-4-56
B 51-4-56 — 47-8-56 61-4-46 52-5-54 52-2-57 B 59-3-53 — 54-3-58 60-7-48
C 52-2-57 56-8-47 — 58-3-50 55-0-56 59-4-48 C 59-8-48 58-3-54 — 57-6-52
D 45-4-62 46-4-61 50-3-58 — 44-1-66 47-1-63
D 56-4-55 48-7-60 52-6-57 —
E 59-1-51 54-5-52 56-0-55 66-1-44 — 57-5-49
F 54-2-55 57-2-52 48-4-59 63-1-47 49-5-57 — Overall 174-15 159-13 154-17 172-17
Overall 261-13 269-23 258-17 310-13 251-12 270-14 Avg. Score 0.6281 0.6143 0.6124 0.6451
Avg. Score 0.5723 0.5844 0.5510 0.6219 0.5736 0.5969

order) fixed. We test six different strategies for ordering
Furthermore, in the steps of the pipeline where the query clusters: clusters selected in random order (strategy A);
text is needed, i.e., sentence ranking and response gener- clusters selected in descending order of cardinality (strategy
ation, we employed the manually rewritten text for every B); clusters selected in ascending order of similarity with the
query. This allows us to account for the possible bias intro- query8 (strategy C); clusters selected in descending order
duced by different query rewriting approaches. Future de- of similarity with the query (strategy D); clusters selected
velopments will investigate the relationship between query in descending order by similarity with the query using a
rewriting approaches and RAG solutions. ping-pong layout from top to bottom (strategy E)9 ; clusters
For co-reference resolution at the document level, i.e., re- selected by similarity with the query in descending order,
moving co-references across different sentences in the “doc- using a ping-pong layout from bottom to top (strategy F)10 .
ument processing” step, we use the “F-Coref” model6 [40] As shown in Table 1, sorting the clusters in descend-
based on the “LingMess” architecture [41]. After this step, ing order by their similarity with the query (strategy D)
we use the well-known SpaCy Python library to divide each is the clear winner in this comparison, in terms of both
document into a sequence of independent sentences. score and pairwise wins. This approach performs 18.77%,
In the following section, we report two different metrics 15.24%, 20.16%, 23.51%, and 14.81% better than other
for each comparison. The former is the average score of options. This figures suggest that the LLM used to gener-
every approach when assessing all 10 random permutations ate the responses exhibit a much stronger “primacy” rather
using RankVicuna. The latter, instead, is a pairwise metric, than “recency” biases, as highlighted by option C being over-
assessing the number of queries for which the first approach all the worst performing among those considered. Instead,
obtains higher/the same/lower score w.r.t. the other one. methods E and F were designed to place the least important
This information should better highlight the differences and clusters towards the center, since LLMs struggle to utilize
provide a more comprehensive view than a single average the information in the middle of their prompt effectively.
value. However, we can see that both approaches are ineffective:
Response Generation. For the response generation, we we suspect this is due to the length of the input text being
employ Vicuna 7B7 [24], a LLM based on Llama 2 [11, 12] much smaller than the maximum context window of the
fine-tuned on 125K user conversations with ChatGPT gath- model. Different results may be observed when varying the
ered using public APIs from the ShareGPT.com website. amount of input data provided to the LLM for generation.
Quality Evaluation. To evaluate the quality of the gener-
ated responses, we employ RankVicuna [32] to perform list- 4.2. RQ2: Order of Sentences within the
wise ranking between all responses being compared. To mit- same Cluster
igate the positional bias intrinsic in RankVicuna, we assess
10 different random permutations of the same responses, In this second experiment, we evaluate different sorting
averaging the results obtained. This is a reasonable trade-off schemes for sentences within the same cluster, keeping
between evaluation accuracy and the computational run- the cluster’s order fixed at the best strategy determined in
time required. For each assessment, we assign 𝑁 +1−𝑖 points RQ1. We test four different strategies for ordering sentences
𝑁
to the i-th ranked response, where 1 ≤ 𝑖 ≤ 𝑁 and 𝑁 is within the same cluster: sentences selected in random order
the number of responses being compared. Furthermore, we (strategy A); sentences selected in descending order by re-
also evaluate the number of wins and ties between pairs ranker score (strategy B); sentences selected by visiting
of responses considered. Whether a valid judgment from order11 (strategy C); sentences selected by aggregation order
the LLM can not be determined, the entire comparison is (strategy D).
discarded from the evaluation. As shown in Table 2, the best results are achieved by two

8
The similarity between a cluster 𝐶 and the query is defined as the max-
4.1. RQ1: Order of Clusters imum cosine similarity between the query 𝑞 ∈ 𝑄 with any sentence
𝑠𝑖,𝑗 ∈ 𝐶 belonging to the cluster.
For the first experiment, we evaluate the effects of different 9
The clusters are placed first, last, second, second-to-last, third, and so
ordering of the clusters while keeping the order of sentences on, e.g., [A, B, C, D, E] becomes [A, C, E, D, B].
10
within the same cluster (based on the clustering aggregation The clusters are placed last, first, second-to-last, second, third-to-last,
and so on, e.g., [A, B, C, D, E] becomes [B, D, E, C, A].
6 11
https://huggingface.co/biu-nlp/f-coref The sentences are sorted based on the order in which they appear when
7
https://huggingface.co/lmsys/vicuna-7b-v1.5 sequentially scanning through the set of top-𝑘 retrieved documents.
Table 3 Table 4
Comparisons between the five approaches considered for RQ3: Comparisons between the seven approaches proposed for RQ4:
“Can our proposed strategy enhance the effectiveness of the RAG “Is there a correlation between the similarity of subsequent sen-
system w.r.t. baseline methods?”. In the top half, each row reports tences in the LLM prompt and the quality of the generated re-
three numbers, which are the wins for approach in the column sponse?”. In the top half, each row reports three numbers, which
label, the ties, and the wins for approach in the row label, respec- are the wins for the approach in the column label, the ties, and
tively. In the bottom half, the overall results are reported. the wins for the approach in the row label, respectively. In the
bottom half, the overall results are reported.
A vs. B vs. C vs. D vs. CL vs.
1.000 vs. 0.625 vs. 0.500 vs. 0.375 vs. 0.250 vs. 0.125 vs. 0.000 vs.
A — 45-4-62 54-1-56 54-0-57 66-2-43
1.000 — 46-2-43 38-2-51 45-0-46 40-2-49 40-1-50 38-1-52
B 62-4-45 — 71-1-39 64-8-39 67-5-39 0.625 43-2-46 — 37-2-52 42-2-47 41-1-49 36-0-55 35-1-55
C 56-1-54 39-1-71 — 50-4-57 59-3-49 0.500 51-2-38 52-2-37 — 51-2-38 52-0-39 37-0-54 44-2-45
D 57-0-54 39-8-64 57-4-50 — 59-3-49 0.375 46-0-45 47-2-42 38-2-51 — 42-2-47 37-3-51 37-1-53
0.250 49-2-40 49-1-41 39-0-52 47-2-42 — 43-3-45 42-1-48
CL 43-2-66 39-5-67 49-3-59 49-3-59 — 0.125 50-1-40 55-0-36 54-0-37 51-3-37 45-3-43 — 44-1-46
0.000 52-1-38 55-1-35 45-2-44 53-1-37 48-1-42 46-1-44 —
Overall 218-7 162-18 231-9 217-15 251-13
Avg. Score 0.5882 0.5533 0.6177 0.6016 0.6392 Overall 291-8 304-8 251-8 289-10 268-9 239-8 240-7
Avg. Score 0.5731 0.5866 0.5480 0.5617 0.5516 0.5349 0.5143

different strategies: option D, sorting sentences within the RQ5 Is the proposed clusterization strategy more effective
same cluster based on aggregation order, and interestingly, than directly optimising the similarity of subsequent
option A, randomly sorting the sentences. Both strategies sentences?
are preferable to the other two methods considered, per-
forming 8.18% and 11.69% better w.r.t. options B and C, Experimental Settings. We determine heuristically the
respectively. We note however that the difference in perfor- two ordering 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− , which maximize and mini-
mance of the various strategies are not large as the sentences mize the overall similarity between subsequent sentences.
are grouped in the clusters by their similarity. The LLM re- Let 𝑠𝑢𝑚+ and 𝑠𝑢𝑚− be the sum of similarity between sub-
sponse appears to be more impacted by the order of the sequent sentences for 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− respectively. The
clusters than by the order of sentences within each cluster. similarity 𝑠𝑖𝑚(𝑝) for a sentence permutation 𝑝 is given
by the following equation, where min-max normalization
4.3. RQ3: Comparison with Baselines is used, and 𝑠𝑖 are the embedding representations of the
respective sentences:
Our last experiment investigates whether our proposed ap-
proach is beneficial in enhancing the overall effectiveness (︁∑︀
ℎ
)︁
of the RAG system w.r.t. four simpler baseline methods that 𝑖=2 𝑐𝑜𝑠(𝑠𝑖−1 , 𝑠𝑖 ) − 𝑠𝑢𝑚−
𝑠𝑖𝑚(𝑝) =
may be used in practice by current state-of-the-art RAG sys- 𝑠𝑢𝑚+ − 𝑠𝑢𝑚−
tems. We test five different strategies: i) the top-5 retrieved
documents (A), ii) the top-40 sentences taken in random In our experiments, for each query, we generate one mil-
order (B), iii) the top-40 sentences taken in descending order lion random permutations, then we determine which is the
by re-ranker score (C), iv) the top-40 sentences selected by permutation with similarity closer to each of the following
visiting order (D), v) the best clusterization-based approach thresholds: 0.125, 0.250, 0.375, 0.500, and 0.625. We de-
determined from RQ1 and RQ2 (CL). cided to stop at 0.625 because higher values are unlikely
The results obtained are shown in Table 3. The to be observed given that the average similarity of these
clusterization-based approach demonstrate superior perfor- permutations is 0.3433 with standard deviation 0.0530.
mance, resulting as the best strategy in this comparison. The Results. We determine how the quality of the generated
four baselines yield notably lower results: 15.14%, 54.94%, response is influenced when varying the similarity between
8.66%, and 15.67%, respectively. Among the methods subsequent sentences at various predefined thresholds, as
considered in this work, randomly sorting the top-ℎ sen- shown in Table 4. It is interesting to note that the highest
tences is by far the least performing approach. This, in results are obtained by permutations with 0.625 normalised
turn, proves our starting intuition about coherent, fluent, similarity, rather than 1.000 which is the ordering maximis-
and well-structured text being critical factors for LLMs to ing the similarity between subsequent sentences (𝑜𝑟𝑑+ ).
generate high quality output. This method achieves 4.47% and 26.67% more pairwise wins
w.r.t. 𝑜𝑟𝑑+ and 𝑜𝑟𝑑− , respectively. To answer RQ5, we
5. Additional Experiments assess the responses generated using the best clustering
strategy against the approach defined above. The average
The clusterization-based ordering strategy proposed in this scores are 0.7652 and 0.7348 while the pairwise wins and
work is designed to position sentences sharing analogous ties are 38 - 46 - 31, respectively.
semantic content close together in the LLM prompt. Given From these experiments, we can conclude that a positive
the results obtained in Section 4.3, we have shown its ef- correlation exists between similarity between subsequent
fectiveness in our experimental settings. Nevertheless, we sentences and response quality, while proving that sentence
answer two additional research questions in this section to similarity may not be the only factor that should be con-
gain additional insights. Specifically, sidered. Moreover, subdividing and explicitly grouping to-
gether sentences by subtopic is beneficial w.r.t. considering
RQ4 Is there a correlation between the similarity of subse- the sentence similarity only in a pairwise fashion and thus
quent sentences in the LLM prompt and the quality lacking a global vision of the retrieved knowledge.
of the generated response?
6. Conclusions and Future Work ocean: A survey on hallucination in large language
models, CoRR abs/2309.01219 (2023). URL: https:
In this work, we presented a novel pipelined RAG archi- //doi.org/10.48550/arXiv.2309.01219. doi:10.48550/
tecture aimed at selecting a set of relevant sentences for ARXIV.2309.01219. arXiv:2309.01219.
each query and arranging them in a specific order to op- [4] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,
timize the quality of responses generated by a LLM. For Y. Bang, A. Madotto, P. Fung, Survey of hallucination
this purpose, sentences are first extracted from the top doc- in natural language generation, ACM Comput. Surv.
uments retrieved. Then, they are reranked, and the most 55 (2023) 248:1–248:38. URL: https://doi.org/10.1145/
relevant sentences are organized in clusters by similarity. 3571730. doi:10.1145/3571730.
We proposed different strategies for ordering clusters and [5] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua,
the sentences within clusters in the input given to the LLM F. Petroni, P. Liang, Lost in the middle: How language
for response generation. To the best of our knowledge, models use long contexts, CoRR abs/2307.03172
this is the first work investigating sentence clustering and (2023). URL: https://doi.org/10.48550/arXiv.2307.
re-ordering to improve the quality of the response gener- 03172. doi:10.48550/ARXIV.2307.03172.
ated by RAG systems. Our empirical assessment is based arXiv:2307.03172.
on a well-known—public—framework for conversational [6] W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen,
search. The results of the experiments show that different D. Yin, Z. Ren, Is chatgpt good at search? investi-
sequences of sentences in the LLM prompt significantly gating large language models as re-ranking agents, in:
impact response quality despite all methodologies process- H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the
ing identical information from the same set of sentences. 2023 Conference on Empirical Methods in Natural Lan-
Random permutations yield the lowest results, whereas our guage Processing, EMNLP 2023, Singapore, December
proposed approach based on sentence clusterization yields 6-10, 2023, Association for Computational Linguis-
superior results. Additionally, we examined whether maxi- tics, 2023, pp. 14918–14937. URL: https://doi.org/10.
mizing the similarity between consecutive sentences in the 18653/v1/2023.emnlp-main.923. doi:10.18653/V1/
LLM prompt enhances response quality. While a positive 2023.EMNLP-MAIN.923.
correlation between these factors was observed, it is not [7] R. Tang, X. Zhang, X. Ma, J. Lin, F. Ture, Found
the exclusive determinant. Consequently, while we infer in the middle: Permutation self-consistency im-
that sentence similarity constitutes a pivotal aspect, other proves listwise ranking in large language mod-
contributing factors remain unidentified, warranting fur- els, CoRR abs/2310.07712 (2023). URL: https:
ther investigation. Moreover, although our experimental //doi.org/10.48550/arXiv.2310.07712. doi:10.48550/
evaluation employs a well-known conversational collection, ARXIV.2310.07712. arXiv:2310.07712.
the methodology and results shown in this work are gen- [8] P. Owoicho, J. Dalton, M. Aliannejadi, L. Azzopardi,
eral. They could also be applied to other scenarios, such as J. R. Trippas, S. Vakulenko, TREC cast 2022: Going be-
ad-hoc search. yond user ask and system retrieve with initiative and
In future work, we intend to evaluate the impact of the response generation, in: I. Soboroff, A. Ellis (Eds.), Pro-
number of clusters selected by our method for generating ceedings of the Thirty-First Text REtrieval Conference,
the response. Our intuition is that the number of clusters TREC 2022, online, November 15-19, 2022, volume 500-
identified for a given query is a proxy of the difficulty of 338 of NIST Special Publication, National Institute of
the query itself. Fewer clusters or even a single large should Standards and Technology (NIST), 2022. URL: https://
characterize simple and close queries. In contrast, difficult— trec.nist.gov/pubs/trec31/papers/Overview_cast.pdf.
multi-faceted—queries are possibly characterized by more [9] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi,
clusters, each addressing a different facet of the query. This H. Hajishirzi, When not to trust language models:
intuition paves the way for the extension of the evaluation Investigating effectiveness of parametric and non-
methodology by adopting diversification-based metrics [42], parametric memories, in: A. Rogers, J. L. Boyd-
allowing us to understand how well the generated answers Graber, N. Okazaki (Eds.), Proceedings of the 61st
cover the query facets and the topical distribution of the Annual Meeting of the Association for Computa-
clusters. tional Linguistics (Volume 1: Long Papers), ACL 2023,
Toronto, Canada, July 9-14, 2023, Association for
Computational Linguistics, 2023, pp. 9802–9822. URL:
References https://doi.org/10.18653/v1/2023.acl-long.546. doi:10.
[1] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, 18653/V1/2023.ACL-LONG.546.
J. Sun, M. Wang, H. Wang, Retrieval-augmented gen- [10] R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian,
eration for large language models: A survey, 2024. H. Wu, J. Wen, H. Wang, Investigating the factual
arXiv:2312.10997. knowledge boundary of large language models
[2] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, with retrieval augmentation, CoRR abs/2307.11019
Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A (2023). URL: https://doi.org/10.48550/arXiv.2307.
survey on hallucination in large language models: 11019. doi:10.48550/ARXIV.2307.11019.
Principles, taxonomy, challenges, and open ques- arXiv:2307.11019.
tions, CoRR abs/2311.05232 (2023). URL: https: [11] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
//doi.org/10.48550/arXiv.2311.05232. doi:10.48550/ M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-
ARXIV.2311.05232. arXiv:2311.05232. bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lam-
[3] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, ple, Llama: Open and efficient foundation language
E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, models, CoRR abs/2302.13971 (2023). URL: https:
W. Bi, F. Shi, S. Shi, Siren’s song in the AI //doi.org/10.48550/arXiv.2302.13971. doi:10.48550/
ARXIV.2302.13971. arXiv:2302.13971.
[12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- abs/1301.3781.
hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, [21] J. Pennington, R. Socher, C. D. Manning, Glove:
S. Bhosale, D. Bikel, L. Blecher, C. Canton-Ferrer, Global vectors for word representation, in: A. Mos-
M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, chitti, B. Pang, W. Daelemans (Eds.), Proceedings
W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, of the 2014 Conference on Empirical Methods in
A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, Natural Language Processing, EMNLP 2014, October
V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a
Koura, M. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Special Interest Group of the ACL, ACL, 2014, pp.
Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, 1532–1543. URL: https://doi.org/10.3115/v1/d14-1162.
Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, doi:10.3115/V1/D14-1162.
A. Schelten, R. Silva, E. M. Smith, R. Subramanian, [22] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, pre-training of deep bidirectional transformers for
P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kam- language understanding, in: J. Burstein, C. Doran,
badur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Solorio (Eds.), Proceedings of the 2019 Conference
T. Scialom, Llama 2: Open foundation and fine-tuned of the North American Chapter of the Association for
chat models, CoRR abs/2307.09288 (2023). URL: https: Computational Linguistics: Human Language Tech-
//doi.org/10.48550/arXiv.2307.09288. doi:10.48550/ nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
ARXIV.2307.09288. arXiv:2307.09288. June 2-7, 2019, Volume 1 (Long and Short Papers),
[13] OpenAI, GPT-4 technical report, CoRR abs/2303.08774 Association for Computational Linguistics, 2019, pp.
(2023). URL: https://doi.org/10.48550/arXiv.2303. 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423.
08774. doi:10.48550/ARXIV.2303.08774. doi:10.18653/V1/N19-1423.
arXiv:2303.08774. [23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
[14] F. Xu, W. Shi, E. Choi, RECOMP: improving retrieval- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the lim-
augmented lms with compression and selective aug- its of transfer learning with a unified text-to-text trans-
mentation, CoRR abs/2310.04408 (2023). URL: https: former, J. Mach. Learn. Res. 21 (2020) 140:1–140:67.
//doi.org/10.48550/arXiv.2310.04408. doi:10.48550/ URL: http://jmlr.org/papers/v21/20-074.html.
ARXIV.2310.04408. arXiv:2310.04408. [24] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu,
[15] F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang,
C. Campagnano, Y. Maarek, N. Tonellotto, F. Silvestri, J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge with
The power of noise: Redefining retrieval for rag sys- mt-bench and chatbot arena, in: A. Oh, T. Naumann,
tems, arXiv preprint arXiv:2401.14887 (2024). A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.),
[16] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: a Advances in Neural Information Processing Systems
method for automatic evaluation of machine trans- 36: Annual Conference on Neural Information
lation, in: Proceedings of the 40th Annual Meet- Processing Systems 2023, NeurIPS 2023, New Or-
ing of the Association for Computational Linguistics, leans, LA, USA, December 10 - 16, 2023, 2023. URL:
July 6-12, 2002, Philadelphia, PA, USA, ACL, 2002, http://papers.nips.cc/paper_files/paper/2023/hash/
pp. 311–318. URL: https://aclanthology.org/P02-1040/. 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_
doi:10.3115/1073083.1073135. and_Benchmarks.html.
[17] C.-Y. Lin, ROUGE: A package for automatic evaluation [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkor-
of summaries, in: Text Summarization Branches Out, eit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
Association for Computational Linguistics, Barcelona, sukhin, Attention is all you need, in: I. Guyon,
Spain, 2004, pp. 74–81. URL: https://aclanthology.org/ U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer-
W04-1013. gus, S. V. N. Vishwanathan, R. Garnett (Eds.),
[18] S. Banerjee, A. Lavie, METEOR: an automatic met- Advances in Neural Information Processing Sys-
ric for MT evaluation with improved correlation with tems 30: Annual Conference on Neural Infor-
human judgments, in: J. Goldstein, A. Lavie, C. Lin, mation Processing Systems 2017, December 4-9,
C. R. Voss (Eds.), Proceedings of the Workshop on 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
Intrinsic and Extrinsic Evaluation Measures for Ma- URL: https://proceedings.neurips.cc/paper/2017/hash/
chine Translation and/or Summarization@ACL 2005, 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Ann Arbor, Michigan, USA, June 29, 2005, Association [26] E. Clark, S. Rijhwani, S. Gehrmann, J. Maynez, R. Aha-
for Computational Linguistics, 2005, pp. 65–72. URL: roni, V. Nikolaev, T. Sellam, A. Siddhant, D. Das,
https://aclanthology.org/W05-0909/. A. P. Parikh, SEAHORSE: A multilingual, multi-
[19] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, faceted dataset for summarization evaluation, in:
Bertscore: Evaluating text generation with BERT, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of
8th International Conference on Learning Representa- the 2023 Conference on Empirical Methods in Natural
tions, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, Language Processing, EMNLP 2023, Singapore, De-
2020, OpenReview.net, 2020. URL: https://openreview. cember 6-10, 2023, Association for Computational Lin-
net/forum?id=SkeHuCVFDr. guistics, 2023, pp. 9397–9413. URL: https://doi.org/10.
[20] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi- 18653/v1/2023.emnlp-main.584. doi:10.18653/V1/
cient estimation of word representations in vector 2023.EMNLP-MAIN.584.
space, in: Y. Bengio, Y. LeCun (Eds.), 1st Interna- [27] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. M.
tional Conference on Learning Representations, ICLR Voorhees, Overview of the TREC 2019 deep learning
2013, Scottsdale, Arizona, USA, May 2-4, 2013, Work- track, CoRR abs/2003.07820 (2020). URL: https://arxiv.
shop Track Proceedings, 2013. URL: http://arxiv.org/ org/abs/2003.07820. arXiv:2003.07820.
[28] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, Overview Proceedings of the 6th Workshop on Representa-
of the TREC 2020 deep learning track, in: E. M. tion Learning for NLP, RepL4NLP@ACL-IJCNLP 2021,
Voorhees, A. Ellis (Eds.), Proceedings of the Twenty- Online, August 6, 2021, Association for Computa-
Ninth Text REtrieval Conference, TREC 2020, Virtual tional Linguistics, 2021, pp. 163–173. URL: https://doi.
Event [Gaithersburg, Maryland, USA], November 16- org/10.18653/v1/2021.repl4nlp-1.17. doi:10.18653/
20, 2020, volume 1266 of NIST Special Publication, Na- V1/2021.REPL4NLP-1.17.
tional Institute of Standards and Technology (NIST), [37] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary,
2020. URL: https://trec.nist.gov/pubs/trec29/papers/ R. Majumder, L. Deng, MS MARCO: A human gen-
OVERVIEW.DL.pdf. erated machine reading comprehension dataset, in:
[29] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, T. R. Besold, A. Bordes, A. S. d’Avila Garcez, G. Wayne
I. Gurevych, BEIR: A heterogenous benchmark for (Eds.), Proceedings of the Workshop on Cognitive
zero-shot evaluation of information retrieval models, Computation: Integrating neural and symbolic ap-
CoRR abs/2104.08663 (2021). URL: https://arxiv.org/ proaches 2016 co-located with the 30th Annual Confer-
abs/2104.08663. arXiv:2104.08663. ence on Neural Information Processing Systems (NIPS
[30] X. Ma, X. Zhang, R. Pradeep, J. Lin, Zero-shot 2016), Barcelona, Spain, December 9, 2016, volume
listwise document reranking with a large language 1773 of CEUR Workshop Proceedings, CEUR-WS.org,
model, CoRR abs/2305.02156 (2023). URL: https: 2016. URL: http://ceur-ws.org/Vol-1773/CoCoNIPS_
//doi.org/10.48550/arXiv.2305.02156. doi:10.48550/ 2016_paper9.pdf.
ARXIV.2305.02156. arXiv:2305.02156. [38] F. Petroni, A. Piktus, A. Fan, P. S. H. Lewis, M. Yaz-
[31] W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren, dani, N. D. Cao, J. Thorne, Y. Jernite, V. Karpukhin,
Z. Chen, D. Yin, Z. Ren, Instruction distillation J. Maillard, V. Plachouras, T. Rocktäschel, S. Riedel,
makes large language models efficient zero-shot KILT: a benchmark for knowledge intensive language
rankers, CoRR abs/2311.01555 (2023). URL: https: tasks, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer,
//doi.org/10.48550/arXiv.2311.01555. doi:10.48550/ D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell,
ARXIV.2311.01555. arXiv:2311.01555. T. Chakraborty, Y. Zhou (Eds.), Proceedings of the
[32] R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankvicuna: 2021 Conference of the North American Chapter of
Zero-shot listwise document reranking with open- the Association for Computational Linguistics: Hu-
source large language models, CoRR abs/2309.15088 man Language Technologies, NAACL-HLT 2021, On-
(2023). URL: https://doi.org/10.48550/arXiv.2309. line, June 6-11, 2021, Association for Computational
15088. doi:10.48550/ARXIV.2309.15088. Linguistics, 2021, pp. 2523–2544. URL: https://doi.org/
arXiv:2309.15088. 10.18653/v1/2021.naacl-main.200. doi:10.18653/V1/
[33] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz, M. de Ri- 2021.NAACL-MAIN.200.
jke, Conversations with search engines: Serp-based [39] D. Yang, Y. Zhang, H. Fang, An exploration study
conversational response generation, ACM Trans. of mixed-initiative query reformulation in conver-
Inf. Syst. 39 (2021) 47:1–47:29. URL: https://doi.org/ sational passage retrieval, in: I. Soboroff, A. El-
10.1145/3432726. doi:10.1145/3432726. lis (Eds.), Proceedings of the Thirty-First Text RE-
[34] W. Lajewska, K. Balog, Towards filling the gap in trieval Conference, TREC 2022, online, November 15-
conversational search: From passage retrieval to con- 19, 2022, volume 500-338 of NIST Special Publication,
versational response generation, in: I. Frommholz, National Institute of Standards and Technology (NIST),
F. Hopfgartner, M. Lee, M. Oakes, M. Lalmas, M. Zhang, 2022. URL: https://trec.nist.gov/pubs/trec31/papers/
R. L. T. Santos (Eds.), Proceedings of the 32nd ACM udel_fang.C.pdf.
International Conference on Information and Knowl- [40] S. Otmazgin, A. Cattan, Y. Goldberg, F-coref: Fast,
edge Management, CIKM 2023, Birmingham, United accurate and easy to use coreference resolution, in:
Kingdom, October 21-25, 2023, ACM, 2023, pp. 5326– Proceedings of the 2nd Conference of the Asia-Pacific
5330. URL: https://doi.org/10.1145/3583780.3615132. Chapter of the Association for Computational Linguis-
doi:10.1145/3583780.3615132. tics and the 12th International Joint Conference on
[35] W. Lajewska, K. Balog, Towards reliable and factual Natural Language Processing, AACL/IJCNLP 2022 -
response generation: Detecting unanswerable ques- System Demostrations, Taipei, Taiwan, November 20 -
tions in information-seeking conversations, in: N. Go- 23, 2022, Association for Computational Linguistics,
harian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, 2022, pp. 48–56. URL: https://aclanthology.org/2022.
C. Macdonald, I. Ounis (Eds.), Advances in Information aacl-demo.6.
Retrieval - 46th European Conference on Information [41] S. Otmazgin, A. Cattan, Y. Goldberg, Lingmess: Lin-
Retrieval, ECIR 2024, Glasgow, UK, March 24-28, 2024, guistically informed multi expert scorers for coref-
Proceedings, Part III, volume 14610 of Lecture Notes erence resolution, in: A. Vlachos, I. Augenstein
in Computer Science, Springer, 2024, pp. 336–344. URL: (Eds.), Proceedings of the 17th Conference of the
https://doi.org/10.1007/978-3-031-56063-7_25. doi:10. European Chapter of the Association for Computa-
1007/978-3-031-56063-7\_25. tional Linguistics, EACL 2023, Dubrovnik, Croatia,
[36] S. Lin, J. Yang, J. Lin, In-batch negatives for knowledge May 2-6, 2023, Association for Computational Lin-
distillation with tightly-coupled teachers for dense re- guistics, 2023, pp. 2744–2752. URL: https://doi.org/
trieval, in: A. Rogers, I. Calixto, I. Vulic, N. Saphra, 10.18653/v1/2023.eacl-main.202. doi:10.18653/V1/
N. Kassner, O. Camburu, T. Bansal, V. Shwartz (Eds.), 2023.EACL-MAIN.202.
[42] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, mation Retrieval, SIGIR ’08, Association for Comput-
A. Ashkan, S. Büttcher, I. MacKinnon, Novelty and ing Machinery, New York, NY, USA, 2008, p. 659–666.
diversity in information retrieval evaluation, in: Pro- URL: https://doi.org/10.1145/1390334.1390446. doi:10.
ceedings of the 31st Annual International ACM SIGIR 1145/1390334.1390446.
Conference on Research and Development in Infor-