<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving RAG Systems via Sentence Clustering and Reordering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Alessio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guglielmo Faggioli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franco Maria Nardini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafaele Perego</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering (DEI), University of Padua</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Information Science and Technologies (ISTI), National Research Council of Italy (CNR)</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large Language Models (LLMs) have gained noteworthy importance and attention across diferent domains and fields in recent years. Information Retrieval (IR) is one of the domains they impacted the most, as witnessed by the recent increase in the number of IR systems incorporating generative models. Specifically, Retrieval Augmented Generation (RAG) is the emerging paradigm that integrates existing knowledge from large-scale document corpora into the generation process, enabling the model to generate more coherent, contextually relevant, and accurate text across various tasks. Such tasks include summarization, question answering, and dialogue systems. Recent studies have highlighted the significant positional dependence exhibited by RAG systems. Such studies observed how the placement of information within the LLM input prompt drastically afects the generated output. We ground our study on this property by investigating alternative strategies for ordering sentences within the LLM prompt to improve the average quality of the generated responses in the user and conversational system dialogues. We propose the architecture of an end-to-end RAG-based conversational assistant and empirically evaluate our strategies using the TREC CAsT 2022 collection. Our experiments highlight significant diferences between distinct arrangement strategies. By employing an evaluation methodology based on RankVicuna, we show that our best approach achieves improvements up to 54% in terms of overall response quality over baseline methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Retrieval Augmented Generation</kwd>
        <kwd>Conversational Search</kwd>
        <kwd>Positional Bias</kwd>
        <kwd>Arrangement Strategy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Retrieval Augmented Generation (RAG) is an emerging
paradigm in the field of Artificial Intelligence ( AI) to
enhance the accuracy and reliability of generative models by
exploiting external data sources. In recent years, RAG has
gained noteworthy importance and attention across
different domains and fields [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] as it allows to combine the
strengths of Information Retrieval (IR) systems and
generative models to overcome each other’s limitations.
      </p>
      <p>
        RAG can improve the output of a generative model in
several ways. First, it allows the generation process to be
grounded on information from trusted knowledge sources
incorporated in the provided prompt, thus avoiding or at
least mitigating the well-known Large Language Model
(LLM) hallucination problem, i.e., when the model
generates contents not factually true or that do not concern the
prompted text [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. Second, RAG allows for continuous
knowledge updates and integration of domain-specific
information: the LLM can successfully respond to facts and
topics not covered in its training data; moreover, it is
easily adapted to diferent scenarios and contexts, without
retraining or fine-tuning the entire model using datasets that
might be unavailable or limited in scope or size. Finally,
grounding the generation process on external knowledge
incorporated in the input permits linking the output to
veriifable external documents, thus enhancing trustworthiness
and transparency [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ].
      </p>
      <p>
        Current RAG systems, however, sufer of some
drawbacks highlighted in the literature. One of these issues
originates from the notable positional sensitivity shown
by LLMs. The placement of information within the input
prompt significantly impacts the resulting output. Previous
research [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ] has highlighted biases towards “primacy”
and “recency”, suggesting that generative models tend to
prioritize information placed at the beginning or end of the
input while neglecting the central portion.
      </p>
      <p>
        In this paper, we advance over previous studies by
investigating the positional bias in the context of RAG-based
conversational systems. Specifically, we propose a novel
strategy for arranging sentences within the input prompt of
the LLM to improve the average quality of the generated
responses over simpler methods. Our approach is based on the
intuition that as coherent, fluent, and well-structured text
are critical factors for successful communication between
human beings, the same should also apply to LLMs: among
all the possible arrangements of the input, those having
sentences with similar meaning placed closer in the LLM
prompt should generate, on average, better quality output.
Therefore, we propose an end-to-end RAG architecture to
test our hypothesis. The components of this architecture
allow us to precisely identify which sentences are likely
useful for answering user queries. To this end, we
cluster sentences by their similarity and we define alternative
strategies for ordering them both inter and intra-cluster.
In this way, we can study the efect on the generated
response of these alternatives for prompting the generative
LLM. To our knowledge, this is the first work that
explicitly considers this aspect and allows us to fine-tune in a
principled way the ordering of input sentences provided to
the generative component of a RAG system. We compare
our proposed approach against competitive baselines that
represent the solutions employed by current RAG systems.
We experimentally evaluate the performance of our
proposed approach using the TREC Conversational Assistance
Track (CAsT) 2022 collection [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which allows us to
compare the results that diferent arrangement strategies can
achieve in a widely accepted Conversational Search (CS)
scenario. Results highlight remarkable diferences among the
tested sentence placement strategies, with improvements up
to 8.66% w.r.t. the best baseline and 54.94% w.r.t. random
ordering.
      </p>
      <p>The remainder of this work is organized as follows:
Section 2 surveys the current state-of-the-art about RAG
systems and quality evaluation for their responses. Section
3 details the architecture of our RAG system. Section 4
and Section 5 detail the results of an experimental analysis,
which aims to highlight how the ordering of clusters and
sentences afects the quality of the generated response.
Finally, Section 6 draws some conclusions and outlines future
directions and extensions of our research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In the following, we survey the main works dealing with
LLM positional dependencies and the dificulties of RAG
systems in conciliating internal and external knowledge.
Then, we analyze the challenges related to the evaluation
of the quality of RAG responses and to the use of an
“LLMas-a-judge”.</p>
      <sec id="sec-2-1">
        <title>2.1. Retrieval Augmented Generation</title>
        <p>RAG enhances LLMs by retrieving additional information
from an external knowledge source, enabling them to
successfully answer queries beyond the scope of the training
data. At the same time, RAG mitigates the hallucination
problem, which is generating factually incorrect text, by
referencing the provided external knowledge.</p>
        <p>The RAG paradigm is organized into two main stages:
retrieval and generation. Upon receiving a query from the
user, the relevant information is retrieved from an external
knowledge source. This task is undertaken by a standard IR
pipeline that outputs a ranked list of documents. Afterwards,
in the generation phase, the LLM synthesizes the response
to answer the user query using the information carried by
the selected documents.</p>
        <p>
          Despite its clear advantages, RAG has drawbacks and
limitations, which spark several challenges. First, RAG
systems employ the external knowledge as their main source
of information, disregarding the internal knowledge
memorized within the LLM [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ]. This, in turn, may determine
a decrease in the quality of the generated output when
the provided content is not high-quality [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It is not
uncommon for RAG to obtain worse outputs w.r.t. what the
LLM can achieve in the closed-book scenario, i.e., without
supplying retrieved results [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In this line, it has been
observed that the LLM produces better results without
injecting external knowledge when the topic popularity is
very high [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In general, state-of-the-art LLMs provide
good quality responses for a wide range of questions but
require assistance from an IR system when the internal
knowledge of the model lacks information about the
current topic. This phenomenon is likely to occur if the topic
is not very popular, requires exceptional expertise, or when
scaling the number of parameters of the generative model
produces little to no efect [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Another challenge lies in
the significant positional dependence [
          <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
          ] exhibited by
LLMs, whereby the placement of information within the
input prompt drastically afects the generated output. Prior
research [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] has identified “primacy” and “recency” biases,
indicating the tendency of generative models to focus
toward information positioned either at the beginning or the
end of the input while disregarding the central part.
Therefore, the performance degrades significantly when LLMs
should rely on information in the middle of its input context,
showing a characteristic U-shaped performance curve [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
This, in turn, means that most state-of-the-art generative
models do not use efectively their longer contexts w.r.t.
smaller and earlier counterparts. These phenomena can be
observed both in open-source, e.g., Llama [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ] by Meta,
and closed-source, e.g., GPT-4 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] by OpenAI, models. It is
not advisable to directly input all the retrieved information
to the LLM for generating the response. Redundant
information and very long contextual data can interfere with
the generation quality, leading to repetitive, disjointed, or
incoherent outputs [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Therefore, the retrieved content is
typically further processed before being given in input to
the LLM [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. A recent work in this direction
systematically examines the retrieval strategy of RAG systems [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
The authors consider multiple retrieval factors afecting the
generation process, such as the relevance of the passages
in the prompt context, their position, and their number.
One counter-intuitive finding is that the retriever’s
highestscoring documents that are not directly relevant to the query,
e.g., do not contain the answer, negatively impact the
efectiveness of the LLM. Moreover, the authors discover that
adding random documents in the prompt improves the LLM
accuracy by up to 35%.
        </p>
        <p>In this work, we rely on the intuition that the use of
coherent, fluent, and well-structured inputs can improve RAG
and we propose an end-to-end architecture for selecting and
structuring the external information included in the LLM
prompt for response generation.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Quality Evaluation</title>
        <p>Another line of research is how to evaluate the overall
quality of the generation output. Despite human assessment
providing the most accurate and reliable measure for
evaluating model performance, the high time and cost
requirements severely limit the application. Therefore, there exists
an ever-increasing demand for automated evaluation
techniques that consistently align with human judgements while
ofering enhanced eficiency and cost-efectiveness.</p>
        <p>
          In this paper, we focus on textual-based generative
models. Classical automatic evaluation metrics, such as
BLEU [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], ROUGE [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], and METEOR [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], are designed to
quantify the degree of similarity between a candidate text
and one or more reference texts, by assessing their n-grams
matching. The simplicity and explainability, along with
the good correlation with human judgements, make these
metrics widely used as baselines. However, these metrics
exhibit several limitations [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]: firstly, they cannot account
for lexical diversity; secondly, they penalize variations in
the semantic ordering of words; thirdly, they struggle to
capture and match paraphrases efectively; lastly, they
inadequately account for distant dependencies within the text.
With the advent of word embeddings [
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ] and neural
models [
          <xref ref-type="bibr" rid="ref11 ref12 ref22 ref23 ref24">22, 23, 11, 12, 24</xref>
          ] based on Transformers [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], new
learned metrics [
          <xref ref-type="bibr" rid="ref19 ref26">19, 26</xref>
          ] have been developed. For example,
BERTScore [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] can capture the semantic similarity between
the candidate and reference texts employing the
contextual embeddings generated by an encoder model, such as
BERT [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
        <p>In recent years, the rapid advancements of LLMs
showing remarkable performance across many tasks have gained
considerable interest in their potential application also as
annotators and evaluators. Due to their training using
Reinforcement Learning from Human Feedback (RLHF),
these models demonstrate significant human alignment.
Many research have investigated leveraging
state-of-theart LLMs to automatically produce assessments serving as
proxies for human judgments, a paradigm known as
“LLMas-a-judge”.</p>
        <p>SENTENCE
REORDERING
SENTENCES</p>
        <p>WITHIN CLUSTER
ORDER STRATEGY
CONVERSATION</p>
        <p>
          Furthermore, in recent years LLMs have gained
popularity also as evaluators. For example, Zheng et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]
assessed the quality of conversations with various LLMs,
both open and closed source, employing GPT-4 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] as judge.
They experimented with various prompts and diferent
approaches, such as single answer grading and pairwise
comparisons both between responses and against a reference
text. GPT-3.5 Turbo and GPT-4 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] have been employed as
listwise rerankers [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ] for the TREC Deep Learning 2019
and 2020 [
          <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
          ] and BEIR [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] experimental collections,
obtaining state-of-the-art performance [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The same LLMs
have also been employed as teacher models to fine-tune
smaller open-source student models, such as Llama and
Vicuna [
          <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
          ] (i.e.: RankVicuna [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]).
        </p>
        <p>
          In this work, we rely on state-of-the-art assessment
methods and evaluate the quality of the responses generated by
the diferent methods using RankVicuna [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The Proposed RAG Architecture</title>
      <p>
        Generative models exhibit strong biases towards
information positioned at the start or the end of the input while
disregarding the middle part [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This phenomenon motivates
our research efort to determine how the order of the input
sentences provided to a RAG-based conversational system
afects the quality of the generated output and, in turn, the
optimal ordering strategy to achieve the best response. This
section describes each method and all variations considered
in our experiments.
      </p>
      <p>The architecture of our proposed RAG system is
illustrated in Figure 1. It includes an IR pipeline, which retrieves
top- documents  = {1, 2, ..., } in response to each
user utterance . The retrieved documents are then
processed by additional components responsible for splitting
them into sentences, identifying the most relevant sentences,
clustering such sentences based on their semantic
similarity, and ordering them according to the various strategies
analyzed. Finally, the selected—re-ordered—sentences are
provided as input to the LLM for response generation. These
components are the focus of our research. Their
functionalities are detailed in the remainder of this section.
SENTENCE
CLUSTERING
s3,7
s4,2
...</p>
      <sec id="sec-3-1">
        <title>3.1. Document Pre-processing and Splitting</title>
        <p>
          As observed in literature [
          <xref ref-type="bibr" rid="ref33 ref34">33, 34</xref>
          ], the entire text of a
relevant document rarely contains meaningful knowledge to
satisfy the user information need expressed by a query .
In most cases, only one or a few portions of the document
are relevant to the query, while the remaining parts contain
irrelevant information. The proposed architecture aims to
precisely identify the key information in the retrieved
documents, i.e., the sentences, to reduce the noise in the prompt
used for response generation.
        </p>
        <p>Hereinafter, we consider sentences in the documents as
the atomic units of information. Our pipeline, illustrated in
Figure 1 works as follows. First, for each query  we consider
only the top- documents {1, 2, ..., } retrieved by the
IR system. Then, a state-of-the-art co-reference resolution
model is applied to all documents to replace pronouns and
other generic terms within a sentence with the fully
speciifed entity mentioned in a previous sentence. This allows
us to remove the contextual dependencies among sentences
in a document so they can be considered self-explanatory.
The third step splits each document  into a sequence of
sentences {,1, ,2, ..., , }. Afterwards, near-duplicate
removal is employed to the sentences originated by all
documents by discarding sentences with a Jaccard similarity
≥ 0.9 between their Bag-of-Words (BoW) representations1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Sentence Selection</title>
        <p>
          After the first pre-processing phase, we obtain a sentence
candidate set for each query to be included in the LLM
prompt of our RAG system (see Figure 1). Since the
cardinality of this set can be large and not all the sentences are
useful for answering the query, we employ the BERT-based
cross encoder answer-in-the-sentence classifier 2 developed
by Lajewska and Balog [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] to rank the candidate sentences
1This step is particularly important in our setting because the CAsT 2022
corpus contains a multitude of near-duplicate documents. In particular,
the same Wikipedia article is often replicated in documents retrieved
from the KILT and MS-MARCO collections.
2The model named “squad_snippets_unanswerable” is available at https:
//iai.group/downloads/emnlp2023-answerability_prediction.
according to their predicted usefulness to (at least partially)
answer the query and we retain the top- ranked sentences
thus discarding the remaining ones. As a possible
limitation, please note that the model by Lajewska and Balog
[
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] employed have been trained on queries and passages
used in our experiments. Therefore, it is very likely that
the model performs significantly better on our data w.r.t.
any other model, ensuring that top-ranked sentences are
indeed relevant to the query. Even though such a model
is not available in a real practical scenario, this choice is
justified by our research efort being focused exclusively on
comparing the ordering strategy for sentences in the LLM
input rather than on the absolute results achievable by our
RAG system.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Sentence Clustering and Ordering</title>
        <p>
          The previous steps of the pipeline constrain the number of
sentences per query while increasing their expected utility
in answering the query. Furthermore, they allow us to
control other noise sources, such as the number or the variable
length of the retrieved documents. Therefore, we can
assess how the positional bias afects the generation process.
We highlight again that the positional bias of LLM has
already been observed in prior research [
          <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
          ]. However,
it has been considered exclusively as a limitation of LLMs
and RAG systems. Our research moves a step forward by
investigating the best ordering strategy to maximize, on
average, the quality of the generated responses over a
testing query set . We believe that logically organized text
where sentences with akin meanings are positioned closer
in the LLM prompt should, on average, yield superior
output quality. Consequently, our sentence ordering strategies
exploit the similarities among sentences selected by the
sentence selection step. To measure semantic inter-sentence
similarity, we resort to the contextualized embeddings
generated with the tct-colbert model3 [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]. We generate the
representation of the  selected sentences for each query
and measure their pair-wise cosine similarity. Then, we
progressively aggregate the most similar sentences by
employing a hierarchical clustering algorithm. The maximum
value of Silhouette statistic is used as the criteria to
determine the optimal clustering among all possible. As a result,
for each query  ∈ , the top- sentences are grouped in
a variable number  ≥ 1 of clusters, each composed of
one or more sentences with similar semantic meaning. To
devise diferent strategies for ordering input sentences, we
leverage the above clustering that allows us to study the
impact of sentence placement variations occurring in both
inter and intra-clusters.
        </p>
        <p>More formally, given a query, the set  of the  previously
selected sentences, and the prompt , we aim to find the
ordering * of  such that:</p>
        <p>∈
* = argmax ∑︁ (,  (, , ())),

where () is a sentence ordering strategy that returns
an ordering of the sentences in ,  (, , ()) is the
response generated by the LLM used for prompt , query
 and sentence ordering (), and, finally, (, ) is a
scoring function evaluating the perceived quality of the
generated response  =  (, , ()) for query .</p>
        <sec id="sec-3-3-1">
          <title>3https://huggingface.co/castorini/tct_colbert-v2-hnp-msmarco</title>
          <p>
            The order of clusters and the order of the sentences within
the same cluster uniquely determine the possible global
ordering of the  sentences we consider for inputting the LLM.
Our experimental assessment will evaluate six diferent
ordering strategies for placing the clusters of sentences in the
input, and four diferent methods for ordering sentences
within the same cluster. Cluster placements consider
diferent aspects, such as the clusters’ cardinality and similarity
to the query. The ordering tested includes the random one
and those obtained by decreasing/increasing the value of
each aspect. Finally, the U-shaped order suggested in [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]
is also tested. Regarding the ordering within clusters, we
consider random order, order by reranker score, visiting
order, and the clustering aggregation order.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>We can now formulate the research questions we aim to
answer with our experimental framework.</p>
      <p>Research Questions. Given the sentence selection and
clustering steps discussed above, the two main aspects to
consider for defining our ordering strategies (· ) are the
order of placement in the LLM prompt of the clusters and
of the sentences within the same cluster. They uniquely
determine the global ordering (· ) of the top- sentences
given in input to the LLM for response generation. Our
research questions assess which is the best solution among
these alternatives considered. Specifically,</p>
      <sec id="sec-4-1">
        <title>RQ1 What is the best cluster ordering strategy?</title>
        <p>
          RQ2 What is the best ordering strategy for sentences
within the same cluster?
RQ3 Can our proposed strategy enhance the efectiveness
of the RAG system w.r.t. baseline methods?
Experimental Settings. We experiment with the TREC
CAsT 2022 dataset, a standard experimental collection for
CS [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This choice is due to prior research that released
additional datasets, models, and human judgments for this
benchmark [
          <xref ref-type="bibr" rid="ref34 ref35">34, 35</xref>
          ]. The corpus is composed of three
documents collections, MS-MARCO v2 [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], KILT [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], and
Washington Post v4, which are subdivided into 106 short
documents. CAsT 2022 includes 18 information needs
(topics) and 205 user utterances (queries), with an average
length of 11.39 user utterances per topic. The number of
utterances for which relevance judgements are provided is
163.
        </p>
        <p>
          For our experiments, as the retrieval system, we employ
as the output of the retrieval pipeline the best-performing
run originally submitted to TREC CAsT 20224 [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]. This
allows us to focus exclusively on the following steps of our
pipeline. In all our experiments, we consider only the top-20
retrieved documents, leaving the investigation about the
implications of this choice and possible alternatives as
future work. To provide meaningful results, all queries where
 @20 &lt; 0.2, that is, having at most 3 relevant
passages in the top-20 results, are discarded5, ensuring that
enough relevant information is retrieved to answer the
considered queries successfully.
4The run is identified as “udinfo_mi_b2021” from the “udel_fang” group,
University of Delaware (USA)
5The number of queries considered in these experiments is 115 out of
163 evaluated in the oficial relevance judgments.
        </p>
        <p>Furthermore, in the steps of the pipeline where the query
text is needed, i.e., sentence ranking and response
generation, we employed the manually rewritten text for every
query. This allows us to account for the possible bias
introduced by diferent query rewriting approaches. Future
developments will investigate the relationship between query
rewriting approaches and RAG solutions.</p>
        <p>
          For co-reference resolution at the document level, i.e.,
removing co-references across diferent sentences in the
“document processing” step, we use the “F-Coref” model6 [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]
based on the “LingMess” architecture [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ]. After this step,
we use the well-known SpaCy Python library to divide each
document into a sequence of independent sentences.
        </p>
        <p>In the following section, we report two diferent metrics
for each comparison. The former is the average score of
every approach when assessing all 10 random permutations
using RankVicuna. The latter, instead, is a pairwise metric,
assessing the number of queries for which the first approach
obtains higher/the same/lower score w.r.t. the other one.
This information should better highlight the diferences and
provide a more comprehensive view than a single average
value.</p>
        <p>
          Response Generation. For the response generation, we
employ Vicuna 7B7 [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], a LLM based on Llama 2 [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ]
ifne-tuned on 125K user conversations with ChatGPT
gathered using public APIs from the ShareGPT.com website.
Quality Evaluation. To evaluate the quality of the
generated responses, we employ RankVicuna [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] to perform
listwise ranking between all responses being compared. To
mitigate the positional bias intrinsic in RankVicuna, we assess
10 diferent random permutations of the same responses,
averaging the results obtained. This is a reasonable trade-of
between evaluation accuracy and the computational
runtime required. For each assessment, we assign +1−  points

to the i-th ranked response, where 1 ≤  ≤  and  is
the number of responses being compared. Furthermore, we
also evaluate the number of wins and ties between pairs
of responses considered. Whether a valid judgment from
the LLM can not be determined, the entire comparison is
discarded from the evaluation.
        </p>
        <sec id="sec-4-1-1">
          <title>4.1. RQ1: Order of Clusters</title>
          <p>For the first experiment, we evaluate the efects of diferent
ordering of the clusters while keeping the order of sentences
within the same cluster (based on the clustering aggregation</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>6https://huggingface.co/biu-nlp/f-coref 7https://huggingface.co/lmsys/vicuna-7b-v1.5</title>
        <p>order) fixed. We test six diferent strategies for ordering
clusters: clusters selected in random order (strategy A);
clusters selected in descending order of cardinality (strategy
B); clusters selected in ascending order of similarity with the
query8 (strategy C); clusters selected in descending order
of similarity with the query (strategy D); clusters selected
in descending order by similarity with the query using a
ping-pong layout from top to bottom (strategy E)9; clusters
selected by similarity with the query in descending order,
using a ping-pong layout from bottom to top (strategy F)10.</p>
        <p>As shown in Table 1, sorting the clusters in
descending order by their similarity with the query (strategy D)
is the clear winner in this comparison, in terms of both
score and pairwise wins. This approach performs 18.77%,
15.24%, 20.16%, 23.51%, and 14.81% better than other
options. This figures suggest that the LLM used to
generate the responses exhibit a much stronger “primacy” rather
than “recency” biases, as highlighted by option C being
overall the worst performing among those considered. Instead,
methods E and F were designed to place the least important
clusters towards the center, since LLMs struggle to utilize
the information in the middle of their prompt efectively.
However, we can see that both approaches are inefective:
we suspect this is due to the length of the input text being
much smaller than the maximum context window of the
model. Diferent results may be observed when varying the
amount of input data provided to the LLM for generation.</p>
        <sec id="sec-4-2-1">
          <title>4.2. RQ2: Order of Sentences within the same Cluster</title>
          <p>In this second experiment, we evaluate diferent sorting
schemes for sentences within the same cluster, keeping
the cluster’s order fixed at the best strategy determined in
RQ1. We test four diferent strategies for ordering sentences
within the same cluster: sentences selected in random order
(strategy A); sentences selected in descending order by
reranker score (strategy B); sentences selected by visiting
order11 (strategy C); sentences selected by aggregation order
(strategy D).</p>
          <p>As shown in Table 2, the best results are achieved by two
8The similarity between a cluster  and the query is defined as the
maximum cosine similarity between the query  ∈  with any sentence
, ∈  belonging to the cluster.
9The clusters are placed first, last, second, second-to-last, third, and so
on, e.g., [A, B, C, D, E] becomes [A, C, E, D, B].
10The clusters are placed last, first, second-to-last, second, third-to-last,
and so on, e.g., [A, B, C, D, E] becomes [B, D, E, C, A].
11The sentences are sorted based on the order in which they appear when
sequentially scanning through the set of top- retrieved documents.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.3. RQ3: Comparison with Baselines</title>
          <p>Our last experiment investigates whether our proposed
approach is beneficial in enhancing the overall efectiveness
of the RAG system w.r.t. four simpler baseline methods that
may be used in practice by current state-of-the-art RAG
systems. We test five diferent strategies: i) the
top</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>5 retrieved</title>
        <p>documents (A), ii) the top-40 sentences taken in random
order (B), iii) the top-40 sentences taken in descending order
by re-ranker score (C), iv) the top-40 sentences selected by
visiting order (D), v) the best clusterization-based approach
determined from RQ1 and RQ2 (CL).</p>
      </sec>
      <sec id="sec-4-4">
        <title>The results obtained are shown in Table 3.</title>
        <p>The
clusterization-based approach demonstrate superior
performance, resulting as the best strategy in this comparison. The
four baselines yield notably lower results: 15.14%, 54.94%,
8.66%, and 15.67%, respectively. Among the methods
considered in this work, randomly sorting the top-ℎ
sentences is by far the least performing approach. This, in
turn, proves our starting intuition about coherent, fluent,
and well-structured text being critical factors for LLMs to
generate high quality output.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Additional Experiments</title>
      <p>The clusterization-based ordering strategy proposed in this
work is designed to position sentences sharing analogous
semantic content close together in the LLM prompt. Given
the results obtained in Section 4.3, we have shown its
effectiveness in our experimental settings. Nevertheless, we
answer two additional research questions in this section to
gain additional insights. Specifically,</p>
      <p>RQ4 Is there a correlation between the similarity of
subsequent sentences in the LLM prompt and the quality
of the generated response?
Comparisons between the seven approaches proposed for RQ4:
“Is there a correlation between the similarity of subsequent
sentences in the LLM prompt and the quality of the generated
response?”. In the top half, each row reports three numbers, which
are the wins for the approach in the column label, the ties, and
the wins for the approach in the row label, respectively. In the
bottom half, the overall results are reported.
sidered. Moreover, subdividing and explicitly grouping
together sentences by subtopic is beneficial w.r.t. considering
the sentence similarity only in a pairwise fashion and thus
lacking a global vision of the retrieved knowledge.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>In this work, we presented a novel pipelined RAG
architecture aimed at selecting a set of relevant sentences for
each query and arranging them in a specific order to
optimize the quality of responses generated by a LLM. For
this purpose, sentences are first extracted from the top
documents retrieved. Then, they are reranked, and the most
relevant sentences are organized in clusters by similarity.
We proposed diferent strategies for ordering clusters and
the sentences within clusters in the input given to the LLM
for response generation. To the best of our knowledge,
this is the first work investigating sentence clustering and
re-ordering to improve the quality of the response
generated by RAG systems. Our empirical assessment is based
on a well-known—public—framework for conversational
search. The results of the experiments show that diferent
sequences of sentences in the LLM prompt significantly
impact response quality despite all methodologies
processing identical information from the same set of sentences.
Random permutations yield the lowest results, whereas our
proposed approach based on sentence clusterization yields
superior results. Additionally, we examined whether
maximizing the similarity between consecutive sentences in the
LLM prompt enhances response quality. While a positive
correlation between these factors was observed, it is not
the exclusive determinant. Consequently, while we infer
that sentence similarity constitutes a pivotal aspect, other
contributing factors remain unidentified, warranting
further investigation. Moreover, although our experimental
evaluation employs a well-known conversational collection,
the methodology and results shown in this work are
general. They could also be applied to other scenarios, such as
ad-hoc search.</p>
      <p>
        In future work, we intend to evaluate the impact of the
number of clusters selected by our method for generating
the response. Our intuition is that the number of clusters
identified for a given query is a proxy of the dificulty of
the query itself. Fewer clusters or even a single large should
characterize simple and close queries. In contrast, dificult—
multi-faceted—queries are possibly characterized by more
clusters, each addressing a diferent facet of the query. This
intuition paves the way for the extension of the evaluation
methodology by adopting diversification-based metrics [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ],
allowing us to understand how well the generated answers
cover the query facets and the topical distribution of the
clusters.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for large language models: A survey</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2312</volume>
          .
          <fpage>10997</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <article-title>A survey on hallucination in large language models: Principles, taxonomy</article-title>
          , challenges, and open questions,
          <source>CoRR abs/2311</source>
          .05232 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2311.05232. doi:
          <volume>10</volume>
          .48550/ ARXIV.2311.05232. arXiv:
          <volume>2311</volume>
          .
          <fpage>05232</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Siren's song in the AI ocean: A survey on hallucination in large language models</article-title>
          ,
          <source>CoRR abs/2309</source>
          .01219 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2309.01219. doi:
          <volume>10</volume>
          .48550/ ARXIV.2309.01219. arXiv:
          <volume>2309</volume>
          .
          <fpage>01219</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of hallucination in natural language generation</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <volume>248</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>248</lpage>
          :
          <fpage>38</fpage>
          . URL: https://doi.org/10.1145/ 3571730. doi:
          <volume>10</volume>
          .1145/3571730.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hewitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paranjape</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bevilacqua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Lost in the middle: How language models use long contexts</article-title>
          ,
          <source>CoRR abs/2307</source>
          .03172 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307. 03172. doi:
          <volume>10</volume>
          .48550/ARXIV.2307.03172. arXiv:
          <volume>2307</volume>
          .
          <fpage>03172</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S. Wang,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Is chatgpt good at search? investigating large language models as re-ranking agents</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>14918</fpage>
          -
          <lpage>14937</lpage>
          . URL: https://doi.org/10. 18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>923</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .EMNLP-MAIN.
          <year>923</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , X. Ma,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ture</surname>
          </string-name>
          ,
          <article-title>Found in the middle: Permutation self-consistency improves listwise ranking in large language models</article-title>
          ,
          <source>CoRR abs/2310</source>
          .07712 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2310.07712. doi:
          <volume>10</volume>
          .48550/ ARXIV.2310.07712. arXiv:
          <volume>2310</volume>
          .
          <fpage>07712</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Owoicho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dalton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Trippas</surname>
          </string-name>
          , S. Vakulenko, TREC cast
          <year>2022</year>
          :
          <article-title>Going beyond user ask and system retrieve with initiative and response generation</article-title>
          , in: I.
          <string-name>
            <surname>Soborof</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Ellis (Eds.),
          <source>Proceedings of the Thirty-First Text REtrieval Conference</source>
          , TREC
          <year>2022</year>
          ,
          <article-title>online</article-title>
          ,
          <source>November 15-19</source>
          ,
          <year>2022</year>
          , volume
          <volume>500</volume>
          - 338 of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2022</year>
          . URL: https:// trec.nist.gov/pubs/trec31/papers/Overview_cast.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Khashabi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>When not to trust language models: Investigating efectiveness of parametric and nonparametric memories</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>BoydGraber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>ACL</source>
          <year>2023</year>
          , Toronto, Canada, July 9-
          <issue>14</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>9802</fpage>
          -
          <lpage>9822</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>546</volume>
          . doi:
          <volume>10</volume>
          . 18653/V1/
          <year>2023</year>
          .
          <article-title>ACL-LONG</article-title>
          .
          <year>546</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Investigating the factual knowledge boundary of large language models with retrieval augmentation</article-title>
          ,
          <source>CoRR abs/2307</source>
          .11019 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2307. 11019. doi:
          <volume>10</volume>
          .48550/ARXIV.2307.11019. arXiv:
          <volume>2307</volume>
          .
          <fpage>11019</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lachaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lacroix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rozière</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hambro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Azhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>CoRR abs/2302</source>
          .13971 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2302.13971. doi:
          <volume>10</volume>
          .48550/ ARXIV.2302.13971. arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Canton-Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kerkez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kloumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Koura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lachaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liskovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mihaylov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishra</surname>
          </string-name>
          , I. Molybog,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poulton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reizenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rungta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Saladi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. E.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Kuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yan</surname>
          </string-name>
          , I. Zarov,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kambadur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stojnic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <source>T. Scialom, Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <source>CoRR abs/2307</source>
          .09288 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2307.09288. doi:
          <volume>10</volume>
          .48550/ ARXIV.2307.09288. arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13] OpenAI, GPT-4
          <source>technical report, CoRR abs/2303</source>
          .08774 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2303. 08774. doi:
          <volume>10</volume>
          .48550/ARXIV.2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Choi,</surname>
          </string-name>
          <article-title>RECOMP: improving retrievalaugmented lms with compression and selective augmentation</article-title>
          ,
          <source>CoRR abs/2310</source>
          .04408 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2310.04408. doi:
          <volume>10</volume>
          .48550/ ARXIV.2310.04408. arXiv:
          <volume>2310</volume>
          .
          <fpage>04408</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cuconasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Filice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Campagnano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>The power of noise: Redefining retrieval for rag systems</article-title>
          ,
          <source>arXiv preprint arXiv:2401.14887</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July</source>
          <volume>6</volume>
          -
          <issue>12</issue>
          ,
          <year>2002</year>
          , Philadelphia, PA, USA, ACL,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040/. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/ W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>an automatic metric for MT evaluation with improved correlation with human judgments</article-title>
          , in: J.
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C. R.</given-names>
          </string-name>
          <string-name>
            <surname>Voss</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL</source>
          <year>2005</year>
          , Ann Arbor, Michigan, USA, June 29,
          <year>2005</year>
          , Association for Computational Linguistics,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . URL: https://aclanthology.org/W05-0909/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with BERT</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations, ICLR</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Addis</given-names>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          , OpenReview.net,
          <year>2020</year>
          . URL: https://openreview. net/forum?id=SkeHuCVFDr.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Eficient estimation of word representations in vector space</article-title>
          , in: Y. Bengio, Y. LeCun (Eds.),
          <source>1st International Conference on Learning Representations, ICLR</source>
          <year>2013</year>
          , Scottsdale, Arizona, USA, May 2-
          <issue>4</issue>
          ,
          <year>2013</year>
          , Workshop Track Proceedings,
          <year>2013</year>
          . URL: http://arxiv.org/ abs/1301.3781.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Glove:
          <article-title>Global vectors for word representation</article-title>
          , in: A.
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Pang</surname>
          </string-name>
          , W. Daelemans (Eds.),
          <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29</source>
          ,
          <year>2014</year>
          , Doha,
          <string-name>
            <surname>Qatar,</surname>
          </string-name>
          <article-title>A meeting of SIGDAT, a Special Interest Group of the ACL</article-title>
          , ACL,
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . URL: https://doi.org/10.3115/v1/d14-
          <fpage>1162</fpage>
          . doi:
          <volume>10</volume>
          .3115/V1/D14-1162.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://doi.org/10.18653/v1/n19-
          <fpage>1423</fpage>
          . doi:
          <volume>10</volume>
          .18653/V1/N19-1423.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <volume>140</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>140</lpage>
          :
          <fpage>67</fpage>
          . URL: http://jmlr.org/papers/v21/
          <fpage>20</fpage>
          -
          <lpage>074</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Judging llm-as-a-judge with mt-bench and chatbot arena</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems</source>
          <year>2023</year>
          , NeurIPS
          <year>2023</year>
          , New Orleans, LA, USA, December
          <volume>10</volume>
          -
          <issue>16</issue>
          ,
          <year>2023</year>
          ,
          <year>2023</year>
          . URL: http://papers.nips.cc/paper_files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_ and_Benchmarks.html.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon, U. von Luxburg, S. Bengio,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. N.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9</source>
          ,
          <year>2017</year>
          , Long Beach, CA, USA,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . URL: https://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>E.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rijhwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maynez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aharoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nikolaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>9397</fpage>
          -
          <lpage>9413</lpage>
          . URL: https://doi.org/10. 18653/v1/
          <year>2023</year>
          .emnlp-main.
          <volume>584</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .EMNLP-MAIN.
          <year>584</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2019 deep learning track</article-title>
          , CoRR abs/
          <year>2003</year>
          .07820 (
          <year>2020</year>
          ). URL: https://arxiv. org/abs/
          <year>2003</year>
          .07820. arXiv:
          <year>2003</year>
          .07820.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <article-title>Overview of the TREC 2020 deep learning track</article-title>
          , in: E. M.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Ellis (Eds.),
          <source>Proceedings of the TwentyNinth Text REtrieval Conference</source>
          , TREC 2020,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          [Gaithersburg, Maryland, USA],
          <source>November 16- 20</source>
          ,
          <year>2020</year>
          , volume
          <volume>1266</volume>
          of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2020</year>
          . URL: https://trec.nist.gov/pubs/trec29/papers/ OVERVIEW.DL.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models</article-title>
          ,
          <source>CoRR abs/2104</source>
          .08663 (
          <year>2021</year>
          ). URL: https://arxiv.org/ abs/2104.08663. arXiv:
          <volume>2104</volume>
          .
          <fpage>08663</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Zero-shot listwise document reranking with a large language model</article-title>
          ,
          <source>CoRR abs/2305</source>
          .02156 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2305.02156. doi:
          <volume>10</volume>
          .48550/ ARXIV.2305.02156. arXiv:
          <volume>2305</volume>
          .
          <fpage>02156</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , L. Yan,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Instruction distillation makes large language models eficient zero-shot rankers</article-title>
          ,
          <source>CoRR abs/2311</source>
          .01555 (
          <year>2023</year>
          ). URL: https: //doi.org/10.48550/arXiv.2311.01555. doi:
          <volume>10</volume>
          .48550/ ARXIV.2311.01555. arXiv:
          <volume>2311</volume>
          .
          <fpage>01555</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharifymoghaddam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Rankvicuna: Zero-shot listwise document reranking with opensource large language models</article-title>
          ,
          <source>CoRR abs/2309</source>
          .15088 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv.2309. 15088. doi:
          <volume>10</volume>
          .48550/ARXIV.2309.15088. arXiv:
          <volume>2309</volume>
          .
          <fpage>15088</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kanoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Monz</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Conversations with search engines: Serp-based conversational response generation</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>39</volume>
          (
          <year>2021</year>
          )
          <volume>47</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          :
          <fpage>29</fpage>
          . URL: https://doi.org/ 10.1145/3432726. doi:
          <volume>10</volume>
          .1145/3432726.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lajewska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <article-title>Towards filling the gap in conversational search: From passage retrieval to conversational response generation</article-title>
          , in: I.
          <string-name>
            <surname>Frommholz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Hopfgartner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Oakes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lalmas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , R. L. T. Santos (Eds.),
          <source>Proceedings of the 32nd ACM International Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>CIKM</surname>
          </string-name>
          <year>2023</year>
          , Birmingham, United Kingdom,
          <source>October 21-25</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>5326</fpage>
          -
          <lpage>5330</lpage>
          . URL: https://doi.org/10.1145/3583780.3615132. doi:
          <volume>10</volume>
          .1145/3583780.3615132.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lajewska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <article-title>Towards reliable and factual response generation: Detecting unanswerable questions in information-seeking conversations</article-title>
          , in: N.
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Tonellotto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lipani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
          </string-name>
          , I. Ounis (Eds.),
          <source>Advances in Information Retrieval - 46th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2024</year>
          , Glasgow, UK, March
          <volume>24</volume>
          -28,
          <year>2024</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>III</given-names>
          </string-name>
          , volume
          <volume>14610</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          , pp.
          <fpage>336</fpage>
          -
          <lpage>344</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -56063-7_
          <fpage>25</fpage>
          . doi:
          <volume>10</volume>
          . 1007/978-3-
          <fpage>031</fpage>
          -56063-7\_
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>I. Calixto</given-names>
          </string-name>
          , I. Vulic,
          <string-name>
            <given-names>N.</given-names>
            <surname>Saphra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kassner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Camburu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bansal</surname>
          </string-name>
          , V. Shwartz (Eds.),
          <source>Proceedings of the 6th Workshop on Representation Learning for NLP, RepL4NLP@ACL-IJCNLP 2021, Online, August</source>
          <volume>6</volume>
          ,
          <year>2021</year>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>173</lpage>
          . URL: https://doi. org/10.18653/v1/
          <year>2021</year>
          .repl4nlp-
          <fpage>1</fpage>
          .17. doi:
          <volume>10</volume>
          .18653/ V1/
          <year>2021</year>
          .REPL4NLP-
          <volume>1</volume>
          .
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , L. Deng, MS MARCO:
          <article-title>A human generated machine reading comprehension dataset</article-title>
          , in: T. R.
          <string-name>
            <surname>Besold</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bordes</surname>
          </string-name>
          , A. S.
          <string-name>
            <surname>d'Avila Garcez</surname>
          </string-name>
          , G. Wayne (Eds.),
          <source>Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS</source>
          <year>2016</year>
          ), Barcelona, Spain, December 9,
          <year>2016</year>
          , volume
          <volume>1773</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2016</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1773</volume>
          /CoCoNIPS_ 2016_paper9.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S. H.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yazdani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , S. Riedel,
          <article-title>KILT: a benchmark for knowledge intensive language tasks</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tür</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online</article-title>
          , June 6-11,
          <year>2021</year>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>2523</fpage>
          -
          <lpage>2544</lpage>
          . URL: https://doi.org/ 10.18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>200</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2021</year>
          .NAACL-MAIN.
          <year>200</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <article-title>An exploration study of mixed-initiative query reformulation in conversational passage retrieval</article-title>
          , in: I.
          <string-name>
            <surname>Soborof</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Ellis (Eds.),
          <source>Proceedings of the Thirty-First Text REtrieval Conference</source>
          , TREC
          <year>2022</year>
          ,
          <article-title>online</article-title>
          ,
          <source>November 15- 19</source>
          ,
          <year>2022</year>
          , volume
          <volume>500</volume>
          -338 of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2022</year>
          . URL: https://trec.nist.gov/pubs/trec31/papers/ udel_fang.C.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Otmazgin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cattan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>F-</surname>
          </string-name>
          <article-title>coref: Fast, accurate and easy to use coreference resolution, in: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th</article-title>
          <source>International Joint Conference on Natural Language Processing</source>
          , AACL/IJCNLP 2022
          <string-name>
            <surname>- System</surname>
            <given-names>Demostrations</given-names>
          </string-name>
          , Taipei, Taiwan,
          <source>November 20 - 23</source>
          ,
          <year>2022</year>
          , Association for Computational Linguistics,
          <year>2022</year>
          , pp.
          <fpage>48</fpage>
          -
          <lpage>56</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          . aacl-demo.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>S.</given-names>
            <surname>Otmazgin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cattan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , Lingmess:
          <article-title>Linguistically informed multi expert scorers for coreference resolution</article-title>
          , in: A.
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , I. Augenstein (Eds.),
          <source>Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>EACL</surname>
          </string-name>
          <year>2023</year>
          , Dubrovnik, Croatia, May 2-
          <issue>6</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>2744</fpage>
          -
          <lpage>2752</lpage>
          . URL: https://doi.org/ 10.18653/v1/
          <year>2023</year>
          .eacl-main.
          <volume>202</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .EACL-MAIN.
          <year>202</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kolla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vechtomova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ashkan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Büttcher</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. MacKinnon</surname>
          </string-name>
          ,
          <article-title>Novelty and diversity in information retrieval evaluation</article-title>
          ,
          <source>in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '08,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2008</year>
          , p.
          <fpage>659</fpage>
          -
          <lpage>666</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>URL: https://doi.org/10.1145/1390334.1390446. doi:10.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>