<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Improving RAG Systems via Sentence Clustering and Reordering</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Marco</forename><surname>Alessio</surname></persName>
							<email>marco.alessio@isti.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Institute of Information Science and Technologies (ISTI)</orgName>
								<orgName type="institution">National Research Council of Italy (CNR)</orgName>
								<address>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Guglielmo</forename><surname>Faggioli</surname></persName>
							<email>guglielmo.faggioli@unipd.it</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Information Engineering (DEI)</orgName>
								<orgName type="institution">University of Padua</orgName>
								<address>
									<settlement>Padua</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nicola</forename><surname>Ferro</surname></persName>
							<email>nicola.ferro@unipd.it</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Information Engineering (DEI)</orgName>
								<orgName type="institution">University of Padua</orgName>
								<address>
									<settlement>Padua</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Franco</forename><forename type="middle">Maria</forename><surname>Nardini</surname></persName>
							<email>francomaria.nardini@isti.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Institute of Information Science and Technologies (ISTI)</orgName>
								<orgName type="institution">National Research Council of Italy (CNR)</orgName>
								<address>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Raffaele</forename><surname>Perego</surname></persName>
							<email>raffaele.perego@isti.cnr.it</email>
							<affiliation key="aff0">
								<orgName type="department">Institute of Information Science and Technologies (ISTI)</orgName>
								<orgName type="institution">National Research Council of Italy (CNR)</orgName>
								<address>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Improving RAG Systems via Sentence Clustering and Reordering</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">9174BEB69D6338958B5B2BF9BCECBCB4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:09+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Retrieval Augmented Generation</term>
					<term>Conversational Search</term>
					<term>Positional Bias</term>
					<term>Arrangement Strategy</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Large Language Models (LLMs) have gained noteworthy importance and attention across different domains and fields in recent years. Information Retrieval (IR) is one of the domains they impacted the most, as witnessed by the recent increase in the number of IR systems incorporating generative models. Specifically, Retrieval Augmented Generation (RAG) is the emerging paradigm that integrates existing knowledge from large-scale document corpora into the generation process, enabling the model to generate more coherent, contextually relevant, and accurate text across various tasks. Such tasks include summarization, question answering, and dialogue systems. Recent studies have highlighted the significant positional dependence exhibited by RAG systems. Such studies observed how the placement of information within the LLM input prompt drastically affects the generated output. We ground our study on this property by investigating alternative strategies for ordering sentences within the LLM prompt to improve the average quality of the generated responses in the user and conversational system dialogues. We propose the architecture of an end-to-end RAG-based conversational assistant and empirically evaluate our strategies using the TREC CAsT 2022 collection. Our experiments highlight significant differences between distinct arrangement strategies. By employing an evaluation methodology based on RankVicuna, we show that our best approach achieves improvements up to 54% in terms of overall response quality over baseline methods.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Retrieval Augmented Generation (RAG) is an emerging paradigm in the field of Artificial Intelligence (AI) to enhance the accuracy and reliability of generative models by exploiting external data sources. In recent years, RAG has gained noteworthy importance and attention across different domains and fields <ref type="bibr" target="#b0">[1]</ref> as it allows to combine the strengths of Information Retrieval (IR) systems and generative models to overcome each other's limitations.</p><p>RAG can improve the output of a generative model in several ways. First, it allows the generation process to be grounded on information from trusted knowledge sources incorporated in the provided prompt, thus avoiding or at least mitigating the well-known Large Language Model (LLM) hallucination problem, i.e., when the model generates contents not factually true or that do not concern the prompted text <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>. Second, RAG allows for continuous knowledge updates and integration of domain-specific information: the LLM can successfully respond to facts and topics not covered in its training data; moreover, it is easily adapted to different scenarios and contexts, without retraining or fine-tuning the entire model using datasets that might be unavailable or limited in scope or size. Finally, grounding the generation process on external knowledge incorporated in the input permits linking the output to verifiable external documents, thus enhancing trustworthiness and transparency <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4]</ref>.</p><p>Current RAG systems, however, suffer of some drawbacks highlighted in the literature. One of these issues originates from the notable positional sensitivity shown by LLMs. The placement of information within the input prompt significantly impacts the resulting output. Previous research <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref> has highlighted biases towards "primacy" and "recency", suggesting that generative models tend to prioritize information placed at the beginning or end of the input while neglecting the central portion.</p><p>In this paper, we advance over previous studies by investigating the positional bias in the context of RAG-based conversational systems. Specifically, we propose a novel strategy for arranging sentences within the input prompt of the LLM to improve the average quality of the generated responses over simpler methods. Our approach is based on the intuition that as coherent, fluent, and well-structured text are critical factors for successful communication between human beings, the same should also apply to LLMs: among all the possible arrangements of the input, those having sentences with similar meaning placed closer in the LLM prompt should generate, on average, better quality output. Therefore, we propose an end-to-end RAG architecture to test our hypothesis. The components of this architecture allow us to precisely identify which sentences are likely useful for answering user queries. To this end, we cluster sentences by their similarity and we define alternative strategies for ordering them both inter and intra-cluster. In this way, we can study the effect on the generated response of these alternatives for prompting the generative LLM. To our knowledge, this is the first work that explicitly considers this aspect and allows us to fine-tune in a principled way the ordering of input sentences provided to the generative component of a RAG system. We compare our proposed approach against competitive baselines that represent the solutions employed by current RAG systems. We experimentally evaluate the performance of our proposed approach using the TREC Conversational Assistance Track (CAsT) 2022 collection <ref type="bibr" target="#b7">[8]</ref>, which allows us to compare the results that different arrangement strategies can achieve in a widely accepted Conversational Search (CS) scenario. Results highlight remarkable differences among the tested sentence placement strategies, with improvements up to 8.66% w.r.t. the best baseline and 54.94% w.r.t. random ordering.</p><p>The remainder of this work is organized as follows: Section 2 surveys the current state-of-the-art about RAG systems and quality evaluation for their responses. Section 3 details the architecture of our RAG system. Section 4 and Section 5 detail the results of an experimental analysis, which aims to highlight how the ordering of clusters and sentences affects the quality of the generated response. Finally, Section 6 draws some conclusions and outlines future directions and extensions of our research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>In the following, we survey the main works dealing with LLM positional dependencies and the difficulties of RAG systems in conciliating internal and external knowledge. Then, we analyze the challenges related to the evaluation of the quality of RAG responses and to the use of an "LLMas-a-judge".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Retrieval Augmented Generation</head><p>RAG enhances LLMs by retrieving additional information from an external knowledge source, enabling them to successfully answer queries beyond the scope of the training data. At the same time, RAG mitigates the hallucination problem, which is generating factually incorrect text, by referencing the provided external knowledge.</p><p>The RAG paradigm is organized into two main stages: retrieval and generation. Upon receiving a query from the user, the relevant information is retrieved from an external knowledge source. This task is undertaken by a standard IR pipeline that outputs a ranked list of documents. Afterwards, in the generation phase, the LLM synthesizes the response to answer the user query using the information carried by the selected documents.</p><p>Despite its clear advantages, RAG has drawbacks and limitations, which spark several challenges. First, RAG systems employ the external knowledge as their main source of information, disregarding the internal knowledge memorized within the LLM <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref>. This, in turn, may determine a decrease in the quality of the generated output when the provided content is not high-quality <ref type="bibr" target="#b9">[10]</ref>. It is not uncommon for RAG to obtain worse outputs w.r.t. what the LLM can achieve in the closed-book scenario, i.e., without supplying retrieved results <ref type="bibr" target="#b9">[10]</ref>. In this line, it has been observed that the LLM produces better results without injecting external knowledge when the topic popularity is very high <ref type="bibr" target="#b8">[9]</ref>. In general, state-of-the-art LLMs provide good quality responses for a wide range of questions but require assistance from an IR system when the internal knowledge of the model lacks information about the current topic. This phenomenon is likely to occur if the topic is not very popular, requires exceptional expertise, or when scaling the number of parameters of the generative model produces little to no effect <ref type="bibr" target="#b8">[9]</ref>. Another challenge lies in the significant positional dependence <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref> exhibited by LLMs, whereby the placement of information within the input prompt drastically affects the generated output. Prior research <ref type="bibr" target="#b4">[5]</ref> has identified "primacy" and "recency" biases, indicating the tendency of generative models to focus toward information positioned either at the beginning or the end of the input while disregarding the central part. Therefore, the performance degrades significantly when LLMs should rely on information in the middle of its input context, showing a characteristic U-shaped performance curve <ref type="bibr" target="#b4">[5]</ref>. This, in turn, means that most state-of-the-art generative models do not use effectively their longer contexts w.r.t. smaller and earlier counterparts. These phenomena can be observed both in open-source, e.g., Llama <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref> by Meta, and closed-source, e.g., GPT-4 <ref type="bibr" target="#b12">[13]</ref> by OpenAI, models. It is not advisable to directly input all the retrieved information to the LLM for generating the response. Redundant information and very long contextual data can interfere with the generation quality, leading to repetitive, disjointed, or incoherent outputs <ref type="bibr" target="#b0">[1]</ref>. Therefore, the retrieved content is typically further processed before being given in input to the LLM <ref type="bibr" target="#b13">[14]</ref>. A recent work in this direction systematically examines the retrieval strategy of RAG systems <ref type="bibr" target="#b14">[15]</ref>. The authors consider multiple retrieval factors affecting the generation process, such as the relevance of the passages in the prompt context, their position, and their number. One counter-intuitive finding is that the retriever's highestscoring documents that are not directly relevant to the query, e.g., do not contain the answer, negatively impact the effectiveness of the LLM. Moreover, the authors discover that adding random documents in the prompt improves the LLM accuracy by up to 35%.</p><p>In this work, we rely on the intuition that the use of coherent, fluent, and well-structured inputs can improve RAG and we propose an end-to-end architecture for selecting and structuring the external information included in the LLM prompt for response generation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Quality Evaluation</head><p>Another line of research is how to evaluate the overall quality of the generation output. Despite human assessment providing the most accurate and reliable measure for evaluating model performance, the high time and cost requirements severely limit the application. Therefore, there exists an ever-increasing demand for automated evaluation techniques that consistently align with human judgements while offering enhanced efficiency and cost-effectiveness.</p><p>In this paper, we focus on textual-based generative models. Classical automatic evaluation metrics, such as BLEU <ref type="bibr" target="#b15">[16]</ref>, ROUGE <ref type="bibr" target="#b16">[17]</ref>, and METEOR <ref type="bibr" target="#b17">[18]</ref>, are designed to quantify the degree of similarity between a candidate text and one or more reference texts, by assessing their n-grams matching. The simplicity and explainability, along with the good correlation with human judgements, make these metrics widely used as baselines. However, these metrics exhibit several limitations <ref type="bibr" target="#b18">[19]</ref>: firstly, they cannot account for lexical diversity; secondly, they penalize variations in the semantic ordering of words; thirdly, they struggle to capture and match paraphrases effectively; lastly, they inadequately account for distant dependencies within the text. With the advent of word embeddings <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21]</ref> and neural models <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b23">24]</ref> based on Transformers <ref type="bibr" target="#b24">[25]</ref>, new learned metrics <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b25">26]</ref> have been developed. For example, BERTScore <ref type="bibr" target="#b18">[19]</ref> can capture the semantic similarity between the candidate and reference texts employing the contextual embeddings generated by an encoder model, such as BERT <ref type="bibr" target="#b21">[22]</ref>.</p><p>In recent years, the rapid advancements of LLMs showing remarkable performance across many tasks have gained considerable interest in their potential application also as annotators and evaluators. Due to their training using Reinforcement Learning from Human Feedback (RLHF), these models demonstrate significant human alignment. Many research have investigated leveraging state-of-theart LLMs to automatically produce assessments serving as proxies for human judgments, a paradigm known as "LLMas-a-judge". Furthermore, in recent years LLMs have gained popularity also as evaluators. For example, Zheng et al. <ref type="bibr" target="#b23">[24]</ref> assessed the quality of conversations with various LLMs, both open and closed source, employing GPT-4 <ref type="bibr" target="#b12">[13]</ref> as judge. They experimented with various prompts and different approaches, such as single answer grading and pairwise comparisons both between responses and against a reference text. GPT-3.5 Turbo and GPT-4 <ref type="bibr" target="#b12">[13]</ref> have been employed as listwise rerankers <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref> for the TREC Deep Learning 2019 and 2020 <ref type="bibr" target="#b26">[27,</ref><ref type="bibr" target="#b27">28]</ref> and BEIR <ref type="bibr" target="#b28">[29]</ref> experimental collections, obtaining state-of-the-art performance <ref type="bibr" target="#b5">[6]</ref>. The same LLMs have also been employed as teacher models to fine-tune smaller open-source student models, such as Llama and Vicuna <ref type="bibr" target="#b29">[30,</ref><ref type="bibr" target="#b30">31]</ref> (i.e.: RankVicuna <ref type="bibr" target="#b31">[32]</ref>).</p><p>In this work, we rely on state-of-the-art assessment methods and evaluate the quality of the responses generated by the different methods using RankVicuna <ref type="bibr" target="#b31">[32]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The Proposed RAG Architecture</head><p>Generative models exhibit strong biases towards information positioned at the start or the end of the input while disregarding the middle part <ref type="bibr" target="#b4">[5]</ref>. This phenomenon motivates our research effort to determine how the order of the input sentences provided to a RAG-based conversational system affects the quality of the generated output and, in turn, the optimal ordering strategy to achieve the best response. This section describes each method and all variations considered in our experiments.</p><p>The architecture of our proposed RAG system is illustrated in Figure <ref type="figure">1</ref>. It includes an IR pipeline, which retrieves top-𝑘 documents 𝐷 = {𝑑1, 𝑑2, ..., 𝑑 𝑘 } in response to each user utterance 𝑞. The retrieved documents are then processed by additional components responsible for splitting them into sentences, identifying the most relevant sentences, clustering such sentences based on their semantic similarity, and ordering them according to the various strategies analyzed. Finally, the selected-re-ordered-sentences are provided as input to the LLM for response generation. These components are the focus of our research. Their functionalities are detailed in the remainder of this section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Document Pre-processing and Splitting</head><p>As observed in literature <ref type="bibr" target="#b32">[33,</ref><ref type="bibr" target="#b33">34]</ref>, the entire text of a relevant document rarely contains meaningful knowledge to satisfy the user information need expressed by a query 𝑞. In most cases, only one or a few portions of the document are relevant to the query, while the remaining parts contain irrelevant information. The proposed architecture aims to precisely identify the key information in the retrieved documents, i.e., the sentences, to reduce the noise in the prompt used for response generation.</p><p>Hereinafter, we consider sentences in the documents as the atomic units of information. Our pipeline, illustrated in Figure <ref type="figure">1</ref> works as follows. First, for each query 𝑞 we consider only the top-𝑘 documents {𝑑1, 𝑑2, ..., 𝑑 𝑘 } retrieved by the IR system. Then, a state-of-the-art co-reference resolution model is applied to all documents to replace pronouns and other generic terms within a sentence with the fully specified entity mentioned in a previous sentence. This allows us to remove the contextual dependencies among sentences in a document so they can be considered self-explanatory. The third step splits each document 𝑑𝑖 into a sequence of sentences {𝑠𝑖,1, 𝑠𝑖,2, ..., 𝑠𝑖,𝑛 𝑖 }. Afterwards, near-duplicate removal is employed to the sentences originated by all documents by discarding sentences with a Jaccard similarity ≥ 0.9 between their Bag-of-Words (BoW) representations<ref type="foot" target="#foot_0">1</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Sentence Selection</head><p>After the first pre-processing phase, we obtain a sentence candidate set for each query to be included in the LLM prompt of our RAG system (see Figure <ref type="figure">1</ref>). Since the cardinality of this set can be large and not all the sentences are useful for answering the query, we employ the BERT-based cross encoder answer-in-the-sentence classifier 2 developed by Lajewska and Balog <ref type="bibr" target="#b34">[35]</ref> to rank the candidate sentences according to their predicted usefulness to (at least partially) answer the query and we retain the top-𝑛 ranked sentences thus discarding the remaining ones. As a possible limitation, please note that the model by Lajewska and Balog <ref type="bibr" target="#b34">[35]</ref> employed have been trained on queries and passages used in our experiments. Therefore, it is very likely that the model performs significantly better on our data w.r.t. any other model, ensuring that top-ranked sentences are indeed relevant to the query. Even though such a model is not available in a real practical scenario, this choice is justified by our research effort being focused exclusively on comparing the ordering strategy for sentences in the LLM input rather than on the absolute results achievable by our RAG system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Sentence Clustering and Ordering</head><p>The previous steps of the pipeline constrain the number of sentences per query while increasing their expected utility in answering the query. Furthermore, they allow us to control other noise sources, such as the number or the variable length of the retrieved documents. Therefore, we can assess how the positional bias affects the generation process. We highlight again that the positional bias of LLM has already been observed in prior research <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>. However, it has been considered exclusively as a limitation of LLMs and RAG systems. Our research moves a step forward by investigating the best ordering strategy to maximize, on average, the quality of the generated responses over a testing query set 𝑄. We believe that logically organized text where sentences with akin meanings are positioned closer in the LLM prompt should, on average, yield superior output quality. Consequently, our sentence ordering strategies exploit the similarities among sentences selected by the sentence selection step. To measure semantic inter-sentence similarity, we resort to the contextualized embeddings generated with the tct-colbert model 3 <ref type="bibr" target="#b35">[36]</ref>. We generate the representation of the 𝑛 selected sentences for each query and measure their pair-wise cosine similarity. Then, we progressively aggregate the most similar sentences by employing a hierarchical clustering algorithm. The maximum value of Silhouette statistic is used as the criteria to determine the optimal clustering among all possible. As a result, for each query 𝑞 ∈ 𝑄, the top-𝑛 sentences are grouped in a variable number 𝑁𝑐 ≥ 1 of clusters, each composed of one or more sentences with similar semantic meaning. To devise different strategies for ordering input sentences, we leverage the above clustering that allows us to study the impact of sentence placement variations occurring in both inter and intra-clusters.</p><p>More formally, given a query, the set 𝑆 of the 𝑛 previously selected sentences, and the prompt 𝑝, we aim to find the ordering 𝑜𝑟𝑑 * of 𝑆 such that:</p><formula xml:id="formula_0">𝑜𝑟𝑑 * = argmax 𝑜𝑟𝑑 ∑︁ 𝑞∈𝑄 𝑠(𝑞, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆))),</formula><p>where 𝑜𝑟𝑑(𝑆) is a sentence ordering strategy that returns an ordering of the sentences in 𝑆, 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) is the response generated by the LLM used for prompt 𝑝, query 𝑞 and sentence ordering 𝑜𝑟𝑑(𝑆), and, finally, 𝑠(𝑞, 𝑟) is a scoring function evaluating the perceived quality of the generated response 𝑟 = 𝐿𝐿𝑀 (𝑝, 𝑞, 𝑜𝑟𝑑(𝑆)) for query 𝑞. The order of clusters and the order of the sentences within the same cluster uniquely determine the possible global ordering of the 𝑛 sentences we consider for inputting the LLM. Our experimental assessment will evaluate six different ordering strategies for placing the clusters of sentences in the input, and four different methods for ordering sentences within the same cluster. Cluster placements consider different aspects, such as the clusters' cardinality and similarity to the query. The ordering tested includes the random one and those obtained by decreasing/increasing the value of each aspect. Finally, the U-shaped order suggested in <ref type="bibr" target="#b4">[5]</ref> is also tested. Regarding the ordering within clusters, we consider random order, order by reranker score, visiting order, and the clustering aggregation order.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Evaluation</head><p>We can now formulate the research questions we aim to answer with our experimental framework.</p><p>Research Questions. Given the sentence selection and clustering steps discussed above, the two main aspects to consider for defining our ordering strategies 𝑜𝑟𝑑(•) are the order of placement in the LLM prompt of the clusters and of the sentences within the same cluster. They uniquely determine the global ordering 𝑜𝑟𝑑(•) of the top-𝑛 sentences given in input to the LLM for response generation. Our research questions assess which is the best solution among these alternatives considered. Specifically, RQ1 What is the best cluster ordering strategy? RQ2 What is the best ordering strategy for sentences within the same cluster? RQ3 Can our proposed strategy enhance the effectiveness of the RAG system w.r.t. baseline methods?</p><p>Experimental Settings. We experiment with the TREC CAsT 2022 dataset, a standard experimental collection for CS <ref type="bibr" target="#b7">[8]</ref>. This choice is due to prior research that released additional datasets, models, and human judgments for this benchmark <ref type="bibr" target="#b33">[34,</ref><ref type="bibr" target="#b34">35]</ref>. The corpus is composed of three documents collections, MS-MARCO v2 <ref type="bibr" target="#b36">[37]</ref>, KILT <ref type="bibr" target="#b37">[38]</ref>, and Washington Post v4, which are subdivided into 106𝑀 short documents. CAsT 2022 includes 18 information needs (topics) and 205 user utterances (queries), with an average length of 11.39 user utterances per topic. The number of utterances for which relevance judgements are provided is 163.</p><p>For our experiments, as the retrieval system, we employ as the output of the retrieval pipeline the best-performing run originally submitted to TREC CAsT 2022 4 [39]. This allows us to focus exclusively on the following steps of our pipeline. In all our experiments, we consider only the top-20 retrieved documents, leaving the investigation about the implications of this choice and possible alternatives as future work. To provide meaningful results, all queries where 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@20 &lt; 0.2, that is, having at most 3 relevant passages in the top-20 results, are discarded <ref type="foot" target="#foot_3">5</ref> , ensuring that enough relevant information is retrieved to answer the considered queries successfully.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Comparisons between the six approaches proposed for RQ1: "What is the best ordering strategy for clusters?". In the top half, each row reports three numbers, which are the wins for the approach in the column label, the ties, and the wins for the approach in the row label, respectively. In the bottom half, the overall results are reported. Furthermore, in the steps of the pipeline where the query text is needed, i.e., sentence ranking and response generation, we employed the manually rewritten text for every query. This allows us to account for the possible bias introduced by different query rewriting approaches. Future developments will investigate the relationship between query rewriting approaches and RAG solutions.</p><formula xml:id="formula_1">A</formula><p>For co-reference resolution at the document level, i.e., removing co-references across different sentences in the "document processing" step, we use the "F-Coref" model<ref type="foot" target="#foot_4">6</ref>  <ref type="bibr" target="#b39">[40]</ref> based on the "LingMess" architecture <ref type="bibr" target="#b40">[41]</ref>. After this step, we use the well-known SpaCy Python library to divide each document into a sequence of independent sentences.</p><p>In the following section, we report two different metrics for each comparison. The former is the average score of every approach when assessing all 10 random permutations using RankVicuna. The latter, instead, is a pairwise metric, assessing the number of queries for which the first approach obtains higher/the same/lower score w.r.t. the other one. This information should better highlight the differences and provide a more comprehensive view than a single average value.</p><p>Response Generation. For the response generation, we employ Vicuna 7B<ref type="foot" target="#foot_5">7</ref>  <ref type="bibr" target="#b23">[24]</ref>, a LLM based on Llama 2 <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref> fine-tuned on 125K user conversations with ChatGPT gathered using public APIs from the ShareGPT.com website.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Quality Evaluation.</head><p>To evaluate the quality of the generated responses, we employ RankVicuna <ref type="bibr" target="#b31">[32]</ref> to perform listwise ranking between all responses being compared. To mitigate the positional bias intrinsic in RankVicuna, we assess 10 different random permutations of the same responses, averaging the results obtained. This is a reasonable trade-off between evaluation accuracy and the computational runtime required. For each assessment, we assign 𝑁 +1−𝑖 𝑁 points to the i-th ranked response, where 1 ≤ 𝑖 ≤ 𝑁 and 𝑁 is the number of responses being compared. Furthermore, we also evaluate the number of wins and ties between pairs of responses considered. Whether a valid judgment from the LLM can not be determined, the entire comparison is discarded from the evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">RQ1: Order of Clusters</head><p>For the first experiment, we evaluate the effects of different ordering of the clusters while keeping the order of sentences within the same cluster (based on the clustering aggregation</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Comparisons between the four approaches proposed for RQ2: "What is the best ordering strategy for sentences within the same cluster?". In the top half, each row reports three numbers, which are the wins for the approach in the column label, the ties, and the wins for the approach in the row label, respectively. In the bottom half, the overall results are reported. order) fixed. We test six different strategies for ordering clusters: clusters selected in random order (strategy A); clusters selected in descending order of cardinality (strategy B); clusters selected in ascending order of similarity with the query<ref type="foot" target="#foot_6">8</ref> (strategy C); clusters selected in descending order of similarity with the query (strategy D); clusters selected in descending order by similarity with the query using a ping-pong layout from top to bottom (strategy E) <ref type="foot" target="#foot_7">9</ref> ; clusters selected by similarity with the query in descending order, using a ping-pong layout from bottom to top (strategy F) <ref type="foot" target="#foot_8">10</ref> .</p><formula xml:id="formula_2">A</formula><p>As shown in Table <ref type="table">1</ref>, sorting the clusters in descending order by their similarity with the query (strategy D) is the clear winner in this comparison, in terms of both score and pairwise wins. This approach performs 18.77%, 15.24%, 20.16%, 23.51%, and 14.81% better than other options. This figures suggest that the LLM used to generate the responses exhibit a much stronger "primacy" rather than "recency" biases, as highlighted by option C being overall the worst performing among those considered. Instead, methods E and F were designed to place the least important clusters towards the center, since LLMs struggle to utilize the information in the middle of their prompt effectively. However, we can see that both approaches are ineffective: we suspect this is due to the length of the input text being much smaller than the maximum context window of the model. Different results may be observed when varying the amount of input data provided to the LLM for generation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">RQ2: Order of Sentences within the same Cluster</head><p>In this second experiment, we evaluate different sorting schemes for sentences within the same cluster, keeping the cluster's order fixed at the best strategy determined in RQ1. We test four different strategies for ordering sentences within the same cluster: sentences selected in random order (strategy A); sentences selected in descending order by reranker score (strategy B); sentences selected by visiting order<ref type="foot" target="#foot_9">11</ref> (strategy C); sentences selected by aggregation order (strategy D).</p><p>As shown in Table <ref type="table">2</ref>, the best results are achieved by two</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Comparisons between the five approaches considered for RQ3: "Can our proposed strategy enhance the effectiveness of the RAG system w.r.t. baseline methods?". In the top half, each row reports three numbers, which are the wins for approach in the column label, the ties, and the wins for approach in the row label, respectively. In the bottom half, the overall results are reported.</p><p>A We note however that the difference in performance of the various strategies are not large as the sentences are grouped in the clusters by their similarity. The LLM response appears to be more impacted by the order of the clusters than by the order of sentences within each cluster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">RQ3: Comparison with Baselines</head><p>Our last experiment investigates whether our proposed approach is beneficial in enhancing the overall effectiveness of the RAG system w.r.t. four simpler baseline methods that may be used in practice by current state-of-the-art RAG systems. We test five different strategies: i) the top-5 retrieved documents ii) the top-40 sentences taken in random order (B), iii) the top-40 sentences taken in descending order by re-ranker score (C), iv) the top-40 sentences selected by visiting order (D), v) the best clusterization-based approach determined from RQ1 and RQ2 (CL).</p><p>The results obtained are shown in Table <ref type="table">3</ref>. The clusterization-based approach demonstrate superior performance, resulting as the best strategy in this comparison. The four baselines yield notably lower results: 15.14%, 54.94%, 8.66%, and 15.67%, respectively. Among the methods considered in this work, randomly sorting the top-ℎ sentences is by far the least performing approach. This, in turn, proves our starting intuition about coherent, fluent, and well-structured text being critical factors for LLMs to generate high quality output.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Additional Experiments</head><p>The clusterization-based ordering strategy proposed in this work is designed to position sentences sharing analogous semantic content close together in the LLM prompt. Given the results obtained in Section 4.3, we have shown its effectiveness in our experimental settings. Nevertheless, we answer two additional research questions in this section to gain additional insights. Specifically, RQ4 Is there a correlation between the similarity of subsequent sentences in the LLM prompt and the quality of the generated response?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Comparisons between the seven approaches proposed for RQ4: "Is there a correlation between the similarity of subsequent sentences in the LLM prompt and the quality of the generated response?". In the top half, each row reports three numbers, which are the wins for the approach in the column label, the ties, and the wins for the approach in the row label, respectively. In the bottom half, the overall results are reported. Experimental Settings. We determine heuristically the two ordering 𝑜𝑟𝑑 + and 𝑜𝑟𝑑 − , which maximize and minimize the overall similarity between subsequent sentences. Let 𝑠𝑢𝑚 + and 𝑠𝑢𝑚 − be the sum of similarity between subsequent sentences for 𝑜𝑟𝑑 + and 𝑜𝑟𝑑 − respectively. The similarity 𝑠𝑖𝑚(𝑝) for a sentence permutation 𝑝 is given by the following equation, where min-max normalization is used, and 𝑠𝑖 are the embedding representations of the respective sentences:</p><formula xml:id="formula_3">𝑠𝑖𝑚(𝑝) = (︁ ∑︀ ℎ 𝑖=2 𝑐𝑜𝑠(𝑠𝑖−1, 𝑠𝑖) )︁ − 𝑠𝑢𝑚 − 𝑠𝑢𝑚 + − 𝑠𝑢𝑚 −</formula><p>In our experiments, for each query, we generate one million random permutations, then we determine which is the permutation with similarity closer to each of the following thresholds: 0.125, 0.250, 0.375, 0.500, and 0.625. We decided to stop at 0.625 because higher values are unlikely to be observed given that the average similarity of these permutations is 0.3433 with standard deviation 0.0530.</p><p>Results. We determine how the quality of the generated response is influenced when varying the similarity between subsequent sentences at various predefined thresholds, as shown in Table <ref type="table">4</ref>. It is interesting to note that the highest results are obtained by permutations with 0.625 normalised similarity, rather than 1.000 which is the ordering maximising the similarity between subsequent sentences (𝑜𝑟𝑑 + ). This method achieves 4.47% and 26.67% more pairwise wins w.r.t. 𝑜𝑟𝑑 + and 𝑜𝑟𝑑 − , respectively. To answer RQ5, we assess the responses generated using the best clustering strategy against the approach defined above. The average scores are 0.7652 and 0.7348 while the pairwise wins and ties are 38 -46 -31, respectively.</p><p>From these experiments, we can conclude that a positive correlation exists between similarity between subsequent sentences and response quality, while proving that sentence similarity may not be the only factor that should be considered. Moreover, subdividing and explicitly grouping together sentences by subtopic is beneficial w.r.t. considering the sentence similarity only in a pairwise fashion and thus lacking a global vision of the retrieved knowledge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and Future Work</head><p>In this work, we presented a novel pipelined RAG architecture aimed at selecting a set of relevant sentences for each query and arranging them in a specific order to optimize the quality of responses generated by a LLM. For this purpose, sentences are first extracted from the top documents retrieved. Then, they are reranked, and the most relevant sentences are organized in clusters by similarity. We proposed different strategies for ordering clusters and the sentences within clusters in the input given to the LLM for response generation. To the best of our knowledge, this is the first work investigating sentence clustering and re-ordering to improve the quality of the response generated by RAG systems. Our empirical assessment is based on a well-known-public-framework for conversational search. The results of the experiments show that different sequences of sentences in the LLM prompt significantly impact response quality despite all methodologies processing identical information from the same set of sentences. Random permutations yield the lowest results, whereas our proposed approach based on sentence clusterization yields superior results. Additionally, we examined whether maximizing the similarity between consecutive sentences in the LLM prompt enhances response quality. While a positive correlation between these factors was observed, it is not the exclusive determinant. Consequently, while we infer that sentence similarity constitutes a pivotal aspect, other contributing factors remain unidentified, warranting further investigation. Moreover, although our experimental evaluation employs a well-known conversational collection, the methodology and results shown in this work are general. They could also be applied to other scenarios, such as ad-hoc search.</p><p>In future work, we intend to evaluate the impact of the number of clusters selected by our method for generating the response. Our intuition is that the number of clusters identified for a given query is a proxy of the difficulty of the query itself. Fewer clusters or even a single large should characterize simple and close queries. In contrast, difficultmulti-faceted-queries are possibly characterized by more clusters, each addressing a different facet of the query. This intuition paves the way for the extension of the evaluation methodology by adopting diversification-based metrics <ref type="bibr" target="#b41">[42]</ref>, allowing us to understand how well the generated answers cover the query facets and the topical distribution of the clusters.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>3 https://huggingface.co/castorini/tct_colbert-v2-hnp-msmarco</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">This step is particularly important in our setting because the CAsT</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2022" xml:id="foot_1">corpus contains a multitude of near-duplicate documents. In particular, the same Wikipedia article is often replicated in documents retrieved from the KILT and MS-MARCO collections.<ref type="bibr" target="#b1">2</ref> The model named "squad_snippets_unanswerable" is available at https: //iai.group/downloads/emnlp2023-answerability_prediction.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">The run is identified as "udinfo_mi_b2021" from the "udel_fang" group, University of Delaware (USA)</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">The number of queries considered in these experiments is 115 out of 163 evaluated in the official relevance judgments.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">https://huggingface.co/biu-nlp/f-coref</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">https://huggingface.co/lmsys/vicuna-7b-v1.5</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_6">The similarity between a cluster 𝐶 and the query is defined as the maximum cosine similarity between the query 𝑞 ∈ 𝑄 with any sentence 𝑠𝑖,𝑗 ∈ 𝐶 belonging to the cluster.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_7"><ref type="bibr" target="#b8">9</ref> The clusters are placed first, last, second, second-to-last, third, and so on, e.g., [A, B, C, D, E] becomes [A, C, E, D, B]</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_8">.<ref type="bibr" target="#b9">10</ref> The clusters are placed last, first, second-to-last, second, third-to-last, and so on, e.g., [A, B, C, D, E] becomes [B, D, E, C,</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_9">A].<ref type="bibr" target="#b10">11</ref> The sentences are sorted based on the order in which they appear when sequentially scanning through the set of top-𝑘 retrieved documents.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.10997</idno>
		<title level="m">Retrieval-augmented generation for large language models: A survey</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions</title>
		<author>
			<persName><forename type="first">L</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2311.05232</idno>
		<idno type="arXiv">arXiv:2311.05232</idno>
		<ptr target="/ARXIV.2311.05232" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Siren&apos;s song in the AI ocean: A survey on hallucination in large language models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Cui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Luu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Bi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shi</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2309.01219</idno>
		<idno type="arXiv">arXiv:2309.01219</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2309.01219.doi:10.48550/ARXIV.2309.01219" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Survey of hallucination in natural language generation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Frieske</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ishii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fung</surname></persName>
		</author>
		<idno type="DOI">10.1145/3571730</idno>
		<ptr target="https://doi.org/10.1145/3571730.doi:10.1145/3571730" />
	</analytic>
	<monogr>
		<title level="j">ACM Comput. Surv</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page">38</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Lost in the middle: How language models use long contexts</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Paranjape</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bevilacqua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Petroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2307.03172</idno>
		<idno type="arXiv">arXiv:2307.03172</idno>
		<ptr target="/ARXIV.2307.03172" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Is chatgpt good at search? investigating large language models as re-ranking agents</title>
		<author>
			<persName><forename type="first">W</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ren</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2023.EMNLP-MAIN.923</idno>
		<ptr target="https://doi.org/10.18653/v1/2023.emnlp-main.923.doi:10.18653/V1/2023.EMNLP-MAIN.923" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</editor>
		<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">December 6-10, 2023. 2023</date>
			<biblScope unit="page" from="14918" to="14937" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Found in the middle: Permutation self-consistency improves listwise ranking in large language models</title>
		<author>
			<persName><forename type="first">R</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ture</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2310.07712</idno>
		<idno type="arXiv">arXiv:2310.07712</idno>
		<ptr target="/ARXIV.2310.07712" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">TREC cast 2022: Going beyond user ask and system retrieve with initiative and response generation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Owoicho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dalton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Aliannejadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Azzopardi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Trippas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vakulenko</surname></persName>
		</author>
		<ptr target="https://trec.nist.gov/pubs/trec31/papers/Overview_cast.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirty-First Text REtrieval Conference, TREC 2022</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Soboroff</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Ellis</surname></persName>
		</editor>
		<meeting>the Thirty-First Text REtrieval Conference, TREC 2022</meeting>
		<imprint>
			<publisher>NIST Special Publication</publisher>
			<date type="published" when="2022">November 15-19, 2022. 2022</date>
			<biblScope unit="page" from="500" to="338" />
		</imprint>
		<respStmt>
			<orgName>National Institute of Standards and Technology (NIST)</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">When not to trust language models: Investigating effectiveness of parametric and nonparametric memories</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mallen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Asai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Khashabi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajishirzi</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2023.ACL-LONG.546</idno>
		<ptr target="https://doi.org/10.18653/v1/2023.acl-long.546.doi:10.18653/V1/2023.ACL-LONG.546" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Rogers</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Boyd-Graber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Okazaki</surname></persName>
		</editor>
		<meeting>the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023<address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2023">July 9-14, 2023. 2023</date>
			<biblScope unit="page" from="9802" to="9822" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Investigating the factual knowledge boundary of large language models with retrieval augmentation</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2307.11019</idno>
		<idno type="arXiv">arXiv:2307.11019</idno>
		<ptr target="/ARXIV.2307.11019" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavril</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Izacard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Martinet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lacroix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Rozière</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hambro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Azhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2302.13971</idno>
		<idno type="arXiv">arXiv:2302.13971</idno>
		<ptr target="/ARXIV.2302.13971" />
		<title level="m">Llama: Open and efficient foundation language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Albert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Almahairi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Babaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bashlykov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bhargava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhosale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Bikel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Blecher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Canton-Ferrer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Cucurull</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Esiobu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fernandes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Fuller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Goswami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hartshorn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hosseini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Inan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kardas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kerkez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Khabsa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kloumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Korenev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Koura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavril</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Liskovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Martinet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mihaylov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Molybog</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Nie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Poulton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Reizenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rungta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Saladi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Schelten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Silva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Subramanian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">E</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Taylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">X</forename><surname>Kuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Zarov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kambadur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rodriguez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Stojnic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Edunov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Scialom</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2307.09288</idno>
		<idno type="arXiv">arXiv:2307.09288</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2307.09288.doi:10.48550/ARXIV.2307.09288" />
		<title level="m">Llama 2: Open foundation and fine-tuned chat models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title/>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2303.08774</idno>
		<idno type="arXiv">arXiv:2303.08774</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2303.08774" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">GPT-4 technical report</note>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">RECOMP: improving retrievalaugmented lms with compression and selective augmentation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Choi</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2310.04408</idno>
		<idno type="arXiv">arXiv:2310.04408</idno>
		<ptr target="/ARXIV.2310.04408" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">F</forename><surname>Cuconasu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Trappolini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Siciliano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Filice</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Campagnano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Maarek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Tonellotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Silvestri</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2401.14887</idno>
		<title level="m">The power of noise: Redefining retrieval for rag systems</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
		<ptr target="https://aclanthology.org/P02-1040/.doi:10.3115/1073083.1073135" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 40th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Philadelphia, PA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACL</publisher>
			<date type="published" when="2002">July 6-12, 2002. 2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">ROUGE: A package for automatic evaluation of summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W04-1013" />
	</analytic>
	<monogr>
		<title level="m">Text Summarization Branches Out, Association for Computational Linguistics</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">METEOR: an automatic metric for MT evaluation with improved correlation with human judgments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W05-0909/" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005</title>
		<title level="s">Association for Computational Linguistics</title>
		<editor>
			<persName><forename type="first">J</forename><surname>Goldstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Lin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><forename type="middle">R</forename><surname>Voss</surname></persName>
		</editor>
		<meeting>the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005<address><addrLine>Ann Arbor, Michigan, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2005-06-29">June 29, 2005. 2005</date>
			<biblScope unit="page" from="65" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Bertscore: Evaluating text generation with BERT</title>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kishore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=SkeHuCVFDr" />
	</analytic>
	<monogr>
		<title level="m">8th International Conference on Learning Representations, ICLR 2020</title>
				<meeting><address><addrLine>Addis Ababa, Ethiopia</addrLine></address></meeting>
		<imprint>
			<publisher>OpenReview</publisher>
			<date type="published" when="2020">April 26-30, 2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<ptr target="http://arxiv.org/abs/1301.3781" />
	</analytic>
	<monogr>
		<title level="m">1st International Conference on Learning Representations, ICLR 2013</title>
				<editor>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</editor>
		<meeting><address><addrLine>Scottsdale, Arizona, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013">May 2-4, 2013. 2013</date>
		</imprint>
	</monogr>
	<note>Workshop Track Proceedings</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Glove: Global vectors for word representation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<idno type="DOI">10.3115/V1/D14-1162</idno>
		<ptr target="https://doi.org/10.3115/v1/d14-1162.doi:10.3115/V1/D14-1162" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Moschitti</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">B</forename><surname>Pang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</editor>
		<meeting>the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014<address><addrLine>Doha, Qatar</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">October 25-29, 2014. 2014</date>
			<biblScope unit="page" from="1532" to="1543" />
		</imprint>
	</monogr>
	<note>, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL</note>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">BERT: pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/N19-1423</idno>
		<ptr target="https://doi.org/10.18653/v1/n19-1423.doi:10.18653/V1/N19-1423" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Burstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Doran</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Solorio</surname></persName>
		</editor>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019<address><addrLine>Minneapolis, MN, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">June 2-7, 2019. 2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Exploring the limits of transfer learning with a unified text-to-text transformer</title>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Matena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Liu</surname></persName>
		</author>
		<ptr target="http://jmlr.org/papers/v21/20-074.html" />
	</analytic>
	<monogr>
		<title level="j">J. Mach. Learn. Res</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page">67</biblScope>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Judging llm-as-a-judge with mt-bench and chatbot arena</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">P</forename><surname>Xing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">E</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Stoica</surname></persName>
		</author>
		<ptr target="http://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Oh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Naumann</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Globerson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Saenko</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hardt</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Levine</surname></persName>
		</editor>
		<meeting><address><addrLine>New Orleans, LA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">December 10 -16, 2023, 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Guyon</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">U</forename><surname>Luxburg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bengio</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><forename type="middle">M</forename><surname>Wallach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Fergus</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">V N</forename><surname>Vishwanathan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Garnett</surname></persName>
		</editor>
		<meeting><address><addrLine>Long Beach, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">December 4-9, 2017. 2017</date>
			<biblScope unit="page" from="5998" to="6008" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation</title>
		<author>
			<persName><forename type="first">E</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rijhwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gehrmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Maynez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Aharoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Nikolaev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sellam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Siddhant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Parikh</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2023.EMNLP-MAIN.584</idno>
		<ptr target="https://doi.org/10.18653/v1/2023.emnlp-main.584.doi:10.18653/V1/2023.EMNLP-MAIN.584" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</editor>
		<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023<address><addrLine>Singapore, De</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2023">cember 6-10, 2023. 2023</date>
			<biblScope unit="page" from="9397" to="9413" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">Overview of the TREC 2019 deep learning track</title>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mitra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Yilmaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Campos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Voorhees</surname></persName>
		</author>
		<idno>CoRR abs/2003.07820</idno>
		<ptr target="https://arxiv.org/abs/2003.07820.arXiv:2003.07820" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Overview of the TREC 2020 deep learning track</title>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mitra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Yilmaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Campos</surname></persName>
		</author>
		<ptr target="https://trec.nist.gov/pubs/trec29/papers/OVERVIEW.DL.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event</title>
		<title level="s">NIST Special Publication</title>
		<editor>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Voorhees</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Ellis</surname></persName>
		</editor>
		<meeting>the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event<address><addrLine>Gaithersburg, Maryland, USA</addrLine></address></meeting>
		<imprint>
			<publisher>National Institute of Standards and Technology (NIST</publisher>
			<date type="published" when="2020">November 16-20, 2020. 2020</date>
			<biblScope unit="volume">1266</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models</title>
		<author>
			<persName><forename type="first">N</forename><surname>Thakur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Reimers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rücklé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gurevych</surname></persName>
		</author>
		<idno>CoRR abs/2104.08663</idno>
		<ptr target="https://arxiv.org/abs/2104.08663.arXiv:2104.08663" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Zero-shot listwise document reranking with a large language model</title>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pradeep</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2305.02156</idno>
		<idno type="arXiv">arXiv:2305.02156</idno>
		<ptr target="https://doi.org/10.48550/arXiv.2305.02156.doi:10.48550/ARXIV.2305.02156" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<title level="m" type="main">Instruction distillation makes large language models efficient zero-shot rankers</title>
		<author>
			<persName><forename type="first">W</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ren</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2311.01555</idno>
		<idno type="arXiv">arXiv:2311.01555</idno>
		<ptr target="/ARXIV.2311.01555" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<title level="m" type="main">Rankvicuna: Zero-shot listwise document reranking with opensource large language models</title>
		<author>
			<persName><forename type="first">R</forename><surname>Pradeep</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sharifymoghaddam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2309.15088</idno>
		<idno type="arXiv">arXiv:2309.15088</idno>
		<ptr target="/ARXIV.2309.15088" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Conversations with search engines: Serp-based conversational response generation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Monz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De Rijke</surname></persName>
		</author>
		<idno type="DOI">10.1145/3432726</idno>
		<ptr target="https://doi.org/10.1145/3432726.doi:10.1145/3432726" />
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="page">29</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Towards filling the gap in conversational search: From passage retrieval to conversational response generation</title>
		<author>
			<persName><forename type="first">W</forename><surname>Lajewska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Balog</surname></persName>
		</author>
		<idno type="DOI">10.1145/3583780.3615132</idno>
		<idno>doi:10.1145/3583780.3615132</idno>
		<ptr target="https://doi.org/10.1145/3583780.3615132" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Frommholz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Hopfgartner</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Lee</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Oakes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Lalmas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Zhang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><forename type="middle">L T</forename><surname>Santos</surname></persName>
		</editor>
		<meeting>the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023<address><addrLine>Birmingham, United Kingdom</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2023">October 21-25, 2023. 2023</date>
			<biblScope unit="page" from="5326" to="5330" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Towards reliable and factual response generation: Detecting unanswerable questions in information-seeking conversations</title>
		<author>
			<persName><forename type="first">W</forename><surname>Lajewska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Balog</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-56063-7_25</idno>
		<idno>doi:</idno>
		<ptr target="10.1007/978-3-031-56063-7\_25" />
	</analytic>
	<monogr>
		<title level="m">Advances in Information Retrieval -46th European Conference on Information Retrieval, ECIR 2024</title>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>
			<persName><forename type="first">N</forename><surname>Goharian</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Tonellotto</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lipani</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Mcdonald</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Macdonald</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Ounis</surname></persName>
		</editor>
		<meeting><address><addrLine>Glasgow, UK</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2024">March 24-28, 2024. 2024</date>
			<biblScope unit="volume">14610</biblScope>
			<biblScope unit="page" from="336" to="344" />
		</imprint>
	</monogr>
	<note>Proceedings, Part III</note>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2021.REPL4NLP-1.17</idno>
		<ptr target="https://doi.org/10.18653/v1/2021.repl4nlp-1.17.doi:10.18653/V1/2021.REPL4NLP-1.17" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th Workshop on Representation Learning for NLP, RepL4NLP@ACL-IJCNLP 2021, Online</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Rogers</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Calixto</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Vulic</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Saphra</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Kassner</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">O</forename><surname>Camburu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Bansal</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Shwartz</surname></persName>
		</editor>
		<meeting>the 6th Workshop on Representation Learning for NLP, RepL4NLP@ACL-IJCNLP 2021, Online</meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2021-08-06">August 6, 2021. 2021</date>
			<biblScope unit="page" from="163" to="173" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">MS MARCO: A human generated machine reading comprehension dataset</title>
		<author>
			<persName><forename type="first">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rosenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tiwary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Majumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Deng</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">T</forename><forename type="middle">R</forename><surname>Besold</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Avila Garcez</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><surname>Wayne</surname></persName>
		</editor>
		<meeting>the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016)<address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016-12-09">December 9, 2016. 2016</date>
			<biblScope unit="volume">1773</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">KILT: a benchmark for knowledge intensive language tasks</title>
		<author>
			<persName><forename type="first">F</forename><surname>Petroni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Piktus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S H</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yazdani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Thorne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jernite</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Karpukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Maillard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Plachouras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rocktäschel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Riedel</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2021.NAACL-MAIN.200</idno>
		<ptr target="https://doi.org/10.18653/v1/2021.naacl-main.200.doi:10.18653/V1/2021.NAACL-MAIN.200" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Rumshisky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Hakkani-Tür</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Beltagy</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bethard</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Cotterell</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Chakraborty</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</editor>
		<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online</meeting>
		<imprint>
			<date type="published" when="2021">June 6-11, 2021. 2021</date>
			<biblScope unit="page" from="2523" to="2544" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">An exploration study of mixed-initiative query reformulation in conversational passage retrieval</title>
		<author>
			<persName><forename type="first">D</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Fang</surname></persName>
		</author>
		<ptr target="https://trec.nist.gov/pubs/trec31/papers/udel_fang.C.pdf" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirty-First Text REtrieval Conference, TREC 2022</title>
				<editor>
			<persName><forename type="first">I</forename><surname>Soboroff</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Ellis</surname></persName>
		</editor>
		<meeting>the Thirty-First Text REtrieval Conference, TREC 2022</meeting>
		<imprint>
			<publisher>NIST Special Publication</publisher>
			<date type="published" when="2022">November 15-19, 2022. 2022</date>
			<biblScope unit="page" from="500" to="338" />
		</imprint>
		<respStmt>
			<orgName>National Institute of Standards and Technology (NIST)</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">F-coref: Fast, accurate and easy to use coreference resolution</title>
		<author>
			<persName><forename type="first">S</forename><surname>Otmazgin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cattan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Goldberg</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2022.aacl-demo.6" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL/IJCNLP 2022 -System Demostrations</title>
		<title level="s">Association for Computational Linguistics</title>
		<meeting>the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, AACL/IJCNLP 2022 -System Demostrations<address><addrLine>Taipei, Taiwan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">November 20 -23, 2022. 2022</date>
			<biblScope unit="page" from="48" to="56" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Lingmess: Linguistically informed multi expert scorers for coreference resolution</title>
		<author>
			<persName><forename type="first">S</forename><surname>Otmazgin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Cattan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Goldberg</surname></persName>
		</author>
		<idno type="DOI">10.18653/V1/2023.EACL-MAIN.202</idno>
		<ptr target="https://doi.org/10.18653/v1/2023.eacl-main.202.doi:10.18653/V1/2023.EACL-MAIN.202" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Vlachos</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Augenstein</surname></persName>
		</editor>
		<meeting>the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023<address><addrLine>Dubrovnik, Croatia</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2023">May 2-6, 2023. 2023</date>
			<biblScope unit="page" from="2744" to="2752" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">Novelty and diversity in information retrieval evaluation</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kolla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vechtomova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ashkan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Büttcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Mackinnon</surname></persName>
		</author>
		<idno type="DOI">10.1145/1390334.1390446</idno>
		<idno>doi:10. 1145/1390334.1390446</idno>
		<ptr target="https://doi.org/10.1145/1390334.1390446" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR &apos;08</title>
				<meeting>the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR &apos;08<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="659" to="666" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
