<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Zackary</forename><surname>Rackauckas</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Columbia University</orgName>
								<address>
									<settlement>New York</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Arthur</forename><surname>Câmara</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Zeta Alpha</orgName>
								<address>
									<settlement>Amsterdam</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jakub</forename><surname>Zavrel</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Zeta Alpha</orgName>
								<address>
									<settlement>Amsterdam</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">DE55414E9F4DE6AC19C183FE8734E14D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Retrieval-augmented generation</term>
					<term>Elo-based evaluation</term>
					<term>LLM-as-a-judge</term>
					<term>RAG-Fusion</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Questionanswering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company-internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF) in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The text-generating capabilities of LLMs, together with their text understanding abilities, have allowed conversational Question-Answering (QA) systems to experience a considerable leap in performance, with near-human text quality and reasoning capabilities <ref type="bibr" target="#b0">[1]</ref>. However, these systems can be prone to hallucinations <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>, as they sometimes produce seemingly plausible but factually incorrect answers.</p><p>The general inability of such models to identify unanswerable questions <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref> can exacerbate hallucinations, especially in enterprise settings. In such scenarios, user questions may require specific domain knowledge to be answered properly. This knowledge LLM4Eval 2024:The First Workshop on Large Language Models for Evaluation in Information Retrieval, 18 July 2024, Washington, DC Envelope zcr2105@columbia.edu (Z. Rackauckas); camara@zeta-alpha.com (A. Câmara); zavrel@zeta-alpha.com (J. Zavrel) is usually out-of-domain for most LLMs, but is present in private and confidential internal documents from the company.</p><p>One such company is Infineon, a leading manufacturer of semiconductors. Given its wide range of equipment, information about its products is spread across multiple, highly technical documents, including datasheets and selection guides of hundreds of pages. Therefore, an internal retrieval augmented conversational QA system was developed by Infineon for internal users such as account managers, field application engineers, and sales operations specialists. This system allows professionals to ask questions about products from the whole catalog while in the field.</p><p>One of the features of Infineon's conversational agent is the usage of RAG-Fusion (RAGF), a technique for increasing the quality of the generated answers by generating variations of the user question and combining the rankings produced by these variations using rank-fusion methods (i.e., recriprocal rank fusion (RRF) <ref type="bibr" target="#b5">[6]</ref>) into a ranking that has both more diverse and higher quality answers.</p><p>However, evaluating these systems bring complications common to retrieval augmented agents, especially in enterprise settings, stemming from the lack of comprehensive test datasets. Ideally, such a test set would comprise a large set of real user questions from a query log, paired with "golden answers" provided by experts. The lack of such a test set leads to two main issues. First, evaluation of answers generated by LLMs by traditional n-gram evaluation metrics such as ROUGE <ref type="bibr" target="#b6">[7]</ref>, BLEU <ref type="bibr" target="#b7">[8]</ref>, and METEOR <ref type="bibr" target="#b8">[9]</ref> is not possible, given the lack of ground truth answers. Second, and as a consequence, evaluating the quality of the answers generated by the LLM systems would require in-domain experts (potentially from within the company) in a process that is both slow and costly <ref type="bibr" target="#b9">[10]</ref>.</p><p>One approach for tackling the lack of an extensive test set is to use synthetic queries generated by LLMs as a proxy of user queries <ref type="bibr" target="#b10">[11]</ref>. However, the lack of in-domain knowledge of LLMs makes queries naively generated by these models unreliable and prone to hallucinations, especially when generating queries about specific products and their specifications (c.f., Table <ref type="table">1</ref> for examples of real user's questions submitted to the system).</p><p>To solve this, we propose to use a process similar to InPars <ref type="bibr" target="#b11">[12]</ref> to create a set of synthetic evaluation queries. We ask LLMs to generate queries based on portions of existing documentation injected into the prompt. To increase similarity to real user queries, we include existing user questions as few-shot examples to the prompt. With this process, we are able to generate a large set of high-quality synthetic queries for evaluating our systems. Figure <ref type="figure" target="#fig_2">2</ref> describes the process of generating synthetic queries and the output of a search agent. Table <ref type="table" target="#tab_0">2</ref> shows a sample of these queries.</p><p>To tackle the second issue, a lack of ground truth "golden answers," we leverage an LLM-as-a-judge process, where a strong LLM is used to evaluate the quality of the answers generated by the RAG agent's LLM <ref type="bibr" target="#b12">[13]</ref>. We then follow the practice of judging generated answers in a pairwise fashion <ref type="bibr" target="#b13">[14]</ref>, prompting the judge LLM to select the better answer between two candidates generated by different RAG pipelines. (c.f. Section 6 with details of our pipelines).</p><p>Finally, to mitigate the lack of in-domain knowledge of the judging LLM, we also annotate the relevance of the documents retrieved by the pipelines being evaluated and inject the relevant documents in the context used by the judging LLM. This allows the judging LLM to better assess for hallucinations and completeness and better align the quality of the evaluations to those conducted by experts.</p><p>This process is mediated by RAGElo 1 , a toolkit for evaluating RAG systems inspired by the Elo ranking system. RAGElo provides an easy-to-use CLI and Python library for using LLMs to evaluate retrieval results and answers produced by RAG pipelines. By combining a retrieval evaluator, a pairwise answer annotator, and an Elo-inspired tournament, RAGElo leverages powerful LLMs to agnostically annotate and rank different RAG pipelines. We notice that, although noisy, the LLM annotations generated by RAGElo are generally well aligned with experts' judgments of relative system quality, allowing for fast experimentation and comparisons between different RAG implementations without the frequent intervention of experts as annotators.</p><p>This paper evaluates multiple implementations of Infineon's retrieval augmented conversational agent using RAGElo: a traditional Retrieval-Augmented Generation and a RAG-Fusion implementation. RAG-Fusion generates multiple variations of the user question and combines the rankings produced by these queries into a more diverse set of documents. The documents are then fed into the LLM. We also analyze these same agents under a keyword-based retrieval regimen (i.e., the retriever uses BM25 to retrieve and rank documents), a dense retriever, and a hybrid retriever that combines the ranking generated by the BM25 and the dense retrievers using RRF. Our goal is to answer the following questions:</p><p>• Does the evaluation framework proposed by RAGElo align with the preferences of human annotators for answers generated by RAG-based conversational agents? • Does the RAGF approach of submitting multiple variations of the user question and combining their rankings lead to better answers?</p><p>Table <ref type="table">1</ref> Sample of questions submitted by users to the Infineon RAG-Fusion system User-submitted queries What is the country of origin of IM72D128, and how does geopolitical exposure affect the market and my SAM for the microphone? What is the IP rating of mounted IM72D128? Tell me microphones that have been released since January 2023 based on the datasheet revision history. We need to confirm whether the IFX waterproof MIC has a sleeping mode and wake-up functions.  While a traditional RAG agent submits only the original query to the search system, a RAGF agent first generates variations of the user query and combines the rankings induced by these queries into a final ranking using RRF. The resulting top-k passages are fed into the LLM for generating the answer to the user's query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Several evaluation systems for RAG have been proposed to address flaws in current evaluation methods. For instance, Facts as a Function (FaaF) <ref type="bibr" target="#b14">[15]</ref> is an end-to-end factual evaluation algorithm specially created for RAG pipelines. By creating functions from ground truth facts, FaaF focuses on the quality of generation and retrieval by calling LLMs. FaaF has substantially increased efficiency and cost-effectiveness, achieving reduced error rates compared to traditional evaluation methods. The reliance on a set of ground truths does not meet our goal of applying an automated evaluation toolkit to our pipelines. Recently, researchers have moved to eliminate the need for ground truths. This is especially important when automatically evaluating agents that retrieve highly technical documents from a large database, such as the Infineon RAGF conversational agent. RAGElo eliminates this reliance by using an LLM-as-a-judge, a method studied in numerous recent works.</p><p>SelfCheckGPT demonstrates the ability to leverage LLMs to detect and rank factuality with zero resources <ref type="bibr" target="#b15">[16]</ref>. In addition, it has been demonstrated that GPT3.5 Turbo outperforms ground truth baselines in fact-checking with a "1/2-shot" method <ref type="bibr" target="#b16">[17]</ref>. A model built to classify statements as true or false based on the activations of an LLM's hidden layers had up to 83% classification accuracy <ref type="bibr" target="#b17">[18]</ref>. This evidence supports RAGElo's usage of LLM-as-a-judge.</p><p>Automated evaluation metrics can also be applied to RAG-based agents. BARTScore, an automated metric based on the BART architecture, has also outperformed most metrics on categories including factuality <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b19">20]</ref>. Besides automated evaluation metrics, several automated evaluation frameworks have been created with a similar goal to RAGElo. Focusing on faithfulness, answer relevance, and content relevance, RAGAS leverages LLM prompting to focus on situations where ground truths and human annotations are not present in a dataset <ref type="bibr" target="#b20">[21]</ref>. Prediction-powered inference aims to decrease the number of human annotations needed for machine learning prediction on a dataset of images of galaxies with approximately 300,000 annotations <ref type="bibr" target="#b21">[22]</ref>. The ARES toolkit leverages prediction-powered inference to evaluate RAG systems with fewer human annotations. Like RAGElo, ARES automatically evaluates RAG systems using synthetically generated data <ref type="bibr" target="#b22">[23]</ref>.</p><p>ARAGOG highlights Hypothetical Document Embedding (HyDE) and LLM reranking as effective methods for enhancing retrieval precision while also exploring the effectiveness of Sentence Window Retrieval and the potential of the Document Summary Index in improving RAG systems <ref type="bibr" target="#b23">[24]</ref>.</p><p>While the aforementioned frameworks evaluate answers on relevance, faithfulness, and correctness metrics, RAG can also be evaluated on noise and counterfactual robustness, negative rejection, and information integration <ref type="bibr" target="#b24">[25]</ref>.</p><p>In addition to answers, frameworks have also been created to evaluate documents. Corrective Retrieval Augmented Generation (CRAG) builds on RAG by employing a retrieval evaluator to ensure that only the optimal documents are fed into the LLM prompt prior to the answer generation phase <ref type="bibr" target="#b25">[26]</ref>.</p><p>Due to its Elo-based ranking system for answers, its use of LLM-as-a-judge, and its relevance evaluation of the intermediate retrieval steps in a RAG pipeline, RAGElo is a unique evaluation toolkit. In this study, we use it to compare a simple RAG versus a more sophisticated RAGF system on a knowledge-intensive industry-specific domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Retrieval Augmented QA with rank fusion</head><p>While answers generated by traditional retrieval augmented systems are based on a number of documents retrieved from a single query, RAGF introduces additional variation into the retrieval process. Upon receiving a query from the user, a RAGF agent leverages a large language model to generate a set of queries based on the original <ref type="bibr" target="#b26">[27]</ref>. Table <ref type="table">3</ref> shows examples of queries generated by the agent based on the query, "How to cross-sell a MEMS microphone and a XENSIV sensor to customers?".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Queries Generated from "How to cross-sell a MEMS microphone and a XENSIV sensor to customers?"</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LLM-Generated Query</head><p>What are the key features of Infineon's MEMS microphones and XENSIV sensors that can be highlighted while cross-selling? How can Infineon's MEMS microphones and XENSIV sensors be integrated for enhanced audio and motion sensing capabilities in various applications? What are the most suitable applications and industries for Infineon's MEMS microphones and XENSIV sensors to maximize cross-selling potential?</p><p>After generating the variations for the user query, the RAGF agent submits the original and the generated queries to a retrieval system <ref type="bibr" target="#b27">[28]</ref> that returns the top-𝑘 relevant documents 𝑑, 𝑑 1 , 𝑑 2 , … 𝑑 𝑘 from the set of all documents 𝐷 for each query. The rankings induced by these queries are then combined using recriprocal rank fusion (RRF) <ref type="bibr" target="#b5">[6]</ref> into a final, higher-quality set of passages. The intuition behind RAGF is that submitting variations of the same query and combining the final rankings increases the likelihood of relevant passages being injected into the LLM prompt. In contrast, non-relevant passages retrieved by a single query are discarded. Figure <ref type="figure" target="#fig_0">1</ref> describes how RAG and RAGF differ.  As previously discussed, one of the main issues when evaluating the quality of a QA system in an enterprise setting is that, frequently, companies do not have a large enough existing collection of queries to evaluate such systems' quality. Therefore, in this work, we propose to adopt a strategy previously used by methods for generating synthetic queries for training retrieval systems, such as InPars <ref type="bibr" target="#b11">[12]</ref> and Promptagator <ref type="bibr" target="#b28">[29]</ref>.</p><formula xml:id="formula_0">𝑅𝑅𝐹 𝑆𝑐𝑜𝑟𝑒(𝑑 ∈ 𝐷) = ∑ 𝑟∈𝑅 1 𝑟(𝑑) + 𝑘 .<label>(1)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Development of a synthetic test set</head><p>Similar to these approaches, we randomly sample passages from documents within our collection and prompt an LLM to generate questions that users may ask about these portions. However, one difference in our approach to generating training queries is the size of these passages. When generating queries for training a retrieval system, we ideally want to keep the passages short to fit in the dense encoder's relatively short context windows. However, when generating queries for evaluating QA systems (including retrieval augmented), we are not bound to the limit of the embedding model used for retrieval. Rather, a longer passage may yield questions that require multiple shorter passages to be answered. Therefore, we submit relatively long passages to the LLMs. Specifically, each passage is extracted from up to ten pages of PDF documents (about 2000 tokens 2 )</p><p>To keep the questions generated as diverse as possible, we prompt four different LLMs to generate up to ten questions based on the same documents. Our test set collection contains a mix of queries generated by OpenAI's GPT-4 turbo <ref type="bibr" target="#b29">[30]</ref> and Anthropic's Claude-3 <ref type="bibr" target="#b30">[31]</ref> Opus, Sonnet, and Haiku models 3 . From a set of 𝑁 = 840 queries, we sampled 200 queries across all four models. Half of the queries are selected from GPT-4 generated queries, and the other half from Claude 3 queries. Among the Claude 3 queries, to ensure the quality of the queries and their diversity, we again sample according to each model size. Ultimately, our test set contains 100 queries from GPT-4-turbo, 50 from Claude 3 Opus, 30 from Sonnet, and 20 from Haiku.</p><p>Finally, to increase the quality of the generated queries, We asked for an account manager, a sales operations specialist, a marketing representative, and a business development manager to create queries that they would submit to the conversational agent from the perspective of their role. They were instructed to produce queries regarding products from the XENSIV sensor product line, consisting of MEM microphones, radar, current, magnetic, pressure, and environmental sensors. We compiled a list of 23 of these queries to use as a base for experimentation and used them as few-shot examples in the query generation prompt. Figure <ref type="figure" target="#fig_2">2</ref> illustrates our method for generating synthetic queries based on existing user queries and document passages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">LLM-as-a-Judge for RAG pipelines</head><p>Even with a suitable set of synthetic questions for evaluating our RAG conversational agent, assessing whether a given answer properly answers a question is not trivially done. If a ground-truth "golden answer" is available, one can use traditional syntactic-based 2 all LLMs used in our experiments had long context windows of 128k or 200k tokens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3</head><p>We did not use GPT-3.5 or open source models due to their shorter context window at the time of writing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Search agents</head><p>Are there any self-diagnosis features available in the KP23x analog sensor?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RAGElo Retrieval Evaluator RAGElo Retrieval Evaluator</head><p>Agent B:The KP23x analog sensor series includes self-diagnosis capabilities, such as built-in monitoring functionality and ISO 26262 (…)</p><p>Agent A: Yes, the KP23x sensor has some self-diagnosis features.</p><p>Both assistants correctly answer the question (…) according to document &lt;doc_id&gt;, the sensors have features such as (…) Based on these observations, Assistant B provides a more accurate, relevant, and focused response to the user's question regarding the KP23x series sensors. The RAGElo evaluation pipeline. First, documents retrieved by the agents are evaluated pointwise according to their relevance to the user's question. Then, the agents' answers are evaluated pairwise, using the retrieved relevant documents from both agents as reference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Final Verdict: [[B]]</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Retrieved documents and generated answers</head><p>metrics such as BLEU, METEOR or ROUGE <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b6">7]</ref>. Without such reference answers, one would require human annotators with a considerable understanding of the question's topic to manually assess the quality of the answers produced by each system. However, this is a costly process.</p><p>Alternatively, several LLM-as-a-Judge methods have been proposed, where another LLM is asked to evaluate the quality of answers generated by other LLMs. Nevertheless, in an enterprise setting, the answers usually require the LLM to access knowledge not present in their training datasets but rather contained in documents internal to the company. This is usually accomplished using a RAG pipeline like the one described above. Therefore, the judging LLM also needs access to similar knowledge to accurately evaluate the agent's answers' quality.</p><p>Therefore, in this work, we rely on RAGElo, an open-source RAG evaluation toolkit that evaluates the answers generated by each agent and the documents retrieved by them. By injecting the annotation of retrieved documents, pooled by the agents being evaluated, on the answer evaluation step, this method allows for the judging LLM to evaluate if the generated answer was able to use all the information available about the question properly and to check for any hallucinations. As the documents used for generating the answers are included in the answer evaluating prompt, an agent that incorrectly cites information from a source or refers to information not present in these documents is likely hallucinating and should have its evaluation adjusted accordingly. As we explore in Section 8, this two-step process results in a high correlation between human expert annotators and the judging LLM, enabling higher reliability and trust when evaluating different RAG pipelines. This process is also illustrated in Figure <ref type="figure" target="#fig_3">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Evaluation aspects</head><p>While our main evaluation focuses on the pairwise comparison between the two agents, RAGElo also allows us to evaluate answers pointwise. In this setting, similar to other works <ref type="bibr" target="#b31">[32]</ref>, we prompt the judging LLM to evaluate the answers according to multiple criteria:</p><p>• Relevance: Does the answer address the user's question?</p><p>• Accuracy: Is the answer factually correct, based on the documents provided? • Completeness: Does the answer provide all the information needed to answer the user's question? • Precision: If the user's question is about a specific product, does the answer provide the answer for that specific product?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Retrieval pipelines</head><p>We not only experiment with different search agents (i.e., RAG and RAGF. We are also interested in how different retrieval methods may impact the quality of the final answers generated by these agents.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Retrieval methods</head><p>Our corpus consists of passages extracted from the Infineon XENSIV Product Selection Guide, a 117-page document with detailed information on every product in the XENSIV family. This document included technical information about all Infineon XENSIV sensors, consumer and automotive sensor applications, guidance in selecting the correct sensor, and other comprehensive and detailed information about the product line.</p><p>The passages are embedded using multilingual-e5-base <ref type="bibr" target="#b32">[33]</ref> <ref type="foot" target="#foot_1">4</ref> and indexed using OpenSearch, allowing us to perform both KNN-based vector search, keyword-based search with BM25 <ref type="bibr" target="#b33">[34]</ref>, and RRF based hybrids thereof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">QA Systems Implementation</head><p>We mainly evaluate two agents: a naive RAG pipeline, where the agent first retrieves top-𝑘 passages that are then templated into a prompt, and the Infineon RAG-Fusion (RAGF) agent. Upon receiving a query, a naive RAG agent takes the following actions:</p><p>1. Retrieve the top k most relevant passages from the search system. 2. Perform a Chat Completions API call, prompting the LLM with instructions for generating an answer based on the five relevant passages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Process and output the Chat Completions response.</head><p>Meanwhile, the Infineon RAGF conversational assistant uses a similar framework and performs the following steps upon receiving a query:</p><p>1. Perform a Chat Completions API call to generate four new queries based on the original query using a prompt tailored to the agent's original goal. 2. Retrieve the top k most relevant passages for each query. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Comparing LLM-as-a-judge to expert annotators</head><p>While LLM-as-a-judge is a theoretically viable algorithm for rating RAG and RAGF answers, we must establish whether the results agree with the annotations of domain experts.</p><p>Figure <ref type="figure" target="#fig_5">4</ref> provides a Bland-Altman plot to visually represent the LLM and human judgments' agreement. The bias of approximately 0.12 indicates that, on average, LLM scores were slightly higher than human scores. The limits of agreement ranged from approximately -1.17 to 1.41. demonstrating substantial variability in the difference between LLM and human evaluators.</p><p>Next, we compared LLM-as-a-judge to expert annotators with Kendall's 𝜏. Kendalls 𝜏 is a nonparametric measure that quantifies the degree of association between two monotonic continuous or ordinal variables by calculating the proportion of concordance and discordance among pairwise ranks, offering valuable insight into their rank correlation <ref type="bibr" target="#b34">[35,</ref><ref type="bibr" target="#b35">36]</ref>. We used the SciPy Stats Kendalltau function to calculate a tau-b score and a p-value for the combined ratings of all columns, flattened into a 1-D array with RAG and RAGF ratings combined <ref type="bibr">[37]</ref>. The tau-b value, a nonparametric measure of association, is calculated using the following formula <ref type="bibr" target="#b36">[38]</ref>:</p><formula xml:id="formula_1">𝜏 𝑏 = (𝑃 − 𝑄) √ (𝑃 + 𝑄 + 𝑇 ) ⋅ (𝑃 + 𝑄 + 𝑈 )<label>(2)</label></formula><p>P represents the number of concordant pairs, Q represents the number of discordant pairs, T represents the number of ties exclusive to x, and U represents the number of ties exclusive to y. This test returned 𝜏 ≈ 0.56, indicating a moderate, positive correlation <ref type="bibr" target="#b37">[39]</ref> with a p-value against a null hypothesis of no association of 𝑝 &lt; 0.01 (99.99% confidence level). For comparison, in similar experiments judging human versus LLM judgments, Faggioli et al. found 𝜏 values of 𝜏 = 0.76 and 𝜏 = 0.86 <ref type="bibr" target="#b38">[40]</ref>.</p><p>Following the same methodology, we also calculated Spearman's 𝜌, a similar nonparametric correlation measure. This resulted in 𝜌 ≈ 0.59 with 𝑝 &lt; 0.01, demonstrating a statistically significant, moderate positive correlation <ref type="bibr" target="#b35">[36]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">RAG vs RAGF</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.1.">Quality of retrieved documents</head><p>We assessed document retrieval quality using Mean Reciprocal Rank@5 (MRR@5), which averages the inverse ranks of the first relevant result within the top five positions across all queries. The formula is given by</p><formula xml:id="formula_2">𝑀𝑅𝑅@5 = 1 |𝑄| |𝑄| ∑ 𝑖=1 1 rank 𝑖 , (<label>3</label></formula><formula xml:id="formula_3">)</formula><p>where |𝑄| is the total number of queries and rank 𝑖 is considered only if it's within the top five, otherwise it counts as zero <ref type="bibr" target="#b39">[41]</ref>. MRR@5 scores were calculated for each agent and each retrieval method considering two categories:</p><p>1. MRR@5 score for documents deemed "somewhat relevant" or "very relevant." 2. MRR@5 score for documents deemed "very relevant."</p><p>The results can be seen below in Table <ref type="table">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Mean MRR@5 scores for RAG vs RAG-F. The retrieval method columns indicate if the retrieval component used was vector search only (KNN), keywords only (BM25) or hybrid (KNN and BM25, combined with RRF). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Agent</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.2.">Pairwise evaluation of answers</head><p>We then ran RAGElo games to evaluate end-to-end answer quality of RAG vs RAGF with different base retriever configurations a task that cannot rely on standard Information Retrieval metrics. These RAGElo results show more victories for RAGF than RAG; For example, when using BM25 as a base retriever, RAGF won 49% of the games, RAG won 14.5%, and RAG and RAGF are tied in 36.5% of the times. The resulting Elo scores for all six variants are shown in table <ref type="table" target="#tab_3">6</ref>, which give a robust ranking of the systems, without reliance on a gold standard. It is interesting to see, that for both RAGF as well as RAG, BM25 is a strong baseline that is not surpassed by generic embeddings in these experiments. Next, we compared the RAGElo outcome to the preference of our Infineon human annotator. We performed two-tailed paired t-tests to compare RAG against RAGF on each category from the Infineon representatives' human evaluations with 𝛼 = .05. As expected, due to its larger variety of retrieved results, RAGF significantly outperforms RAG in completeness at the 95% confidence level with 𝑝 ≈ 0.01 However, on the precision of answers, RAG significantly outperformed RAGF at the 95% confidence level with 𝑝 ≈ 0.04. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Discussion</head><p>As observed above, we found statistically significant, moderate positive correlations between LLM ratings and human annotations. This indicates a consistent association between the ratings from LLM-as-a-judge and those by Infineon experts. We find that on average, LLM scores are slightly higher than those of human annotators. This means that while relevance judgements on individual queries should not be fully reliable, and IR metrics derived from LLM-as-a-judge should not be equated with regular relevance scores without further calibration, we can still make good use of this approach to rankorder systems. These findings collectively support the validity of our LLM evaluation method, which assesses conversational system outputs based on a combination of relevance, accuracy, completeness, and recall.</p><p>The style of evaluation and the different dimensions it takes into account are specified in the prompts given to the LLM in the RAGElo evaluation, which are provided in Appendix A. Specifically, while the initial LLM-as-a-judge is given specific criteria to focus on only four categories, we instructed RAGElo's impartial judge LLM to value more than the initial four categories: Your evaluation should consider factors such as comprehensiveness, correctness, helpfulness, completeness, accuracy, depth, and level of detail of their responses.</p><p>Since RAGF significantly outperformed RAG in the completeness category, the RAGElo judge LLM likely weighed completeness higher than precision. In addition, based on manual observation of a small random sample of answers, RAGF produced more comprehensive answers and featured higher depth and level of detail due to the multiple query generation. However, games where RAG won were most likely influenced by a significantly more precise answer than that of RAGF. While RAGF values comprehensive answers that offer multiple perspectives to the user, RAG produces shorter answers that answer the original query only. Since completeness is defined as the extent to which a user's question was answered, it can be presumed that RAGF's longer and more comprehensive answers may tend to be more complete. And since precision relates to the agent mentioning the correct product or product family, it can be presumed that RAGF's longer answers have more room to consider other products or product families, leading to reduced answer precision. While the human annotation was done by Infineon experts, different humans may rate answers differently, even if following the same set of criteria.</p><p>A larger number of documents or a database of non-technical documents may have led to a different outcome. RAGF can be applied to not only Infineon documents but also any documents database to retrieve. This includes not only enterprise uses but also uses in education, such as mathematics and language learning. The algorithm can be tuned to different use cases by tweaking the internal LLM prompt. For example, the Infineon RAGF bot was prompted to "think like an engineer." However, an educator RAGF bot could be prompted to "think like a teacher." Future work includes exploring other applications of RAGF, especially in education. In addition, we will experiment with different prompts for both LLM-as-a-judge and RAGElo while using different quantities and types of documents with the same retrieval algorithms.</p><p>Based on the calculated MMR@5 scores, we found that the RAGF agent mostly outperforms the RAG agent in ranking both highly relevant and somewhat relevant documents retrieved. This evidence search on multiple query variants produced, on average, slightly more higher-ranked relevant documents than using only the original user query. We also see that using vector search with embeddings is not a silver bullet, as for our test queries, BM25 seriously outperforms it. Since retrieval quality is highly dependent on the quality of the embeddings and their fit to the domain, this outcome will likely be changed by fine-tuning the embeddings, and adding additional intelligent re-rankers, which we leave here for future work, as the evaluation framework would remain the same.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Conclusion</head><p>Overall, we found that the evaluation framework proposed by RAGElo positively aligns with the preferences of human annotators for RAG and RAGF with due caution due to a moderate correlation and variability of scoring. We found that the RAGF approach leads to better answers most of the time, according to the RAGElo evaluation. According to expert scoring, the RAGF approach significantly outperforms in completeness compared to RAG but significantly underperforms in precision compared to RAG. Based on these results, we cannot confidently assert that RAGF's approach leads to better answers generally. However, the results do support that RAGF's approach leads to more complete answers and a higher proportion of better answers under evaluation by RAGElo.</p><p>Since RAGElo is generally applicable to all retrieval-augmented algorithms, in future work, we also intend to test different agents other than RAG and RAGF, including those with different reranking algorithms, different embedding models, and different LLMs. In addition, due to RAGF's underperformance in document relevance, we may also leverage CRAG to reduce this gap. We will also investigate the reflection of human sensitivity in expert ratings, especially whether the LLMs should or can reflect human sensitivities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. RAGElo's prompts and configurations</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1. Retrieval Evaluator</head><p>We used the default RAGElo's ReasonerEvaluator, which has the following system prompt: You a r e an e x p e r t document a n n o t a t o r . Your j o b i s t o e v a l u a t e whether a document c o n t a i n s r e l e v a n t i n f o r m a t i o n t o answer a u s e r ' s q u e s t i o n .</p><p>P l e a s e a c t a s an i m p a r t i a l r e l e v a n c e a n n o t a t o r f o r a s e a r c h e n g i n e . Your g o a l i s t o e v a l u a t e t h e r e l e v a n c y o f t h e documents g i v e n a u s e r q u e s t i o n .</p><p>You s h o u l d w r i t e one s e n t e n c e e x p l a i n i n g why t h e document i s r e l e v a n t o r not f o r t h e u s e r q u e s t i o n . A document can be : − Not r e l e v a n t : The document i s not on t o p i c . − Somewhat r e l e v a n t : The document i s on t o p i c but d o e s not f u l l y answer t h e u s e r q u e s t i o n . − Very r e l e v a n t : The document i s on t o p i c and an s we r s t h e u s e r ' s q u e s t i o n .</p><p>[ u s e r q u e s t i o n ] { query } [ document c o n t e n t ] { document }</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. Answer evaluators</head><p>For the pointwise evaluator used in Section 5.1, we used the following prompt with</p><p>RAGElo's CustomPromptAnswerEvaluator:</p><p>You a r e an i m p a r t i a l j u d g e f o r e v a l u a t i n g t h e q u a l i t y o f t h e r e s p o n s e s p r o v i d e d by an AI a s s i s t a n t t a s k e d t o answer u s e r s ' q u e s t i o n s about t h e c a t a l o g u e o f IoT s e n s o r s produced by I n f i n e o n .</p><p>You w i l l be g i v e n t h e u s e r ' s q u e s t i o n and t h e answer produced by t h e a s s i s t a n t . The agent ' s answer was g e n e r a t e d based on a s e t o f documents r e t r i e v e d by a s e a r c h e n g i n e . You w i l l be p r o v i d e d with t h e r e l e v a n t documents r e t r i e v e d by t h e s e a r c h e n g i n e . Your t a s k i s t o e v a l u a t e t h e answer ' s q u a l i t y based on t h e ## S t e p s t o e v a l u a t e an answer : 1 . * * Understand t h e u s e r ' s i n t e n t * * : E x p l a i n i n your own words what t h e u s e r ' s i n t e n t i s , g i v e n t h e q u e s t i o n . 2 . * * Check i f t h e answer i s c o r r e c t * * : Think s t e p −by−s t e p whether t h e answer c o r r e c t l y a ns w er s t h e u s e r ' s q u e s t i o n . 3 . * * E v a l u a t e t h e q u a l i t y o f t h e answer * * : E v a l u a t e t h e q u a l i t y o f t h e answer based on i t s r e l e v a n c e , acc uracy , and c o m p l e t e n e s s . 4 . * * A s s i g n a s c o r e * * : Produce a s i n g l e l i n e JSON o b j e c t with t h e f o l l o w i n g keys , each with a s i n g l e s c o r e between 0 and 2 , where 2 i s t h e h i g h e s t s c o r e on t h a t a s p e c t : − " r e l e v a n c e " − 0 : The answer i s not r e l e v a n t t o t h e u s e r ' s q u e s t i o n . − 1 : The answer i s p a r t i a l l y r e l e v a n t t o t h e u s e r ' s q u e s t i o n . − 2 : The answer i s f u l l y r e l e v a n t t o t h e u s e r ' s q u e s t i o n . − " a c c u r a c y " − 0 : The answer i s f a c t u a l l y i n c o r r e c t . − 1 : The answer i s p a r t i a l l y c o r r e c t . t h e u s e r m u l t i p l e p e r s p e c t i v e s i n a d d i t i o n t o but s t i l l r e l e v a n t t o t h e i n t e n t o f t h e o r i g i n a l q u e s t i o n . " , )</p><p>This generates 15 random games between two agents per query (i.e., all possible unique games for 6 agents) and tells the evaluator that:</p><p>• The answers do not include specific citations to any passage (has_citations=False)</p><p>• Include the full text of the retrieved passages in the evaluation prompt (include_raw_documents=True) • Inject the output of the retrieval evaluator into the prompt (include_annotations=True) • Ignore any passage with a relevance score below 2 (document_relevance_threshold=2) • Consider these factors when selecting the best answer factors=…)</p><p>These parameters produce the following final prompt used for evaluating the answers:</p><p>P l e a s e a c t a s an i m p a r t i a l j u d g e and e v a l u a t e t h e q u a l i t y o f t h e r e s p o n s e s p r o v i d e d by two AI a s s i s t a n t s t a s k e d t o answer t h e q u e s t i o n below based on a s e t o f documents r e t r i e v e d by a s e a r c h e n g i n e .</p><p>You s h o u l d c h o o s e t h e a s s i s t a n t t h a t b e s t an s we r s t h e u s e r q u e s t i o n based on a s e t o f r e f e r e n c e documents t h a t may o r may not be r e l e v a n t .</p><p>For each r e f e r e n c e document , you w i l l be p r o v i d e d with t h e t e x t o f t h e document a s w e l l a s r e a s o n s why t h e document i s o r i s not r e l e v a n t . Answers a r e c o m p r e h e n s i v e i f they show t h e u s e r m u l t i p l e p e r s p e c t i v e s i n a d d i t i o n t o but s t i l l r e l e v a n t t o t h e i n t e n t o f t h e o r i g i n a l q u e s t i o n . D e t a i l s a r e o n l y u s e f u l i f they answer t h e u s e r ' s q u e s t i o n . I f an answer c o n t a i n s non−r e l e v a n t d e t a i l s , i t s h o u l d not be p r e f e r r e d o v e r one t h a t o n l y u s e s r e l e v a n t i n f o r m a t i o n .</p><p>Begin your e v a l u a t i o n by e x p l a i n i n g why each answer c o r r e c t l y ans we rs t h e u s e r ' s q u e s t A f t e r p r o v i d i n g your e x p l a n a t i o n , output your f i n a l v e r d i c t by s t r i c t l y f o l l o w i n g t h i s format : " </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: A traditional Retrieval-Augmented Generation pipeline compared to a RAG-Fusion pipeline.While a traditional RAG agent submits only the original query to the search system, a RAGF agent first generates variations of the user query and combines the rankings induced by these queries into a final ranking using RRF. The resulting top-k passages are fed into the LLM for generating the answer to the user's query.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>What security features does the OPTIGA Trust M provide for IoT devices (…)Are there any self-diagnosis features available in the KP23x analog sensor?For applications with fast switching technologies like SiC, which (…) Which TLE4971 current sensor models are available in the TISON-8-6 package?</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Process for creating synthetic queries. We prompt multiple LLMs to generate queries based on existing documents. We include some existing user queries in the prompt as few-shot examples.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3:The RAGElo evaluation pipeline. First, documents retrieved by the agents are evaluated pointwise according to their relevance to the user's question. Then, the agents' answers are evaluated pairwise, using the retrieved relevant documents from both agents as reference.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>3 . 4 .</head><label>34</label><figDesc>Using RRF, combines the top-𝑘 passages induced by all queries into a final ranking. Perform a Chat Completions API call prompting the LLM with carefully worded instructions for generating an answer based on the top-𝑘 fused passages 5. Process and output the Chat Completions response.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Bland-Altman plot to visualize the comparison between LLM-as-a-judge and expert answers.</figDesc><graphic coords="10,202.64,211.18,187.51,140.63" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head></head><label></label><figDesc>r e s p o n s e ' s r e l e v a n c e , acc uracy , and c o m p l e t e n e s s . ## Rules f o r e v a l u a t i n g an answer : − * * R e l e v a n c e * * : Does t h e answer a d d r e s s t h e u s e r ' s q u e s t i o n ? − * * Accuracy * * : I s t h e answer f a c t u a l l y c o r r e c t , based on t h e documents p r o v i d e d ? − * * Completeness * * : Does t h e answer p r o v i d e a l l t h e i n f o r m a t i o n needed t o answer t h e u s e r ' s q u e s t i o n ? − * * P r e c i s i o n * * : I f t h e u s e r ' s q u e s t i o n i s about a s p e c i f i c product , d o e s t h e answer p r o v i d e t h e answer f o r t h a t s p e c i f i c p r o d u c t ?</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>− 2 :</head><label>2</label><figDesc>The answer i s f u l l y c o r r e c t . − " c o m p l e t e n e s s " − 0 : The answer d o e s not p r o v i d e enough i n f o r m a t i o n t o answer t h e u s e r ' s q u e s t i o n . − 1 : The answer o n l y a n sw e rs some a s p e c t s o f t h e u s e r ' s q u e s t i o n . − 2 : The answer f u l l y a ns w er s t h e u s e r ' s q u e s t i o n . − " p r e c i s i o n " − 0 : The answer d o e s not mention t h e same p r o d u c t o r p r o d u c t l i n e a s t h e u s e r ' s q u e s t i o n . − 1 : The answer mentions a s i m i l a r p r o d u c t o r p r o d u c t l i n e , but not t h e same a s t h e u s e r ' s q u e s t i o n . − 2 : The answer mentions t h e e x a c t same p r o d u c t o r p r o d u c t l i n e a s t h e u s e r ' s q u e s t i o n . The l a s t l i n e o f your answer must be a SINGLE LINE JSON o b j e c t with t h e k e y s " r e l e v a n c e " , " a c c u r a c y " , " c o m p l e t e n e s s " , and " p r e c i s i o n " , each with a s i n g l e s c o r e between 0 and 2 . [DOCUMENTS RETRIEVED] { documents } [ User Query ] { query } [ Agent answer ] { answer } For the pairwise evaluation between agents used for the results in Tables 5 and 6, we used RAGElo's PairwiseAnswerEvaluator with the following parameters: p a i r w i s e _ e v a l u a t o r _ c o n f i g = P a i r w i s e E v a l u a t o r C o n f i g ( n_games_per_query =15 , h a s _ c i t a t i o n s=F a l s e , include_raw_documents=True , i n c l u d e _ a n n o t a t i o n s=True , document_relevance_thresh old =2, f a c t o r s=" t h e c o m p r e h e n s i v e n e s s , c o r r e c t n e s s , h e l p f u l n e s s , c o m p l e t e n e s s , ac curacy , depth , and l e v e l o f d e t a i l o f t h e i r r e s p o n s e s . Answers a r e c o m p r e h e n s i v e i f they show</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head></head><label></label><figDesc>Your e v a l u a t i o n s h o u l d c o n s i d e r f a c t o r s such a s c o m p r e h e n s i v e n e s s , c o r r e c t n e s s , h e l p f u l n e s s , c o m p l e t e n e s s , accurac y , depth , and l e v e l o f d e t a i l o f t h e i r r e s p o n s e s .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head></head><label></label><figDesc>i o n . Then , you s h o u l d compare t h e two r e s p o n s e s and p r o v i d e a s h o r t e x p l a n a t i o n o f t h e i r d i f f e r e n c e s . Avoid any p o s i t i o n b i a s e s and e n s u r e t h a t t h e o r d e r i n which t h e r e s p o n s e s were p r e s e n t e d d o e s not i n f l u e n c e your d e c i s i o n . Do not a l l o w t h e l e n g t h o ft h e r e s p o n s e s t o i n f l u e n c e your e v a l u a t i o n . Be a s o b j e c t i v e a s p o s s i b l e .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2</head><label>2</label><figDesc>Sample of synthetic queries for evaluating Infineon's RAG assistant. GPT4 refers to OpenAI's gpt-4-turbo-2024-04-09 model. Opus, Sonnet and Haiku refer to Anthropic's Claude 3 models opus-20240229, sonnet-20240229 and haiku-20240307, respectively.</figDesc><table><row><cell>model</cell><cell cols="2">Query</cell><cell></cell></row><row><cell cols="4">GPT4 What are some typical consumer applications for TLV496x-xTA/B sensors?</cell></row><row><cell cols="3">GPT4 What specific ISO 26262 readiness is available for the KP253 sensor?</cell><cell></cell></row><row><cell cols="4">Opus How small of a form factor can I achieve for a battery-powered air quality device using</cell></row><row><cell cols="2">Infineon's PAS CO2 sensor?</cell><cell></cell><cell></cell></row><row><cell cols="4">Sonnet Can Infineon's sensors support bus configurations or daisy-chaining for simplified wiring and</cell></row><row><cell cols="2">reduced complexity in IoT systems?</cell><cell></cell><cell></cell></row><row><cell cols="4">Haiku Which TLE4971 current sensor models are available in the TISON-8-6 package?</cell></row><row><cell>How to cross-sell a MEMS microphone and a XENSIV sensor to customers?</cell><cell></cell><cell cols="2">To cross-sell a MEMS microphone and a XENSIV sensor to customers, you can follow these steps:</cell></row><row><cell></cell><cell></cell><cell cols="2">1. Identify customer's requirements (…)</cell></row><row><cell>Original query</cell><cell>Dense KNN search</cell><cell>Top-k retrieved documents</cell><cell>Agent generated answer</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Traditional RAG Agent</cell></row><row><cell>What are the key features of Infineon's MEMS</cell><cell></cell><cell></cell><cell></cell></row><row><cell>microphones and XENSIV sensors (…)</cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="2">Here are the key features, benefits, and</cell></row><row><cell>How can Infineon's MEMS microphones and XENSIV sensors be integrated for enhanced audio (…)</cell><cell></cell><cell cols="2">applications of Infineon's MEMS microphone and XENSIV sensor products, along with successful cross-selling strategies: (…)</cell></row><row><cell>What are the most suitable applications and industries for Infineon's MEMS microphones (…)</cell><cell>Dense KNN search</cell><cell>Fused top-k retrieved documents</cell><cell>Agent generated answer</cell></row><row><cell>Query variations generated by agent</cell><cell>Per query top-k</cell><cell>(RRF)</cell><cell></cell></row><row><cell></cell><cell>retrieved documents</cell><cell></cell><cell>RAG Fusion Agent</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 5</head><label>5</label><figDesc>RAG vs RAGF Win percentage between pairwise comparison of the agent's answers using GPT-4o as a judge with RAGElo.</figDesc><table><row><cell cols="2">Agent</cell><cell cols="2">BM25 RAG</cell><cell>RAGF</cell><cell>RAG</cell><cell>KNN</cell><cell>RAGF</cell><cell cols="2">Hybrid RAG RAGF</cell><cell>AVG</cell></row><row><cell>BM25</cell><cell>RAG RAGF</cell><cell>-49.0%</cell><cell cols="2">14.5% -</cell><cell>49.5% 58.5%</cell><cell></cell><cell>52.5% 51.5%</cell><cell>29.0% 53.5%</cell><cell>28.5% 30.5%</cell><cell>34.8% 48.6%</cell></row><row><cell>KNN</cell><cell>RAG RAGF</cell><cell>33.0% 34.5%</cell><cell cols="2">27.0% 30.0%</cell><cell>-37.0%</cell><cell></cell><cell>20.0% -</cell><cell>26.0% 30.5%</cell><cell>31.0% 32.0%</cell><cell>27.4% 32.8%</cell></row><row><cell>Hybrid</cell><cell>RAG RAGF</cell><cell>41.5% 46.0%</cell><cell cols="2">21.0% 35.0%</cell><cell>51.5% 49.0%</cell><cell></cell><cell>48.0% 45.5%</cell><cell>-43.5%</cell><cell>20.5% -</cell><cell>36.0% 44.3%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 6</head><label>6</label><figDesc>Elo Ranking for all agents averaged over 500 tournaments.</figDesc><table><row><cell>Agent</cell><cell>Retrieval</cell><cell>Elo score</cell></row><row><cell>RAGF</cell><cell>BM25</cell><cell>571.0</cell></row><row><cell>RAGF</cell><cell>Hybrid</cell><cell>550.0</cell></row><row><cell>RAG</cell><cell>Hybrid</cell><cell>497.0</cell></row><row><cell>RAG</cell><cell>BM25</cell><cell>487.0</cell></row><row><cell>RAGF</cell><cell>KNN</cell><cell>470.0</cell></row><row><cell>RAG</cell><cell>KNN</cell><cell>436.0</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head></head><label></label><figDesc>[ [ A ]  ] " i f a s s i s t a n t A i s b e t t e r , " [ [ B ] ] " i f a s s i s t a n t B i s b e t t e r , and " [ [ C ] ] " f o r a t i e .</figDesc><table><row><cell>[ User Q u e s t i o n ]</cell></row><row><cell>{ query }</cell></row><row><cell>[ R e f e r e n c e Documents ]</cell></row><row><cell>{ documents }</cell></row><row><cell>[ The S t a r t o f A s s i s t a n t A' s Answer ]</cell></row><row><cell>{ answer_a }</cell></row><row><cell>[ The End o f A s s i s t a n t A' s Answer ]</cell></row><row><cell>[ The S t a r t o f A s s i s t a n t B' s Answer ]</cell></row><row><cell>{answer_b }</cell></row><row><cell>[ The End o f A s s i s t a n t B' s Answer ]</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/zetaalphavector/ragelo</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">https://huggingface.co/intfloat/multilingual-e5-base</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We thank Brooks Felton from Infineon for his support during this work. We also thank the Infineon sales team for providing valuable feedback.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Language Models are Few-Shot Learners</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herbert-Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Ziegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Winter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sigler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Litwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Berner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2005.14165</idno>
		<idno type="arXiv">arXiv:2005.14165</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">On Hallucination and Predictive Uncertainty in Conditional Language Generation</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">Y</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.eacl-main.236</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics</title>
				<meeting>the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="2734" to="2744" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Survey of Hallucination in Natural Language Generation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Frieske</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ishii</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">J</forename><surname>Bang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madotto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Fung</surname></persName>
		</author>
		<idno type="DOI">10.1145/3571730</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page">38</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Do large language models know what they dont know?</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X.-J</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: ACL 2023</title>
				<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="8653" to="8665" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Amayuelas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2305.13712" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Reciprocal rank fusion outperforms condorcet and individual rank learning methods</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L A</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Buettcher</surname></persName>
		</author>
		<idno type="DOI">10.1145/1571941.1572114</idno>
		<idno>doi:10.1145/1571941.1572114</idno>
		<ptr target="https://doi.org/10.1145/1571941.1572114" />
	</analytic>
	<monogr>
		<title level="m">SIGIR 2019, SIGIR &apos;09</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page">758759</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">ROUGE: A package for automatic evaluation of summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/W04-1013" />
	</analytic>
	<monogr>
		<title level="m">Text Summarization Branches Out, Association for Computational Linguistics</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
		<idno>doi:10.3115/1073083.1073135</idno>
		<ptr target="https://doi.org/10.3115/1073083.1073135" />
	</analytic>
	<monogr>
		<title level="m">ACL 2002, ACL &apos;02</title>
				<meeting><address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page">311318</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">StatMT 2007</title>
				<meeting><address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page">228231</biblScope>
		</imprint>
	</monogr>
	<note>StatMT &apos;07</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Pairwise crowd judgments: Preference, absolute, and ratio</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moffat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Turpin</surname></persName>
		</author>
		<idno type="DOI">10.1145/3291992.3291995</idno>
		<idno>doi:10.1145/3291992.3291995</idno>
		<ptr target="https://doi.org/10.1145/3291992.3291995" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd Australasian Document Computing Symposium, ADCS &apos;18</title>
				<meeting>the 23rd Australasian Document Computing Symposium, ADCS &apos;18<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">A comparison of methods for evaluating generative ir</title>
		<author>
			<persName><forename type="first">N</forename><surname>Arabzadeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L A</forename><surname>Clarke</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2404.04044" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval</title>
		<author>
			<persName><forename type="first">V</forename><surname>Jeronymo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bonifacio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Abonizio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fadaee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lotufo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zavrel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nogueira</surname></persName>
		</author>
		<idno type="DOI">10.48550/arXiv.2301.01820</idno>
		<idno type="arXiv">arXiv:2301.01820</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers</title>
		<author>
			<persName><forename type="first">H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhao</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2403.02839" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Judging llm-as-a-judge with mt-bench and chatbot arena</title>
		<author>
			<persName><forename type="first">L</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-L</forename><surname>Chiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">P</forename><surname>Xing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">E</forename><surname>Gonzalez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Stoica</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2306.05685" />
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">V</forename><surname>Katranidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Barany</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2403.03888</idno>
		<title level="m">Faaf: Facts as a function for the evaluation of generated text</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Manakul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Liusie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J F</forename><surname>Gales</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2303.08896</idno>
		<title level="m">Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-S</forename><surname>Chuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gaitskell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hartvigsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Meng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Glass</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2304.03728</idno>
		<title level="m">Interpretable unified language checking</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Azaria</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mitchell</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2304.13734</idno>
		<title level="m">The internal state of an llm knows when it&apos;s lying</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ghazvininejad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1910.13461</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Bartscore: Evaluating generated text as text generation</title>
		<author>
			<persName><forename type="first">W</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Neubig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liu</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Ranzato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Beygelzimer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Dauphin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Vaughan</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="27263" to="27277" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Es</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>James</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Espinosa-Anke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Schockaert</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2309.15217</idno>
		<title level="m">Ragas: Automated evaluation of retrieval augmented generation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Angelopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bates</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Fannjiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">I</forename><surname>Jordan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zrnic</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.09633</idno>
		<title level="m">Predictionpowered inference</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Ares: An automated evaluation framework for retrieval-augmented generation systems</title>
		<author>
			<persName><forename type="first">J</forename><surname>Saad-Falcon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Khattab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Potts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zaharia</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2311.09476</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Eibich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Nagpal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fred-Ojala</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2404.01037</idno>
		<title level="m">Aragog: Advanced rag output grading</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Benchmarking large language models in retrievalaugmented generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sun</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2309.01431</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">S.-Q</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-C</forename><surname>Gu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z.-H</forename><surname>Ling</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2401.15884</idno>
		<title level="m">Corrective retrieval augmented generation</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<title level="m" type="main">Toward Optimising a Retrieval Augmented Generation Pipeline using Large Language Model</title>
		<author>
			<persName><forename type="first">G</forename><surname>Fazlija</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">Master&apos;s thesis</note>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Rag-fusion: A new take on retrieval augmented generation</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Rackauckas</surname></persName>
		</author>
		<idno type="DOI">10.5121/ijnlc.2024.13103</idno>
		<ptr target="http://dx.doi.org/10.5121/ijnlc.2024.13103.doi:10.5121/ijnlc.2024.13103" />
	</analytic>
	<monogr>
		<title level="j">International Journal on Natural Language Computing</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page">3747</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bakalov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Guu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">B</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2209.11755</idno>
		<title level="m">Promptagator: Few-shot dense retrieval from 8 examples</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<idno>arXiv:gpt-4-turbo-and-gpt-4</idno>
		<title level="m">Gpt-4 turbo and gpt-4</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<title level="m" type="main">The claude 3 model family: Opus, sonnet, haiku</title>
		<author>
			<persName><surname>Anthropic</surname></persName>
		</author>
		<idno>arXiv:Model Card Claude 3.pdf</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<title level="m" type="main">Large language models can accurately predict searcher preferences</title>
		<author>
			<persName><forename type="first">P</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Spielman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mitra</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2309.10621</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Majumder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2402.05672</idno>
		<title level="m">Multilingual e5 text embeddings: A technical report</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Okapi at TREC-3</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">E</forename><surname>Robertson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Walker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hancock-Beaulieu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gatford</surname></persName>
		</author>
		<ptr target="http://trec.nist.gov/pubs/trec3/papers/city.ps.gz" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of The Third Text REtrieval Conference, TREC 1994</title>
				<editor>
			<persName><forename type="first">D</forename><forename type="middle">K</forename><surname>Harman</surname></persName>
		</editor>
		<meeting>The Third Text REtrieval Conference, TREC 1994<address><addrLine>Gaithersburg, Maryland, USA</addrLine></address></meeting>
		<imprint>
			<publisher>NIST Special Publication</publisher>
			<date type="published" when="1994">November 2-4, 1994. 1994</date>
			<biblScope unit="page" from="109" to="126" />
		</imprint>
		<respStmt>
			<orgName>National Institute of Standards and Technology (NIST</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">D</forename><surname>Edwards</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Jong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Ferguson</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2308.08466</idno>
		<title level="m">Graphing methods for kendall&apos;s 𝜏</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<title level="m" type="main">Efficient inference for kendall&apos;s tau</title>
		<author>
			<persName><forename type="first">S</forename><surname>Perreault</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2206.04019</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">The treatment of ties in ranking problems</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">G</forename><surname>Kendall</surname></persName>
		</author>
		<idno type="DOI">10.1093/biomet/33.3.239</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1093/biomet/33.3.239" />
	</analytic>
	<monogr>
		<title level="j">Biometrika</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="239" to="251" />
			<date type="published" when="1945">1945</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Correlation coefficients: Appropriate use and interpretation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Schober</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Boer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">A</forename><surname>Schwarte</surname></persName>
		</author>
		<idno type="DOI">10.1213/ane.0000000000002864</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1213/ane.0000000000002864" />
	</analytic>
	<monogr>
		<title level="j">Anesthesia &amp; Analgesia</title>
		<imprint>
			<biblScope unit="volume">126</biblScope>
			<biblScope unit="page" from="1763" to="1768" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Perspectives on large language models for relevance judgment</title>
		<author>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L A</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Demartini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hagen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hauff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wachsmuth</surname></persName>
		</author>
		<idno type="DOI">10.1145/3578337.3605136</idno>
		<idno>doi:10.1145/3578337.3605136</idno>
		<ptr target="http://dx.doi.org/10.1145/3578337.3605136" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 23</title>
				<meeting>the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 23</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<title level="m" type="main">A comprehensive survey of evaluation techniques for recommendation systems</title>
		<author>
			<persName><forename type="first">A</forename><surname>Jadon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Patil</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.16015</idno>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
