<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Elo-based Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zackary Rackauckas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arthur Câmara</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jakub Zavrel</string-name>
          <email>zavrel@zeta-alpha.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>document relevance based on MRR@</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>scores. We find that</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Columbia University</institution>
          ,
          <addr-line>New York, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Zeta Alpha</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Questionanswering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company-internal tasks. This results in dificulties in evaluating RAG variations, like RAG-Fusion (RAGF) in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks diferent variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The text-generating capabilities of LLMs, together with their text understanding abilities,
have allowed conversational Question-Answering (QA) systems to experience a
considerable leap in performance, with near-human text quality and reasoning capabilities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
However, these systems can be prone to hallucinations [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], as they sometimes produce
seemingly plausible but factually incorrect answers.
      </p>
      <p>
        The general inability of such models to identify unanswerable questions [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] can
exacerbate hallucinations, especially in enterprise settings. In such scenarios, user
questions may require specific domain knowledge to be answered properly. This knowledge
      </p>
      <p>2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
is usually out-of-domain for most LLMs, but is present in private and confidential internal
documents from the company.</p>
      <p>One such company is Infineon, a leading manufacturer of semiconductors. Given its
wide range of equipment, information about its products is spread across multiple, highly
technical documents, including datasheets and selection guides of hundreds of pages.
Therefore, an internal retrieval augmented conversational QA system was developed by
Infineon for internal users such as account managers, field application engineers, and sales
operations specialists. This system allows professionals to ask questions about products
from the whole catalog while in the field.</p>
      <p>
        One of the features of Infineon’s conversational agent is the usage of RAG -Fusion (RAGF ),
a technique for increasing the quality of the generated answers by generating variations
of the user question and combining the rankings produced by these variations using
rank-fusion methods (i.e., recriprocal rank fusion (RRF) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) into a ranking that has both
more diverse and higher quality answers.
      </p>
      <p>
        However, evaluating these systems bring complications common to retrieval augmented
agents, especially in enterprise settings, stemming from the lack of comprehensive test
datasets. Ideally, such a test set would comprise a large set of real user questions from
a query log, paired with “golden answers” provided by experts. The lack of such a
test set leads to two main issues. First, evaluation of answers generated by LLMs by
traditional n-gram evaluation metrics such as ROUGE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], BLEU [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and METEOR [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
is not possible, given the lack of ground truth answers. Second, and as a consequence,
evaluating the quality of the answers generated by the LLM systems would require
in-domain experts (potentially from within the company) in a process that is both slow
and costly [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        One approach for tackling the lack of an extensive test set is to use synthetic queries
generated by LLMs as a proxy of user queries [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, the lack of in-domain
knowledge of LLMs makes queries naively generated by these models unreliable and
prone to hallucinations, especially when generating queries about specific products and
their specifications (c.f., Table 1 for examples of real user’s questions submitted to the
system).
      </p>
      <p>
        To solve this, we propose to use a process similar to InPars [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to create a set of
synthetic evaluation queries. We ask LLMs to generate queries based on portions of
existing documentation injected into the prompt. To increase similarity to real user
queries, we include existing user questions as few-shot examples to the prompt. With
this process, we are able to generate a large set of high-quality synthetic queries for
evaluating our systems. Figure 2 describes the process of generating synthetic queries
and the output of a search agent. Table 2 shows a sample of these queries.
      </p>
      <p>
        To tackle the second issue, a lack of ground truth “golden answers,” we leverage an
LLM-as-a-judge process, where a strong LLM is used to evaluate the quality of the
answers generated by the RAG agent’s LLM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We then follow the practice of judging
generated answers in a pairwise fashion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], prompting the judge LLM to select the
better answer between two candidates generated by diferent RAG pipelines. (c.f. Section
6 with details of our pipelines).
      </p>
      <p>Finally, to mitigate the lack of in-domain knowledge of the judging LLM, we also
annotate the relevance of the documents retrieved by the pipelines being evaluated and
inject the relevant documents in the context used by the judging LLM. This allows the
judging LLM to better assess for hallucinations and completeness and better align the
quality of the evaluations to those conducted by experts.</p>
      <p>This process is mediated by RAGElo 1, a toolkit for evaluating RAG systems inspired
by the Elo ranking system. RAGElo provides an easy-to-use CLI and Python library
for using LLMs to evaluate retrieval results and answers produced by RAG pipelines.
By combining a retrieval evaluator, a pairwise answer annotator, and an Elo-inspired
tournament, RAGElo leverages powerful LLMs to agnostically annotate and rank diferent
RAG pipelines. We notice that, although noisy, the LLM annotations generated by
RAGElo are generally well aligned with experts’ judgments of relative system quality,
allowing for fast experimentation and comparisons between diferent RAG implementations
without the frequent intervention of experts as annotators.</p>
      <p>This paper evaluates multiple implementations of Infineon’s retrieval augmented
conversational agent using RAGElo: a traditional Retrieval-Augmented Generation and a
RAG-Fusion implementation. RAG-Fusion generates multiple variations of the user question
and combines the rankings produced by these queries into a more diverse set of documents.
The documents are then fed into the LLM. We also analyze these same agents under
a keyword-based retrieval regimen (i.e., the retriever uses BM25 to retrieve and rank
documents), a dense retriever, and a hybrid retriever that combines the ranking generated
by the BM25 and the dense retrievers using RRF. Our goal is to answer the following
questions:
• Does the evaluation framework proposed by RAGElo align with the preferences of
human annotators for answers generated by RAG-based conversational agents?
• Does the RAGF approach of submitting multiple variations of the user question and
combining their rankings lead to better answers?</p>
      <sec id="sec-2-1">
        <title>User-submitted queries</title>
        <p>What is the country of origin of IM72D128, and how does geopolitical exposure afect the market and
my SAM for the microphone?
What is the IP rating of mounted IM72D128?
Tell me microphones that have been released since January 2023 based on the datasheet revision
history.</p>
        <p>We need to confirm whether the IFX waterproof MIC has a sleeping mode and wake-up functions.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        Several evaluation systems for RAG have been proposed to address flaws in current
evaluation methods. For instance, Facts as a Function (FaaF) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is an end-to-end
factual evaluation algorithm specially created for RAG pipelines. By creating functions
from ground truth facts, FaaF focuses on the quality of generation and retrieval by
calling LLMs. FaaF has substantially increased eficiency and cost-efectiveness, achieving
reduced error rates compared to traditional evaluation methods. The reliance on a set of
ground truths does not meet our goal of applying an automated evaluation toolkit to
our pipelines. Recently, researchers have moved to eliminate the need for ground truths.
This is especially important when automatically evaluating agents that retrieve highly
technical documents from a large database, such as the Infineon RAGF conversational
agent. RAGElo eliminates this reliance by using an LLM-as-a-judge, a method studied in
      </p>
      <sec id="sec-3-1">
        <title>1https://github.com/zetaalphavector/ragelo</title>
        <p>numerous recent works.</p>
        <p>
          SelfCheckGPT demonstrates the ability to leverage LLMs to detect and rank factuality
with zero resources [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. In addition, it has been demonstrated that GPT3.5 Turbo
outperforms ground truth baselines in fact-checking with a "1/2-shot" method [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. A
model built to classify statements as true or false based on the activations of an LLM’s
hidden layers had up to 83% classification accuracy [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. This evidence supports RAGElo’s
usage of LLM-as-a-judge.
        </p>
        <p>
          Automated evaluation metrics can also be applied to RAG-based agents. BARTScore,
an automated metric based on the BART architecture, has also outperformed most
metrics on categories including factuality [
          <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
          ]. Besides automated evaluation metrics,
several automated evaluation frameworks have been created with a similar goal to RAGElo.
Focusing on faithfulness, answer relevance, and content relevance, RAGAS leverages
LLM prompting to focus on situations where ground truths and human annotations are
not present in a dataset [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Prediction-powered inference aims to decrease the number
of human annotations needed for machine learning prediction on a dataset of images
of galaxies with approximately 300,000 annotations [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. The ARES toolkit leverages
prediction-powered inference to evaluate RAG systems with fewer human annotations.
Like RAGElo, ARES automatically evaluates RAG systems using synthetically generated
data [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
        </p>
        <p>
          ARAGOG highlights Hypothetical Document Embedding (HyDE) and LLM reranking
as efective methods for enhancing retrieval precision while also exploring the efectiveness
of Sentence Window Retrieval and the potential of the Document Summary Index in
improving RAG systems [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].
        </p>
        <p>
          While the aforementioned frameworks evaluate answers on relevance, faithfulness, and
correctness metrics, RAG can also be evaluated on noise and counterfactual robustness,
negative rejection, and information integration [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>
          In addition to answers, frameworks have also been created to evaluate documents.
Corrective Retrieval Augmented Generation (CRAG) builds on RAG by employing a
retrieval evaluator to ensure that only the optimal documents are fed into the LLM
prompt prior to the answer generation phase [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ].
        </p>
        <p>Due to its Elo-based ranking system for answers, its use of LLM-as-a-judge, and its
relevance evaluation of the intermediate retrieval steps in a RAG pipeline, RAGElo is a
unique evaluation toolkit. In this study, we use it to compare a simple RAG versus a more
sophisticated RAGF system on a knowledge-intensive industry-specific domain.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Retrieval Augmented QA with rank fusion</title>
      <p>
        While answers generated by traditional retrieval augmented systems are based on a
number of documents retrieved from a single query, RAGF introduces additional variation
into the retrieval process. Upon receiving a query from the user, a RAGF agent leverages
a large language model to generate a set of queries based on the original [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Table 3
shows examples of queries generated by the agent based on the query, “How to cross-sell
a MEMS microphone and a XENSIV sensor to customers?”.
      </p>
      <p>
        After generating the variations for the user query, the RAGF agent submits the original
and the generated queries to a retrieval system [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] that returns the top- relevant
documents ,  1,  2, …   from the set of all documents  for each query. The rankings
induced by these queries are then combined using recriprocal rank fusion (RRF) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] into
a final, higher-quality set of passages. The intuition behind RAGF is that submitting
variations of the same query and combining the final rankings increases the likelihood of
relevant passages being injected into the LLM prompt. In contrast, non-relevant passages
retrieved by a single query are discarded. Figure 1 describes how RAG and RAGF difer.
 ( ∈ ) =
∑ 1 .
∈ () + 
(1)
      </p>
    </sec>
    <sec id="sec-5">
      <title>4. Development of a synthetic test set</title>
      <p>Documents
(passages)
Expert questions
(few-shot)
claude-3-opus-20240229
claude-3-haiku-20240307
claude-3-sonnet-20240229
gpt-4-turbo-2024-04-09</p>
      <p>Query generators
What security features does the OPTIGA
Trust M provide for IoT devices (…)
WahviacihlaTbLlEe49i7n1tchuerTrIeSnOtNs-8en-s6opramcokadegles? are
FtoerchanpoplloigciaetsiolniskewiStihC,fawshticshwi(t…c)hing
Are there any self-diagnosis features
available in the KP23x analog sensor?</p>
      <p>Pooled queries</p>
      <p>
        As previously discussed, one of the main issues when evaluating the quality of a QA
system in an enterprise setting is that, frequently, companies do not have a large enough
existing collection of queries to evaluate such systems’ quality. Therefore, in this work,
we propose to adopt a strategy previously used by methods for generating synthetic
queries for training retrieval systems, such as InPars [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and Promptagator [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
      <p>Similar to these approaches, we randomly sample passages from documents within
our collection and prompt an LLM to generate questions that users may ask about
these portions. However, one diference in our approach to generating training queries
is the size of these passages. When generating queries for training a retrieval system,
we ideally want to keep the passages short to fit in the dense encoder’s relatively short
context windows. However, when generating queries for evaluating QA systems (including
retrieval augmented), we are not bound to the limit of the embedding model used for
retrieval. Rather, a longer passage may yield questions that require multiple shorter
passages to be answered. Therefore, we submit relatively long passages to the LLMs.
Specifically, each passage is extracted from up to ten pages of PDF documents (about
2000 tokens 2)</p>
      <p>
        To keep the questions generated as diverse as possible, we prompt four diferent LLMs
to generate up to ten questions based on the same documents. Our test set collection
contains a mix of queries generated by OpenAI’s GPT-4 turbo [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] and Anthropic’s
Claude-3 [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] Opus, Sonnet, and Haiku models 3. From a set of  = 840 queries, we
sampled 200 queries across all four models. Half of the queries are selected from GPT-4
generated queries, and the other half from Claude 3 queries. Among the Claude 3 queries,
to ensure the quality of the queries and their diversity, we again sample according to
each model size. Ultimately, our test set contains 100 queries from GPT-4-turbo, 50 from
Claude 3 Opus, 30 from Sonnet, and 20 from Haiku.
      </p>
      <p>Finally, to increase the quality of the generated queries, We asked for an account
manager, a sales operations specialist, a marketing representative, and a business
development manager to create queries that they would submit to the conversational agent
from the perspective of their role. They were instructed to produce queries regarding
products from the XENSIV sensor product line, consisting of MEM microphones, radar,
current, magnetic, pressure, and environmental sensors. We compiled a list of 23 of these
queries to use as a base for experimentation and used them as few-shot examples in
the query generation prompt. Figure 2 illustrates our method for generating synthetic
queries based on existing user queries and document passages.</p>
    </sec>
    <sec id="sec-6">
      <title>5. LLM-as-a-Judge for RAG pipelines</title>
      <p>Even with a suitable set of synthetic questions for evaluating our RAG conversational
agent, assessing whether a given answer properly answers a question is not trivially done.
If a ground-truth “golden answer” is available, one can use traditional syntactic-based
2all LLMs used in our experiments had long context windows of 128k or 200k tokens.
3We did not use GPT-3.5 or open source models due to their shorter context window at the time of
writing.</p>
      <p>Agent A answer</p>
      <p>Agent B answer
Are there any self-diagnosis
features available in the
KP23x analog sensor?</p>
      <p>Synthetic test query</p>
      <p>Search agents</p>
      <p>RAGElo RetrievalEvaluator
gednoecRruaemtteredienvatesndsawnders dPoociunmtweinsteerveatrliueavteiodn</p>
      <p>AsgoemnetsAe:lfY-edsi,agtnhoesiKsP2f3exatsuernesso.r has Btohtehqausessitsitoann(t…s)correctly answer
AifsgnuuecncnlchtutadiBseo:snbTauhslieeilltKtfyP--2ida3ninxadmgaoInnnSoaiOsltio2osg6r2ics6nae2gpnas(bo…ir)listeireise,s asAaBcescacnscsosiuerosrddrtaoisatnnnehgt,tahtvBreoeepslrdfeeooevovcaabiutnsdmuteeer,rsnevtsaaan&lt;stdmduioocforchone_csaiu,sds&gt;e(,d…)the
rreegsaprodnisnegttohtehKePu2s3exrs'esrqiueesssteinosnors.</p>
      <p>Final Verdict: [[B]]
RAGElo RetrievalEvaluator</p>
      <p>Pairwise answer
judgement</p>
      <p>
        FInal evaluation
metrics such as BLEU, METEOR or ROUGE [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">8, 9, 7</xref>
        ]. Without such reference answers,
one would require human annotators with a considerable understanding of the question’s
topic to manually assess the quality of the answers produced by each system. However,
this is a costly process.
      </p>
      <p>Alternatively, several LLM-as-a-Judge methods have been proposed, where another
LLM is asked to evaluate the quality of answers generated by other LLMs. Nevertheless,
in an enterprise setting, the answers usually require the LLM to access knowledge not
present in their training datasets but rather contained in documents internal to the
company. This is usually accomplished using a RAG pipeline like the one described above.
Therefore, the judging LLM also needs access to similar knowledge to accurately evaluate
the agent’s answers’ quality.</p>
      <p>
        Therefore, in this work, we rely on RAGElo, an open-source RAG evaluation toolkit that
evaluates the answers generated by each agent and the documents retrieved by them. By
injecting the annotation of retrieved documents, pooled by the agents being evaluated,
on the answer evaluation step, this method allows for the judging LLM to evaluate if
the generated answer was able to use all the information available about the question
properly and to check for any hallucinations. As the documents used for generating the
answers are included in the answer evaluating prompt, an agent that incorrectly cites
information from a source or refers to information not present in these documents is
likely hallucinating and should have its evaluation adjusted accordingly. As we explore
in Section 8, this two-step process results in a high correlation between human expert
annotators and the judging LLM, enabling higher reliability and trust when evaluating
diferent RAG pipelines. This process is also illustrated in Figure 3.
5.1. Evaluation aspects
While our main evaluation focuses on the pairwise comparison between the two agents,
RAGElo also allows us to evaluate answers pointwise. In this setting, similar to other
works [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], we prompt the judging LLM to evaluate the answers according to multiple
criteria:
• Relevance: Does the answer address the user’s question?
• Accuracy: Is the answer factually correct, based on the documents provided?
• Completeness: Does the answer provide all the information needed to answer the
user’s question?
• Precision: If the user’s question is about a specific product, does the answer provide
the answer for that specific product?
      </p>
    </sec>
    <sec id="sec-7">
      <title>6. Retrieval pipelines</title>
      <p>We not only experiment with diferent search agents (i.e., RAG and RAGF. We are also
interested in how diferent retrieval methods may impact the quality of the final answers
generated by these agents.
6.1. Retrieval methods
Our corpus consists of passages extracted from the Infineon XENSIV Product Selection
Guide, a 117-page document with detailed information on every product in the XENSIV
family. This document included technical information about all Infineon XENSIV sensors,
consumer and automotive sensor applications, guidance in selecting the correct sensor,
and other comprehensive and detailed information about the product line.</p>
      <p>
        The passages are embedded using multilingual-e5-base [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] 4 and indexed using
OpenSearch, allowing us to perform both KNN-based vector search, keyword-based
search with BM25 [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], and RRF based hybrids thereof.
6.2. QA Systems Implementation
We mainly evaluate two agents: a naive RAG pipeline, where the agent first retrieves top- 
passages that are then templated into a prompt, and the Infineon RAG-Fusion (RAGF)
agent. Upon receiving a query, a naive RAG agent takes the following actions:
1. Retrieve the top k most relevant passages from the search system.
2. Perform a Chat Completions API call, prompting the LLM with instructions for
generating an answer based on the five relevant passages.
3. Process and output the Chat Completions response.
      </p>
      <p>Meanwhile, the Infineon RAGF conversational assistant uses a similar framework and
performs the following steps upon receiving a query:
1. Perform a Chat Completions API call to generate four new queries based on the
original query using a prompt tailored to the agent’s original goal.
2. Retrieve the top k most relevant passages for each query.
3. Using RRF, combines the top- passages induced by all queries into a final ranking.
4. Perform a Chat Completions API call prompting the LLM with carefully worded
instructions for generating an answer based on the top- fused passages
5. Process and output the Chat Completions response.</p>
      <sec id="sec-7-1">
        <title>4https://huggingface.co/intfloat/multilingual-e5-base</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7. Experiments</title>
      <p>7.1. Comparing LLM-as-a-judge to expert annotators
While LLM-as-a-judge is a theoretically viable algorithm for rating RAG and RAGF answers,
we must establish whether the results agree with the annotations of domain experts.</p>
      <p>Figure 4 provides a Bland-Altman plot to visually represent the LLM and human
judgments’ agreement.</p>
      <p>The bias of approximately 0.12 indicates that, on average, LLM scores were slightly
higher than human scores. The limits of agreement ranged from approximately -1.17 to
1.41. demonstrating substantial variability in the diference between LLM and human
evaluators.</p>
      <p>
        Next, we compared LLM-as-a-judge to expert annotators with Kendall’s  . Kendalls
 is a nonparametric measure that quantifies the degree of association between two
monotonic continuous or ordinal variables by calculating the proportion of concordance
and discordance among pairwise ranks, ofering valuable insight into their rank correlation
[
        <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
        ]. We used the SciPy Stats Kendalltau function to calculate a tau-b score and a
p-value for the combined ratings of all columns, flattened into a 1-D array with RAG and
RAGF ratings combined [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. The tau-b value, a nonparametric measure of association, is
calculated using the following formula [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]:
  =
      </p>
      <p>( − )
√( +  +  ) ⋅ ( +  +  )
(2)
P represents the number of concordant pairs, Q represents the number of discordant
pairs, T represents the number of ties exclusive to x, and U represents the number of ties
exclusive to y.</p>
      <p>
        This test returned  ≈ 0.56, indicating a moderate, positive correlation [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] with a
p-value against a null hypothesis of no association of  &lt; 0.01 (99.99% confidence level).
For comparison, in similar experiments judging human versus LLM judgments, Faggioli
et al. found  values of  = 0.76 and  = 0.86 [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ].
      </p>
      <p>
        Following the same methodology, we also calculated Spearman’s  , a similar
nonparametric correlation measure. This resulted in  ≈ 0.59 with  &lt; 0.01, demonstrating a
statistically significant, moderate positive correlation [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ].
7.2. RAG vs RAGF
      </p>
      <sec id="sec-8-1">
        <title>7.2.1. Quality of retrieved documents</title>
        <p>We assessed document retrieval quality using Mean Reciprocal Rank@5 (MRR@5), which
averages the inverse ranks of the first relevant result within the top five positions across
all queries. The formula is given by</p>
        <p>
          ||
@5 = 1 ∑
|| =1 rank
1 ,
(3)
where || is the total number of queries and rank is considered only if it’s within the top
ifve, otherwise it counts as zero [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ].
        </p>
        <p>MRR@5 scores were calculated for each agent and each retrieval method considering
two categories:
1. MRR@5 score for documents deemed “somewhat relevant” or “very relevant.”
2. MRR@5 score for documents deemed “very relevant.”</p>
        <p>The results can be seen below in Table 4.</p>
      </sec>
      <sec id="sec-8-2">
        <title>7.2.2. Pairwise evaluation of answers</title>
        <p>We then ran RAGElo games to evaluate end-to-end answer quality of RAG vs RAGF with
diferent base retriever configurations a task that cannot rely on standard Information
Retrieval metrics. These RAGElo results show more victories for RAGF than RAG; For
example, when using BM25 as a base retriever, RAGF won 49% of the games, RAG won
14.5%, and RAG and RAGF are tied in 36.5% of the times. The resulting Elo scores for all six
variants are shown in table 6, which give a robust ranking of the systems, without reliance</p>
      </sec>
      <sec id="sec-8-3">
        <title>Agent</title>
        <p>BM25
KNN</p>
      </sec>
      <sec id="sec-8-4">
        <title>Hybrid RAG</title>
        <p>RAGF
RAG
RAGF
RAG
RAGF</p>
      </sec>
      <sec id="sec-8-5">
        <title>Agent</title>
        <p>RAG
49.5%
58.5%
—
37.0%
51.5%
49.0%</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>8. Discussion</title>
      <p>on a gold standard. It is interesting to see, that for both RAGF as well as RAG, BM25 is a
strong baseline that is not surpassed by generic embeddings in these experiments.</p>
      <p>Next, we compared the RAGElo outcome to the preference of our Infineon human
annotator. We performed two-tailed paired t-tests to compare RAG against RAGF on
each category from the Infineon representatives’ human evaluations with  = .05 . As
expected, due to its larger variety of retrieved results, RAGF significantly outperforms RAG
in completeness at the 95% confidence level with  ≈ 0.01 However, on the precision of
answers, RAG significantly outperformed RAGF at the 95% confidence level with  ≈ 0.04.
As observed above, we found statistically significant, moderate positive correlations
between LLM ratings and human annotations. This indicates a consistent association
between the ratings from LLM-as-a-judge and those by Infineon experts. We find that
on average, LLM scores are slightly higher than those of human annotators. This means
that while relevance judgements on individual queries should not be fully reliable, and
IR metrics derived from LLM-as-a-judge should not be equated with regular relevance
scores without further calibration, we can still make good use of this approach to
rankorder systems. These findings collectively support the validity of our LLM evaluation
method, which assesses conversational system outputs based on a combination of relevance,
accuracy, completeness, and recall.</p>
      <p>The style of evaluation and the diferent dimensions it takes into account are specified
in the prompts given to the LLM in the RAGElo evaluation, which are provided in
Appendix A. Specifically, while the initial LLM-as-a-judge is given specific criteria to
focus on only four categories, we instructed RAGElo’s impartial judge LLM to value more
than the initial four categories:</p>
      <p>Your evaluation should consider factors such as comprehensiveness,
correctness, helpfulness, completeness, accuracy, depth, and level of detail of their
responses.</p>
      <p>Since RAGF significantly outperformed RAG in the completeness category, the RAGElo
judge LLM likely weighed completeness higher than precision. In addition, based
on manual observation of a small random sample of answers, RAGF produced more
comprehensive answers and featured higher depth and level of detail due to the multiple
query generation. However, games where RAG won were most likely influenced by a
significantly more precise answer than that of RAGF. While RAGF values comprehensive
answers that ofer multiple perspectives to the user, RAG produces shorter answers
that answer the original query only. Since completeness is defined as the extent to
which a user’s question was answered, it can be presumed that RAGF’s longer and more
comprehensive answers may tend to be more complete. And since precision relates to the
agent mentioning the correct product or product family, it can be presumed that RAGF’s
longer answers have more room to consider other products or product families, leading
to reduced answer precision. While the human annotation was done by Infineon experts,
diferent humans may rate answers diferently, even if following the same set of criteria.</p>
      <p>A larger number of documents or a database of non-technical documents may have led
to a diferent outcome. RAGF can be applied to not only Infineon documents but also any
documents database to retrieve. This includes not only enterprise uses but also uses in
education, such as mathematics and language learning. The algorithm can be tuned to
diferent use cases by tweaking the internal LLM prompt. For example, the Infineon RAGF
bot was prompted to "think like an engineer." However, an educator RAGF bot could be
prompted to "think like a teacher." Future work includes exploring other applications of
RAGF, especially in education. In addition, we will experiment with diferent prompts for
both LLM-as-a-judge and RAGElo while using diferent quantities and types of documents
with the same retrieval algorithms.</p>
      <p>Based on the calculated MMR@5 scores, we found that the RAGF agent mostly
outperforms the RAG agent in ranking both highly relevant and somewhat relevant documents
retrieved. This evidence search on multiple query variants produced, on average, slightly
more higher-ranked relevant documents than using only the original user query. We also
see that using vector search with embeddings is not a silver bullet, as for our test queries,
BM25 seriously outperforms it. Since retrieval quality is highly dependent on the quality
of the embeddings and their fit to the domain, this outcome will likely be changed by
ifne-tuning the embeddings, and adding additional intelligent re-rankers, which we leave
here for future work, as the evaluation framework would remain the same.</p>
    </sec>
    <sec id="sec-10">
      <title>9. Conclusion</title>
      <p>Overall, we found that the evaluation framework proposed by RAGElo positively aligns
with the preferences of human annotators for RAG and RAGF with due caution due to a
moderate correlation and variability of scoring. We found that the RAGF approach leads to
better answers most of the time, according to the RAGElo evaluation. According to expert
scoring, the RAGF approach significantly outperforms in completeness compared to RAG
but significantly underperforms in precision compared to RAG. Based on these results,
we cannot confidently assert that RAGF’s approach leads to better answers generally.
However, the results do support that RAGF’s approach leads to more complete answers
and a higher proportion of better answers under evaluation by RAGElo.</p>
      <p>Since RAGElo is generally applicable to all retrieval-augmented algorithms, in future
work, we also intend to test diferent agents other than RAG and RAGF, including those
with diferent reranking algorithms, diferent embedding models, and diferent LLMs. In
addition, due to RAGF’s underperformance in document relevance, we may also leverage
CRAG to reduce this gap. We will also investigate the reflection of human sensitivity in
expert ratings, especially whether the LLMs should or can reflect human sensitivities.</p>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgments</title>
      <p>We thank Brooks Felton from Infineon for his support during this work. We also thank
the Infineon sales team for providing valuable feedback.</p>
    </sec>
    <sec id="sec-12">
      <title>A. RAGElo’s prompts and configurations</title>
      <p>A.1. Retrieval Evaluator
We used the default RAGElo’s ReasonerEvaluator, which has the following system prompt:
You a r e an e x p e r t document a n n o t a t o r . Your j o b i s t o e v a l u a t e
whether a document c o n t a i n s r e l e v a n t i n f o r m a t i o n t o answer a
u s e r ’ s q u e s t i o n .</p>
      <p>P l e a s e a c t a s an i m p a r t i a l r e l e v a n c e a n n o t a t o r f o r a s e a r c h
e n g i n e . Your g o a l i s t o e v a l u a t e t h e r e l e v a n c y o f t h e
documents g i v e n a u s e r q u e s t i o n .</p>
      <p>You s h o u l d w r i t e one s e n t e n c e e x p l a i n i n g why t h e document i s
r e l e v a n t o r not f o r t h e u s e r q u e s t i o n . A document can be :
− Not r e l e v a n t : The document i s not on t o p i c .
− Somewhat r e l e v a n t : The document i s on t o p i c but d o e s not
f u l l y answer t h e u s e r q u e s t i o n .
− Very r e l e v a n t : The document i s on t o p i c and a n s w e r s t h e u s e r ’
s q u e s t i o n .
[ u s e r q u e s t i o n ]</p>
      <p>{ query }
[ document c o n t e n t ]</p>
      <p>{ document }
A.2. Answer evaluators
For the pointwise evaluator used in Section 5.1, we used the following prompt with
RAGElo’s CustomPromptAnswerEvaluator:
You a r e an i m p a r t i a l judge f o r e v a l u a t i n g the q u a l i t y o f the
r e s p o n s e s provided by an AI a s s i s t a n t tasked to answer u s e r s
’ q u e s t i o n s about the c a t a l o g u e o f IoT s e n s o r s produced by
I n f i n e o n .</p>
      <p>You w i l l be g i v e n the user ’ s q u e s t i o n and the answer produced
by the a s s i s t a n t .</p>
      <p>The agent ’ s answer was g e n e r a t e d based on a s e t o f documents
r e t r i e v e d by a s e a r c h e n g i n e .</p>
      <p>You w i l l be provided with the r e l e v a n t documents r e t r i e v e d by
the s e a r c h e n g i n e .</p>
      <p>Your t a s k i s to e v a l u a t e the answer ’ s q u a l i t y based on the
response ’ s r e l e v a n c e , accuracy , and c o m p l e t e n e s s .
## Rules f o r e v a l u a t i n g an answer :
− ∗∗ Relevance ∗ ∗ : Does the answer a d d r e s s the user ’ s q u e s t i o n ?
− ∗∗ Accuracy ∗ ∗ : I s the answer f a c t u a l l y c o r r e c t , based on the
documents provided ?
− ∗∗ Completeness ∗ ∗ : Does the answer p r o v i d e a l l the i n f o r m a t i o n
needed to answer the user ’ s q u e s t i o n ?
− ∗∗ P r e c i s i o n ∗ ∗ : I f the user ’ s q u e s t i o n i s about a s p e c i f i c
product , does the answer p r o v i d e the answer f o r that
s p e c i f i c product ?
## Steps to e v a l u a t e an answer :
1 . ∗∗ Understand the user ’ s i n t e n t ∗ ∗ : Explain i n your own words
what the user ’ s i n t e n t i s , g i v e n the q u e s t i o n .
2 . ∗∗ Check i f the answer i s c o r r e c t ∗ ∗ : Think step −by−s t e p
whether the answer c o r r e c t l y answers the user ’ s q u e s t i o n .
3 . ∗∗ Evaluate the q u a l i t y o f the answer ∗ ∗ : Evaluate the q u a l i t y
o f the answer based on i t s r e l e v a n c e , accuracy , and
c o m p l e t e n e s s .
4 . ∗∗ Assign a s c o r e ∗ ∗ : Produce a s i n g l e l i n e JSON o b j e c t with
the f o l l o w i n g keys , each with a s i n g l e s c o r e between 0 and
2 , where 2 i s the h i g h e s t s c o r e on that a s p e c t :
− " r e l e v a n c e "
− 0 : The answer i s not r e l e v a n t to the user ’ s q u e s t i o n .
− 1 : The answer i s p a r t i a l l y r e l e v a n t to the user ’ s</p>
      <p>q u e s t i o n .</p>
      <p>− 2 : The answer i s f u l l y r e l e v a n t to the user ’ s q u e s t i o n .
− " accuracy "
− 0 : The answer i s f a c t u a l l y i n c o r r e c t .
− 1 : The answer i s p a r t i a l l y c o r r e c t .</p>
      <p>− 2 : The answer i s f u l l y c o r r e c t .
− " c o m p l e t e n e s s "
− 0 : The answer does not p r o v i d e enough i n f o r m a t i o n to
answer the user ’ s q u e s t i o n .
− 1 : The answer only answers some a s p e c t s o f the user ’ s
q u e s t i o n .</p>
      <p>− 2 : The answer f u l l y answers the user ’ s q u e s t i o n .
− " p r e c i s i o n "
− 0 : The answer does not mention the same product or
product l i n e as the user ’ s q u e s t i o n .
− 1 : The answer mentions a s i m i l a r product or product l i n e ,
but not the same as the user ’ s q u e s t i o n .
− 2 : The answer mentions the e x a c t same product or product
l i n e as the user ’ s q u e s t i o n .</p>
      <p>The l a s t l i n e o f your answer must be a SINGLE LINE JSON o b j e c t
with the keys " r e l e v a n c e " , " accuracy " , " c o m p l e t e n e s s " , and "
p r e c i s i o n " , each with a s i n g l e s c o r e between 0 and 2 .
[DOCUMENTS RETRIEVED]
{ documents }
[ User Query ]
{ query }
[ Agent answer ]
{ answer }</p>
      <p>For the pairwise evaluation between agents used for the results in Tables 5 and 6, we
used RAGElo’s PairwiseAnswerEvaluator with the following parameters:
p a i r w i s e _ e v a l u a t o r _ c o n f i g = P a i r w i s e E v a l u a t o r C o n f i g (
n_games_per_query=15 ,
h a s _ c i t a t i o n s=False ,
include_raw_documents=True ,
i n c l u d e _ a n n o t a t i o n s=True ,
document_relevance_threshold =2,
f a c t o r s=" the comprehensiveness , c o r r e c t n e s s , h e l p f u l n e s s ,
completeness , accuracy , depth , and l e v e l o f d e t a i l o f
t h e i r r e s p o n s e s . Answers a r e comprehensive i f they show
the u s e r m u l t i p l e p e r s p e c t i v e s i n a d d i t i o n to but s t i l l
r e l e v a n t to the i n t e n t o f the o r i g i n a l q u e s t i o n . " ,
)
This generates 15 random games between two agents per query (i.e., all possible unique
games for 6 agents) and tells the evaluator that:
• The answers do not include specific citations to any passage ( has_citations=False)
• Include the full text of the retrieved passages in the evaluation prompt
(include_raw_documents=True)
• Inject the output of the retrieval evaluator into the prompt
(include_annotations=True)
• Ignore any passage with a relevance score below 2</p>
      <p>(document_relevance_threshold=2)
• Consider these factors when selecting the best answer</p>
      <p>factors=…)</p>
      <p>These parameters produce the following final prompt used for evaluating the answers:
P l e a s e a c t as an i m p a r t i a l judge and e v a l u a t e the q u a l i t y o f
the r e s p o n s e s provided by two AI a s s i s t a n t s tasked to answer
the q u e s t i o n below based on a s e t o f documents r e t r i e v e d by
a s e a r c h e n g i n e .</p>
      <p>You should choose the a s s i s t a n t that b e s t answers the u s e r
q u e s t i o n based on a s e t o f r e f e r e n c e documents that may or
may not be r e l e v a n t .</p>
      <p>For each r e f e r e n c e document , you w i l l be provided with the t e x t
o f the document as w e l l as r e a s o n s why the document i s or
i s not r e l e v a n t .</p>
      <p>Your e v a l u a t i o n should c o n s i d e r f a c t o r s such as
comprehensiveness , c o r r e c t n e s s , h e l p f u l n e s s , completeness ,
accuracy , depth , and l e v e l o f d e t a i l o f t h e i r r e s p o n s e s .
Answers a r e comprehensive i f they show the u s e r m u l t i p l e
p e r s p e c t i v e s i n a d d i t i o n to but s t i l l r e l e v a n t to the i n t e n t
o f the o r i g i n a l q u e s t i o n .</p>
      <p>D e t a i l s a r e only u s e f u l i f they answer the user ’ s q u e s t i o n . I f
an answer c o n t a i n s non−r e l e v a n t d e t a i l s , i t should not be
p r e f e r r e d over one that only u s e s r e l e v a n t i n f o r m a t i o n .
Begin your e v a l u a t i o n by e x p l a i n i n g why each answer c o r r e c t l y
answers the user ’ s q u e s t i o n . Then , you should compare the
two r e s p o n s e s and p r o v i d e a s h o r t e x p l a n a t i o n o f t h e i r
d i f f e r e n c e s . Avoid any p o s i t i o n b i a s e s and e n s u r e that the
o r d e r i n which the r e s p o n s e s were p r e s e n t e d does not
i n f l u e n c e your d e c i s i o n . Do not a l l o w the l e n g t h o f the
r e s p o n s e s to i n f l u e n c e your e v a l u a t i o n . Be as o b j e c t i v e as
p o s s i b l e .</p>
      <p>A f t e r p r o v i d i n g your e x p l a n a t i o n , output your f i n a l v e r d i c t by
s t r i c t l y f o l l o w i n g t h i s format : " [ [A] ] " i f a s s i s t a n t A i s
b e t t e r , " [ [ B ] ] " i f a s s i s t a n t B i s b e t t e r , and " [ [ C ] ] " f o r a
t i e .
[ User Question ]
{ query }
[ R e f e r e n c e Documents ]
{ documents }
[ The S t a r t o f A s s i s t a n t A’ s Answer ]
{answer_a}
[ The End o f A s s i s t a n t A’ s Answer ]
[ The S t a r t o f A s s i s t a n t B’ s Answer ]
{answer_b}
[ The End o f A s s i s t a n t B’ s Answer ]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          , Language Models are
          <string-name>
            <surname>Few-Shot Learners</surname>
          </string-name>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>14165</volume>
          . arXiv:
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>On Hallucination and Predictive Uncertainty in Conditional Language Generation, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2734</fpage>
          -
          <lpage>2744</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . eacl-main.
          <volume>236</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frieske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ishii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Bang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <article-title>Survey of Hallucination in Natural Language Generation</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <volume>248</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>248</lpage>
          :
          <fpage>38</fpage>
          . doi:
          <volume>10</volume>
          .1145/3571730.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Do large language models know what they dont know?</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: ACL</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>8653</fpage>
          -
          <lpage>8665</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Amayuelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv. org/abs/2305.13712.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Buettcher</surname>
          </string-name>
          ,
          <article-title>Reciprocal rank fusion outperforms condorcet and individual rank learning methods</article-title>
          ,
          <source>in: SIGIR</source>
          <year>2019</year>
          , SIGIR '09,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2009</year>
          , p.
          <fpage>758759</fpage>
          . URL: https://doi.org/10.1145/1571941.1572114. doi:
          <volume>10</volume>
          .1145/1571941.1572114.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>ROUGE:</surname>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: ACL</source>
          <year>2002</year>
          ,
          <article-title>ACL '02, Association for Computational Linguistics</article-title>
          , USA,
          <year>2002</year>
          , p.
          <fpage>311318</fpage>
          . URL: https://doi.org/10.3115/1073083.1073135. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <article-title>Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments</article-title>
          ,
          <source>in: StatMT</source>
          <year>2007</year>
          , StatMT '07,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, USA,
          <year>2007</year>
          , p.
          <fpage>228231</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Turpin</surname>
          </string-name>
          ,
          <article-title>Pairwise crowd judgments: Preference, absolute, and ratio</article-title>
          ,
          <source>in: Proceedings of the 23rd Australasian Document Computing Symposium</source>
          , ADCS '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          . URL: https://doi.org/10.1145/3291992.3291995. doi:
          <volume>10</volume>
          .1145/3291992.3291995.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Arabzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <article-title>A comparison of methods for evaluating generative ir</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2404.04044.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Jeronymo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bonifacio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Abonizio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fadaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lotufo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zavrel</surname>
          </string-name>
          , R. Nogueira, InPars-v2:
          <article-title>Large Language Models as Eficient Dataset Generators for Information Retrieval</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2301.
          <year>01820</year>
          . arXiv:
          <fpage>2301</fpage>
          .
          <year>01820</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.02839.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Judging llm-as-a-judge with mt-bench and chatbot arena</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2306.05685.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Katranidis</surname>
          </string-name>
          , G. Barany, Faaf:
          <article-title>Facts as a function for the evaluation of generated text</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2403</volume>
          .
          <fpage>03888</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>P.</given-names>
            <surname>Manakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liusie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. F.</given-names>
            <surname>Gales</surname>
          </string-name>
          , Selfcheckgpt:
          <article-title>Zero-resource black-box hallucination detection for generative large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08896</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Luo,
          <string-name>
            <given-names>Y.-S.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gaitskell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hartvigsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <source>Interpretable unified language checking</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>03728</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Azaria</surname>
          </string-name>
          , T. Mitchell,
          <article-title>The internal state of an llm knows when it's lying</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>13734</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, Bart:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation</article-title>
          , and comprehension,
          <year>2019</year>
          . arXiv:
          <year>1910</year>
          .13461.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, Bartscore:
          <article-title>Evaluating generated text as text generation</article-title>
          , in: M.
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Beygelzimer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Dauphin</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Vaughan</surname>
          </string-name>
          (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>34</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2021</year>
          , pp.
          <fpage>27263</fpage>
          -
          <lpage>27277</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/ 2021/file/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Es</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>James</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schockaert</surname>
          </string-name>
          , Ragas:
          <source>Automated evaluation of retrieval augmented generation</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>15217</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Angelopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fannjiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zrnic</surname>
          </string-name>
          , Predictionpowered inference,
          <year>2023</year>
          . arXiv:
          <volume>2301</volume>
          .
          <fpage>09633</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Saad-Falcon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ares:</surname>
          </string-name>
          <article-title>An automated evaluation framework for retrieval-augmented generation systems</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2311</volume>
          .
          <fpage>09476</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eibich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nagpal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fred-Ojala</surname>
          </string-name>
          ,
          <source>Aragog: Advanced rag output grading</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2404</volume>
          .
          <fpage>01037</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Sun,
          <article-title>Benchmarking large language models in retrievalaugmented generation</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>01431</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.-Q.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-H.</given-names>
            <surname>Ling</surname>
          </string-name>
          , Corrective retrieval augmented generation,
          <year>2024</year>
          . arXiv:
          <volume>2401</volume>
          .
          <fpage>15884</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>G.</given-names>
            <surname>Fazlija</surname>
          </string-name>
          ,
          <article-title>Toward Optimising a Retrieval Augmented Generation Pipeline using Large Language Model</article-title>
          ,
          <source>Master's thesis</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rackauckas</surname>
          </string-name>
          ,
          <article-title>Rag-fusion: A new take on retrieval augmented generation</article-title>
          ,
          <source>International Journal on Natural Language Computing</source>
          <volume>13</volume>
          (
          <year>2024</year>
          )
          <article-title>3747</article-title>
          . URL: http://dx.doi.org/10.5121/ijnlc.
          <year>2024</year>
          .
          <volume>13103</volume>
          . doi:
          <volume>10</volume>
          .5121/ijnlc.
          <year>2024</year>
          .
          <volume>13103</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. B. Hall</surname>
            , M.-
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , Promptagator:
          <article-title>Few-shot dense retrieval from 8 examples,</article-title>
          <year>2022</year>
          . arXiv:
          <volume>2209</volume>
          .
          <fpage>11755</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-4
          <source>turbo and gpt-4</source>
          ,
          <year>2024</year>
          . arXiv:gpt-4
          <article-title>-turbo-and-gpt-4.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          ,
          <article-title>The claude 3 model family: Opus, sonnet</article-title>
          , haiku,
          <year>2024</year>
          . arXiv:
          <article-title>Model Card Claude 3</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>P.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Spielman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <article-title>Large language models can accurately predict searcher preferences</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>10621</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <source>Multilingual e5 text embeddings: A technical report</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>05672</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          , Okapi at TREC-3, in: D. K. Harman (Ed.),
          <source>Proceedings of The Third Text REtrieval Conference</source>
          , TREC 1994, Gaithersburg, Maryland, USA, November 2-
          <issue>4</issue>
          ,
          <year>1994</year>
          , volume
          <volume>500</volume>
          -225 of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>1994</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>126</lpage>
          . URL: http://trec.nist.gov/pubs/trec3/papers/city.ps.gz.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Edwards</surname>
          </string-name>
          , E. de Jong, S. T. Ferguson,
          <article-title>Graphing methods for kendall's</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2308</volume>
          .
          <fpage>08466</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>S.</given-names>
            <surname>Perreault</surname>
          </string-name>
          ,
          <article-title>Eficient inference for kendall's tau</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2206</volume>
          .
          <fpage>04019</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37] scipy.stats.kendalltau, ???? URL: https://docs.scipy.org/doc/scipy/reference/generated/ scipy.stats.kendalltau.
          <source>html#r4cd1899fa369-2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>M. G. KENDALL,</surname>
          </string-name>
          <article-title>The treatment of ties in ranking problems</article-title>
          ,
          <source>Biometrika</source>
          <volume>33</volume>
          (
          <year>1945</year>
          )
          <fpage>239</fpage>
          -
          <lpage>251</lpage>
          . doi:https://doi.org/10.1093/biomet/33.3.239.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schober</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Boer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Schwarte</surname>
          </string-name>
          ,
          <article-title>Correlation coeficients: Appropriate use and interpretation</article-title>
          ,
          <source>Anesthesia &amp; Analgesia</source>
          <volume>126</volume>
          (
          <year>2018</year>
          )
          <fpage>1763</fpage>
          -
          <lpage>1768</lpage>
          . doi:https://doi.org/ 10.1213/ane.0000000000002864.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dietz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , G. Demartini,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          , E. Kanoulas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <article-title>Perspectives on large language models for relevance judgment</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 23, ACM</source>
          ,
          <year>2023</year>
          . URL: http://dx.doi.org/10.1145/3578337.3605136. doi:
          <volume>10</volume>
          .1145/3578337.3605136.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jadon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey of evaluation techniques for recommendation systems</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2312</volume>
          .
          <fpage>16015</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>