<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Xiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.18653/v1</article-id>
      <title-group>
        <article-title>Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ermelinda Oro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Maria Granata</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Lanza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amir Bachir</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca De Grandis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimo Rufolo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Altilia srl, TechNest Start-up Incubator of University of Calabria</institution>
          ,
          <addr-line>Piazza Vermicelli, Rende (CS), 87036</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Research Council, Institute for High Performance Computing and Networking</institution>
          ,
          <addr-line>via P. Bucci 8/9C, Rende (CS), 87036</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2309</year>
      </pub-date>
      <volume>15217</volume>
      <fpage>2383</fpage>
      <lpage>2392</lpage>
      <abstract>
        <p>We present a comprehensive framework for evaluating retrieval-augmented generation (RAG) systems designed for questionanswering tasks using large language models (LLMs). The proposed framework integrates document ingestion, information retrieval, answer generation, and evaluation phases. Both ground truth-based and reference-free evaluation metrics are implemented to provide a multi-faceted assessment approach. Through experiments across diverse datasets like NarrativeQA and a proprietary financial dataset (FinAM-it), the reliability of existing metrics is investigated by comparing them against rigorous human evaluations. The results demonstrate that ground truth-based metrics such as BEM and RAGAS Answer Correctness exhibit a moderately strong correlation with human judgments. However, reference-free metrics still struggle to capture nuances in answer quality without predefined correct responses accurately. An in-depth analysis of Spearman correlation coeficients sheds light on the interrelationships and relative efectiveness of various evaluation approaches across multiple domains. While highlighting the current limitations of reference-free methodologies, the study underscores the need for more sophisticated techniques to better approximate human perception of answer relevance and correctness. Overall, this research contributes to ongoing eforts in developing reliable evaluation frameworks for RAG systems, paving the way for advancements in natural language processing and the realization of highly accurate and human-like AI systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Retrieval Augmented Generation (RAG)</kwd>
        <kwd>Question Answering (QA)</kwd>
        <kwd>Retrieval</kwd>
        <kwd>Large Language Model (LLM)</kwd>
        <kwd>Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>accuracy and relevance. The process is segmented into
four main phases: Ingestion: Input documents are
processed into manageable chunks, leveraging techniques
like document layout analysis for PDFs. The chunks are
embedded into high-dimensional vectors capturing their
semantic essence and ingested into a vector store for
eficient similarity search. Retrieval: Upon receiving
a query, its vector form undergoes similarity search in
the vector store to identify the  most relevant chunks.</p>
      <p>This narrows down the information to the most
perFigure 1: The simplified figure of the implemented RAG Sys- tinent chunks for answer generation. Generation: A
tem. Large Language Model (LLM) synthesizes information
from the retrieved chunks to construct a coherent and
natural-sounding answer to the query. Evaluation: A
2. Related Work two-sided approach employs both ground-truth
dependent and independent metrics. Ground-truth dependent
metrics assess correctness against predefined answers,
while ground-truth independent metrics evaluate answer
relevance without a predefined set. This dual approach
enables a comprehensive assessment of performance,
correctness, and overall text quality. The system can receive
human evaluations of question-answer pairs to evaluate
metric reliability and alignment with expectations.</p>
      <sec id="sec-1-1">
        <title>RAG systems have been implemented in various forms</title>
        <p>[1, 2, 3, 4, 5], incorporating advanced strategies like
document splitting, chunking, retrieval, and diverse models
for embedding and language generation, including
proprietary and open-source models from platforms like
HuggingFace1. We have also explored diferent variants
of RAG systems, however, this paper’s primary focus is
not to introduce a novel RAG system or methodology but
to comprehensively evaluate the efectiveness of Large 3.2. Evaluation Strategies
Language Model (LLM)-derived metrics, emphasizing
reference-free approaches. In our RAG system, we implemented and tested a wide</p>
        <p>Several prior works have proposed frameworks and range of evaluation metrics. Specifically, our system
innovel metrics that leverage the capabilities of LLMs corporates metrics for assessing individual RAG
compo[6, 7, 8, 9, 10, 11]. Unlike these existing solutions, which nents like Information Retrieval (IR) and Answer
Generaim to score diferent RAG systems or propose new evalu- ation, as well as the overall pipeline. For IR, we used
clasation methods, metrics, or datasets, our research is specif- sical metrics such as Recall@K, Precision@K, mAP, MRR,
ically targeted at evaluating the potential satisfaction of and nDCG. For answer generation, the implemented
metend-user customers who receive the evaluation scores rics were divided into two categories: Syntactic
metgenerated by such systems. rics evaluate formal response aspects, including BLEU</p>
        <p>By concentrating on the practical utility and inter- [12], ROUGE [13], Precision, Recall, F1, and Exact Match
pretability of evaluation metrics from the perspective [14]. These focus on text properties rather than semantic
of end-users, our study diverges from the conventional meaning. Semantic metrics evaluate response meaning,
approach of optimizing technical performance alone. In- including BERT score [15] and BEM score [16]. BEM is
stead, we strive to bridge the gap between state-of-the-art preferred over BERT due to reported correlation with
evaluation techniques and the real-world expectations of human evaluations and our empirical findings.
LMMcustomers who rely on these systems for decision-making derived Metrics: We implemented in our framework
and information retrieval. the RAG triad of metrics for the three main steps of an
RAG’s execution [6]: (i) Context relevance that assesses if
the passage returned is relevant for answering the given
3. Method query. (ii) Groundedness that assesses if the generated
answer is faithful to the retrieved passage or if it contains
3.1. Framework for RAG and evaluation hallucinated or extrapolated statements beyond the
pasThis paper introduces a framework for running and eval- sage. (iii) Answer relevance that assesses if the generated
uating a RAG system for eficiently processing and re- answer is relevant given the query and retrieved passage.
sponding to natural language queries. The system inte- In addition, we implemented the Answer correctness that
grates state-of-the-art technologies to enhance answer exploits LLMs and gold answers to measure the factual
correctness of an answer. In this paper, only a subset of
metrics are considered and compared for assessing the
quality of the answers (see Section 4.2).</p>
      </sec>
      <sec id="sec-1-2">
        <title>1https://huggingface.co/spaces/HuggingFaceH4/open_llm_</title>
        <p>leaderboard</p>
        <p>Manual evaluation. To verify the reliability of au- 4.1. Datasets
tomated evaluation metrics, we implemented a rigorous
manual evaluation process to assess the relevance, accu- NarrativeQA - English. A subsample of the
Narraracy, and coherence of the answers generated by our RAG tiveQA dataset [17] was used, with 50 book-related and
system. This manual evaluation was conducted by three 50 movie script-related questions (1% of the test set),
spanindependent human annotators, each with expertise in ning 41 unique books and 42 unique movie scripts. This
the domain of the questions posed to the system. For each allowed evaluating the RAG system’s performance across
evaluation session, the annotators were presented with two distinct narrative content types.
the question, the corresponding answer generated by the Financial Asset Management - Italian. The
FinAMRAG system, and the ground truth provided by the origi- it dataset, created by Altilia, consists of 50
questionnal dataset or the customer answers. The primary task answer pairs from Italian asset management documents
for each annotator was to assess the quality of the gener- on topics like investment strategies, risk management,
ated answer in relation to the posed question, employing and regulatory compliance. The questions are complex
a discrete scoring 5-point likert scale. The criteria for and diverse, often requiring information from multiple
scoring were as follows: 1. Very Poor: The generated paragraphs, with detailed, conversational-style answers.
answer is totally incorrect or irrelevant to the question.</p>
        <p>This case indicates a failure of the system to comprehend 4.2. Metrics
the query or retrieve pertinent information. 2. Poor: The
generated answer is predominantly incorrect but with
glimpses of relevance suggesting some level of under- TNaabmlein1g and classification of metrics shown in the
experimenstanding or appropriate retrieval. 3. Neither: The gener- tal evaluation
ated answer mixes relevant and irrelevant information
almost equally, showcasing the system’s partial success Acronym Name - Framework Type
in addressing the query. 4. Good: The generated answer BEM BEM score - TensorFlow GT-based
is largely correct but includes minor inaccuracies or irrel- AR TruLens Answer Relevance - TruLens GT-free
evant details, demonstrating a strong understanding and AR RAGAS Answer Relevance - RAGAS GT-free
response to the question. 5. Very Good: Reserved for AC RAGAS Answer Correctness - RAGAS GT-based
answers that are completely correct and fully relevant,
reflecting an ideal outcome where the system accurately In this paper we focus on evaluating the generated
understood and responded to the query. The annotators answer’s quality of the entire pipeline.
conducted their assessments independently to ensure un- In our analysis, we considered the BEM score (BERT
biased evaluations. Upon completion, the scores for each matching score) [15], which we experimented is the most
question-answer pair were collected and compared. In satisfying among the classic metrics. It is a metric that
cases of discrepancy, a consensus discussion was initi- uses a BERT model [18] trained to solve an answer
equivated among the annotators to agree on the most accurate alence task, this task is solved by training a classifier
score. This consensus process allowed for mitigating in- that tells if two given answers are equivalent and returns
dividual bias and considering diferent perspectives in the equivalence score. We use the variation of the BERT
evaluating the quality of the generated answers. This score Answers and questions that exploits the two answers
manual evaluation process helps particularly in assess- and the question as model input. This variation results
ing the reliability and validity of our system’s automated in performing better [16].
evaluation metrics. By comparing the human-generated In addition, we considered novel LLM-derived
metscores against the results produced by these automated rics developed in the RAGAS [6] and Truelens2 systems.
measures, we can determine the extent to which the au- These metrics ofer evaluations both ground truth-based
tomatic metrics accurately reflect human judgment and and reference-free. In particular, from RAGAS we used
perception of answer quality. the two main metrics that focus on answers: Answer
Correctness and Answer Relevance. More in detail: (i)
Answer Correctness3: This metric measures the factual
cor4. Experiments rectness of an answer and needs the presence of a ground
Considering diferent domains (Section 4.1), we investi- truth. It employs an LLM to extract factual statements
gate the reliability of a subset of existing metrics (Section from both the predicted answer and the ground truth
4.2) for evaluating a RAG system (Section 3.1). We ex- labeling them as True Positives if are present in both the
plore the feasibility of adopting reference-free metrics answers, False Negatives if are present only in the ground
and the correlation among them and the human evalua- 2https://www.trulens.org/
tion (Section 3.2). 3https://docs.ragas.io/en/latest/concepts/metrics/answer_
correctness.html
truth, and False Positives if they are present only in the
prediction. Then a final F1 score is calculated, this score
in the range (0, 1) is the Answer Correctness. (ii)
Answer Relevance 4: This metric measures how pertinent
the generated answer is to the prompt given to the LLM
in the generation step. It computes a score in the range
(0, 1) as the mean of the cosine similarities between the
original question and a set of artificial questions
generated by an LLM on the basis of the predicted answer and
the given context. The formula of the score is the
follow</p>
        <p>1 ∑︀
ing:  =  =1 (,  )
where  is the embedding of the original generated
answer and  is the embedding of the i-th generated
question. From TruLens we used the implemented
Answer Relevance metric that prompts an LLM to evaluate
the relevance of the answer with respect to the input
prompt that includes context and question. The score
that the LLM assigns to each answer is in the range (0, 1). 4.4. Results</p>
        <p>To study the interrelationships and relative
efectiveness among various evaluation metrics, we exploit the
Spearman correlation coeficient. The Spearman Rank
Correlation [19] is a non-parametric measure that
assesses the statistical dependence between the rankings of
two variables. It tells how well the relationship between
these variables can be described using a monotonic
function. This measure is computed on ranked data, allowing
for the analysis of both ordinal variables and continuous
variables that have been converted into ranks. The
Spearman Rank Correlation coeficient is denoted by  , and its
value ranges from − 1 to 1 inclusive, where 1 indicates
perfect positive correlation, 0 indicates no correlation,
and − 1 indicates perfect negative correlation.
4.3. Settings</p>
      </sec>
      <sec id="sec-1-3">
        <title>For this implementation, we employed OpenAI models</title>
        <p>for the embedding, retrieval, and generation stages of
the RAG and to implement evaluations with RAGAS and
TruLens. The Ingestion step produced chunks of 1024
characters, balancing semantic integrity with avoiding
irrelevant or redundant information. Larger chunks may
capture more context but increase noise, while smaller
sizes may sacrifice contextual information. These chunks
were embedded using OpenAI’s text-embedding-ada-0025,
a state-of-the-art transformer model for generating
highquality text embeddings. For retrieval within the vector
store, the system identified the 10 most similar
embeddings to previously indexed chunks. During generation,
we employed the GPT-4-Turbo model6 with the following
prompt structure:</p>
      </sec>
      <sec id="sec-1-4">
        <title>4https://docs.ragas.io/en/latest/concepts/metrics/answer_</title>
        <p>relevance.html
5https://openai.com/blog/new-and-improved-embedding-model
6https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo
You a r e a c h a t b o t h a v i n g a</p>
        <p>c o n v e r s a t i o n w i t h a human .</p>
        <p>G i v e n t h e f o l l o w i n g e x t r a c t e d p a r t s
o f a l o n g document and a q u e s t i o n ,
c r e a t e a f i n a l a n s w e r .</p>
        <p>I f you don ’ t know t h e answer , j u s t
s a y t h a t you don ’ t know , don ’ t t r y
t o make up an a n s w e r .</p>
        <p>C o n t e x t : { CONTEXT }
C h a t h i s t o r y : { CHAT_HISTORY }
Human : { HUMAN_INPUT }
C h a t b o t :</p>
      </sec>
      <sec id="sec-1-5">
        <title>This prompt provided the model with instructions, con</title>
        <p>text, and encouraged concise, truthful answers without
fabrication.</p>
        <p>For both books and movies subsamples from the
NarrativeQA dataset, as can be seen in table 2 and table 3,
human judgment shows a moderately strong Spearman
correlation with BEM (0.735 and 0.704) and AC RAGAS
scores across both GPT-3.5-turbo (0.718, 0.792), and
GPT4-turbo models (0.67 and 0.781). This indicates that these
ground truth-based metrics are more aligned with
human perception of answer quality. Reference-free metrics
show poor correlation with human judgment, especially
AR RAGAS (0.234 and 0.483), highlighting the fact that
evaluating an answer without ground truth is still a
challenging problem for Large Language Models. The
analysis of the FinAM-it dataset as it can be seen in table
4 shows generally lower correlations across all metrics,
with the highest correlation being observed between
human judgment and AC RAGAS gpt-4-turbo (0.531). This
could be related to the fact that the FinAM-it dataset
presents more challenging and diverse content that is
more dificult to evaluate. Extending the analysis on all
the datasets at once, it can be seen that all the metrics
have still dificulties to approximate the human
evaluation in a robust and reliable way.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Conclusion</title>
      <sec id="sec-2-1">
        <title>Our exploration into evaluating Retrieval Augmented</title>
        <p>Generation (RAG) systems via ground truth-based and
reference-free metrics was driven by the need for reliable
evaluation frameworks, particularly for scenarios lacking
ground truth data. Our evaluation framework’s
implementation has demonstrated its potential for facilitating
a more comprehensive understanding of these systems’
capabilities in such situations. Through rigorous
experimentation across diferent domains and datasets,
including NarrativeQA and a specialized industrial dataset, we
compared various evaluation methodologies against hu- turing nuanced aspects of human judgment, suggesting
man judgment. While ground truth-based metrics like an urgent need for further refinement of reference-free
BEM and AC RAGAS showed moderate to strong correla- evaluation methods. The Spearman correlation analysis
tion with human judgments across diferent domains and reveals that while some metrics align more closely with
models, reference-free metrics still face significant chal- human assessments, there is still significant room for
imlenges in achieving similar correlation levels. This high- provement, especially for more challenging and diverse
lights the current limitations of automated metrics in cap- content like the FinAM-it dataset. These findings
under</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>