1. Introduction

Xiv:

10.18653/v1

Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models

Ermelinda Oro

0 1

Francesco Maria Granata

Antonio Lanza

Amir Bachir

Luca De Grandis

Massimo Rufolo

0 1 0 Altilia srl, TechNest Start-up Incubator of University of Calabria , Piazza Vermicelli, Rende (CS), 87036 , Italy 1 National Research Council, Institute for High Performance Computing and Networking , via P. Bucci 8/9C, Rende (CS), 87036 , Italy

2309

15217 2383 2392

We present a comprehensive framework for evaluating retrieval-augmented generation (RAG) systems designed for questionanswering tasks using large language models (LLMs). The proposed framework integrates document ingestion, information retrieval, answer generation, and evaluation phases. Both ground truth-based and reference-free evaluation metrics are implemented to provide a multi-faceted assessment approach. Through experiments across diverse datasets like NarrativeQA and a proprietary financial dataset (FinAM-it), the reliability of existing metrics is investigated by comparing them against rigorous human evaluations. The results demonstrate that ground truth-based metrics such as BEM and RAGAS Answer Correctness exhibit a moderately strong correlation with human judgments. However, reference-free metrics still struggle to capture nuances in answer quality without predefined correct responses accurately. An in-depth analysis of Spearman correlation coeficients sheds light on the interrelationships and relative efectiveness of various evaluation approaches across multiple domains. While highlighting the current limitations of reference-free methodologies, the study underscores the need for more sophisticated techniques to better approximate human perception of answer relevance and correctness. Overall, this research contributes to ongoing eforts in developing reliable evaluation frameworks for RAG systems, paving the way for advancements in natural language processing and the realization of highly accurate and human-like AI systems.

eol>Retrieval Augmented Generation (RAG) Question Answering (QA) Retrieval Large Language Model (LLM) Evaluation

1. Introduction

accuracy and relevance. The process is segmented into four main phases: Ingestion: Input documents are processed into manageable chunks, leveraging techniques like document layout analysis for PDFs. The chunks are embedded into high-dimensional vectors capturing their semantic essence and ingested into a vector store for eficient similarity search. Retrieval: Upon receiving a query, its vector form undergoes similarity search in the vector store to identify the most relevant chunks.

This narrows down the information to the most perFigure 1: The simplified figure of the implemented RAG Sys- tinent chunks for answer generation. Generation: A tem. Large Language Model (LLM) synthesizes information from the retrieved chunks to construct a coherent and natural-sounding answer to the query. Evaluation: A 2. Related Work two-sided approach employs both ground-truth dependent and independent metrics. Ground-truth dependent metrics assess correctness against predefined answers, while ground-truth independent metrics evaluate answer relevance without a predefined set. This dual approach enables a comprehensive assessment of performance, correctness, and overall text quality. The system can receive human evaluations of question-answer pairs to evaluate metric reliability and alignment with expectations.

RAG systems have been implemented in various forms

[1, 2, 3, 4, 5], incorporating advanced strategies like document splitting, chunking, retrieval, and diverse models for embedding and language generation, including proprietary and open-source models from platforms like HuggingFace1. We have also explored diferent variants of RAG systems, however, this paper’s primary focus is not to introduce a novel RAG system or methodology but to comprehensively evaluate the efectiveness of Large 3.2. Evaluation Strategies Language Model (LLM)-derived metrics, emphasizing reference-free approaches. In our RAG system, we implemented and tested a wide

Several prior works have proposed frameworks and range of evaluation metrics. Specifically, our system innovel metrics that leverage the capabilities of LLMs corporates metrics for assessing individual RAG compo[6, 7, 8, 9, 10, 11]. Unlike these existing solutions, which nents like Information Retrieval (IR) and Answer Generaim to score diferent RAG systems or propose new evalu- ation, as well as the overall pipeline. For IR, we used clasation methods, metrics, or datasets, our research is specif- sical metrics such as Recall@K, Precision@K, mAP, MRR, ically targeted at evaluating the potential satisfaction of and nDCG. For answer generation, the implemented metend-user customers who receive the evaluation scores rics were divided into two categories: Syntactic metgenerated by such systems. rics evaluate formal response aspects, including BLEU

By concentrating on the practical utility and inter- [12], ROUGE [13], Precision, Recall, F1, and Exact Match pretability of evaluation metrics from the perspective [14]. These focus on text properties rather than semantic of end-users, our study diverges from the conventional meaning. Semantic metrics evaluate response meaning, approach of optimizing technical performance alone. In- including BERT score [15] and BEM score [16]. BEM is stead, we strive to bridge the gap between state-of-the-art preferred over BERT due to reported correlation with evaluation techniques and the real-world expectations of human evaluations and our empirical findings. LMMcustomers who rely on these systems for decision-making derived Metrics: We implemented in our framework and information retrieval. the RAG triad of metrics for the three main steps of an RAG’s execution [6]: (i) Context relevance that assesses if the passage returned is relevant for answering the given 3. Method query. (ii) Groundedness that assesses if the generated answer is faithful to the retrieved passage or if it contains 3.1. Framework for RAG and evaluation hallucinated or extrapolated statements beyond the pasThis paper introduces a framework for running and eval- sage. (iii) Answer relevance that assesses if the generated uating a RAG system for eficiently processing and re- answer is relevant given the query and retrieved passage. sponding to natural language queries. The system inte- In addition, we implemented the Answer correctness that grates state-of-the-art technologies to enhance answer exploits LLMs and gold answers to measure the factual correctness of an answer. In this paper, only a subset of metrics are considered and compared for assessing the quality of the answers (see Section 4.2).

1https://huggingface.co/spaces/HuggingFaceH4/open_llm_

leaderboard

Manual evaluation. To verify the reliability of au- 4.1. Datasets tomated evaluation metrics, we implemented a rigorous manual evaluation process to assess the relevance, accu- NarrativeQA - English. A subsample of the Narraracy, and coherence of the answers generated by our RAG tiveQA dataset [17] was used, with 50 book-related and system. This manual evaluation was conducted by three 50 movie script-related questions (1% of the test set), spanindependent human annotators, each with expertise in ning 41 unique books and 42 unique movie scripts. This the domain of the questions posed to the system. For each allowed evaluating the RAG system’s performance across evaluation session, the annotators were presented with two distinct narrative content types. the question, the corresponding answer generated by the Financial Asset Management - Italian. The FinAMRAG system, and the ground truth provided by the origi- it dataset, created by Altilia, consists of 50 questionnal dataset or the customer answers. The primary task answer pairs from Italian asset management documents for each annotator was to assess the quality of the gener- on topics like investment strategies, risk management, ated answer in relation to the posed question, employing and regulatory compliance. The questions are complex a discrete scoring 5-point likert scale. The criteria for and diverse, often requiring information from multiple scoring were as follows: 1. Very Poor: The generated paragraphs, with detailed, conversational-style answers. answer is totally incorrect or irrelevant to the question.

This case indicates a failure of the system to comprehend 4.2. Metrics the query or retrieve pertinent information. 2. Poor: The generated answer is predominantly incorrect but with glimpses of relevance suggesting some level of under- TNaabmlein1g and classification of metrics shown in the experimenstanding or appropriate retrieval. 3. Neither: The gener- tal evaluation ated answer mixes relevant and irrelevant information almost equally, showcasing the system’s partial success Acronym Name - Framework Type in addressing the query. 4. Good: The generated answer BEM BEM score - TensorFlow GT-based is largely correct but includes minor inaccuracies or irrel- AR TruLens Answer Relevance - TruLens GT-free evant details, demonstrating a strong understanding and AR RAGAS Answer Relevance - RAGAS GT-free response to the question. 5. Very Good: Reserved for AC RAGAS Answer Correctness - RAGAS GT-based answers that are completely correct and fully relevant, reflecting an ideal outcome where the system accurately In this paper we focus on evaluating the generated understood and responded to the query. The annotators answer’s quality of the entire pipeline. conducted their assessments independently to ensure un- In our analysis, we considered the BEM score (BERT biased evaluations. Upon completion, the scores for each matching score) [15], which we experimented is the most question-answer pair were collected and compared. In satisfying among the classic metrics. It is a metric that cases of discrepancy, a consensus discussion was initi- uses a BERT model [18] trained to solve an answer equivated among the annotators to agree on the most accurate alence task, this task is solved by training a classifier score. This consensus process allowed for mitigating in- that tells if two given answers are equivalent and returns dividual bias and considering diferent perspectives in the equivalence score. We use the variation of the BERT evaluating the quality of the generated answers. This score Answers and questions that exploits the two answers manual evaluation process helps particularly in assess- and the question as model input. This variation results ing the reliability and validity of our system’s automated in performing better [16]. evaluation metrics. By comparing the human-generated In addition, we considered novel LLM-derived metscores against the results produced by these automated rics developed in the RAGAS [6] and Truelens2 systems. measures, we can determine the extent to which the au- These metrics ofer evaluations both ground truth-based tomatic metrics accurately reflect human judgment and and reference-free. In particular, from RAGAS we used perception of answer quality. the two main metrics that focus on answers: Answer Correctness and Answer Relevance. More in detail: (i) Answer Correctness3: This metric measures the factual cor4. Experiments rectness of an answer and needs the presence of a ground Considering diferent domains (Section 4.1), we investi- truth. It employs an LLM to extract factual statements gate the reliability of a subset of existing metrics (Section from both the predicted answer and the ground truth 4.2) for evaluating a RAG system (Section 3.1). We ex- labeling them as True Positives if are present in both the plore the feasibility of adopting reference-free metrics answers, False Negatives if are present only in the ground and the correlation among them and the human evalua- 2https://www.trulens.org/ tion (Section 3.2). 3https://docs.ragas.io/en/latest/concepts/metrics/answer_ correctness.html truth, and False Positives if they are present only in the prediction. Then a final F1 score is calculated, this score in the range (0, 1) is the Answer Correctness. (ii) Answer Relevance 4: This metric measures how pertinent the generated answer is to the prompt given to the LLM in the generation step. It computes a score in the range (0, 1) as the mean of the cosine similarities between the original question and a set of artificial questions generated by an LLM on the basis of the predicted answer and the given context. The formula of the score is the follow

1 ∑︀ ing: = =1 (, ) where is the embedding of the original generated answer and is the embedding of the i-th generated question. From TruLens we used the implemented Answer Relevance metric that prompts an LLM to evaluate the relevance of the answer with respect to the input prompt that includes context and question. The score that the LLM assigns to each answer is in the range (0, 1). 4.4. Results

To study the interrelationships and relative efectiveness among various evaluation metrics, we exploit the Spearman correlation coeficient. The Spearman Rank Correlation [19] is a non-parametric measure that assesses the statistical dependence between the rankings of two variables. It tells how well the relationship between these variables can be described using a monotonic function. This measure is computed on ranked data, allowing for the analysis of both ordinal variables and continuous variables that have been converted into ranks. The Spearman Rank Correlation coeficient is denoted by , and its value ranges from − 1 to 1 inclusive, where 1 indicates perfect positive correlation, 0 indicates no correlation, and − 1 indicates perfect negative correlation. 4.3. Settings

For this implementation, we employed OpenAI models

for the embedding, retrieval, and generation stages of the RAG and to implement evaluations with RAGAS and TruLens. The Ingestion step produced chunks of 1024 characters, balancing semantic integrity with avoiding irrelevant or redundant information. Larger chunks may capture more context but increase noise, while smaller sizes may sacrifice contextual information. These chunks were embedded using OpenAI’s text-embedding-ada-0025, a state-of-the-art transformer model for generating highquality text embeddings. For retrieval within the vector store, the system identified the 10 most similar embeddings to previously indexed chunks. During generation, we employed the GPT-4-Turbo model6 with the following prompt structure:

4https://docs.ragas.io/en/latest/concepts/metrics/answer_

relevance.html 5https://openai.com/blog/new-and-improved-embedding-model 6https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo You a r e a c h a t b o t h a v i n g a

c o n v e r s a t i o n w i t h a human .

G i v e n t h e f o l l o w i n g e x t r a c t e d p a r t s o f a l o n g document and a q u e s t i o n , c r e a t e a f i n a l a n s w e r .

I f you don ’ t know t h e answer , j u s t s a y t h a t you don ’ t know , don ’ t t r y t o make up an a n s w e r .

C o n t e x t : { CONTEXT } C h a t h i s t o r y : { CHAT_HISTORY } Human : { HUMAN_INPUT } C h a t b o t :

This prompt provided the model with instructions, con

text, and encouraged concise, truthful answers without fabrication.

For both books and movies subsamples from the NarrativeQA dataset, as can be seen in table 2 and table 3, human judgment shows a moderately strong Spearman correlation with BEM (0.735 and 0.704) and AC RAGAS scores across both GPT-3.5-turbo (0.718, 0.792), and GPT4-turbo models (0.67 and 0.781). This indicates that these ground truth-based metrics are more aligned with human perception of answer quality. Reference-free metrics show poor correlation with human judgment, especially AR RAGAS (0.234 and 0.483), highlighting the fact that evaluating an answer without ground truth is still a challenging problem for Large Language Models. The analysis of the FinAM-it dataset as it can be seen in table 4 shows generally lower correlations across all metrics, with the highest correlation being observed between human judgment and AC RAGAS gpt-4-turbo (0.531). This could be related to the fact that the FinAM-it dataset presents more challenging and diverse content that is more dificult to evaluate. Extending the analysis on all the datasets at once, it can be seen that all the metrics have still dificulties to approximate the human evaluation in a robust and reliable way.

5. Conclusion Our exploration into evaluating Retrieval Augmented

Generation (RAG) systems via ground truth-based and reference-free metrics was driven by the need for reliable evaluation frameworks, particularly for scenarios lacking ground truth data. Our evaluation framework’s implementation has demonstrated its potential for facilitating a more comprehensive understanding of these systems’ capabilities in such situations. Through rigorous experimentation across diferent domains and datasets, including NarrativeQA and a specialized industrial dataset, we compared various evaluation methodologies against hu- turing nuanced aspects of human judgment, suggesting man judgment. While ground truth-based metrics like an urgent need for further refinement of reference-free BEM and AC RAGAS showed moderate to strong correla- evaluation methods. The Spearman correlation analysis tion with human judgments across diferent domains and reveals that while some metrics align more closely with models, reference-free metrics still face significant chal- human assessments, there is still significant room for imlenges in achieving similar correlation levels. This high- provement, especially for more challenging and diverse lights the current limitations of automated metrics in cap- content like the FinAM-it dataset. These findings under