=Paper=
{{Paper
|id=Vol-3762/495
|storemode=property
|title=Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models
|pdfUrl=https://ceur-ws.org/Vol-3762/495.pdf
|volume=Vol-3762
|authors=Ermelinda Oro,Francesco Granata,Antonio Lanza,Amir Bachir,Luca De Grandis,Massimo Ruffolo
|dblpUrl=https://dblp.org/rec/conf/ital-ia/OroGLBGR24
}}
==Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models==
Evaluating Retrieval-Augmented Generation for Question
Answering with Large Language Models
Ermelinda Oro1,2,* , Francesco Maria Granata2 , Antonio Lanza2 , Amir Bachir2 ,
Luca De Grandis2 and Massimo Ruffolo1,2
1
National Research Council, Institute for High Performance Computing and Networking, via P. Bucci 8/9C, Rende (CS), 87036, Italy
2
Altilia srl, TechNest Start-up Incubator of University of Calabria, Piazza Vermicelli, Rende (CS), 87036, Italy
Abstract
We present a comprehensive framework for evaluating retrieval-augmented generation (RAG) systems designed for question-
answering tasks using large language models (LLMs). The proposed framework integrates document ingestion, information
retrieval, answer generation, and evaluation phases. Both ground truth-based and reference-free evaluation metrics are
implemented to provide a multi-faceted assessment approach. Through experiments across diverse datasets like NarrativeQA
and a proprietary financial dataset (FinAM-it), the reliability of existing metrics is investigated by comparing them against
rigorous human evaluations. The results demonstrate that ground truth-based metrics such as BEM and RAGAS Answer
Correctness exhibit a moderately strong correlation with human judgments. However, reference-free metrics still struggle
to capture nuances in answer quality without predefined correct responses accurately. An in-depth analysis of Spearman
correlation coefficients sheds light on the interrelationships and relative effectiveness of various evaluation approaches across
multiple domains. While highlighting the current limitations of reference-free methodologies, the study underscores the need
for more sophisticated techniques to better approximate human perception of answer relevance and correctness. Overall, this
research contributes to ongoing efforts in developing reliable evaluation frameworks for RAG systems, paving the way for
advancements in natural language processing and the realization of highly accurate and human-like AI systems.
Keywords
Retrieval Augmented Generation (RAG), Question Answering (QA), Retrieval, Large Language Model (LLM), Evaluation
1. Introduction an extensive series of experiments spanning diverse do-
mains and datasets we investigate the reliability and
Retrieval-Augmented Generation (RAG) systems, which validity of existing evaluation methodologies. Specifi-
integrate information retrieval with natural language cally, we examine the correlation between various met-
generation, have shown promise in enhancing language rics and rigorous human evaluations, shedding light on
modelsโ capabilities. However, evaluating their perfor- their strengths, limitations, and potential for improve-
mance remains challenging, particularly when ground ment. Our findings reveal that while ground truth-based
truth data is unavailable, impeding accurate assessments metrics like BEM and RAG Answer Correctness exhibit
of system utility. To address this challenge, we present a moderate alignment with human judgments, reference-
comprehensive framework designed to facilitate the rig- free metrics still struggle to accurately capture answer
orous evaluation of RAG systems for question-answering quality nuances without predefined correct responses.
tasks. Our framework integrates document ingestion, By analyzing Spearman correlation coefficients, we elu-
retrieval, generation, and evaluation phases, leverag- cidate the interrelationships and relative effectiveness of
ing state-of-the-art technologies to optimize accuracy different evaluation approaches across multiple domains.
and relevance. We implement both ground truth-based This research makes the following key contributions:
and reference-free evaluation metrics, providing a multi- (i) presenting a comprehensive framework for evaluating
faceted approach to assessing system outputs. Through RAG systems with state-of-the-art components, (ii) imple-
menting and comparing diverse ground truth-based and
Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- reference-free evaluation metrics, (iii) conducting rigor-
nized by CINI, May 29-30, 2024, Naples, Italy
*
Corresponding author.
ous experiments across multiple datasets to assess metric
$ ermelinda.oro@icar.cnr.it (E. Oro); reliability against human judgments, and (iv) analyzing
francesco.granata@altiliagroup.com (F. M. Granata); the strengths and limitations of existing metrics, high-
antonio.lanza@altiliagroup.com (A. Lanza); lighting the need for advanced reference-free evaluation
amir.bachir@altiliagroup.com (A. Bachir); techniques that better approximate human perception.
luca.degrandis@altiliagroup.com (L. D. Grandis);
massimo.ruffolo@altiliagroup.com (M. Ruffolo)
The rest of the paper is organized as follows: Section
0000-0002-5529-1007 (E. Oro); 0000-0003-4425-753X 2 discusses related work. Section 3 presents the method.
(F. M. Granata); 0000-0002-2875-4133 (L. D. Grandis); Section 4 shows the experimental evaluation and Section
0000-0002-4094-4810 (M. Ruffolo) 5 concludes the work.
ยฉ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
accuracy and relevance. The process is segmented into
four main phases: Ingestion: Input documents are pro-
cessed into manageable chunks, leveraging techniques
like document layout analysis for PDFs. The chunks are
embedded into high-dimensional vectors capturing their
semantic essence and ingested into a vector store for
efficient similarity search. Retrieval: Upon receiving
a query, its vector form undergoes similarity search in
the vector store to identify the ๐ most relevant chunks.
This narrows down the information to the most per-
Figure 1: The simplified figure of the implemented RAG Sys- tinent chunks for answer generation. Generation: A
tem. Large Language Model (LLM) synthesizes information
from the retrieved chunks to construct a coherent and
natural-sounding answer to the query. Evaluation: A
2. Related Work two-sided approach employs both ground-truth depen-
dent and independent metrics. Ground-truth dependent
RAG systems have been implemented in various forms metrics assess correctness against predefined answers,
[1, 2, 3, 4, 5], incorporating advanced strategies like doc- while ground-truth independent metrics evaluate answer
ument splitting, chunking, retrieval, and diverse models relevance without a predefined set. This dual approach
for embedding and language generation, including pro- enables a comprehensive assessment of performance, cor-
prietary and open-source models from platforms like rectness, and overall text quality. The system can receive
HuggingFace1 . We have also explored different variants human evaluations of question-answer pairs to evaluate
of RAG systems, however, this paperโs primary focus is metric reliability and alignment with expectations.
not to introduce a novel RAG system or methodology but
to comprehensively evaluate the effectiveness of Large
3.2. Evaluation Strategies
Language Model (LLM)-derived metrics, emphasizing
reference-free approaches. In our RAG system, we implemented and tested a wide
Several prior works have proposed frameworks and range of evaluation metrics. Specifically, our system in-
novel metrics that leverage the capabilities of LLMs corporates metrics for assessing individual RAG compo-
[6, 7, 8, 9, 10, 11]. Unlike these existing solutions, which nents like Information Retrieval (IR) and Answer Gener-
aim to score different RAG systems or propose new evalu- ation, as well as the overall pipeline. For IR, we used clas-
ation methods, metrics, or datasets, our research is specif- sical metrics such as Recall@K, Precision@K, mAP, MRR,
ically targeted at evaluating the potential satisfaction of and nDCG. For answer generation, the implemented met-
end-user customers who receive the evaluation scores rics were divided into two categories: Syntactic met-
generated by such systems. rics evaluate formal response aspects, including BLEU
By concentrating on the practical utility and inter- [12], ROUGE [13], Precision, Recall, F1, and Exact Match
pretability of evaluation metrics from the perspective [14]. These focus on text properties rather than semantic
of end-users, our study diverges from the conventional meaning. Semantic metrics evaluate response meaning,
approach of optimizing technical performance alone. In- including BERT score [15] and BEM score [16]. BEM is
stead, we strive to bridge the gap between state-of-the-art preferred over BERT due to reported correlation with
evaluation techniques and the real-world expectations of human evaluations and our empirical findings. LMM-
customers who rely on these systems for decision-making derived Metrics: We implemented in our framework
and information retrieval. the RAG triad of metrics for the three main steps of an
RAGโs execution [6]: (i) Context relevance that assesses if
the passage returned is relevant for answering the given
3. Method query. (ii) Groundedness that assesses if the generated
answer is faithful to the retrieved passage or if it contains
3.1. Framework for RAG and evaluation hallucinated or extrapolated statements beyond the pas-
This paper introduces a framework for running and eval- sage. (iii) Answer relevance that assesses if the generated
uating a RAG system for efficiently processing and re- answer is relevant given the query and retrieved passage.
sponding to natural language queries. The system inte- In addition, we implemented the Answer correctness that
grates state-of-the-art technologies to enhance answer exploits LLMs and gold answers to measure the factual
correctness of an answer. In this paper, only a subset of
1
https://huggingface.co/spaces/HuggingFaceH4/open_llm_
metrics are considered and compared for assessing the
leaderboard quality of the answers (see Section 4.2).
Manual evaluation. To verify the reliability of au- 4.1. Datasets
tomated evaluation metrics, we implemented a rigorous
NarrativeQA - English. A subsample of the Narra-
manual evaluation process to assess the relevance, accu-
tiveQA dataset [17] was used, with 50 book-related and
racy, and coherence of the answers generated by our RAG
50 movie script-related questions (1% of the test set), span-
system. This manual evaluation was conducted by three
ning 41 unique books and 42 unique movie scripts. This
independent human annotators, each with expertise in
allowed evaluating the RAG systemโs performance across
the domain of the questions posed to the system. For each
two distinct narrative content types.
evaluation session, the annotators were presented with
Financial Asset Management - Italian. The FinAM-
the question, the corresponding answer generated by the
it dataset, created by Altilia, consists of 50 question-
RAG system, and the ground truth provided by the origi-
answer pairs from Italian asset management documents
nal dataset or the customer answers. The primary task
on topics like investment strategies, risk management,
for each annotator was to assess the quality of the gener-
and regulatory compliance. The questions are complex
ated answer in relation to the posed question, employing
and diverse, often requiring information from multiple
a discrete scoring 5-point likert scale. The criteria for
paragraphs, with detailed, conversational-style answers.
scoring were as follows: 1. Very Poor: The generated
answer is totally incorrect or irrelevant to the question.
This case indicates a failure of the system to comprehend 4.2. Metrics
the query or retrieve pertinent information. 2. Poor: The
generated answer is predominantly incorrect but with
Table 1
glimpses of relevance suggesting some level of under- Naming and classification of metrics shown in the experimen-
standing or appropriate retrieval. 3. Neither: The gener- tal evaluation
ated answer mixes relevant and irrelevant information
almost equally, showcasing the systemโs partial success Acronym Name - Framework Type
in addressing the query. 4. Good: The generated answer BEM BEM score - TensorFlow GT-based
is largely correct but includes minor inaccuracies or irrel- AR TruLens Answer Relevance - TruLens GT-free
evant details, demonstrating a strong understanding and AR RAGAS Answer Relevance - RAGAS GT-free
response to the question. 5. Very Good: Reserved for AC RAGAS Answer Correctness - RAGAS GT-based
answers that are completely correct and fully relevant,
reflecting an ideal outcome where the system accurately In this paper we focus on evaluating the generated
understood and responded to the query. The annotators answerโs quality of the entire pipeline.
conducted their assessments independently to ensure un- In our analysis, we considered the BEM score (BERT
biased evaluations. Upon completion, the scores for each matching score) [15], which we experimented is the most
question-answer pair were collected and compared. In satisfying among the classic metrics. It is a metric that
cases of discrepancy, a consensus discussion was initi- uses a BERT model [18] trained to solve an answer equiv-
ated among the annotators to agree on the most accurate alence task, this task is solved by training a classifier
score. This consensus process allowed for mitigating in- that tells if two given answers are equivalent and returns
dividual bias and considering different perspectives in the equivalence score. We use the variation of the BERT
evaluating the quality of the generated answers. This score Answers and questions that exploits the two answers
manual evaluation process helps particularly in assess- and the question as model input. This variation results
ing the reliability and validity of our systemโs automated in performing better [16].
evaluation metrics. By comparing the human-generated In addition, we considered novel LLM-derived met-
scores against the results produced by these automated rics developed in the RAGAS [6] and Truelens2 systems.
measures, we can determine the extent to which the au- These metrics offer evaluations both ground truth-based
tomatic metrics accurately reflect human judgment and and reference-free. In particular, from RAGAS we used
perception of answer quality. the two main metrics that focus on answers: Answer Cor-
rectness and Answer Relevance. More in detail: (i) An-
swer Correctness3 : This metric measures the factual cor-
4. Experiments rectness of an answer and needs the presence of a ground
truth. It employs an LLM to extract factual statements
Considering different domains (Section 4.1), we investi-
from both the predicted answer and the ground truth
gate the reliability of a subset of existing metrics (Section
labeling them as True Positives if are present in both the
4.2) for evaluating a RAG system (Section 3.1). We ex-
answers, False Negatives if are present only in the ground
plore the feasibility of adopting reference-free metrics
and the correlation among them and the human evalua- 2 https://www.trulens.org/
3
tion (Section 3.2). https://docs.ragas.io/en/latest/concepts/metrics/answer_
correctness.html
truth, and False Positives if they are present only in the You a r e a c h a t b o t h a v i n g a
prediction. Then a final F1 score is calculated, this score c o n v e r s a t i o n w i t h a human .
in the range (0, 1) is the Answer Correctness. (ii) An- Given t h e f o l l o w i n g e x t r a c t e d p a r t s
swer Relevance 4 : This metric measures how pertinent o f a l o n g document and a q u e s t i o n ,
the generated answer is to the prompt given to the LLM c r e a t e a f i n a l answer .
in the generation step. It computes a score in the range I f you don โ t know t h e answer , j u s t
(0, 1) as the mean of the cosine similarities between the s a y t h a t you don โ t know , don โ t t r y
original question and a set of artificial questions gener- t o make up an answer .
ated by an LLM on the basis of the predicted answer and C o n t e x t : { CONTEXT }
the given context. The formula ofโ๏ธ the score is the follow- Chat h i s t o r y : { CHAT_HISTORY }
ing: ๐ด๐๐ ๐ค๐๐๐
๐๐๐๐ฃ๐๐๐๐ = ๐1 ๐
๐=1 ๐๐๐ ๐๐๐(๐ธ๐ , ๐ธ๐๐ ) Human : { HUMAN_INPUT }
where ๐ธ๐ is the embedding of the original generated Chatbot :
answer and ๐ธ๐๐ is the embedding of the i-th generated
This prompt provided the model with instructions, con-
question. From TruLens we used the implemented An-
text, and encouraged concise, truthful answers without
swer Relevance metric that prompts an LLM to evaluate
fabrication.
the relevance of the answer with respect to the input
prompt that includes context and question. The score
that the LLM assigns to each answer is in the range (0, 1). 4.4. Results
To study the interrelationships and relative effective-
For both books and movies subsamples from the Nar-
ness among various evaluation metrics, we exploit the
rativeQA dataset, as can be seen in table 2 and table 3,
Spearman correlation coefficient. The Spearman Rank
human judgment shows a moderately strong Spearman
Correlation [19] is a non-parametric measure that as-
correlation with BEM (0.735 and 0.704) and AC RAGAS
sesses the statistical dependence between the rankings of
scores across both GPT-3.5-turbo (0.718, 0.792), and GPT-
two variables. It tells how well the relationship between
4-turbo models (0.67 and 0.781). This indicates that these
these variables can be described using a monotonic func-
ground truth-based metrics are more aligned with hu-
tion. This measure is computed on ranked data, allowing
man perception of answer quality. Reference-free metrics
for the analysis of both ordinal variables and continuous
show poor correlation with human judgment, especially
variables that have been converted into ranks. The Spear-
AR RAGAS (0.234 and 0.483), highlighting the fact that
man Rank Correlation coefficient is denoted by ๐, and its
evaluating an answer without ground truth is still a chal-
value ranges from โ1 to 1 inclusive, where 1 indicates
lenging problem for Large Language Models. The anal-
perfect positive correlation, 0 indicates no correlation,
ysis of the FinAM-it dataset as it can be seen in table
and โ1 indicates perfect negative correlation.
4 shows generally lower correlations across all metrics,
with the highest correlation being observed between hu-
4.3. Settings man judgment and AC RAGAS gpt-4-turbo (0.531). This
could be related to the fact that the FinAM-it dataset
For this implementation, we employed OpenAI models
presents more challenging and diverse content that is
for the embedding, retrieval, and generation stages of
more difficult to evaluate. Extending the analysis on all
the RAG and to implement evaluations with RAGAS and
the datasets at once, it can be seen that all the metrics
TruLens. The Ingestion step produced chunks of 1024
have still difficulties to approximate the human evalua-
characters, balancing semantic integrity with avoiding
tion in a robust and reliable way.
irrelevant or redundant information. Larger chunks may
capture more context but increase noise, while smaller
sizes may sacrifice contextual information. These chunks 5. Conclusion
were embedded using OpenAIโs text-embedding-ada-002 5 ,
a state-of-the-art transformer model for generating high- Our exploration into evaluating Retrieval Augmented
quality text embeddings. For retrieval within the vector Generation (RAG) systems via ground truth-based and
store, the system identified the 10 most similar embed- reference-free metrics was driven by the need for reliable
dings to previously indexed chunks. During generation, evaluation frameworks, particularly for scenarios lacking
we employed the GPT-4-Turbo model6 with the following ground truth data. Our evaluation frameworkโs imple-
prompt structure: mentation has demonstrated its potential for facilitating
a more comprehensive understanding of these systemsโ
capabilities in such situations. Through rigorous experi-
4
https://docs.ragas.io/en/latest/concepts/metrics/answer_ mentation across different domains and datasets, includ-
relevance.html
5
https://openai.com/blog/new-and-improved-embedding-model
ing NarrativeQA and a specialized industrial dataset, we
6
https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo
Table 2
Spearman correlations on NarrativeQA books subsample
Human AR TruLens AR RAGAS AC RAGAS AR TruLens AR RAGAS AC RAGAS
Metrics BEM
Judgement gpt-3.5-turbo gpt-3.5-turbo gpt-3.5-turbo gpt-4-turbo gpt-4-turbo gpt-4-turbo
Human Judgement 1.000 0.735 0.436 0.234 0.718 0.420 0.150 0.670
BEM 0.735 1.000 0.185 0.224 0.740 0.405 -0.026 0.713
AR TruLens gpt-3.5-turbo 0.436 0.185 1.000 0.197 0.274 0.477 0.178 0.224
AR RAGAS gpt-3.5-turbo 0.234 0.224 0.197 1.000 0.129 0.156 0.633 0.121
AC RAGAS gpt-3.5-turbo 0.718 0.740 0.274 0.129 1.000 0.238 0.093 0.854
AR TruLens gpt-4-turbo 0.420 0.405 0.477 0.156 0.238 1.000 0.122 0.108
AR RAGAS gpt-4-turbo 0.150 -0.026 0.178 0.633 0.093 0.122 1.000 0.097
AC RAGAS gpt-4-turbo 0.670 0.713 0.224 0.121 0.854 0.108 0.097 1.000
Table 3
Spearman correlations on NarrativeQA movies subsample
Human AR TruLens AR RAGAS AC RAGAS AR TruLens AR RAGAS AC RAGAS
Metrics BEM
Judgement gpt-3.5-turbo gpt-3.5-turbo gpt-3.5-turbo gpt-4-turbo gpt-4-turbo gpt-4-turbo
Human Judgement 1.000 0.704 0.565 0.483 0.792 0.213 0.411 0.781
BEM 0.704 1.000 0.522 0.428 0.752 0.235 0.358 0.746
AR TruLens gpt-3.5-turbo 0.565 0.522 1.000 0.390 0.476 0.270 0.422 0.473
AR RAGAS gpt-3.5-turbo 0.483 0.428 0.390 1.000 0.403 0.406 0.738 0.421
AC RAGAS gpt-3.5-turbo 0.792 0.752 0.476 0.403 1.000 0.228 0.358 0.977
AR TruLens gpt-4-turbo 0.213 0.235 0.270 0.406 0.228 1.000 0.456 0.200
AR RAGAS gpt-4-turbo 0.411 0.358 0.422 0.738 0.358 0.456 1.000 0.379
AC RAGAS gpt-4-turbo 0.781 0.746 0.473 0.421 0.977 0.200 0.379 1.000
compared various evaluation methodologies against hu- turing nuanced aspects of human judgment, suggesting
man judgment. While ground truth-based metrics like an urgent need for further refinement of reference-free
BEM and AC RAGAS showed moderate to strong correla- evaluation methods. The Spearman correlation analysis
tion with human judgments across different domains and reveals that while some metrics align more closely with
models, reference-free metrics still face significant chal- human assessments, there is still significant room for im-
lenges in achieving similar correlation levels. This high- provement, especially for more challenging and diverse
lights the current limitations of automated metrics in cap- content like the FinAM-it dataset. These findings under-
Table 4
Spearman correlations on FinAM-it dataset
Human AR TruLens AR RAGAS AC RAGAS AR TruLens AR RAGAS AC RAGAS
Metrics BEM
Judgement gpt-3.5-turbo gpt-3.5-turbo gpt-3.5-turbo gpt-4-turbo gpt-4-turbo gpt-4-turbo
Human Judgement 1.000 0.208 0.178 0.153 0.053 0.280 0.230 0.531
BEM 0.208 1.000 0.214 0.209 0.276 0.001 0.203 0.278
AR TruLens gpt-3.5-turbo 0.178 0.214 1.000 0.412 0.433 0.181 0.446 0.300
AR RAGAS gpt-3.5-turbo 0.153 0.209 0.412 1.000 0.463 -0.191 0.608 0.130
AC RAGAS gpt-3.5-turbo 0.053 0.276 0.433 0.463 1.000 -0.099 0.243 0.255
AR TruLens gpt-4-turbo 0.280 0.001 0.181 -0.191 -0.099 1.000 -0.009 0.245
AR RAGAS gpt-4-turbo 0.230 0.203 0.446 0.608 0.243 -0.009 1.000 0.157
AC RAGAS gpt-4-turbo 0.531 0.278 0.300 0.130 0.255 0.245 0.157 1.000
Table 5
Spearman correlations on all datasets
Human AR TruLens AR RAGAS AC RAGAS AR TruLens AR RAGAS AC RAGAS
Metrics BEM
Judgement gpt-3.5-turbo gpt-3.5-turbo gpt-3.5-turbo gpt-4-turbo gpt-4-turbo gpt-4-turbo
Human Judgement 1.000 0.627 0.423 0.323 0.536 0.314 0.287 0.653
BEM 0.627 1.000 0.310 0.266 0.654 0.249 0.155 0.711
AR TruLens gpt-3.5-turbo 0.423 0.310 1.000 0.346 0.303 0.302 0.375 0.302
AR RAGAS gpt-3.5-turbo 0.323 0.266 0.346 1.000 0.213 0.201 0.682 0.198
AC RAGAS gpt-3.5-turbo 0.536 0.654 0.303 0.213 1.000 0.208 0.139 0.813
AR TruLens gpt-4-turbo 0.314 0.249 0.302 0.201 0.208 1.000 0.250 0.187
AR RAGAS gpt-4-turbo 0.287 0.155 0.375 0.682 0.139 0.250 1.000 0.169
AC RAGAS gpt-4-turbo 0.653 0.711 0.302 0.198 0.813 0.187 0.169 1.000
score the complexity of accurately evaluating RAG sys- [6] S. Es, J. James, L. Espinosa-Anke, S. Schockaert, Ra-
tems and the importance of considering domain-specific gas: Automated evaluation of retrieval augmented
factors in metric development and selection. The ob- generation, 2023. arXiv:2309.15217.
served limitations can have practical consequences, such [7] Y. Tang, Y. Yang, Multihop-rag: Benchmark-
as inaccurate system performance assessments, leading ing retrieval-augmented generation for multi-hop
to suboptimal deployment decisions and reduced user sat- queries, 2024. arXiv:2401.15391.
isfaction. Looking forward, our study emphasizes devel- [8] M. Gao, X. Hu, J. Ruan, X. Pu, X. Wan, Llm-based
oping more nuanced and sophisticated evaluation frame- nlg evaluation: Current status and challenges, 2024.
works that can better approximate human judgment. This arXiv:2402.01383.
entails improving existing metricsโ accuracy and relia- [9] Z. Zhang, M. Fang, L. Chen, Retrievalqa: Assess-
bility and exploring new methodologies to effectively ing adaptive retrieval-augmented generation for
capture qualitative aspects of generated answers. short-form open-domain question answering, 2024.
While our evaluation framework provides valuable in- arXiv:2402.16457.
sights, we acknowledge several limitations: (i) Current [10] V. Katranidis, G. Barany, Faaf: Facts as a func-
reference-free metrics still struggle to match human judg- tion for the evaluation of rag systems, 2024.
ment, necessitating further refinement. (ii) Metric perfor- arXiv:2403.03888.
mance suffers for challenging, domain-specific datasets, [11] J. Saad-Falcon, O. Khattab, C. Potts, M. Zaharia,
highlighting the need for domain-aware or adaptive ap- Ares: An automated evaluation framework for
proaches. (iii) Our analysis covered a subset of available retrieval-augmented generation systems, 2024.
metrics; exploring a wider range, including leveraging arXiv:2311.09476.
advanced LLMs and additional context, is needed. (iv) Re- [12] C.-Y. Lin, E. Hovy, Automatic evaluation of sum-
sults should be validated across different RAG configura- maries using n-gram co-occurrence statistics, in:
tions and domains for broader applicability. (v) Despite Human Language Technology Conference of the
rigorous human evaluation, inherent subjectivity and North American Chapter of the ACL, 2003, pp. 150โ
potential biases may have impacted findings. We view 157. URL: https://aclanthology.org/N03-1020.
these limitations as opportunities to contribute to devel- [13] C.-Y. Lin, ROUGE: A package for automatic eval-
oping more reliable, accurate, and human-like evaluation uation of summaries, in: Text Summarization
frameworks that can drive advancements in natural lan- Branches Out, ACL, Barcelona, Spain, 2004, pp. 74โ
guage processing capabilities and the realization of highly 81. URL: https://aclanthology.org/W04-1013.
effective RAG systems across diverse domains. [14] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD:
100,000+ questions for machine comprehension of
text, in: J. Su, K. Duh, X. Carreras (Eds.), EMNLP,
References ACL, Austin, Texas, 2016, pp. 2383โ2392. URL: https:
//aclanthology.org/D16-1264. doi:10.18653/v1/
[1] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang,
D16-1264.
Retrieval augmented language model pre-training,
[15] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
in: ICML 2020, 13-18 July 2020, Virtual Event,
Y. Artzi, Bertscore: Evaluating text generation with
volume 119 of Proceedings of Machine Learning
bert, 2020. arXiv:1904.09675.
Research, PMLR, 2020, pp. 3929โ3938. URL: http:
[16] J. Bulian, C. Buck, W. Gajewski, B. Boerschinger,
//proceedings.mlr.press/v119/guu20a.html.
T. Schuster, Tomayto, tomahto. beyond token-level
[2] O. Khattab, C. Potts, M. Zaharia, Relevance-
answer equivalence for question answering evalua-
guided supervision for openqa with colbert, 2021.
tion, 2022. arXiv:2202.07654.
arXiv:2007.00814.
[17] T. Koฤiskรฝ, J. Schwarz, P. Blunsom, C. Dyer, K. M.
[3] K. Shuster, S. Poff, M. Chen, D. Kiela, J. Weston,
Hermann, G. Melis, E. Grefenstette, The narra-
Retrieval augmentation reduces hallucination in
tiveqa reading comprehension challenge, 2017.
conversation, 2021. arXiv:2104.07567.
arXiv:1712.07040.
[4] S. Huo, N. Arabzadeh, C. Clarke, Retrieving sup-
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
porting evidence for generative question answer-
Bert: Pre-training of deep bidirectional trans-
ing, in: SIGIR-AP, ACM, 2023, pp. 11โ20. URL:
formers for language understanding, 2019.
http://dx.doi.org/10.1145/3624918.3625336. doi:10.
arXiv:1810.04805.
1145/3624918.3625336.
[19] K. F. Weaver, V. Morales, S. L. Dunn,
[5] T. Zhang, S. G. Patil, N. Jain, S. Shen, M. Za-
K. Godde, P. F. Weaver, Pearsonโs and
haria, I. Stoica, J. E. Gonzalez, Raft: Adapting
Spearmanโs Correlation, John Wiley and
language model to domain specific rag, 2024.
Sons, Ltd, 2017, pp. 435โ471. doi:https:
arXiv:2403.10131.
//doi.org/10.1002/9781119454205.ch10.