Evaluating Retrieval-Augmented Generation for Question
                                Answering with Large Language Models
                                Ermelinda Oro1,2,* , Francesco Maria Granata2 , Antonio Lanza2 , Amir Bachir2 ,
                                Luca De Grandis2 and Massimo Ruffolo1,2
                                1
                                    National Research Council, Institute for High Performance Computing and Networking, via P. Bucci 8/9C, Rende (CS), 87036, Italy
                                2
                                    Altilia srl, TechNest Start-up Incubator of University of Calabria, Piazza Vermicelli, Rende (CS), 87036, Italy


                                                  Abstract
                                                  We present a comprehensive framework for evaluating retrieval-augmented generation (RAG) systems designed for question-
                                                  answering tasks using large language models (LLMs). The proposed framework integrates document ingestion, information
                                                  retrieval, answer generation, and evaluation phases. Both ground truth-based and reference-free evaluation metrics are
                                                  implemented to provide a multi-faceted assessment approach. Through experiments across diverse datasets like NarrativeQA
                                                  and a proprietary financial dataset (FinAM-it), the reliability of existing metrics is investigated by comparing them against
                                                  rigorous human evaluations. The results demonstrate that ground truth-based metrics such as BEM and RAGAS Answer
                                                  Correctness exhibit a moderately strong correlation with human judgments. However, reference-free metrics still struggle
                                                  to capture nuances in answer quality without predefined correct responses accurately. An in-depth analysis of Spearman
                                                  correlation coefficients sheds light on the interrelationships and relative effectiveness of various evaluation approaches across
                                                  multiple domains. While highlighting the current limitations of reference-free methodologies, the study underscores the need
                                                  for more sophisticated techniques to better approximate human perception of answer relevance and correctness. Overall, this
                                                  research contributes to ongoing efforts in developing reliable evaluation frameworks for RAG systems, paving the way for
                                                  advancements in natural language processing and the realization of highly accurate and human-like AI systems.

                                                  Keywords
                                                  Retrieval Augmented Generation (RAG), Question Answering (QA), Retrieval, Large Language Model (LLM), Evaluation


                                1. Introduction                                                                                            an extensive series of experiments spanning diverse do-
                                                                                                                                           mains and datasets we investigate the reliability and
                                Retrieval-Augmented Generation (RAG) systems, which                                                        validity of existing evaluation methodologies. Specifi-
                                integrate information retrieval with natural language                                                      cally, we examine the correlation between various met-
                                generation, have shown promise in enhancing language                                                       rics and rigorous human evaluations, shedding light on
                                models’ capabilities. However, evaluating their perfor-                                                    their strengths, limitations, and potential for improve-
                                mance remains challenging, particularly when ground                                                        ment. Our findings reveal that while ground truth-based
                                truth data is unavailable, impeding accurate assessments                                                   metrics like BEM and RAG Answer Correctness exhibit
                                of system utility. To address this challenge, we present a                                                 moderate alignment with human judgments, reference-
                                comprehensive framework designed to facilitate the rig-                                                    free metrics still struggle to accurately capture answer
                                orous evaluation of RAG systems for question-answering                                                     quality nuances without predefined correct responses.
                                tasks. Our framework integrates document ingestion,                                                        By analyzing Spearman correlation coefficients, we elu-
                                retrieval, generation, and evaluation phases, leverag-                                                     cidate the interrelationships and relative effectiveness of
                                ing state-of-the-art technologies to optimize accuracy                                                     different evaluation approaches across multiple domains.
                                and relevance. We implement both ground truth-based                                                            This research makes the following key contributions:
                                and reference-free evaluation metrics, providing a multi-                                                  (i) presenting a comprehensive framework for evaluating
                                faceted approach to assessing system outputs. Through                                                      RAG systems with state-of-the-art components, (ii) imple-
                                                                                                                                           menting and comparing diverse ground truth-based and
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga-                                    reference-free evaluation metrics, (iii) conducting rigor-
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                *
                                  Corresponding author.
                                                                                                                                           ous experiments across multiple datasets to assess metric
                                $ ermelinda.oro@icar.cnr.it (E. Oro);                                                                      reliability against human judgments, and (iv) analyzing
                                francesco.granata@altiliagroup.com (F. M. Granata);                                                        the strengths and limitations of existing metrics, high-
                                antonio.lanza@altiliagroup.com (A. Lanza);                                                                 lighting the need for advanced reference-free evaluation
                                amir.bachir@altiliagroup.com (A. Bachir);                                                                  techniques that better approximate human perception.
                                luca.degrandis@altiliagroup.com (L. D. Grandis);
                                massimo.ruffolo@altiliagroup.com (M. Ruffolo)
                                                                                                                                               The rest of the paper is organized as follows: Section
                                 0000-0002-5529-1007 (E. Oro); 0000-0003-4425-753X                                                        2 discusses related work. Section 3 presents the method.
                                (F. M. Granata); 0000-0002-2875-4133 (L. D. Grandis);                                                      Section 4 shows the experimental evaluation and Section
                                0000-0002-4094-4810 (M. Ruffolo)                                                                           5 concludes the work.
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                             accuracy and relevance. The process is segmented into
                                                             four main phases: Ingestion: Input documents are pro-
                                                             cessed into manageable chunks, leveraging techniques
                                                             like document layout analysis for PDFs. The chunks are
                                                             embedded into high-dimensional vectors capturing their
                                                             semantic essence and ingested into a vector store for
                                                             efficient similarity search. Retrieval: Upon receiving
                                                             a query, its vector form undergoes similarity search in
                                                             the vector store to identify the 𝑘 most relevant chunks.
                                                             This narrows down the information to the most per-
Figure 1: The simplified figure of the implemented RAG Sys- tinent chunks for answer generation. Generation: A
tem.                                                         Large Language Model (LLM) synthesizes information
                                                             from the retrieved chunks to construct a coherent and
                                                             natural-sounding answer to the query. Evaluation: A
2. Related Work                                              two-sided approach employs both ground-truth depen-
                                                             dent and independent metrics. Ground-truth dependent
RAG systems have been implemented in various forms metrics assess correctness against predefined answers,
[1, 2, 3, 4, 5], incorporating advanced strategies like doc- while ground-truth independent metrics evaluate answer
ument splitting, chunking, retrieval, and diverse models relevance without a predefined set. This dual approach
for embedding and language generation, including pro- enables a comprehensive assessment of performance, cor-
prietary and open-source models from platforms like rectness, and overall text quality. The system can receive
HuggingFace1 . We have also explored different variants human evaluations of question-answer pairs to evaluate
of RAG systems, however, this paper’s primary focus is metric reliability and alignment with expectations.
not to introduce a novel RAG system or methodology but
to comprehensively evaluate the effectiveness of Large
                                                             3.2. Evaluation Strategies
Language Model (LLM)-derived metrics, emphasizing
reference-free approaches.                                   In our RAG system, we implemented and tested a wide
    Several prior works have proposed frameworks and range of evaluation metrics. Specifically, our system in-
novel metrics that leverage the capabilities of LLMs corporates metrics for assessing individual RAG compo-
[6, 7, 8, 9, 10, 11]. Unlike these existing solutions, which nents like Information Retrieval (IR) and Answer Gener-
aim to score different RAG systems or propose new evalu- ation, as well as the overall pipeline. For IR, we used clas-
ation methods, metrics, or datasets, our research is specif- sical metrics such as Recall@K, Precision@K, mAP, MRR,
ically targeted at evaluating the potential satisfaction of and nDCG. For answer generation, the implemented met-
end-user customers who receive the evaluation scores rics were divided into two categories: Syntactic met-
generated by such systems.                                   rics evaluate formal response aspects, including BLEU
    By concentrating on the practical utility and inter- [12], ROUGE [13], Precision, Recall, F1, and Exact Match
pretability of evaluation metrics from the perspective [14]. These focus on text properties rather than semantic
of end-users, our study diverges from the conventional meaning. Semantic metrics evaluate response meaning,
approach of optimizing technical performance alone. In- including BERT score [15] and BEM score [16]. BEM is
stead, we strive to bridge the gap between state-of-the-art preferred over BERT due to reported correlation with
evaluation techniques and the real-world expectations of human evaluations and our empirical findings. LMM-
customers who rely on these systems for decision-making derived Metrics: We implemented in our framework
and information retrieval.                                   the RAG triad of metrics for the three main steps of an
                                                             RAG’s execution [6]: (i) Context relevance that assesses if
                                                             the passage returned is relevant for answering the given
3. Method                                                    query. (ii) Groundedness that assesses if the generated
                                                             answer is faithful to the retrieved passage or if it contains
3.1. Framework for RAG and evaluation                        hallucinated or extrapolated statements beyond the pas-
This paper introduces a framework for running and eval- sage. (iii) Answer relevance that assesses if the generated
uating a RAG system for efficiently processing and re- answer is relevant given the query and retrieved passage.
sponding to natural language queries. The system inte- In addition, we implemented the Answer correctness that
grates state-of-the-art technologies to enhance answer exploits LLMs and gold answers to measure the factual
                                                             correctness of an answer. In this paper, only a subset of
1
  https://huggingface.co/spaces/HuggingFaceH4/open_llm_
                                                             metrics are considered and compared for assessing the
  leaderboard                                                quality of the answers (see Section 4.2).
    Manual evaluation. To verify the reliability of au-        4.1. Datasets
tomated evaluation metrics, we implemented a rigorous
                                                               NarrativeQA - English. A subsample of the Narra-
manual evaluation process to assess the relevance, accu-
                                                               tiveQA dataset [17] was used, with 50 book-related and
racy, and coherence of the answers generated by our RAG
                                                               50 movie script-related questions (1% of the test set), span-
system. This manual evaluation was conducted by three
                                                               ning 41 unique books and 42 unique movie scripts. This
independent human annotators, each with expertise in
                                                               allowed evaluating the RAG system’s performance across
the domain of the questions posed to the system. For each
                                                               two distinct narrative content types.
evaluation session, the annotators were presented with
                                                                  Financial Asset Management - Italian. The FinAM-
the question, the corresponding answer generated by the
                                                               it dataset, created by Altilia, consists of 50 question-
RAG system, and the ground truth provided by the origi-
                                                               answer pairs from Italian asset management documents
nal dataset or the customer answers. The primary task
                                                               on topics like investment strategies, risk management,
for each annotator was to assess the quality of the gener-
                                                               and regulatory compliance. The questions are complex
ated answer in relation to the posed question, employing
                                                               and diverse, often requiring information from multiple
a discrete scoring 5-point likert scale. The criteria for
                                                               paragraphs, with detailed, conversational-style answers.
scoring were as follows: 1. Very Poor: The generated
answer is totally incorrect or irrelevant to the question.
This case indicates a failure of the system to comprehend      4.2. Metrics
the query or retrieve pertinent information. 2. Poor: The
generated answer is predominantly incorrect but with
                                                               Table 1
glimpses of relevance suggesting some level of under-          Naming and classification of metrics shown in the experimen-
standing or appropriate retrieval. 3. Neither: The gener-      tal evaluation
ated answer mixes relevant and irrelevant information
almost equally, showcasing the system’s partial success         Acronym                    Name - Framework          Type
in addressing the query. 4. Good: The generated answer          BEM                   BEM score - TensorFlow     GT-based
is largely correct but includes minor inaccuracies or irrel-    AR TruLens         Answer Relevance - TruLens     GT-free
evant details, demonstrating a strong understanding and         AR RAGAS           Answer Relevance - RAGAS       GT-free
response to the question. 5. Very Good: Reserved for            AC RAGAS          Answer Correctness - RAGAS     GT-based
answers that are completely correct and fully relevant,
reflecting an ideal outcome where the system accurately           In this paper we focus on evaluating the generated
understood and responded to the query. The annotators         answer’s quality of the entire pipeline.
conducted their assessments independently to ensure un-           In our analysis, we considered the BEM score (BERT
biased evaluations. Upon completion, the scores for each      matching score) [15], which we experimented is the most
question-answer pair were collected and compared. In          satisfying among the classic metrics. It is a metric that
cases of discrepancy, a consensus discussion was initi-       uses a BERT model [18] trained to solve an answer equiv-
ated among the annotators to agree on the most accurate       alence task, this task is solved by training a classifier
score. This consensus process allowed for mitigating in-      that tells if two given answers are equivalent and returns
dividual bias and considering different perspectives in       the equivalence score. We use the variation of the BERT
evaluating the quality of the generated answers. This         score Answers and questions that exploits the two answers
manual evaluation process helps particularly in assess-       and the question as model input. This variation results
ing the reliability and validity of our system’s automated    in performing better [16].
evaluation metrics. By comparing the human-generated              In addition, we considered novel LLM-derived met-
scores against the results produced by these automated        rics developed in the RAGAS [6] and Truelens2 systems.
measures, we can determine the extent to which the au-        These metrics offer evaluations both ground truth-based
tomatic metrics accurately reflect human judgment and         and reference-free. In particular, from RAGAS we used
perception of answer quality.                                 the two main metrics that focus on answers: Answer Cor-
                                                              rectness and Answer Relevance. More in detail: (i) An-
                                                              swer Correctness3 : This metric measures the factual cor-
4. Experiments                                                rectness of an answer and needs the presence of a ground
                                                              truth. It employs an LLM to extract factual statements
Considering different domains (Section 4.1), we investi-
                                                              from both the predicted answer and the ground truth
gate the reliability of a subset of existing metrics (Section
                                                              labeling them as True Positives if are present in both the
4.2) for evaluating a RAG system (Section 3.1). We ex-
                                                              answers, False Negatives if are present only in the ground
plore the feasibility of adopting reference-free metrics
and the correlation among them and the human evalua- 2 https://www.trulens.org/
                                                              3
tion (Section 3.2).                                             https://docs.ragas.io/en/latest/concepts/metrics/answer_
                                                               correctness.html
truth, and False Positives if they are present only in the        You a r e a c h a t b o t h a v i n g a
prediction. Then a final F1 score is calculated, this score              c o n v e r s a t i o n w i t h a human .
in the range (0, 1) is the Answer Correctness. (ii) An-           Given t h e f o l l o w i n g e x t r a c t e d p a r t s
swer Relevance 4 : This metric measures how pertinent                    o f a l o n g document and a q u e s t i o n ,
the generated answer is to the prompt given to the LLM                     c r e a t e a f i n a l answer .
in the generation step. It computes a score in the range          I f you don ’ t know t h e answer , j u s t
(0, 1) as the mean of the cosine similarities between the                s a y t h a t you don ’ t know , don ’ t t r y
original question and a set of artificial questions gener-                 t o make up an answer .
ated by an LLM on the basis of the predicted answer and           C o n t e x t : { CONTEXT }
the given context. The formula of∑︀ the score is the follow-      Chat h i s t o r y : { CHAT_HISTORY }
ing: 𝐴𝑛𝑠𝑤𝑒𝑟𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 = 𝑁1             𝑁
                                      𝑖=1 𝑐𝑜𝑠𝑖𝑛𝑒(𝐸𝑜 , 𝐸𝑔𝑖 )       Human : { HUMAN_INPUT }
where 𝐸𝑜 is the embedding of the original generated               Chatbot :
answer and 𝐸𝑔𝑖 is the embedding of the i-th generated
                                                                    This prompt provided the model with instructions, con-
question. From TruLens we used the implemented An-
                                                                  text, and encouraged concise, truthful answers without
swer Relevance metric that prompts an LLM to evaluate
                                                                  fabrication.
the relevance of the answer with respect to the input
prompt that includes context and question. The score
that the LLM assigns to each answer is in the range (0, 1).       4.4. Results
   To study the interrelationships and relative effective-
                                                                  For both books and movies subsamples from the Nar-
ness among various evaluation metrics, we exploit the
                                                                  rativeQA dataset, as can be seen in table 2 and table 3,
Spearman correlation coefficient. The Spearman Rank
                                                                  human judgment shows a moderately strong Spearman
Correlation [19] is a non-parametric measure that as-
                                                                  correlation with BEM (0.735 and 0.704) and AC RAGAS
sesses the statistical dependence between the rankings of
                                                                  scores across both GPT-3.5-turbo (0.718, 0.792), and GPT-
two variables. It tells how well the relationship between
                                                                  4-turbo models (0.67 and 0.781). This indicates that these
these variables can be described using a monotonic func-
                                                                  ground truth-based metrics are more aligned with hu-
tion. This measure is computed on ranked data, allowing
                                                                  man perception of answer quality. Reference-free metrics
for the analysis of both ordinal variables and continuous
                                                                  show poor correlation with human judgment, especially
variables that have been converted into ranks. The Spear-
                                                                  AR RAGAS (0.234 and 0.483), highlighting the fact that
man Rank Correlation coefficient is denoted by 𝜌, and its
                                                                  evaluating an answer without ground truth is still a chal-
value ranges from −1 to 1 inclusive, where 1 indicates
                                                                  lenging problem for Large Language Models. The anal-
perfect positive correlation, 0 indicates no correlation,
                                                                  ysis of the FinAM-it dataset as it can be seen in table
and −1 indicates perfect negative correlation.
                                                                  4 shows generally lower correlations across all metrics,
                                                                  with the highest correlation being observed between hu-
4.3. Settings                                                     man judgment and AC RAGAS gpt-4-turbo (0.531). This
                                                                  could be related to the fact that the FinAM-it dataset
For this implementation, we employed OpenAI models
                                                                  presents more challenging and diverse content that is
for the embedding, retrieval, and generation stages of
                                                                  more difficult to evaluate. Extending the analysis on all
the RAG and to implement evaluations with RAGAS and
                                                                  the datasets at once, it can be seen that all the metrics
TruLens. The Ingestion step produced chunks of 1024
                                                                  have still difficulties to approximate the human evalua-
characters, balancing semantic integrity with avoiding
                                                                  tion in a robust and reliable way.
irrelevant or redundant information. Larger chunks may
capture more context but increase noise, while smaller
sizes may sacrifice contextual information. These chunks          5. Conclusion
were embedded using OpenAI’s text-embedding-ada-002 5 ,
a state-of-the-art transformer model for generating high-         Our exploration into evaluating Retrieval Augmented
quality text embeddings. For retrieval within the vector          Generation (RAG) systems via ground truth-based and
store, the system identified the 10 most similar embed-           reference-free metrics was driven by the need for reliable
dings to previously indexed chunks. During generation,            evaluation frameworks, particularly for scenarios lacking
we employed the GPT-4-Turbo model6 with the following             ground truth data. Our evaluation framework’s imple-
prompt structure:                                                 mentation has demonstrated its potential for facilitating
                                                                  a more comprehensive understanding of these systems’
                                                                  capabilities in such situations. Through rigorous experi-
4
  https://docs.ragas.io/en/latest/concepts/metrics/answer_        mentation across different domains and datasets, includ-
  relevance.html
5
  https://openai.com/blog/new-and-improved-embedding-model
                                                                  ing NarrativeQA and a specialized industrial dataset, we
6
  https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo
Table 2
Spearman correlations on NarrativeQA books subsample
                                  Human              AR TruLens       AR RAGAS        AC RAGAS      AR TruLens    AR RAGAS      AC RAGAS
    Metrics                                BEM
                               Judgement            gpt-3.5-turbo   gpt-3.5-turbo   gpt-3.5-turbo   gpt-4-turbo   gpt-4-turbo   gpt-4-turbo
    Human Judgement                1.000   0.735            0.436          0.234           0.718          0.420         0.150        0.670
    BEM                            0.735    1.000           0.185          0.224           0.740          0.405        -0.026        0.713
    AR TruLens gpt-3.5-turbo       0.436    0.185           1.000          0.197           0.274          0.477         0.178        0.224
    AR RAGAS gpt-3.5-turbo         0.234    0.224           0.197          1.000           0.129          0.156        0.633         0.121
    AC RAGAS gpt-3.5-turbo         0.718   0.740            0.274          0.129           1.000          0.238         0.093        0.854
    AR TruLens gpt-4-turbo         0.420    0.405           0.477          0.156           0.238          1.000         0.122        0.108
    AR RAGAS gpt-4-turbo           0.150   -0.026           0.178          0.633           0.093          0.122         1.000        0.097
    AC RAGAS gpt-4-turbo           0.670   0.713            0.224          0.121           0.854          0.108         0.097        1.000


Table 3
Spearman correlations on NarrativeQA movies subsample
                                  Human              AR TruLens       AR RAGAS        AC RAGAS      AR TruLens    AR RAGAS      AC RAGAS
    Metrics                                BEM
                               Judgement            gpt-3.5-turbo   gpt-3.5-turbo   gpt-3.5-turbo   gpt-4-turbo   gpt-4-turbo   gpt-4-turbo
    Human Judgement                1.000   0.704            0.565          0.483           0.792          0.213        0.411         0.781
    BEM                            0.704   1.000            0.522          0.428           0.752          0.235        0.358         0.746
    AR TruLens gpt-3.5-turbo       0.565   0.522            1.000          0.390           0.476          0.270        0.422         0.473
    AR RAGAS gpt-3.5-turbo         0.483   0.428            0.390          1.000           0.403          0.406        0.738         0.421
    AC RAGAS gpt-3.5-turbo         0.792   0.752            0.476          0.403           1.000          0.228        0.358         0.977
    AR TruLens gpt-4-turbo         0.213   0.235            0.270          0.406           0.228          1.000        0.456         0.200
    AR RAGAS gpt-4-turbo           0.411   0.358            0.422          0.738           0.358          0.456        1.000         0.379
    AC RAGAS gpt-4-turbo           0.781   0.746            0.473          0.421           0.977          0.200        0.379         1.000


compared various evaluation methodologies against hu-                    turing nuanced aspects of human judgment, suggesting
man judgment. While ground truth-based metrics like                      an urgent need for further refinement of reference-free
BEM and AC RAGAS showed moderate to strong correla-                      evaluation methods. The Spearman correlation analysis
tion with human judgments across different domains and                   reveals that while some metrics align more closely with
models, reference-free metrics still face significant chal-              human assessments, there is still significant room for im-
lenges in achieving similar correlation levels. This high-               provement, especially for more challenging and diverse
lights the current limitations of automated metrics in cap-              content like the FinAM-it dataset. These findings under-


Table 4
Spearman correlations on FinAM-it dataset
                                  Human              AR TruLens       AR RAGAS        AC RAGAS      AR TruLens    AR RAGAS      AC RAGAS
    Metrics                                BEM
                               Judgement            gpt-3.5-turbo   gpt-3.5-turbo   gpt-3.5-turbo   gpt-4-turbo   gpt-4-turbo   gpt-4-turbo
    Human Judgement                1.000   0.208            0.178           0.153           0.053         0.280         0.230         0.531
    BEM                            0.208   1.000            0.214           0.209           0.276         0.001         0.203         0.278
    AR TruLens gpt-3.5-turbo       0.178   0.214            1.000           0.412           0.433         0.181         0.446         0.300
    AR RAGAS gpt-3.5-turbo         0.153   0.209            0.412           1.000           0.463        -0.191         0.608         0.130
    AC RAGAS gpt-3.5-turbo         0.053   0.276            0.433           0.463           1.000        -0.099         0.243         0.255
    AR TruLens gpt-4-turbo         0.280   0.001            0.181          -0.191          -0.099         1.000        -0.009         0.245
    AR RAGAS gpt-4-turbo           0.230   0.203            0.446           0.608           0.243        -0.009         1.000         0.157
    AC RAGAS gpt-4-turbo           0.531   0.278            0.300           0.130           0.255         0.245         0.157         1.000


Table 5
Spearman correlations on all datasets
                                  Human              AR TruLens       AR RAGAS        AC RAGAS      AR TruLens    AR RAGAS      AC RAGAS
    Metrics                                BEM
                               Judgement            gpt-3.5-turbo   gpt-3.5-turbo   gpt-3.5-turbo   gpt-4-turbo   gpt-4-turbo   gpt-4-turbo
    Human Judgement                1.000   0.627            0.423          0.323           0.536          0.314        0.287         0.653
    BEM                            0.627   1.000            0.310          0.266           0.654          0.249        0.155         0.711
    AR TruLens gpt-3.5-turbo       0.423   0.310            1.000          0.346           0.303          0.302        0.375         0.302
    AR RAGAS gpt-3.5-turbo         0.323   0.266            0.346          1.000           0.213          0.201        0.682         0.198
    AC RAGAS gpt-3.5-turbo         0.536   0.654            0.303          0.213           1.000          0.208        0.139         0.813
    AR TruLens gpt-4-turbo         0.314   0.249            0.302          0.201           0.208          1.000        0.250         0.187
    AR RAGAS gpt-4-turbo           0.287   0.155            0.375          0.682           0.139          0.250        1.000         0.169
    AC RAGAS gpt-4-turbo           0.653   0.711            0.302          0.198           0.813          0.187        0.169         1.000
score the complexity of accurately evaluating RAG sys-         [6] S. Es, J. James, L. Espinosa-Anke, S. Schockaert, Ra-
tems and the importance of considering domain-specific             gas: Automated evaluation of retrieval augmented
factors in metric development and selection. The ob-               generation, 2023. arXiv:2309.15217.
served limitations can have practical consequences, such       [7] Y. Tang, Y. Yang, Multihop-rag: Benchmark-
as inaccurate system performance assessments, leading              ing retrieval-augmented generation for multi-hop
to suboptimal deployment decisions and reduced user sat-           queries, 2024. arXiv:2401.15391.
isfaction. Looking forward, our study emphasizes devel-        [8] M. Gao, X. Hu, J. Ruan, X. Pu, X. Wan, Llm-based
oping more nuanced and sophisticated evaluation frame-             nlg evaluation: Current status and challenges, 2024.
works that can better approximate human judgment. This             arXiv:2402.01383.
entails improving existing metrics’ accuracy and relia-        [9] Z. Zhang, M. Fang, L. Chen, Retrievalqa: Assess-
bility and exploring new methodologies to effectively              ing adaptive retrieval-augmented generation for
capture qualitative aspects of generated answers.                  short-form open-domain question answering, 2024.
   While our evaluation framework provides valuable in-            arXiv:2402.16457.
sights, we acknowledge several limitations: (i) Current       [10] V. Katranidis, G. Barany, Faaf: Facts as a func-
reference-free metrics still struggle to match human judg-         tion for the evaluation of rag systems, 2024.
ment, necessitating further refinement. (ii) Metric perfor-        arXiv:2403.03888.
mance suffers for challenging, domain-specific datasets,      [11] J. Saad-Falcon, O. Khattab, C. Potts, M. Zaharia,
highlighting the need for domain-aware or adaptive ap-             Ares: An automated evaluation framework for
proaches. (iii) Our analysis covered a subset of available         retrieval-augmented generation systems, 2024.
metrics; exploring a wider range, including leveraging             arXiv:2311.09476.
advanced LLMs and additional context, is needed. (iv) Re-     [12] C.-Y. Lin, E. Hovy, Automatic evaluation of sum-
sults should be validated across different RAG configura-          maries using n-gram co-occurrence statistics, in:
tions and domains for broader applicability. (v) Despite           Human Language Technology Conference of the
rigorous human evaluation, inherent subjectivity and               North American Chapter of the ACL, 2003, pp. 150–
potential biases may have impacted findings. We view               157. URL: https://aclanthology.org/N03-1020.
these limitations as opportunities to contribute to devel-    [13] C.-Y. Lin, ROUGE: A package for automatic eval-
oping more reliable, accurate, and human-like evaluation           uation of summaries, in: Text Summarization
frameworks that can drive advancements in natural lan-             Branches Out, ACL, Barcelona, Spain, 2004, pp. 74–
guage processing capabilities and the realization of highly        81. URL: https://aclanthology.org/W04-1013.
effective RAG systems across diverse domains.                 [14] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD:
                                                                   100,000+ questions for machine comprehension of
                                                                   text, in: J. Su, K. Duh, X. Carreras (Eds.), EMNLP,
References                                                         ACL, Austin, Texas, 2016, pp. 2383–2392. URL: https:
                                                                   //aclanthology.org/D16-1264. doi:10.18653/v1/
 [1] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang,
                                                                   D16-1264.
     Retrieval augmented language model pre-training,
                                                              [15] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
     in: ICML 2020, 13-18 July 2020, Virtual Event,
                                                                   Y. Artzi, Bertscore: Evaluating text generation with
     volume 119 of Proceedings of Machine Learning
                                                                   bert, 2020. arXiv:1904.09675.
     Research, PMLR, 2020, pp. 3929–3938. URL: http:
                                                              [16] J. Bulian, C. Buck, W. Gajewski, B. Boerschinger,
     //proceedings.mlr.press/v119/guu20a.html.
                                                                   T. Schuster, Tomayto, tomahto. beyond token-level
 [2] O. Khattab, C. Potts, M. Zaharia, Relevance-
                                                                   answer equivalence for question answering evalua-
     guided supervision for openqa with colbert, 2021.
                                                                   tion, 2022. arXiv:2202.07654.
     arXiv:2007.00814.
                                                              [17] T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M.
 [3] K. Shuster, S. Poff, M. Chen, D. Kiela, J. Weston,
                                                                   Hermann, G. Melis, E. Grefenstette, The narra-
     Retrieval augmentation reduces hallucination in
                                                                   tiveqa reading comprehension challenge, 2017.
     conversation, 2021. arXiv:2104.07567.
                                                                   arXiv:1712.07040.
 [4] S. Huo, N. Arabzadeh, C. Clarke, Retrieving sup-
                                                              [18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     porting evidence for generative question answer-
                                                                   Bert: Pre-training of deep bidirectional trans-
     ing, in: SIGIR-AP, ACM, 2023, pp. 11–20. URL:
                                                                   formers for language understanding, 2019.
     http://dx.doi.org/10.1145/3624918.3625336. doi:10.
                                                                   arXiv:1810.04805.
     1145/3624918.3625336.
                                                              [19] K. F. Weaver, V. Morales, S. L. Dunn,
 [5] T. Zhang, S. G. Patil, N. Jain, S. Shen, M. Za-
                                                                   K. Godde, P. F. Weaver, Pearson’s and
     haria, I. Stoica, J. E. Gonzalez, Raft: Adapting
                                                                   Spearman’s Correlation, John Wiley and
     language model to domain specific rag, 2024.
                                                                   Sons, Ltd, 2017, pp. 435–471. doi:https:
     arXiv:2403.10131.
                                                                   //doi.org/10.1002/9781119454205.ch10.