Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models Ermelinda Oro1,2,* , Francesco Maria Granata2 , Antonio Lanza2 , Amir Bachir2 , Luca De Grandis2 and Massimo Ruffolo1,2 1 National Research Council, Institute for High Performance Computing and Networking, via P. Bucci 8/9C, Rende (CS), 87036, Italy 2 Altilia srl, TechNest Start-up Incubator of University of Calabria, Piazza Vermicelli, Rende (CS), 87036, Italy Abstract We present a comprehensive framework for evaluating retrieval-augmented generation (RAG) systems designed for question- answering tasks using large language models (LLMs). The proposed framework integrates document ingestion, information retrieval, answer generation, and evaluation phases. Both ground truth-based and reference-free evaluation metrics are implemented to provide a multi-faceted assessment approach. Through experiments across diverse datasets like NarrativeQA and a proprietary financial dataset (FinAM-it), the reliability of existing metrics is investigated by comparing them against rigorous human evaluations. The results demonstrate that ground truth-based metrics such as BEM and RAGAS Answer Correctness exhibit a moderately strong correlation with human judgments. However, reference-free metrics still struggle to capture nuances in answer quality without predefined correct responses accurately. An in-depth analysis of Spearman correlation coefficients sheds light on the interrelationships and relative effectiveness of various evaluation approaches across multiple domains. While highlighting the current limitations of reference-free methodologies, the study underscores the need for more sophisticated techniques to better approximate human perception of answer relevance and correctness. Overall, this research contributes to ongoing efforts in developing reliable evaluation frameworks for RAG systems, paving the way for advancements in natural language processing and the realization of highly accurate and human-like AI systems. Keywords Retrieval Augmented Generation (RAG), Question Answering (QA), Retrieval, Large Language Model (LLM), Evaluation 1. Introduction an extensive series of experiments spanning diverse do- mains and datasets we investigate the reliability and Retrieval-Augmented Generation (RAG) systems, which validity of existing evaluation methodologies. Specifi- integrate information retrieval with natural language cally, we examine the correlation between various met- generation, have shown promise in enhancing language rics and rigorous human evaluations, shedding light on modelsโ€™ capabilities. However, evaluating their perfor- their strengths, limitations, and potential for improve- mance remains challenging, particularly when ground ment. Our findings reveal that while ground truth-based truth data is unavailable, impeding accurate assessments metrics like BEM and RAG Answer Correctness exhibit of system utility. To address this challenge, we present a moderate alignment with human judgments, reference- comprehensive framework designed to facilitate the rig- free metrics still struggle to accurately capture answer orous evaluation of RAG systems for question-answering quality nuances without predefined correct responses. tasks. Our framework integrates document ingestion, By analyzing Spearman correlation coefficients, we elu- retrieval, generation, and evaluation phases, leverag- cidate the interrelationships and relative effectiveness of ing state-of-the-art technologies to optimize accuracy different evaluation approaches across multiple domains. and relevance. We implement both ground truth-based This research makes the following key contributions: and reference-free evaluation metrics, providing a multi- (i) presenting a comprehensive framework for evaluating faceted approach to assessing system outputs. Through RAG systems with state-of-the-art components, (ii) imple- menting and comparing diverse ground truth-based and Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- reference-free evaluation metrics, (iii) conducting rigor- nized by CINI, May 29-30, 2024, Naples, Italy * Corresponding author. ous experiments across multiple datasets to assess metric $ ermelinda.oro@icar.cnr.it (E. Oro); reliability against human judgments, and (iv) analyzing francesco.granata@altiliagroup.com (F. M. Granata); the strengths and limitations of existing metrics, high- antonio.lanza@altiliagroup.com (A. Lanza); lighting the need for advanced reference-free evaluation amir.bachir@altiliagroup.com (A. Bachir); techniques that better approximate human perception. luca.degrandis@altiliagroup.com (L. D. Grandis); massimo.ruffolo@altiliagroup.com (M. Ruffolo) The rest of the paper is organized as follows: Section  0000-0002-5529-1007 (E. Oro); 0000-0003-4425-753X 2 discusses related work. Section 3 presents the method. (F. M. Granata); 0000-0002-2875-4133 (L. D. Grandis); Section 4 shows the experimental evaluation and Section 0000-0002-4094-4810 (M. Ruffolo) 5 concludes the work. ยฉ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings accuracy and relevance. The process is segmented into four main phases: Ingestion: Input documents are pro- cessed into manageable chunks, leveraging techniques like document layout analysis for PDFs. The chunks are embedded into high-dimensional vectors capturing their semantic essence and ingested into a vector store for efficient similarity search. Retrieval: Upon receiving a query, its vector form undergoes similarity search in the vector store to identify the ๐‘˜ most relevant chunks. This narrows down the information to the most per- Figure 1: The simplified figure of the implemented RAG Sys- tinent chunks for answer generation. Generation: A tem. Large Language Model (LLM) synthesizes information from the retrieved chunks to construct a coherent and natural-sounding answer to the query. Evaluation: A 2. Related Work two-sided approach employs both ground-truth depen- dent and independent metrics. Ground-truth dependent RAG systems have been implemented in various forms metrics assess correctness against predefined answers, [1, 2, 3, 4, 5], incorporating advanced strategies like doc- while ground-truth independent metrics evaluate answer ument splitting, chunking, retrieval, and diverse models relevance without a predefined set. This dual approach for embedding and language generation, including pro- enables a comprehensive assessment of performance, cor- prietary and open-source models from platforms like rectness, and overall text quality. The system can receive HuggingFace1 . We have also explored different variants human evaluations of question-answer pairs to evaluate of RAG systems, however, this paperโ€™s primary focus is metric reliability and alignment with expectations. not to introduce a novel RAG system or methodology but to comprehensively evaluate the effectiveness of Large 3.2. Evaluation Strategies Language Model (LLM)-derived metrics, emphasizing reference-free approaches. In our RAG system, we implemented and tested a wide Several prior works have proposed frameworks and range of evaluation metrics. Specifically, our system in- novel metrics that leverage the capabilities of LLMs corporates metrics for assessing individual RAG compo- [6, 7, 8, 9, 10, 11]. Unlike these existing solutions, which nents like Information Retrieval (IR) and Answer Gener- aim to score different RAG systems or propose new evalu- ation, as well as the overall pipeline. For IR, we used clas- ation methods, metrics, or datasets, our research is specif- sical metrics such as Recall@K, Precision@K, mAP, MRR, ically targeted at evaluating the potential satisfaction of and nDCG. For answer generation, the implemented met- end-user customers who receive the evaluation scores rics were divided into two categories: Syntactic met- generated by such systems. rics evaluate formal response aspects, including BLEU By concentrating on the practical utility and inter- [12], ROUGE [13], Precision, Recall, F1, and Exact Match pretability of evaluation metrics from the perspective [14]. These focus on text properties rather than semantic of end-users, our study diverges from the conventional meaning. Semantic metrics evaluate response meaning, approach of optimizing technical performance alone. In- including BERT score [15] and BEM score [16]. BEM is stead, we strive to bridge the gap between state-of-the-art preferred over BERT due to reported correlation with evaluation techniques and the real-world expectations of human evaluations and our empirical findings. LMM- customers who rely on these systems for decision-making derived Metrics: We implemented in our framework and information retrieval. the RAG triad of metrics for the three main steps of an RAGโ€™s execution [6]: (i) Context relevance that assesses if the passage returned is relevant for answering the given 3. Method query. (ii) Groundedness that assesses if the generated answer is faithful to the retrieved passage or if it contains 3.1. Framework for RAG and evaluation hallucinated or extrapolated statements beyond the pas- This paper introduces a framework for running and eval- sage. (iii) Answer relevance that assesses if the generated uating a RAG system for efficiently processing and re- answer is relevant given the query and retrieved passage. sponding to natural language queries. The system inte- In addition, we implemented the Answer correctness that grates state-of-the-art technologies to enhance answer exploits LLMs and gold answers to measure the factual correctness of an answer. In this paper, only a subset of 1 https://huggingface.co/spaces/HuggingFaceH4/open_llm_ metrics are considered and compared for assessing the leaderboard quality of the answers (see Section 4.2). Manual evaluation. To verify the reliability of au- 4.1. Datasets tomated evaluation metrics, we implemented a rigorous NarrativeQA - English. A subsample of the Narra- manual evaluation process to assess the relevance, accu- tiveQA dataset [17] was used, with 50 book-related and racy, and coherence of the answers generated by our RAG 50 movie script-related questions (1% of the test set), span- system. This manual evaluation was conducted by three ning 41 unique books and 42 unique movie scripts. This independent human annotators, each with expertise in allowed evaluating the RAG systemโ€™s performance across the domain of the questions posed to the system. For each two distinct narrative content types. evaluation session, the annotators were presented with Financial Asset Management - Italian. The FinAM- the question, the corresponding answer generated by the it dataset, created by Altilia, consists of 50 question- RAG system, and the ground truth provided by the origi- answer pairs from Italian asset management documents nal dataset or the customer answers. The primary task on topics like investment strategies, risk management, for each annotator was to assess the quality of the gener- and regulatory compliance. The questions are complex ated answer in relation to the posed question, employing and diverse, often requiring information from multiple a discrete scoring 5-point likert scale. The criteria for paragraphs, with detailed, conversational-style answers. scoring were as follows: 1. Very Poor: The generated answer is totally incorrect or irrelevant to the question. This case indicates a failure of the system to comprehend 4.2. Metrics the query or retrieve pertinent information. 2. Poor: The generated answer is predominantly incorrect but with Table 1 glimpses of relevance suggesting some level of under- Naming and classification of metrics shown in the experimen- standing or appropriate retrieval. 3. Neither: The gener- tal evaluation ated answer mixes relevant and irrelevant information almost equally, showcasing the systemโ€™s partial success Acronym Name - Framework Type in addressing the query. 4. Good: The generated answer BEM BEM score - TensorFlow GT-based is largely correct but includes minor inaccuracies or irrel- AR TruLens Answer Relevance - TruLens GT-free evant details, demonstrating a strong understanding and AR RAGAS Answer Relevance - RAGAS GT-free response to the question. 5. Very Good: Reserved for AC RAGAS Answer Correctness - RAGAS GT-based answers that are completely correct and fully relevant, reflecting an ideal outcome where the system accurately In this paper we focus on evaluating the generated understood and responded to the query. The annotators answerโ€™s quality of the entire pipeline. conducted their assessments independently to ensure un- In our analysis, we considered the BEM score (BERT biased evaluations. Upon completion, the scores for each matching score) [15], which we experimented is the most question-answer pair were collected and compared. In satisfying among the classic metrics. It is a metric that cases of discrepancy, a consensus discussion was initi- uses a BERT model [18] trained to solve an answer equiv- ated among the annotators to agree on the most accurate alence task, this task is solved by training a classifier score. This consensus process allowed for mitigating in- that tells if two given answers are equivalent and returns dividual bias and considering different perspectives in the equivalence score. We use the variation of the BERT evaluating the quality of the generated answers. This score Answers and questions that exploits the two answers manual evaluation process helps particularly in assess- and the question as model input. This variation results ing the reliability and validity of our systemโ€™s automated in performing better [16]. evaluation metrics. By comparing the human-generated In addition, we considered novel LLM-derived met- scores against the results produced by these automated rics developed in the RAGAS [6] and Truelens2 systems. measures, we can determine the extent to which the au- These metrics offer evaluations both ground truth-based tomatic metrics accurately reflect human judgment and and reference-free. In particular, from RAGAS we used perception of answer quality. the two main metrics that focus on answers: Answer Cor- rectness and Answer Relevance. More in detail: (i) An- swer Correctness3 : This metric measures the factual cor- 4. Experiments rectness of an answer and needs the presence of a ground truth. It employs an LLM to extract factual statements Considering different domains (Section 4.1), we investi- from both the predicted answer and the ground truth gate the reliability of a subset of existing metrics (Section labeling them as True Positives if are present in both the 4.2) for evaluating a RAG system (Section 3.1). We ex- answers, False Negatives if are present only in the ground plore the feasibility of adopting reference-free metrics and the correlation among them and the human evalua- 2 https://www.trulens.org/ 3 tion (Section 3.2). https://docs.ragas.io/en/latest/concepts/metrics/answer_ correctness.html truth, and False Positives if they are present only in the You a r e a c h a t b o t h a v i n g a prediction. Then a final F1 score is calculated, this score c o n v e r s a t i o n w i t h a human . in the range (0, 1) is the Answer Correctness. (ii) An- Given t h e f o l l o w i n g e x t r a c t e d p a r t s swer Relevance 4 : This metric measures how pertinent o f a l o n g document and a q u e s t i o n , the generated answer is to the prompt given to the LLM c r e a t e a f i n a l answer . in the generation step. It computes a score in the range I f you don โ€™ t know t h e answer , j u s t (0, 1) as the mean of the cosine similarities between the s a y t h a t you don โ€™ t know , don โ€™ t t r y original question and a set of artificial questions gener- t o make up an answer . ated by an LLM on the basis of the predicted answer and C o n t e x t : { CONTEXT } the given context. The formula ofโˆ‘๏ธ€ the score is the follow- Chat h i s t o r y : { CHAT_HISTORY } ing: ๐ด๐‘›๐‘ ๐‘ค๐‘’๐‘Ÿ๐‘…๐‘’๐‘™๐‘’๐‘ฃ๐‘Ž๐‘›๐‘๐‘’ = ๐‘1 ๐‘ ๐‘–=1 ๐‘๐‘œ๐‘ ๐‘–๐‘›๐‘’(๐ธ๐‘œ , ๐ธ๐‘”๐‘– ) Human : { HUMAN_INPUT } where ๐ธ๐‘œ is the embedding of the original generated Chatbot : answer and ๐ธ๐‘”๐‘– is the embedding of the i-th generated This prompt provided the model with instructions, con- question. From TruLens we used the implemented An- text, and encouraged concise, truthful answers without swer Relevance metric that prompts an LLM to evaluate fabrication. the relevance of the answer with respect to the input prompt that includes context and question. The score that the LLM assigns to each answer is in the range (0, 1). 4.4. Results To study the interrelationships and relative effective- For both books and movies subsamples from the Nar- ness among various evaluation metrics, we exploit the rativeQA dataset, as can be seen in table 2 and table 3, Spearman correlation coefficient. The Spearman Rank human judgment shows a moderately strong Spearman Correlation [19] is a non-parametric measure that as- correlation with BEM (0.735 and 0.704) and AC RAGAS sesses the statistical dependence between the rankings of scores across both GPT-3.5-turbo (0.718, 0.792), and GPT- two variables. It tells how well the relationship between 4-turbo models (0.67 and 0.781). This indicates that these these variables can be described using a monotonic func- ground truth-based metrics are more aligned with hu- tion. This measure is computed on ranked data, allowing man perception of answer quality. Reference-free metrics for the analysis of both ordinal variables and continuous show poor correlation with human judgment, especially variables that have been converted into ranks. The Spear- AR RAGAS (0.234 and 0.483), highlighting the fact that man Rank Correlation coefficient is denoted by ๐œŒ, and its evaluating an answer without ground truth is still a chal- value ranges from โˆ’1 to 1 inclusive, where 1 indicates lenging problem for Large Language Models. The anal- perfect positive correlation, 0 indicates no correlation, ysis of the FinAM-it dataset as it can be seen in table and โˆ’1 indicates perfect negative correlation. 4 shows generally lower correlations across all metrics, with the highest correlation being observed between hu- 4.3. Settings man judgment and AC RAGAS gpt-4-turbo (0.531). This could be related to the fact that the FinAM-it dataset For this implementation, we employed OpenAI models presents more challenging and diverse content that is for the embedding, retrieval, and generation stages of more difficult to evaluate. Extending the analysis on all the RAG and to implement evaluations with RAGAS and the datasets at once, it can be seen that all the metrics TruLens. The Ingestion step produced chunks of 1024 have still difficulties to approximate the human evalua- characters, balancing semantic integrity with avoiding tion in a robust and reliable way. irrelevant or redundant information. Larger chunks may capture more context but increase noise, while smaller sizes may sacrifice contextual information. These chunks 5. Conclusion were embedded using OpenAIโ€™s text-embedding-ada-002 5 , a state-of-the-art transformer model for generating high- Our exploration into evaluating Retrieval Augmented quality text embeddings. For retrieval within the vector Generation (RAG) systems via ground truth-based and store, the system identified the 10 most similar embed- reference-free metrics was driven by the need for reliable dings to previously indexed chunks. During generation, evaluation frameworks, particularly for scenarios lacking we employed the GPT-4-Turbo model6 with the following ground truth data. Our evaluation frameworkโ€™s imple- prompt structure: mentation has demonstrated its potential for facilitating a more comprehensive understanding of these systemsโ€™ capabilities in such situations. Through rigorous experi- 4 https://docs.ragas.io/en/latest/concepts/metrics/answer_ mentation across different domains and datasets, includ- relevance.html 5 https://openai.com/blog/new-and-improved-embedding-model ing NarrativeQA and a specialized industrial dataset, we 6 https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo Table 2 Spearman correlations on NarrativeQA books subsample Human AR TruLens AR RAGAS AC RAGAS AR TruLens AR RAGAS AC RAGAS Metrics BEM Judgement gpt-3.5-turbo gpt-3.5-turbo gpt-3.5-turbo gpt-4-turbo gpt-4-turbo gpt-4-turbo Human Judgement 1.000 0.735 0.436 0.234 0.718 0.420 0.150 0.670 BEM 0.735 1.000 0.185 0.224 0.740 0.405 -0.026 0.713 AR TruLens gpt-3.5-turbo 0.436 0.185 1.000 0.197 0.274 0.477 0.178 0.224 AR RAGAS gpt-3.5-turbo 0.234 0.224 0.197 1.000 0.129 0.156 0.633 0.121 AC RAGAS gpt-3.5-turbo 0.718 0.740 0.274 0.129 1.000 0.238 0.093 0.854 AR TruLens gpt-4-turbo 0.420 0.405 0.477 0.156 0.238 1.000 0.122 0.108 AR RAGAS gpt-4-turbo 0.150 -0.026 0.178 0.633 0.093 0.122 1.000 0.097 AC RAGAS gpt-4-turbo 0.670 0.713 0.224 0.121 0.854 0.108 0.097 1.000 Table 3 Spearman correlations on NarrativeQA movies subsample Human AR TruLens AR RAGAS AC RAGAS AR TruLens AR RAGAS AC RAGAS Metrics BEM Judgement gpt-3.5-turbo gpt-3.5-turbo gpt-3.5-turbo gpt-4-turbo gpt-4-turbo gpt-4-turbo Human Judgement 1.000 0.704 0.565 0.483 0.792 0.213 0.411 0.781 BEM 0.704 1.000 0.522 0.428 0.752 0.235 0.358 0.746 AR TruLens gpt-3.5-turbo 0.565 0.522 1.000 0.390 0.476 0.270 0.422 0.473 AR RAGAS gpt-3.5-turbo 0.483 0.428 0.390 1.000 0.403 0.406 0.738 0.421 AC RAGAS gpt-3.5-turbo 0.792 0.752 0.476 0.403 1.000 0.228 0.358 0.977 AR TruLens gpt-4-turbo 0.213 0.235 0.270 0.406 0.228 1.000 0.456 0.200 AR RAGAS gpt-4-turbo 0.411 0.358 0.422 0.738 0.358 0.456 1.000 0.379 AC RAGAS gpt-4-turbo 0.781 0.746 0.473 0.421 0.977 0.200 0.379 1.000 compared various evaluation methodologies against hu- turing nuanced aspects of human judgment, suggesting man judgment. While ground truth-based metrics like an urgent need for further refinement of reference-free BEM and AC RAGAS showed moderate to strong correla- evaluation methods. The Spearman correlation analysis tion with human judgments across different domains and reveals that while some metrics align more closely with models, reference-free metrics still face significant chal- human assessments, there is still significant room for im- lenges in achieving similar correlation levels. This high- provement, especially for more challenging and diverse lights the current limitations of automated metrics in cap- content like the FinAM-it dataset. These findings under- Table 4 Spearman correlations on FinAM-it dataset Human AR TruLens AR RAGAS AC RAGAS AR TruLens AR RAGAS AC RAGAS Metrics BEM Judgement gpt-3.5-turbo gpt-3.5-turbo gpt-3.5-turbo gpt-4-turbo gpt-4-turbo gpt-4-turbo Human Judgement 1.000 0.208 0.178 0.153 0.053 0.280 0.230 0.531 BEM 0.208 1.000 0.214 0.209 0.276 0.001 0.203 0.278 AR TruLens gpt-3.5-turbo 0.178 0.214 1.000 0.412 0.433 0.181 0.446 0.300 AR RAGAS gpt-3.5-turbo 0.153 0.209 0.412 1.000 0.463 -0.191 0.608 0.130 AC RAGAS gpt-3.5-turbo 0.053 0.276 0.433 0.463 1.000 -0.099 0.243 0.255 AR TruLens gpt-4-turbo 0.280 0.001 0.181 -0.191 -0.099 1.000 -0.009 0.245 AR RAGAS gpt-4-turbo 0.230 0.203 0.446 0.608 0.243 -0.009 1.000 0.157 AC RAGAS gpt-4-turbo 0.531 0.278 0.300 0.130 0.255 0.245 0.157 1.000 Table 5 Spearman correlations on all datasets Human AR TruLens AR RAGAS AC RAGAS AR TruLens AR RAGAS AC RAGAS Metrics BEM Judgement gpt-3.5-turbo gpt-3.5-turbo gpt-3.5-turbo gpt-4-turbo gpt-4-turbo gpt-4-turbo Human Judgement 1.000 0.627 0.423 0.323 0.536 0.314 0.287 0.653 BEM 0.627 1.000 0.310 0.266 0.654 0.249 0.155 0.711 AR TruLens gpt-3.5-turbo 0.423 0.310 1.000 0.346 0.303 0.302 0.375 0.302 AR RAGAS gpt-3.5-turbo 0.323 0.266 0.346 1.000 0.213 0.201 0.682 0.198 AC RAGAS gpt-3.5-turbo 0.536 0.654 0.303 0.213 1.000 0.208 0.139 0.813 AR TruLens gpt-4-turbo 0.314 0.249 0.302 0.201 0.208 1.000 0.250 0.187 AR RAGAS gpt-4-turbo 0.287 0.155 0.375 0.682 0.139 0.250 1.000 0.169 AC RAGAS gpt-4-turbo 0.653 0.711 0.302 0.198 0.813 0.187 0.169 1.000 score the complexity of accurately evaluating RAG sys- [6] S. Es, J. James, L. Espinosa-Anke, S. Schockaert, Ra- tems and the importance of considering domain-specific gas: Automated evaluation of retrieval augmented factors in metric development and selection. The ob- generation, 2023. arXiv:2309.15217. served limitations can have practical consequences, such [7] Y. Tang, Y. Yang, Multihop-rag: Benchmark- as inaccurate system performance assessments, leading ing retrieval-augmented generation for multi-hop to suboptimal deployment decisions and reduced user sat- queries, 2024. arXiv:2401.15391. isfaction. Looking forward, our study emphasizes devel- [8] M. Gao, X. Hu, J. Ruan, X. Pu, X. Wan, Llm-based oping more nuanced and sophisticated evaluation frame- nlg evaluation: Current status and challenges, 2024. works that can better approximate human judgment. This arXiv:2402.01383. entails improving existing metricsโ€™ accuracy and relia- [9] Z. Zhang, M. Fang, L. Chen, Retrievalqa: Assess- bility and exploring new methodologies to effectively ing adaptive retrieval-augmented generation for capture qualitative aspects of generated answers. short-form open-domain question answering, 2024. While our evaluation framework provides valuable in- arXiv:2402.16457. sights, we acknowledge several limitations: (i) Current [10] V. Katranidis, G. Barany, Faaf: Facts as a func- reference-free metrics still struggle to match human judg- tion for the evaluation of rag systems, 2024. ment, necessitating further refinement. (ii) Metric perfor- arXiv:2403.03888. mance suffers for challenging, domain-specific datasets, [11] J. Saad-Falcon, O. Khattab, C. Potts, M. Zaharia, highlighting the need for domain-aware or adaptive ap- Ares: An automated evaluation framework for proaches. (iii) Our analysis covered a subset of available retrieval-augmented generation systems, 2024. metrics; exploring a wider range, including leveraging arXiv:2311.09476. advanced LLMs and additional context, is needed. (iv) Re- [12] C.-Y. Lin, E. Hovy, Automatic evaluation of sum- sults should be validated across different RAG configura- maries using n-gram co-occurrence statistics, in: tions and domains for broader applicability. (v) Despite Human Language Technology Conference of the rigorous human evaluation, inherent subjectivity and North American Chapter of the ACL, 2003, pp. 150โ€“ potential biases may have impacted findings. We view 157. URL: https://aclanthology.org/N03-1020. these limitations as opportunities to contribute to devel- [13] C.-Y. Lin, ROUGE: A package for automatic eval- oping more reliable, accurate, and human-like evaluation uation of summaries, in: Text Summarization frameworks that can drive advancements in natural lan- Branches Out, ACL, Barcelona, Spain, 2004, pp. 74โ€“ guage processing capabilities and the realization of highly 81. URL: https://aclanthology.org/W04-1013. effective RAG systems across diverse domains. [14] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 100,000+ questions for machine comprehension of text, in: J. Su, K. Duh, X. Carreras (Eds.), EMNLP, References ACL, Austin, Texas, 2016, pp. 2383โ€“2392. URL: https: //aclanthology.org/D16-1264. doi:10.18653/v1/ [1] K. Guu, K. Lee, Z. Tung, P. Pasupat, M. Chang, D16-1264. Retrieval augmented language model pre-training, [15] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, in: ICML 2020, 13-18 July 2020, Virtual Event, Y. Artzi, Bertscore: Evaluating text generation with volume 119 of Proceedings of Machine Learning bert, 2020. arXiv:1904.09675. Research, PMLR, 2020, pp. 3929โ€“3938. URL: http: [16] J. Bulian, C. Buck, W. Gajewski, B. Boerschinger, //proceedings.mlr.press/v119/guu20a.html. T. Schuster, Tomayto, tomahto. beyond token-level [2] O. Khattab, C. Potts, M. Zaharia, Relevance- answer equivalence for question answering evalua- guided supervision for openqa with colbert, 2021. tion, 2022. arXiv:2202.07654. arXiv:2007.00814. [17] T. Koฤiskรฝ, J. Schwarz, P. Blunsom, C. Dyer, K. M. [3] K. Shuster, S. Poff, M. Chen, D. Kiela, J. Weston, Hermann, G. Melis, E. Grefenstette, The narra- Retrieval augmentation reduces hallucination in tiveqa reading comprehension challenge, 2017. conversation, 2021. arXiv:2104.07567. arXiv:1712.07040. [4] S. Huo, N. Arabzadeh, C. Clarke, Retrieving sup- [18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, porting evidence for generative question answer- Bert: Pre-training of deep bidirectional trans- ing, in: SIGIR-AP, ACM, 2023, pp. 11โ€“20. URL: formers for language understanding, 2019. http://dx.doi.org/10.1145/3624918.3625336. doi:10. arXiv:1810.04805. 1145/3624918.3625336. [19] K. F. Weaver, V. Morales, S. L. Dunn, [5] T. Zhang, S. G. Patil, N. Jain, S. Shen, M. Za- K. Godde, P. F. Weaver, Pearsonโ€™s and haria, I. Stoica, J. E. Gonzalez, Raft: Adapting Spearmanโ€™s Correlation, John Wiley and language model to domain specific rag, 2024. Sons, Ltd, 2017, pp. 435โ€“471. doi:https: arXiv:2403.10131. //doi.org/10.1002/9781119454205.ch10.