Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models

Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models ErmelindaOro ermelinda.oro@icar.cnr.it National Research Council Institute for High Performance Computing and Networking

via P. Bucci 8/9C, (CS) 87036 Rende Italy

Altilia srl TechNest Start-up Incubator University of Calabria

Piazza Vermicelli, Rende (CS) 87036 Italy

FrancescoMariaGranata francesco.granata@altiliagroup.com Altilia srl TechNest Start-up Incubator University of Calabria

Piazza Vermicelli, Rende (CS) 87036 Italy

AntonioLanza antonio.lanza@altiliagroup.com Altilia srl TechNest Start-up Incubator University of Calabria

Piazza Vermicelli, Rende (CS) 87036 Italy

AmirBachir amir.bachir@altiliagroup.com Altilia srl TechNest Start-up Incubator University of Calabria

Piazza Vermicelli, Rende (CS) 87036 Italy

LucaDe Grandis luca.degrandis@altiliagroup.com Altilia srl TechNest Start-up Incubator University of Calabria

Piazza Vermicelli, Rende (CS) 87036 Italy

MassimoRuffolo massimo.ruffolo@altiliagroup.com National Research Council Institute for High Performance Computing and Networking

via P. Bucci 8/9C, (CS) 87036 Rende Italy

Altilia srl TechNest Start-up Incubator University of Calabria

Piazza Vermicelli, Rende (CS) 87036 Italy

Evaluating Retrieval-Augmented Generation for Question Answering with Large Language Models 1613-0073 9E5D2F1F25DD3EA6A75A5BF24579E3B9 arXiv:2402.01383. GROBID - A machine learning software for extracting information from scholarly documents Retrieval Augmented Generation (RAG), Question Answering (QA), Retrieval, Large Language Model (LLM), Evaluation M. Ruffolo) 0000-0002-5529-1007 (E. Oro) 0000-0003-4425-753X (F. M. Granata) 0000-0002-2875-4133 (L. D. Grandis) 0000-0002-4094-4810 (M. Ruffolo)

We present a comprehensive framework for evaluating retrieval-augmented generation (RAG) systems designed for questionanswering tasks using large language models (LLMs). The proposed framework integrates document ingestion, information retrieval, answer generation, and evaluation phases. Both ground truth-based and reference-free evaluation metrics are implemented to provide a multi-faceted assessment approach. Through experiments across diverse datasets like NarrativeQA and a proprietary financial dataset (FinAM-it), the reliability of existing metrics is investigated by comparing them against rigorous human evaluations. The results demonstrate that ground truth-based metrics such as BEM and RAGAS Answer Correctness exhibit a moderately strong correlation with human judgments. However, reference-free metrics still struggle to capture nuances in answer quality without predefined correct responses accurately. An in-depth analysis of Spearman correlation coefficients sheds light on the interrelationships and relative effectiveness of various evaluation approaches across multiple domains. While highlighting the current limitations of reference-free methodologies, the study underscores the need for more sophisticated techniques to better approximate human perception of answer relevance and correctness. Overall, this research contributes to ongoing efforts in developing reliable evaluation frameworks for RAG systems, paving the way for advancements in natural language processing and the realization of highly accurate and human-like AI systems.

Introduction

Retrieval-Augmented Generation (RAG) systems, which integrate information retrieval with natural language generation, have shown promise in enhancing language models' capabilities. However, evaluating their performance remains challenging, particularly when ground truth data is unavailable, impeding accurate assessments of system utility. To address this challenge, we present a comprehensive framework designed to facilitate the rigorous evaluation of RAG systems for question-answering tasks. Our framework integrates document ingestion, retrieval, generation, and evaluation phases, leveraging state-of-the-art technologies to optimize accuracy and relevance. We implement both ground truth-based and reference-free evaluation metrics, providing a multifaceted approach to assessing system outputs. Through an extensive series of experiments spanning diverse domains and datasets we investigate the reliability and validity of existing evaluation methodologies. Specifically, we examine the correlation between various metrics and rigorous human evaluations, shedding light on their strengths, limitations, and potential for improvement. Our findings reveal that while ground truth-based metrics like BEM and RAG Answer Correctness exhibit moderate alignment with human judgments, referencefree metrics still struggle to accurately capture answer quality nuances without predefined correct responses. By analyzing Spearman correlation coefficients, we elucidate the interrelationships and relative effectiveness of different evaluation approaches across multiple domains.

This research makes the following key contributions: (i) presenting a comprehensive framework for evaluating RAG systems with state-of-the-art components, (ii) implementing and comparing diverse ground truth-based and reference-free evaluation metrics, (iii) conducting rigorous experiments across multiple datasets to assess metric reliability against human judgments, and (iv) analyzing the strengths and limitations of existing metrics, highlighting the need for advanced reference-free evaluation techniques that better approximate human perception.

The rest of the paper is organized as follows: Section 2 discusses related work. Section 3 presents the method. Section 4 shows the experimental evaluation and Section 5 concludes the work.

Related Work

RAG systems have been implemented in various forms [1,2,3,4,5], incorporating advanced strategies like document splitting, chunking, retrieval, and diverse models for embedding and language generation, including proprietary and open-source models from platforms like HuggingFace1 . We have also explored different variants of RAG systems, however, this paper's primary focus is not to introduce a novel RAG system or methodology but to comprehensively evaluate the effectiveness of Large Language Model (LLM)-derived metrics, emphasizing reference-free approaches.

Several prior works have proposed frameworks and novel metrics that leverage the capabilities of LLMs [6,7,8,9,10,11]. Unlike these existing solutions, which aim to score different RAG systems or propose new evaluation methods, metrics, or datasets, our research is specifically targeted at evaluating the potential satisfaction of end-user customers who receive the evaluation scores generated by such systems.

By concentrating on the practical utility and interpretability of evaluation metrics from the perspective of end-users, our study diverges from the conventional approach of optimizing technical performance alone. Instead, we strive to bridge the gap between state-of-the-art evaluation techniques and the real-world expectations of customers who rely on these systems for decision-making and information retrieval.

Method

Framework for RAG and evaluation

This paper introduces a framework for running and evaluating a RAG system for efficiently processing and responding to natural language queries. The system integrates state-of-the-art technologies to enhance answer accuracy and relevance. The process is segmented into four main phases: Ingestion: Input documents are processed into manageable chunks, leveraging techniques like document layout analysis for PDFs. The chunks are embedded into high-dimensional vectors capturing their semantic essence and ingested into a vector store for efficient similarity search. Retrieval: Upon receiving a query, its vector form undergoes similarity search in the vector store to identify the 𝑘 most relevant chunks. This narrows down the information to the most pertinent chunks for answer generation. Generation: A Large Language Model (LLM) synthesizes information from the retrieved chunks to construct a coherent and natural-sounding answer to the query. Evaluation: A two-sided approach employs both ground-truth dependent and independent metrics. Ground-truth dependent metrics assess correctness against predefined answers, while ground-truth independent metrics evaluate answer relevance without a predefined set. This dual approach enables a comprehensive assessment of performance, correctness, and overall text quality. The system can receive human evaluations of question-answer pairs to evaluate metric reliability and alignment with expectations.

Evaluation Strategies

In our RAG system, we implemented and tested a wide range of evaluation metrics. Specifically, our system incorporates metrics for assessing individual RAG components like Information Retrieval (IR) and Answer Generation, as well as the overall pipeline. For IR, we used classical metrics such as Recall@K, Precision@K, mAP, MRR, and nDCG. For answer generation, the implemented metrics were divided into two categories: Syntactic metrics evaluate formal response aspects, including BLEU [12], ROUGE [13], Precision, Recall, F1, and Exact Match [14]. These focus on text properties rather than semantic meaning. Semantic metrics evaluate response meaning, including BERT score [15] and BEM score [16]. BEM is preferred over BERT due to reported correlation with human evaluations and our empirical findings. LMMderived Metrics: We implemented in our framework the RAG triad of metrics for the three main steps of an RAG's execution [6]: (i) Context relevance that assesses if the passage returned is relevant for answering the given query. (ii) Groundedness that assesses if the generated answer is faithful to the retrieved passage or if it contains hallucinated or extrapolated statements beyond the passage. (iii) Answer relevance that assesses if the generated answer is relevant given the query and retrieved passage. In addition, we implemented the Answer correctness that exploits LLMs and gold answers to measure the factual correctness of an answer. In this paper, only a subset of metrics are considered and compared for assessing the quality of the answers (see Section 4.2).

Manual evaluation. To verify the reliability of automated evaluation metrics, we implemented a rigorous manual evaluation process to assess the relevance, accuracy, and coherence of the answers generated by our RAG system. This manual evaluation was conducted by three independent human annotators, each with expertise in the domain of the questions posed to the system. For each evaluation session, the annotators were presented with the question, the corresponding answer generated by the RAG system, and the ground truth provided by the original dataset or the customer answers. The primary task for each annotator was to assess the quality of the generated answer in relation to the posed question, employing a discrete scoring 5-point likert scale. The criteria for scoring were as follows: 1. Very Poor: The generated answer is totally incorrect or irrelevant to the question. This case indicates a failure of the system to comprehend the query or retrieve pertinent information. 2. Poor: The generated answer is predominantly incorrect but with glimpses of relevance suggesting some level of understanding or appropriate retrieval. 3. Neither: The generated answer mixes relevant and irrelevant information almost equally, showcasing the system's partial success in addressing the query. 4. Good: The generated answer is largely correct but includes minor inaccuracies or irrelevant details, demonstrating a strong understanding and response to the question. 5. Very Good: Reserved for answers that are completely correct and fully relevant, reflecting an ideal outcome where the system accurately understood and responded to the query. The annotators conducted their assessments independently to ensure unbiased evaluations. Upon completion, the scores for each question-answer pair were collected and compared. In cases of discrepancy, a consensus discussion was initiated among the annotators to agree on the most accurate score. This consensus process allowed for mitigating individual bias and considering different perspectives in evaluating the quality of the generated answers. This manual evaluation process helps particularly in assessing the reliability and validity of our system's automated evaluation metrics. By comparing the human-generated scores against the results produced by these automated measures, we can determine the extent to which the automatic metrics accurately reflect human judgment and perception of answer quality.

Experiments

Considering different domains (Section 4.1), we investigate the reliability of a subset of existing metrics (Section 4.2) for evaluating a RAG system (Section 3.1). We explore the feasibility of adopting reference-free metrics and the correlation among them and the human evaluation (Section 3.2).

Datasets

NarrativeQA -English. A subsample of the Narra-tiveQA dataset [17] was used, with 50 book-related and 50 movie script-related questions (1% of the test set), spanning 41 unique books and 42 unique movie scripts. This allowed evaluating the RAG system's performance across two distinct narrative content types.

Financial Asset Management -Italian. The FinAMit dataset, created by Altilia, consists of 50 questionanswer pairs from Italian asset management documents on topics like investment strategies, risk management, and regulatory compliance. The questions are complex and diverse, often requiring information from multiple paragraphs, with detailed, conversational-style answers. In this paper we focus on evaluating the generated answer's quality of the entire pipeline.

Metrics

In our analysis, we considered the BEM score (BERT matching score) [15], which we experimented is the most satisfying among the classic metrics. It is a metric that uses a BERT model [18] trained to solve an answer equivalence task, this task is solved by training a classifier that tells if two given answers are equivalent and returns the equivalence score. We use the variation of the BERT score Answers and questions that exploits the two answers and the question as model input. This variation results in performing better [16].

In addition, we considered novel LLM-derived metrics developed in the RAGAS [6] and Truelens 2 systems. These metrics offer evaluations both ground truth-based and reference-free. In particular, from RAGAS we used the two main metrics that focus on answers: Answer Correctness and Answer Relevance. More in detail: (i) Answer Correctness 3 : This metric measures the factual correctness of an answer and needs the presence of a ground truth. It employs an LLM to extract factual statements from both the predicted answer and the ground truth labeling them as True Positives if are present in both the answers, False Negatives if are present only in the ground truth, and False Positives if they are present only in the prediction. Then a final F1 score is calculated, this score in the range (0, 1) is the Answer Correctness. (ii) Answer Relevance4 : This metric measures how pertinent the generated answer is to the prompt given to the LLM in the generation step. It computes a score in the range (0, 1) as the mean of the cosine similarities between the original question and a set of artificial questions generated by an LLM on the basis of the predicted answer and the given context. The formula of the score is the following: 𝐴𝑛𝑠𝑤𝑒𝑟𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 = 1 𝑁 ∑︀ 𝑁 𝑖=1 𝑐𝑜𝑠𝑖𝑛𝑒(𝐸𝑜, 𝐸𝑔 𝑖 ) where 𝐸𝑜 is the embedding of the original generated answer and 𝐸𝑔 𝑖 is the embedding of the i-th generated question. From TruLens we used the implemented Answer Relevance metric that prompts an LLM to evaluate the relevance of the answer with respect to the input prompt that includes context and question. The score that the LLM assigns to each answer is in the range (0, 1).

To study the interrelationships and relative effectiveness among various evaluation metrics, we exploit the Spearman correlation coefficient. The Spearman Rank Correlation [19] is a non-parametric measure that assesses the statistical dependence between the rankings of two variables. It tells how well the relationship between these variables can be described using a monotonic function. This measure is computed on ranked data, allowing for the analysis of both ordinal variables and continuous variables that have been converted into ranks. The Spearman Rank Correlation coefficient is denoted by 𝜌, and its value ranges from −1 to 1 inclusive, where 1 indicates perfect positive correlation, 0 indicates no correlation, and −1 indicates perfect negative correlation.

Settings

For this implementation, we employed OpenAI models for the embedding, retrieval, and generation stages of the RAG and to implement evaluations with RAGAS and TruLens. The Ingestion step produced chunks of 1024 characters, balancing semantic integrity with avoiding irrelevant or redundant information. Larger chunks may capture more context but increase noise, while smaller sizes may sacrifice contextual information. These chunks were embedded using OpenAI's text-embedding-ada-002 5 , a state-of-the-art transformer model for generating highquality text embeddings. For retrieval within the vector store, the system identified the 10 most similar embeddings to previously indexed chunks. During generation, we employed the GPT-4-Turbo model6 with the following prompt structure:

You a r e a c h a t b o t h a v i n g a c o n v e r s a t i o n w i t h a human . Given t h e f o l l o w i n g e x t r a c t e d p a r t s

o f a l o n g document and a q u e s t i o n , c r e a t e a f i n a l answer . I f you don ' t know t h e answer , j u s t s a y t h a t you don ' t know , don ' t t r y t o make up an answer . C o n t e x t : { CONTEXT } Chat h i s t o r y : { CHAT_HISTORY } Human : { HUMAN_INPUT } C h a t b o t :

This prompt provided the model with instructions, context, and encouraged concise, truthful answers without fabrication.

Results

For both books and movies subsamples from the Nar-rativeQA dataset, as can be seen in table 2 and table 3, human judgment shows a moderately strong Spearman correlation with BEM (0.735 and 0.704) and AC RAGAS scores across both GPT-3.5-turbo (0.718, 0.792), and GPT-4-turbo models (0.67 and 0.781). This indicates that these ground truth-based metrics are more aligned with human perception of answer quality. Reference-free metrics show poor correlation with human judgment, especially AR RAGAS (0.234 and 0.483), highlighting the fact that evaluating an answer without ground truth is still a challenging problem for Large Language Models. The analysis of the FinAM-it dataset as it can be seen in table 4 shows generally lower correlations across all metrics, with the highest correlation being observed between human judgment and AC RAGAS gpt-4-turbo (0.531). This could be related to the fact that the FinAM-it dataset presents more challenging and diverse content that is more difficult to evaluate. Extending the analysis on all the datasets at once, it can be seen that all the metrics have still difficulties to approximate the human evaluation in a robust and reliable way.

Conclusion

Our exploration into evaluating Retrieval Augmented Generation (RAG) systems via ground truth-based and reference-free metrics was driven by the need for reliable evaluation frameworks, particularly for scenarios lacking ground truth data. Our evaluation framework's implementation has demonstrated its potential for facilitating a more comprehensive understanding of these systems' capabilities in such situations. Through rigorous experimentation across different domains and datasets, including NarrativeQA and a specialized industrial dataset, we compared various evaluation methodologies against human judgment. While ground truth-based metrics like BEM and AC RAGAS showed moderate to strong correlation with human judgments across different domains and models, reference-free metrics still face significant challenges in achieving similar correlation levels. This highlights the current limitations of automated metrics in cap-turing nuanced aspects of human judgment, suggesting an urgent need for further refinement of reference-free evaluation methods. The Spearman correlation analysis reveals that while some metrics align more closely with human assessments, there is still significant room for improvement, especially for more challenging and diverse content like the FinAM-it dataset. These findings under- score the complexity of accurately evaluating RAG systems and the importance of considering domain-specific factors in metric development and selection. The observed limitations can have practical consequences, such as inaccurate system performance assessments, leading to suboptimal deployment decisions and reduced user satisfaction. Looking forward, our study emphasizes developing more nuanced and sophisticated evaluation frameworks that can better approximate human judgment. This entails improving existing metrics' accuracy and reliability and exploring new methodologies to effectively capture qualitative aspects of generated answers. While our evaluation framework provides valuable insights, we acknowledge several limitations: (i) Current reference-free metrics still struggle to match human judgment, necessitating further refinement. (ii) Metric performance suffers for challenging, domain-specific datasets, highlighting the need for domain-aware or adaptive approaches. (iii) Our analysis covered a subset of available metrics; exploring a wider range, including leveraging advanced LLMs and additional context, is needed. (iv) Results should be validated across different RAG configurations and domains for broader applicability. (v) Despite rigorous human evaluation, inherent subjectivity and potential biases may have impacted findings. We view these limitations as opportunities to contribute to developing more reliable, accurate, and human-like evaluation frameworks that can drive advancements in natural language processing capabilities and the realization of highly effective RAG systems across diverse domains.

Figure 1 :1Figure 1: The simplified figure of the implemented RAG System.

Table 11Naming and classification of metrics shown in the experimental evaluationAcronymName -FrameworkTypeBEMBEM score -TensorFlow GT-basedAR TruLensAnswer Relevance -TruLensGT-freeAR RAGASAnswer Relevance -RAGASGT-freeAC RAGASAnswer Correctness -RAGAS GT-based

Table 22Spearman correlations on NarrativeQA books subsampleMetricsHuman JudgementBEMAR TruLens gpt-3.5-turboAR RAGAS gpt-3.5-turboAC RAGAS gpt-3.5-turboAR TruLens gpt-4-turboAR RAGAS gpt-4-turboAC RAGAS gpt-4-turboHuman Judgement1.0000.7350.4360.2340.7180.4200.1500.670BEM0.7351.0000.1850.2240.7400.405-0.0260.713AR TruLens gpt-3.5-turbo0.4360.1851.0000.1970.2740.4770.1780.224AR RAGAS gpt-3.5-turbo0.2340.2240.1971.0000.1290.1560.6330.121AC RAGAS gpt-3.5-turbo0.7180.7400.2740.1291.0000.2380.0930.854AR TruLens gpt-4-turbo0.4200.4050.4770.1560.2381.0000.1220.108AR RAGAS gpt-4-turbo0.150 -0.0260.1780.6330.0930.1221.0000.097AC RAGAS gpt-4-turbo0.6700.7130.2240.1210.8540.1080.0971.000

Table 33Spearman correlations on NarrativeQA movies subsampleMetricsHuman JudgementBEMAR TruLens gpt-3.5-turboAR RAGAS gpt-3.5-turboAC RAGAS gpt-3.5-turboAR TruLens gpt-4-turboAR RAGAS gpt-4-turboAC RAGAS gpt-4-turboHuman Judgement1.000 0.7040.5650.4830.7920.2130.4110.781BEM0.7041.0000.5220.4280.7520.2350.3580.746AR TruLens gpt-3.5-turbo0.5650.5221.0000.3900.4760.2700.4220.473AR RAGAS gpt-3.5-turbo0.4830.4280.3901.0000.4030.4060.7380.421AC RAGAS gpt-3.5-turbo0.792 0.7520.4760.4031.0000.2280.3580.977AR TruLens gpt-4-turbo0.2130.2350.2700.4060.2281.0000.4560.200AR RAGAS gpt-4-turbo0.4110.3580.4220.7380.3580.4561.0000.379AC RAGAS gpt-4-turbo0.781 0.7460.4730.4210.9770.2000.3791.000

Table 44Spearman correlations on FinAM-it datasetMetricsHuman JudgementBEMAR TruLens gpt-3.5-turboAR RAGAS gpt-3.5-turboAC RAGAS gpt-3.5-turboAR TruLens gpt-4-turboAR RAGAS gpt-4-turboAC RAGAS gpt-4-turboHuman Judgement1.000 0.2080.1780.1530.0530.2800.2300.531BEM0.208 1.0000.2140.2090.2760.0010.2030.278AR TruLens gpt-3.5-turbo0.178 0.2141.0000.4120.4330.1810.4460.300AR RAGAS gpt-3.5-turbo0.153 0.2090.4121.0000.463-0.1910.6080.130AC RAGAS gpt-3.5-turbo0.053 0.2760.4330.4631.000-0.0990.2430.255AR TruLens gpt-4-turbo0.280 0.0010.181-0.191-0.0991.000-0.0090.245AR RAGAS gpt-4-turbo0.230 0.2030.4460.6080.243-0.0091.0000.157AC RAGAS gpt-4-turbo0.531 0.2780.3000.1300.2550.2450.1571.000

Table 55Spearman correlations on all datasetsMetricsHuman JudgementBEMAR TruLens gpt-3.5-turboAR RAGAS gpt-3.5-turboAC RAGAS gpt-3.5-turboAR TruLens gpt-4-turboAR RAGAS gpt-4-turboAC RAGAS gpt-4-turboHuman Judgement1.000 0.6270.4230.3230.5360.3140.2870.653BEM0.6271.0000.3100.2660.6540.2490.1550.711AR TruLens gpt-3.5-turbo0.4230.3101.0000.3460.3030.3020.3750.302AR RAGAS gpt-3.5-turbo0.3230.2660.3461.0000.2130.2010.6820.198AC RAGAS gpt-3.5-turbo0.536 0.6540.3030.2131.0000.2080.1390.813AR TruLens gpt-4-turbo0.3140.2490.3020.2010.2081.0000.2500.187AR RAGAS gpt-4-turbo0.2870.1550.3750.6820.1390.2501.0000.169AC RAGAS gpt-4-turbo0.653 0.7110.3020.1980.8130.1870.1691.000

https://huggingface.co/spaces/HuggingFaceH4/open_llm_ leaderboard https://www.trulens.org/ https://docs.ragas.io/en/latest/concepts/metrics/answer_ correctness.html https://docs.ragas.io/en/latest/concepts/metrics/answer_ relevance.html https://openai.com/blog/new-and-improved-embedding-model https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo

Retrieval augmented language model pre-training KGuu KLee ZTung PPasupat MChang Proceedings of Machine Learning Research Machine Learning Research

PMLR

13-18 July 2020. 2020 119 Virtual Event OKhattab CPotts MZaharia arXiv:2007.00814 Relevanceguided supervision for openqa with colbert 2021 KShuster SPoff MChen DKiela JWeston arXiv:2104.07567 Retrieval augmentation reduces hallucination in conversation 2021 Retrieving supporting evidence for generative question answering SHuo NArabzadeh CClarke 10.1145/3624918.3625336 doi:10. 1145/3624918.3625336 SIGIR-AP ACM 2023 TZhang SGPatil NJain SShen MZaharia IStoica JEGonzalez arXiv:2403.10131 Raft: Adapting language model to domain specific rag 2024 SEs JJames LEspinosa-Anke SSchockaert arXiv:2309.15217 Ragas: Automated evaluation of retrieval augmented generation 2023 YTang YYang arXiv:2401.15391 Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries 2024 MGao XHu JRuan XPu XWan Llm-based nlg evaluation: Current status and challenges 2024 ZZhang MFang LChen arXiv:2402.16457 Retrievalqa: Assessing adaptive retrieval-augmented generation for short-form open-domain question answering 2024 VKatranidis GBarany arXiv:2403.03888 Faaf: Facts as a function for the evaluation of rag systems 2024 Ares: An automated evaluation framework for retrieval-augmented generation systems JSaad-Falcon OKhattab CPotts MZaharia arXiv:2311.09476 2024 Automatic evaluation of summaries using n-gram co-occurrence statistics C.-YLin EHovy Human Language Technology Conference of the North American Chapter of the ACL 2003 ROUGE: A package for automatic evaluation of summaries C.-YLin Text Summarization Branches Out, ACL

Barcelona, Spain

2004 SQuAD: 100,000+ questions for machine comprehension of text PRajpurkar JZhang KLopyrev PLiang 10.18653/v1/D16-1264 EMNLP, ACL JSu KDuh XCarreras

Austin, Texas

2016 TZhang VKishore FWu KQWeinberger YArtzi arXiv:1904.09675 Bertscore: Evaluating text generation with bert 2020 beyond token-level answer equivalence for question answering evaluation JBulian CBuck WGajewski BBoerschinger TSchuster Tomayto Tomahto arXiv:2202.07654 2022 The narrativeqa reading comprehension challenge TKočiský JSchwarz PBlunsom CDyer KMHermann GMelis EGrefenstette arXiv:1712.07040 2017 JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2019 Pearson's and Spearman's Correlation KFWeaver VMorales SLDunn KGodde PFWeaver 10.1002/9781119454205.ch10 doi: 2017 John Wiley and Sons, Ltd