=Paper=
{{Paper
|id=Vol-2950/paper-16
|storemode=property
|title=EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want
|pdfUrl=https://ceur-ws.org/Vol-2950/paper-16.pdf
|volume=Vol-2950
|authors=David P. Sander,Laura Dietz
|dblpUrl=https://dblp.org/rec/conf/desires/SanderD21
}}
==EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want==
EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want David P. Sander1 , Laura Dietz2 1 Bottomline Technologies, 325 Corporate Dr, Portsmouth, NH 03801, United States of America 2 University of New Hampshire, Durham NH 03824, United States of America Abstract Our long-term goal is to develop systems that are forthcoming with information. To be effective, such systems should be allowed to combine retrieval with language generation. To alleviate challenges such systems pose for today’s IR evaluation paradigms, we propose EXAM, an evaluation paradigm that uses held-out exam questions and an automated question- answering system to evaluate how well generated responses can answer follow-up questions—without knowing the exam questions in advance. Keywords Information Retrieval, Natural Language Generation, Evaluation, Conscious Information Needs 1. Introduction Often users want to learn about a topic they know very little about. Taylor and Belkin [1, 2] call this a “conscious information need” originating from an “anomalous state of knowledge” where the user knows too little about the Figure 1: Evaluating articles (left) through EXAM questions. topic to ask precise questions. As a result, web search and conversational search systems do not provide a satisfying user experience. Instead, users often turn to Wikipedia. However, depending on the topic, articles may be out-of- 1.1. Vision: Retrieve-and-Generate date, incomplete, or missing. If this is the case, today’s Systems users embark on a journey of exploratory search where they are required to manually compile relevant informa- To compose a comprehensive overview article, our long- tion from multiple search requests. term goal is to develop retrieve-and-generate systems As a remedy, research on interactive information re- that automatically read the web and organize relevant trieval is developing novel search interfaces [3]. We con- information. An ideal overview article would educate sider a complementary avenue by aiming to provide the the user about different aspects of the topic, with the best possible response in a single interaction turn, by goal of enabling the user to formulate precise questions compiling an overview for a topic of the user’s choice. or search queries. To not waste the user’s time, the ar- Within the TREC Complex Answer Retrieval track [4], ticle should be forthcoming with relevant information we aspire to retrieve-and-generate overview articles as that immediately answers obvious follow-up questions found on Wikipedia. The objective of the third year of without being explicitly asked. the track (CAR Y3) is to respond to the query with an We envision such systems to perform retrieval, content article that is composed of existing paragraphs. planning, and natural language generation—all while in- We offer a new evaluation based on whether these ferring which pieces of information are relevant and how articles answer obvious follow-up questions. Examples they fit together. We refer to such systems as retrieve- are available in the online appendix.1 and-generate systems to indicate that retrieval is only the first step in the pipeline, and sources will be further processed—possibly using abstractive summarization or DESIRES 2021 – 2nd International Conference on Design of Experimental Search & Information REtrieval Systems, September language generation—for presentation to the user. 15–18, 2021, Padua, Italy GPT-3 [5], T5 [6], and other natural language gener- Envelope-Open dpsander42@gmail.com (D. P. Sander); dietz@cs.unh.edu ation (NLG) models offer a promising avenue for gen- (L. Dietz) erating relevant text in combination with information Orcid 0000-0001-8508-6357 (D. P. Sander); 0000-0003-1624-3907 retrieval (IR) systems. Recent models achieve great re- (L. Dietz) © 2021 Copyright for this paper by its authors. Use permitted under Creative sults with respect to grammar, flow of arguments, and CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) readability. However, there are concerns whether these 1 Appendix: https://www.cs.unh.edu/~dietz/appendix/exam/ generated articles contain the most relevant information [7]. Integrating NLG with retrieval will help to instill influenced Darwin’s work?” The suggested evaluation the user’s trust in the faithfulness of the provided in- paradigm assesses the system’s ability to generate query- formation. The development of retrieve-and-generate relevant articles which offer comprehensive information. systems is hindered by the lack of an accepted evaluation The goal is to preempt the user with answers to potential paradigm for a fair comparison. follow-up questions, thus alleviating the user from the burden of asking obvious questions that could have been 1.2. Evaluation Challenges anticipated. EXAM does not rely on relevance assessments or a Typical IR evaluation paradigms are based on relevance fixed corpus. Once a sufficient question bank is created, assessments for texts from a known corpus in order to it can be reused to evaluate future systems without any quantify the relevance of a ranking. The evaluation manual involvements. EXAM can compare retrieval-only paradigm is directly applicable to predefined passages or systems as well as retrieve-and-generate systems. Since “legal spans” [8, 9]. However, when arbitrary text spans EXAM only assesses the information content, not the in- can be retrieved or when retrieved text is modified, this formation source or document, it is a corpus independent evaluation paradigm needs to be adjusted. A common ap- metric that even allows the comparison of systems that proach is to predict if retrieved text is sufficiently similar use the open-web as a corpus or neural NLG systems. to assessed text, as in character-MAP [10, 11], passage- ROUGE [12], or BERTscore [13]. In a preliminary study 1.4. The Chicken-and-Egg Problem we found BERTscore to be successful when the linguis- tic style is similar, e.g. both are Wikipedia paragraphs. The development of novel retrieve-and-generate systems However, when gold articles are linguistically different, and the development of suitable evaluation paradigms BERTscore was found to be less reliable. form a chicken-and-egg problem: New systems (the “egg”) An open question is how to develop evaluation meth- cannot be studied without an established evaluation, ods that can identify relevant information without being while novel evaluation paradigms (the “chicken”) can- affected by the linguistic style in which the information not be tested without established retrieve-and-generate is presented. systems. With this work we provide the “chicken” by studying the efficacy of the EXAM evaluation metric on 1.3. An Alternative Evaluation Approach retrieval-only systems with respect to an established IR benchmark. The efficacy study uses systems submitted In this work, we discuss an alternative evaluation paradigm to the TREC Complex Answer Retrieval track in Year 3 that directly assesses the usefulness of the retrieved in- for which a question bank is available through the TQA formation–instead of documents [14]. In particular, the collection which was created by the Allen Institute for AI evaluation paradigm is directly in service of our design (AI2). We demonstrate that the leaderboard ranking of goals: to educate the user about a topic they are not fa- systems under EXAM correlates highly with the official miliar with, while preemptively being forthcoming with track evaluation measures based on manual assessments answers. created by the National Institute of Science and Technol- We achieve this with a mostly2 automatic metric, called ogy (NIST). In contrast, using a collection of gold articles, the EXam Answerability Metric (EXAM). EXAM deter- we show that the system ranking under ROUGE does mines the quality of generated text by conducting an not correlate with the manual assessments, despite the exam that assesses the article’s suitability for correctly fact that corresponding gold articles contain the right answering a set of query-relevant follow-up questions as information and obtain a high EXAM score. depicted in Figure 1. Like an exam in school, the retrieve- Contributions of this paper are as follows. and-generate systems must identify relevant information without knowing the exam questions beforehand. An • Start a discussion on how to evaluate IR systems that external Q/A system will attempt to answer the follow- further process retrieved raw text. up questions; the more questions that can be answered • Suggest EXAM, an alternative evaluation paradigm to correctly with the article, the higher the system’s quality. complement existing evaluation strategies. We suggest using exam questions that are relatively • Provide a study on TREC CAR Y3, by reusing exam obvious follow-up questions to the user’s request. For questions from the related TQA data set. We demon- example, when a user provides a query such as “Darwin’s strate a high correlation with traditional IR metrics, Theory of Evolution”, the generated comprehensive ar- even in cases where the linguistic style is too different ticle should directly answer some reasonable follow-up for ROUGE to work. questions such as “What species of bird did Darwin ob- • While our motivation arises from conscious informa- serve on the Galapagos Islands?” and, “Which scientists tion needs, the EXAM paradigm is applicable to many 2 Fully automatic once a benchmark is created. areas of IR, including ad hoc document retrieval and TREC CAR eries Corpus Generated conversational search. Wikipedia paragraphs Articles Manual Darwin's theory Retrieve Assessment Outline. Section 2 provides an overview of related & Generate + evaluation approaches. Section 3 introduces our EXAM System Official evaluation paradigm. Section 4 outlines the experimental Evaluation evaluation and discusses our results, before concluding Gold Our Evaluation in Section 5. Article System EXAM Q/A evaluation 2. Related Work System score 2.1. Text Summarization Exam estions Correct Answers ROUGE [15] is one of the most popular evaluation met- rics for text summarization, because the only human Figure 2: Pipeline of retrieve-and-generate systems and eval- involvement is to create a reference summary. ROUGE uation with our proposed evaluation paradigm. and related metrics like METEOR [16] use n-gram over- lap to quantify the similarity of phrases and vocabulary between two texts—one being the reference. Though 2.3. Information Retrieval Evaluation ROUGE is commonly used, it has some drawbacks, as sug- Information Retrieval is commonly evaluated with a pool- gested by Scialom et al. [17] and Deutsch et al. [18]. Dif- fering word choice between summaries results in lower based Cranfield-style paradigm, where the top 𝑘 docu- ROUGE scores. Because of this, comparing two articles, ments are pooled and manually assessed for relevance by two different authors, both about the same topic, could[24]. Dietz and Dalton [8] automate manual IR assess- have very low ROUGE scores due to dissimilar word ment by deriving queries and a passage corpus from ex- choice. We use ROUGE-1 as a baseline for our evaluation isting articles, then assess passages as relevant when they paradigm in Section 4. originated from the corresponding article and/or section. Alternatively, some automated metrics use a trained This method does not allow information to deviate from predefined passages. similarity to detect text that is classified as relevant, such as Reval [19], BERTScore [13], and NUBIA [20]. To evaluate systems that retrieve passages of variable length, Keikha et al. [12] use a ROUGE metric to mea- sure the n-gram overlap with ground truth passages. An 2.2. Summarization Evaluation with Q/A alternative approach is to use a character-wise MAP to Eyal et al. [21] compares the quality of generated sum- award credit for shared character sequences [10, 11]. maries to a reference summary, using a Q/A system and In this work we discuss an alternative evaluation para- questions generated from entities in the reference sum- digm that focuses on retrieving information. mary. Deutsch et al. [18] focuses on automatic question generation from a reference summary. Huang et al. [22] develop a CLOZE-style training signal that automatically 3. Approach: EXAM Evaluation derives multiple-choice questions from reference sum- We propose the automatic EXam Answerability Metric maries. However, since not all phrases in a reference (EXAM) for evaluating the usefulness of systems which summary are equally important, such approaches can- retrieve and generate relevant articles in response to not guarantee that the evaluation will test for relevant topical queries. These articles are evaluated based on information. how many exam questions an automated Q/A system Doddington [23] suggests evaluating machine trans- can answer correctly. Our metric does not use relevance lation by conducting an exam with questions. In 2002, judgments nor reference summaries. Instead, a bench- these were answered by human annotators (not a Q/A mark for EXAM consists of a set of queries with follow-up system), however inconsistencies between annotators led questions that a relevant article should answer. We use to issues in the evaluation, which was then discontinued. it to measure the relevance and completeness of compre- We hypothesize that automatic Q/A systems (albeit not hensive articles. perfect) offer a fair comparison across systems. We first introduce our general evaluation approach, then explain customizations for evaluating CAR Y3 as used for evaluation. 3.1. EXAM Evaluation Paradigm score of a generated article 𝑑𝑞 from system 𝑆 over the question bank for the query as, While motivated by conscious information needs, the evaluation paradigm can be applied to most topical IR tasks. Only a suitable bank of exam questions and a Q/A correct answers in 𝑑𝑞 system need to be available. EXAM(𝑑𝑞 |𝑆) = number of exam questions for 𝑞 3.1.1. Resources Required for Evaluation The EXAM score of each retrieve-and-generate system is computed for each query (and hence generated arti- We reserve exam questions and disallow the retrieve-and- cle), then macro-averaged over multiple queries. Skipped generate systems under study to access them. Retrieve- queries are counted as zero score. EXAM awards no and-read systems are given: credit for unanswered or incorrectly answered questions, Queries: Given a free-text user query, such as “Dar- as these suggest that the generated article does not con- win’s Theory of Evolution”, systems must generate tain the right information.3 a comprehensive response. Queries can be a simple Similar to other proposals, e.g., of nugget-recall [29], keyword query or a more complex expression of infor- EXAM is a recall-oriented evaluation measure. To pe- mation needs, such as a conversation prompt, desired nalize large amounts of non-relevant information, the sub-topics, or usage contexts. article length can be restricted as in TREC CAR Y3, NT- The systems can access any corpus of their choice. CIR One-click [30], or composite retrieval [31]. EXAM does not require a predetermined corpus, unlike pool-based evaluations, such as the official TREC assess- 3.1.3. Normalizing EXAM with Gold Articles ments [4]. Even corpus-free systems like GPT-3 can be As introduced above, our EXAM score enables relative evaluated. quality comparison among systems and baselines. How- ever, questions which are too difficult for the Q/A system Solely for the evaluation, we require the following to answer, or that are irrelevant to the query, could re- resources to be available on a per-query basis: sult in an artificially lowered score. To correct for this, Exam Questions and Answer Verification: A set of we propose a relative-normalized EXAM score that uses reasonably obvious follow-up questions about the que- human-edited gold articles 𝑑𝑞∗ which are written to ad- ry’s topic. Any question style (e.g. multiple-choice, dress the query and exam questions. This allows retrieve- free text, etc.) can be used as long as the underlying and-generate systems to be scored using the context of Q/A system is trained to answer them and the answer an expected best-case scenario. can be automatically verified. ∑𝑞 EXAM(𝑑𝑞 |𝑆) Q/A System: A high-quality Q/A system that is trained n-EXAM(𝑆) = to answer exam questions. To be suitable, the Q/A sys- ∑𝑞 EXAM(𝑑𝑞∗ ) tem must use the given article to identify evidence for Note that if the gold articles contain less informa- the question. All systems must be evaluated using the tion than the generated articles, or are written obtusely, same Q/A system for EXAM scores to be comparable. the retrieve-and-generated articles could earn a higher EXAM score than the gold articles, especially if the re- The evaluation process will use the above resources as trieve-and-generated articles express information clearer. depicted in Figure 2. We suggest using a multiple-choice This would result in an n-EXAM score above one. For Q/A system and an answer key to verify correctness. example, a Q/A system would have difficulty extracting However, many Q/A systems can be used in our paradigm, answers from a college textbook, when used as a gold such as the systems of Choi et al. [25], Nie et al. [26], article, as college-level reading material requires reader or Perez et al. [27]. To be suitable, some Q/A systems inference or significant logical deduction for full com- would need to be customized, e.g. restricting the sentence prehension. This is one way that human-written, gold selector of Min et al. [28]. articles could receive lower EXAM scores than generated articles sourced from more plainly written corpora. 3.1.2. EXAM Evaluation Scores Given the queries and corpus, each system will generate 3.2. EXAM Evaluation for CAR Y3 one article per query. The exam questions are only used The purpose of the CAR Y3 track is to study retrieval during evaluation: the Q/A system attempts to answer algorithms that respond to complex information needs all exam questions based on the content of the generated 3 article. For a query 𝑞, we measure the EXAM evaluation Our focus is not on evaluating the Q/A system, but the useful- ness of the generated article to a reader. ery Darwin's theory of evolution mitted articles with respect to the query and sub-topics. Participants were encouraged to additionally submit para- Voyage of the Beagle graph rankings for each sub-topic. The official CAR Y3 How did Darwin come up with the theory of evaluation was based on these rankings and the relevance Sub-topics evolution by natural selection? A major influence was an amazing scientific expedition assessments.4 Giant Tortoises The Galpagos Islands are still famous for their 3.2.2. Proposed Alternative: EXAM Evaluation giant tortoises. These gentle giants are found Gold Article* almost nowhere else in the world. Darwin was To evaluate the articles of participating systems with ... EXAM, we require a question bank of exam questions Exam Darwin observed that the environment with an answer key. Queries are derived from titles of estion* on different Galapagos Islands was TQA textbook chapters which come with multiple-choice correlated with the shell shape of ... questions designed by the book author to test (human) a) snails students. We are using these multiple choice questions as b) fossils a question bank for the EXAM metric to assess generated c) tortoises d) none of the above articles of participating systems. In particular, we use all Correct provided non-diagram (i.e. not dependent on a picture) Correct Answer: c) questions. Answer* Figure 3: An example from the TQA dataset used to derive the Gold Articles: We use the textbook content from the TREC CAR benchmark as well as data for our EXAM evaluation TQA textbook chapters as gold articles for the queries measure (marked with *). The example is an excerpt from TQA entry L_0432/NDQ_009501. (also used by the ROUGE baseline as reference summary). While the EXAM metric does not require gold articles, we report the EXAM score achieved by the gold article for reference and include n-EXAM scores as well. As gold by synthesizing longer answers by collating retrieved articles and questions were designed for middle school information, mimicking the style of Wikipedia articles. students, many answers are stated in an obtuse way and For the shared task to yield a reusable benchmark, par- cannot be answered by simple text matches. ticipant systems were restricted to use a corpus of five million predefined passages without modifications. The Textbook Question-Answering (TQA) [32] dataset Used Question-Answering System: As a high-qual- provides textbook chapters and multiple-choice ques- ity Q/A system we use the Decomposable Attention Q/A tions designed for middle school students. In CAR Y3, system provided by the organizers of the TQA challenge. the queries were taken from titles of textbook chapters, The system is trained on the AI2 Reasoning Challenge and sub-topics were derived from headings. Sub-topics dataset (ARC) [33]. The model is adapted from Parikh are used as “nuggets” in the official assessment and were et al. [34], which performs the best on the SNLI [35] provided to participants. For each query, participants dataset, which contains questions similar to TQA ques- were asked to produce relevant articles by selecting and tions. arranging 20 paragraphs from the provided corpus of As inputs, the Q/A system requires a text and a set of Wikipedia paragraphs. questions. The Decomposable Attention model searches For the example query “Darwin’s Theory of Evolu- the text for passages relevant to each question, then ex- tion” (tqa2:L_0432), an excerpt of a textbook chapter is tracts answers by constructing an assertion per question depicted in Figure 3. Neither the questions nor the text- and answer choice. Assertions without text support are book content were available to the CAR Y3 participant eliminated, the most likely assertion under the attention systems. The figure depicts an example of how parts of model is returned as the answer. If all assertions are re- the TQA dataset textbook chapters are used in the CAR jected, the question is not answered. Both unanswered Y3 benchmark versus held out for the EXAM evaluation. and incorrectly answered questions result in a reduced The connections between CAR Y3, retrieval systems un- EXAM score. der study, and our proposed evaluation paradigm are depicted in Figure 2. 3.2.1. Reference: Official CAR Y3 Evaluation For selected queries (see Table 3), NIST assessors pro- vided relevance annotations for all paragraphs in all sub- 4 Only manual assessments are available for CAR Y3. The auto- matic evaluation paradigm was only applicable to CAR Y1. Table 1 Rank correlation between the leaderboards of different evaluation measures. Standard errors are below 0.02. Range: -1 to +1, higher is better. EXAM Prec@R MAP nDCG20 EXAM Prec@R MAP nDCG20 ROUGE -0.09 -0.01 -0.07 -0.01 ROUGE -0.07 0.00 -0.05 0.00 nDCG20 0.74 0.94 0.95 nDCG20 0.57 0.86 0.88 MAP 0.75 0.94 MAP 0.57 0.86 Prec@R 0.74 Prec@R 0.56 (a) Spearman’s rank correlation coefficient. (b) Kendalls’s tau rank correlation coefficient. Table 3 Spearman’s Rank: High when each system 𝑆 (of 𝑛) has Dataset statistics. a similar rank position under both leaderboards A, B: 132 Queries with generated articles 6 ∑𝑆 (𝑟𝑎𝑛𝑘𝐴 (𝑆)−𝑟𝑎𝑛𝑘𝐵 (𝑆))2 𝜌 =1− 𝑛(𝑛2 −1) 20 Paragraphs per article per system 131 Queries with exam questions Kendall’s Tau: High when the rank order of many sys- 2320 Exam questions tem pairs 𝑆1 , 𝑆2 is preserved (𝑃 + ) versus swapped (𝑃 − ): + − 55 Queries with official TREC CAR assessments 𝜏 = 𝑃𝑃 + −𝑃 +𝑃 − 303 Subtopics with official TREC CAR assessments Under any evaluation metric some systems obtain a 2790 Positively assessed paragraphs similar evaluation score within standard error. As this is unlikely to indicate significant difference, we define such system pairs as tied, and thus attribute any score differ- 4. Experimental Evaluation ence to random chance. Therefore, we randomly break ties, which is necessary for Spearman’s rank, to produce We empirically evaluate EXAM as described in Section the leaderboard and compute the rank correlation, re- 3.2 using articles generated by the CAR Y3 [4] participant peating the process ten times. Results are presented in systems. Tables 2a and 2b. 4.1. Experiment Setup 4.1.2. Metrics for System Quality Due to the chicken-and-egg problem, no established ret- We study the leaderboard of systems under the following rieve-and-generate benchmarks exist and no established evaluation measures. retrieve-and-generate systems are available. We base the experimental evaluation on sixteen retrieval systems EXAM (ours): Our proposed evaluation measure which submitted to CAR Y3. We use 131 queries that have uses a Q/A system to evaluate generated articles (see a total of 2320 questions in the TQA dataset. We use Section 3.2). each query’s textbook chapter in the TQA dataset as a n-EXAM (ours): A relative-normalized version of EXAM gold article. Dataset statistics are summarized in Table that uses a set of gold articles to contextualize the 3. Since these systems are not part of our work, we refer EXAM score. to the participant’s description of their systems in the Official CAR Y3 Evaluation (reference): Systems in TREC Proceedings5 and CAR Y3 Overview [4]. CAR Y3 are evaluated using Precision at R (Prec@R), Mean-Average Precision (MAP), and Normalized Dis- 4.1.1. Evaluating the Evaluation Measure counted Cumulated Gain at rank 20 (nDCG20) as im- plemented in trec_eval.6 Our goal is to find an alternative evaluation metric that— ROUGE-1 F1 (baseline): ROUGE evaluates via the sim- while mostly automatic—offers the same high quality ilarity between a generated article and a gold article. as a manual assessment conducted by NIST. Hence, our ROUGE-1 F1 combines precision and recall of predict- measure of success is to produce a system ranking (i.e., ing words in the summary. Words are lowercased and leaderboard) that is highly correlated with the official lemmatized, with punctuation and stopwords removed. CAR Y3 leaderboard. Low or anti-correlation suggests We include ROUGE as a baseline evaluation paradigm, that an evaluation measure would not agree with a user’s because it is fully automated and widely used in NLG. sense of relevance. Correlation of leaderboard rankings is measured in: 6 TREC evaluation tool available here: 5 Proceedings: https://trec.nist.gov/pubs/trec28/trec2019.html https://github.com/usnistgov/trec_eval Table 4 Quality of 16 participating systems submitted to TREC CAR Y3 as measured by our proposed EXAM, official TREC CAR metrics, and ROUGE. Systems are ordered by EXAM score, ranks under other metrics given in column “#”, the best evaluation scores are marked in bold. Standard errors are about 0.01 or less. Systems whose performance is comparable to the gold articles are marked with “⋆“ . Ours Official TREC CAR Evaluation Baseline Systems EXAM n-EXAM # ⋆ Prec@R MAP nDCG20 # ROUGE # rerank2-bert 0.17 1.03 1 ⋆ 0.22 0.18 0.31 3 0.42 12 dangnt-nlp 0.17 1.02 2 ⋆ 0.28 0.25 0.38 1 0.41 13 bert-cknrm-50 0.16 0.99 3 ⋆ 0.14 0.11 0.22 12 0.47 2 irit-run2 0.16 0.94 4 ⋆ 0.19 0.16 0.27 4 0.45 5 rerank3-bert 0.16 0.94 5 ⋆ 0.23 0.20 0.34 2 0.44 8 ict-b-convk 0.16 0.94 6 ⋆ 0.19 0.15 0.27 8 0.39 15 irit-run1 0.16 0.93 7 ⋆ 0.19 0.16 0.27 4 0.44 7 bm25-populated 0.15 0.93 8 0.18 0.14 0.25 9 0.43 10 unh-tfidf-ptsim 0.15 0.92 9 0.17 0.13 0.23 10 0.43 11 irit-run3 0.15 0.92 10 0.19 0.16 0.27 4 0.44 6 unh-bm25-ecmpsg 0.15 0.88 11 0.17 0.13 0.23 10 0.43 9 ecnu-bm25-1 0.14 0.87 12 0.19 0.15 0.27 7 0.49 1 ict-b-drmmtks 0.13 0.80 13 0.01 0.01 0.01 16 0.24 16 uvabottomupch. 0.09 0.57 14 0.04 0.03 0.06 14 0.45 4 uvabm25rm3 0.09 0.54 15 0.04 0.03 0.06 13 0.45 3 uvabottomup2 0.09 0.53 16 0.03 0.02 0.04 15 0.40 14 Gold articles (⋆) 0.17 1.00 - - - - - - - - 4.2. Results cial measures is higher than between EXAM and official measures. EXAM: Table 4 displays the leaderboard of tested re- trieve-and-generate systems ordered by EXAM score. The trend is clear: Systems that rank high on the official ROUGE: By contrast, the leaderboard according to TREC CAR leaderboard also rank high on the EXAM ROUGE-1 F1 is uncorrelated to the official TREC CAR leaderboard, and systems that rank low on the official leaderboard. Ecnu-bm25-1, which has the best ROUGE-1 leaderboard also rank low on the EXAM leaderboard. For F1 score, is not in the top five of the official leaderboard. example, the rerank2-bert and dangnt-nlp participant Tables 2a and 2b demonstrate that the rank correlation systems are ranked as the top two systems by both the of ROUGE is near zero across all metrics, which is equiv- official leaderboard and the EXAM leaderboard. Some alent to a random ordering. systems achieved similar EXAM scores as they likely produce the same passages ordered differently, affecting 4.3. Discussion official TREC CAR evaluations but not EXAM. Addition- We discuss advantages and limitations of the paradigm. ally, due to the setup described in Section 3.2.2, the best systems even slightly surpass the gold articles (see dis- cussion in Section 4.3). Many systems are performing Resilience to Q/A system errors: Any real-world within standard-error of the gold articles (marked with Q/A system will make mistakes, most likely causing cor- ⋆)—these are all similar systems based on the BERT neu- rect answers contained within the generated article to ral language model. be missed. Indeed, the Q/A system is unable to correctly Tables 2a and 2b display the Spearman’s and Kendall’s answer many questions with the gold article, in part be- rank correlations of the EXAM and the official TREC cause the article and questions were designed to be a CAR evaluation. Notably, EXAM achieves correlation of challenge for middle school students. However, despite 0.74 in terms of Spearman’s rank correlation and 0.56 in these Q/A errors, the study demonstrates EXAM reveals terms of Kendall’s Tau, both values can range from -1 to significant quality differences between systems. If our ex- 1. These averages illustrate just how strong EXAM corre- periment had not been successful, we would not observe lates with official assessments, despite the much different any correlation between EXAM and the official CAR Y3 evaluation paradigm. Prec@R, MAP, and nDCG20 use assessments. the same relevance assessments and evaluation paradigm. Hence it is not surpring that the correlation within offi- Overcoming linguistic differences: We found when While exam questions cannot test all possible useful using the gold article with ROUGE-F1, the system rank- follow-up questions, we demonstrate that the available ing does not agree with manual assessments. The issue question bank is large enough to measure significant dif- originates from a difference in linguistic style, as gener- ferences between systems. To identify how little effort ated articles are constrained to use Wikipedia paragraphs, would still yield good results, we spent one hour to manu- but the gold articles are sourced from TQA. Hence, it is ally create ten questions. While error bars are larger, the unlikely that gold articles would use the same phrases as results still correlate with the official leaderboard. (Study the generated articles—despite both covering the same, omitted due to space constraints). relevant topics. In a previous (unpublished) study on CAR Y2 data we Benchmark reusability and comparability: Many found that ROUGE-F1 [12] obtains a reasonable correla- IR benchmarks mandate the use of unmodified elements tion (Kendall’s tau of 0.67, Spearman’s rank of 0.67) when from a fixed corpus. By contrast, EXAM uses a corpus- using manually assessed relevant paragraphs instead of independent evaluation, and thus can be applied across gold articles. We conclude that ROUGE is struggling to different corpora and sources, including open web or overcome the linguistic differences between generated NLG algorithms. Systems using different sources can all and gold articles. be evaluated and compared with each other using the In contrast, when evaluated under the EXAM measure, EXAM evaluation. the gold article obtains the same score as the best partici- pant system. Given the positive results, we conclude that EXAM is able to overcome the linguistic differences. We 5. Conclusions and Future believe this is an important finding as the same issue is Directions likely to arise when a retrieve-and-generate system uses external sources or a fully generative model. We discuss an evaluation paradigm for retrieve-and-gene- rate systems, which are systems that modify retrieved Interpretation of N-EXAM: The n-EXAM metric can raw data before presentation. This poses a challenge for exceed 1.0 when the gold article is written obtusely, but today’s IR evaluation paradigms. To facilitate empirical the generated article explains relevant facts in accessible research on retrieve-and-generate systems, we discuss an language. In our study, gold articles are designed for (hu- alternative evaluation paradigm, the EXam Answerability man) students to carefully read the text and think about Metric (EXAM), that tests whether the system provides the answer, which is challenging for the Q/A system. In relevant information rather than the right documents. contrast, the submitted retrieval systems were allowed to EXAM uses a Q/A system and query-specific question select content from Wikipedia passages, which are likely banks to evaluate whether the system response is capable to clearly state answers to obvious follow-up questions. of answering some obvious follow-up questions, even Hence, we suggest to consider EXAM scores on gold without being explicitly asked to do so. We verify that articles as guidance, rather than a gold standard. Simi- leaderboards under the EXAM evaluation and the manual lar dataset biases are known from work on Multi-Hop TREC CAR evaluation, agree with a Spearman’s Rank Question Answering [36]. We remark that this issue also correlation of 0.74 and Kendall’s Tau of 0.56. affects the ROUGE evaluation. EXAM has two benefits over the traditional IR evalua- tion paradigm: it avoids the need for manual relevance Universality of quality: Our evaluation paradigm is assessments, and it can compare systems that use dif- very different from pool-based Cranfield-style evalua- ferent (or no) corpora for retrieval. While gold articles tions practiced in IR today [24]. Initial concerns that and assessments can be used within the EXAM paradigm, these paradigms evaluate different measures of quality at a minimum EXAM only requires humans to curate have been ameliorated as the experimental evaluation queries and question banks—the rest of the evaluation demonstrates a high agreement between EXAM and the is fully automatic. EXAM also has an advantage over official CAR Y3 assessments. the text summarization metric, ROUGE [15], as EXAM evaluates documents by the relevance of information pro- Reduced manual effort: Cranfield-style evaluations vided, rather than exact wording. This conclusion is in involve a non-trivial amount of human labor. By contrast, line with findings of Deutsch et al. [18]. EXAM’s human assessors only develop a bank of ques- While not studied in this work, EXAM could also be tions that evaluate the information content of articles. used to construct a training signal, as long as the exam EXAM question banks can be reused in a fully automatic questions are not available as inputs to the retrieve-and- manner, as the Q/A system conducts the exam. generate system. Our long-term goal is to develop systems to support users who do not (yet) know what exactly they are look- [6] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, ing for. We envision a system that synthesizes a com- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the prehensive topical overview by collating retrieved text limits of transfer learning with a unified text-to-text with post-processing steps like natural language genera- transformer, Journal of Machine Learning Research tion. Permitting different linguistic styles and encourag- 21 (2020) 1–67. ing comprehensiveness, renders traditional IR evaluation [7] D. Ippolito, D. Duckworth, C. Callison-Burch, paradigms as very costly. These goals also pose chal- D. Eck, Automatic detection of generated text is lenges regarding benchmark reuse for a fair comparison easiest when humans are fooled, in: Proceedings across systems. The EXAM evaluation paradigm pro- of the 58th Annual Meeting of the Association for vides a new avenue for retrieve-and-generate research Computational Linguistics, 2020, pp. 1808–1822. to evaluate systems by information content. [8] L. Dietz, J. Dalton, Humans optional? Auto- However, EXAM can also evaluate many other infor- matic large-scale test collections for entity, passage, mation retrieval tasks: EXAM allows the comparison and entity-passage retrieval, Datenbank-Spektrum of ad hoc retrieval from fixed corpora with open-web (2020) 1–12. retrieval. EXAM offers an alternative way to assess re- [9] W. R. Hersh, A. M. Cohen, P. M. Roberts, H. K. Reka- dundancy for search result diversification. EXAM can palli, Trec 2006 genomics track overview., in: TREC, evaluate the information content of each turn of a con- volume 7, 2006, pp. 500–274. versational search system as well as the provided infor- [10] J. Kamps, M. Lalmas, J. Pehcevski, Evaluating rele- mation content over multiple turns. We believe that in vant in context: Document retrieval with a twist, in: general, evaluation paradigms that, like EXAM, penalize Proceedings of the 30th annual international ACM avoidable conversation turns will encourage information SIGIR conference on Research and development in systems that are forthcoming with answers. information retrieval, ACM, 2007, pp. 749–750. [11] C. Wade, J. Allan, Passage retrieval and evalua- tion, Technical Report, Massachusetts University Acknowledgments Amherst Center for Intelligent Information Re- trieval, 2005. We thank Peter Clark, Ashish Sabharwal, Tushar Khot [12] M. Keikha, J. H. Park, W. B. Croft, Evaluating an- from the Allen Institute for AI for their help with the swer passages using summarization measures, in: TQA dataset and the provision of the Q/A System, and Proceedings of the 37th international ACM SIGIR the UNH TREMA group for their guidance in doing this conference on Research & development in informa- research. tion retrieval, ACM, 2014, pp. 963–966. [13] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, References Y. Artzi, Bertscore: Evaluating text generation with bert, arXiv preprint arXiv:1904.09675 (2019). [1] R. S. Taylor, Question-negotiation and information [14] J. Allan, B. Croft, A. Moffat, M. Sanderson, Fron- seeking in libraries, College & Research Libraries tiers, challenges, and opportunities for information 29 (1968) 178–194. retrieval: Report from swirl 2012 the second strate- [2] N. J. Belkin, Anomalous states of knowledge as a gic workshop on information retrieval in lorne, in: basis for information retrieval, Canadian journal ACM SIGIR Forum, volume 46, ACM New York, NY, of information science 5 (1980) 133–143. USA, 2012, pp. 2–32. [3] T. Ruotsalo, J. Peltonen, M. J. Eugster, D. Głowacka, [15] C.-Y. Lin, ROUGE: A package for automatic eval- P. Floréen, P. Myllymäki, G. Jacucci, S. Kaski, In- uation of summaries, in: Text Summarization teractive intent modeling for exploratory search, Branches Out, Association for Computational Lin- ACM Transactions on Information Systems (TOIS) guistics, Barcelona, Spain, 2004, pp. 74–81. URL: 36 (2018) 1–46. https://www.aclweb.org/anthology/W04-1013. [4] L. Dietz, J. Foley, Trec car y3: Complex answer re- [16] S. Banerjee, A. Lavie, Meteor: An automatic met- trieval overview, in: Proceedings of Text REtrieval ric for mt evaluation with improved correlation Conference (TREC), 2019. with human judgments, in: Proceedings of the [5] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- acl workshop on intrinsic and extrinsic evaluation plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- measures for machine translation and/or summa- try, A. Askell, et al., Language models are few-shot rization, 2005, pp. 65–72. learners, arXiv preprint arXiv:2005.14165 (2020). [17] T. Scialom, S. Lamprier, B. Piwowarski, J. Staiano, tational Linguistics (Volume 1: Long Papers), 2017, Answers unite! Unsupervised metrics for rein- pp. 209–220. forced summarization models, in: Proceedings of [26] P. Nie, Y. Zhang, X. Geng, A. Ramamurthy, L. Song, the 2019 Conference on Empirical Methods in Nat- D. Jiang, Dc-bert: Decoupling question and doc- ural Language Processing and the 9th International ument for efficient contextual encoding, Proceed- Joint Conference on Natural Language Processing ings of the 43rd International ACM SIGIR Confer- (EMNLP-IJCNLP), Association for Computational ence on Research and Development in Information Linguistics, Hong Kong, China, 2019, pp. 3246–3256. Retrieval (2020). URL: http://dx.doi.org/10.1145/ URL: https://www.aclweb.org/anthology/D19-1320. 3397271.3401271. doi:1 0 . 1 1 4 5 / 3 3 9 7 2 7 1 . 3 4 0 1 2 7 1 . doi:1 0 . 1 8 6 5 3 / v 1 / D 1 9 - 1 3 2 0 . [27] E. Perez, P. Lewis, W.-t. Yih, K. Cho, D. Kiela, Un- [18] D. Deutsch, T. Bedrax-Weiss, D. Roth, Towards supervised question decomposition for question question-answering as an automatic metric for eval- answering, arXiv preprint arXiv:2002.09758 (2020). uating the content quality of a summary, arXiv [28] S. Min, V. Zhong, R. Socher, C. Xiong, Efficient preprint arXiv:2010.00490 (2020). and robust question answering from minimal con- [19] R. Gupta, C. Orasan, J. van Genabith, Reval: A text over documents, in: Proceedings of the 56th simple and effective machine translation evaluation Annual Meeting of the Association for Computa- metric based on recurrent neural networks, in: tional Linguistics (Volume 1: Long Papers), 2018, Proceedings of the 2015 Conference on Empirical pp. 1725–1735. Methods in Natural Language Processing, 2015, pp. [29] J. Lin, Is question answering better than infor- 1066–1072. mation retrieval? Towards a task-based evalua- [20] H. Kane, M. Y. Kocyigit, A. Abdalla, P. Ajanoh, tion framework for question series, in: Human M. Coulibali, Nubia: Neural based interchange- Language Technologies 2007: The Conference of ability assessor for text generation, arXiv preprint the North American Chapter of the Association arXiv:2004.14667 (2020). for Computational Linguistics; Proceedings of the [21] M. Eyal, T. Baumel, M. Elhadad, Question answer- Main Conference, 2007, pp. 212–219. ing as an automatic evaluation metric for news ar- [30] T. Sakai, M. P. Kato, Y.-I. Song, Overview of ntcir- ticle summarization, in: Proceedings of the 2019 9, in: Proceedings of the 9th NTCIR Workshop Conference of the North American Chapter of the Meeting, 2011, 2011, pp. 1–7. Association for Computational Linguistics: Human [31] H. Bota, K. Zhou, J. M. Jose, M. Lalmas, Composite Language Technologies, Volume 1 (Long and Short retrieval of heterogeneous web search, in: Proceed- Papers), Association for Computational Linguis- ings of the 23rd international conference on World tics, Minneapolis, Minnesota, 2019, pp. 3938–3948. wide web, 2014, pp. 119–130. URL: https://www.aclweb.org/anthology/N19-1395. [32] A. Kembhavi, M. Seo, D. Schwenk, J. Choi, doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 3 9 5 . A. Farhadi, H. Hajishirzi, Are you smarter than [22] L. Huang, L. Wu, L. Wang, Knowledge a sixth grader? Textbook question answering for graph-augmented abstractive summarization with multimodal machine comprehension, 2017 IEEE semantic-driven cloze reward, Proceedings of the Conference on Computer Vision and Pattern Recog- 58th Annual Meeting of the Association for Com- nition (CVPR) (2017) 5376–5384. putational Linguistics (2020). URL: http://dx.doi. [33] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sab- org/10.18653/v1/2020.acl-main.457. doi:1 0 . 1 8 6 5 3 / harwal, C. Schoenick, O. Tafjord, Think you have v1/2020.acl- main.457. solved question answering? Try ARC, the AI2 rea- [23] G. Doddington, Automatic evaluation of machine soning challenge, ArXiv abs/1803.05457 (2018). translation quality using n-gram co-occurrence [34] A. Parikh, O. Täckström, D. Das, J. Uszkoreit, statistics, in: Proceedings of the Second Interna- A decomposable attention model for natural lan- tional Conference on Human Language Technology guage inference, in: Proceedings of the 2016 Research, HLT ’02, Morgan Kaufmann Publishers Conference on Empirical Methods in Natural Lan- Inc., San Francisco, CA, USA, 2002, p. 138–145. guage Processing, Association for Computational [24] E. M. Voorhees, The evolution of cranfield, in: In- Linguistics, Austin, Texas, 2016, pp. 2249–2255. formation retrieval evaluation in a changing world, URL: https://www.aclweb.org/anthology/D16-1244. Springer, 2019, pp. 45–69. doi:1 0 . 1 8 6 5 3 / v 1 / D 1 6 - 1 2 4 4 . [25] E. Choi, D. Hewlett, J. Uszkoreit, I. Polosukhin, [35] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A. Lacoste, J. Berant, Coarse-to-fine question an- A large annotated corpus for learning natural lan- swering for long documents, in: Proceedings of the guage inference, arXiv preprint arXiv:1508.05326 55th Annual Meeting of the Association for Compu- (2015). [36] J. Chen, G. Durrett, Understanding dataset de- Chapter of the Association for Computational Lin- sign choices for multi-hop reasoning, in: Proceed- guistics: Human Language Technologies, Volume 1 ings of the 2019 Conference of the North American (Long and Short Papers), 2019, pp. 4026–4032.