<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">EXAM++: LLM-based Answerability Metrics for IR Evaluation</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Naghmeh</forename><surname>Farzi</surname></persName>
							<email>naghmeh.farzi@unh.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">University of New Hampshire</orgName>
								<address>
									<addrLine>33 Academic Way</addrLine>
									<settlement>Durham</settlement>
									<region>NH</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Data and code available at https</orgName>
								<orgName type="laboratory" key="lab1">LLM4Eval</orgName>
								<orgName type="laboratory" key="lab2">The First Workshop on Large Language Models for Evaluation in Information Retrieval</orgName>
								<orgName type="institution">TREMA-UNH</orgName>
								<address>
									<addrLine>18 July 2024</addrLine>
									<settlement>Wash, ington</settlement>
									<region>DC</region>
									<country key="US">United States</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Laura</forename><surname>Dietz</surname></persName>
							<email>dietz@cs.unh.edu</email>
							<affiliation key="aff0">
								<orgName type="institution">University of New Hampshire</orgName>
								<address>
									<addrLine>33 Academic Way</addrLine>
									<settlement>Durham</settlement>
									<region>NH</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Data and code available at https</orgName>
								<orgName type="laboratory" key="lab1">LLM4Eval</orgName>
								<orgName type="laboratory" key="lab2">The First Workshop on Large Language Models for Evaluation in Information Retrieval</orgName>
								<orgName type="institution">TREMA-UNH</orgName>
								<address>
									<addrLine>18 July 2024</addrLine>
									<settlement>Wash, ington</settlement>
									<region>DC</region>
									<country key="US">United States</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">EXAM++: LLM-based Answerability Metrics for IR Evaluation</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">70115F04BC91B63FDB68D86AE04C8AAD</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>Information Retrieval Evaluation, Large Language Models</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Large language models provide an opportunity for reliable and efficient information retrieval evaluation methods. However, current evaluation metrics fall short in accurately assessing the information content of systems' responses-without resorting to expensive human judgments.</p><p>In contrast, the EXAM++ Answerability Metric leverages a bank of query-related exam questions to quantify relevant information content that is covered in the systems' responses. The process involves (1) decomposing the query into detailed questions, (2) checking each for answerability with passages in the system response, and (3) devising evaluation metrics based on this information. Using the TREC Complex Answer Retrieval benchmark, we demonstrate that our LLM-based EXAM++ approach works successfully, outperforming several established baselines. In particular, we take a deep dive into different approaches to determine the answerability of questions in a given passage, including the use of question answering systems with answer verification and self-rated answerability determination. 1</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Large Language Models (LLMs) can generate and/or retrieve responses for search queries, resulting in many systems that combine traditional retrieval with neural ranking and natural language generation. Ideally, the systems' responses cover relevant information content while being concise and complete. However, there is a need for convincing evaluation metrics to assess the accuracy and completeness of the information content in responses. This should be accomplished in a repeatable and reusable manner and without resorting to expensive human judgments.</p><p>To address this scenario, Sander and Dietz <ref type="bibr" target="#b0">[1]</ref> proposed the EXAM Answerability Metric, which evaluates retrieval/generation systems based on whether they retrieve passages that answer a set of query-specific exam questions. Given a test bank of exam questions, they automate the work-intensive part of scanning each passage for answers using an automated question answering system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Naghmeh Farzi et al. CEUR Workshop Proceedings</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1-20</head><p>With EXAM++, we significantly expand on Sander's idea by • supporting the development of exam question banks with prompt-based generation, • modernizing the question answering system with the recently released FLAN-T5 family <ref type="bibr" target="#b1">[2]</ref>,</p><p>• exploiting abilities of modern LLMs to determine the answerability of questions, • offering relevance labels that are inter-operable with commonly used evaluation tools (e.g.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>trec_eval).</head><p>A strength of EXAM++ is that, in contrast to other work on LLM-based relevance grading, we can readily integrate humans into the evaluation by having them manage the design of the test bank of exam questions. The test questions should be designed to cover all relevant facets of a query, so that the more questions are addressed, the more relevant a passage is. Based on the long history of classroom education and exam design, we argue it is more natural for human judges to control the design of exam questions than to directly provide relevance judgments.</p><p>By virtue of automating the grading of system responses, human judges are never required to perform passage-level relevance assessments. At the same time, humans are fully in control of defining which information content is relevant via the exam question bank. <ref type="foot" target="#foot_0">1</ref>The evaluation approach yields reusable test collections that can be expanded by modifying the question bank at any point in the evaluation process, as the remaining pipeline is fully automated. The impact of a question bank modification can be directly observed by listing passages whose relevance grade would change.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contributions.</head><p>In this paper, we provide an in-depth study analyzing different choices of the EXAM++ approach: Automatic vs. manual test banks, predicted relevance labels with traditional evaluation metrics vs. coverage-based measures, impacts of fine-tuning vs. prompt engineering, grading via self-ratings vs. via question answering systems with different answer verification approaches. While EXAM++ is identical to the question-based RUBRIC evaluation method <ref type="bibr" target="#b3">[4]</ref>, in this paper we provide an in-depth comparison on different question-answering approaches. Additionally, we compare to the original EXAM method <ref type="bibr" target="#b0">[1]</ref> and several direct grading prompts <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>We focus on an approach that does not require passage-level relevance judgments or source texts. Our work is unique in this regard, but aspects relate to many active branches of research, which we detail below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">LLM-based Relevance Label Predictors</head><p>In contrast to our approach, several LLM-based evaluation approaches attempt to directly imitate the relevance judgment process.</p><p>Sun et al. <ref type="bibr" target="#b4">[5]</ref> rerank passages using a simple LLM prompt "does the passage answer the query?" Faggioli et al. <ref type="bibr" target="#b5">[6]</ref> conduct an early evaluation experiment by asking an LLM to judge the relevance of a passage. They design a simple prompt and a more elaborate multi-relevance few-shot prompt developed for the TREC Deep Learning track. Thomas et al. <ref type="bibr" target="#b7">[8]</ref> compare the ability of LLMs to perform document-level relevance judgments in comparison to different groups of human annotators. In their study they use a detailed prompt that instructs the LLM to respond with a multi-level relevance grade. We include several of these prompts in our empirical evaluation. <ref type="foot" target="#foot_1">2</ref>In 1SLs, MacAvaney and Soldaini <ref type="bibr" target="#b8">[9]</ref> focus on evaluating passages with a DuoPrompt, that instructs an LLM to indicate which of two passages is more relevant for a query. However, several critiques have been raised about using LLMs for producing relevance labels in general. Faggioli et al. <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b9">10]</ref> elaborates a wide range of theoretical concerns, centered on questions of trustworthiness and reliability of LLMs now and in the future. Liu et al. <ref type="bibr" target="#b10">[11]</ref> demonstrate that evaluator-LLMs assign a higher score to systems that use the same LLM model. Wang et al. <ref type="bibr" target="#b11">[12]</ref> empirically demonstrate that LLMs exhibit unfair positional bias towards candidates displayed for evaluation. Fok and Weld <ref type="bibr" target="#b12">[13]</ref> studies general issues of human over-reliance and under-reliance on LLMs. They elaborate why rationales produced by LLMs for human verification do not generally lead to improvements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Evaluation with Test Questions</head><p>The idea of anchoring an evaluation on a bank of test questions has been widely discussed in literature on summarization <ref type="bibr" target="#b13">[14]</ref>, recently with automated question answering methods. Eyal et al. <ref type="bibr" target="#b14">[15]</ref> suggest a system evaluation score that is based on the number of questions that a question answering system can correctly answer using the system response-a principle that both the original EXAM method and our approach follow.</p><p>Many approaches use a Cloze-style approach to generate questions from a given gold summary or source text. Questions can be in the form of multiple-choice questions <ref type="bibr" target="#b15">[16]</ref>, free text questions with exact-match answer verification <ref type="bibr" target="#b17">[17]</ref>, or be derived from extracted entities and relations <ref type="bibr" target="#b18">[18,</ref><ref type="bibr" target="#b14">15]</ref>.</p><p>As it pertains to information retrieval evaluation, the problem with generating questions from a given source text or gold summary is that (1) such a gold standard is usually not available and (2) it is unclear which of these questions relate to relevant information in the gold summary (or source text).</p><p>The original EXAM method avoids this problem altogether by asking a human to design questions that address the search query. In contrast, we propose to automatically generate questions directly from the query, building on the world-knowledge of GPT <ref type="bibr" target="#b19">[19]</ref>-with the intention of employing manual labor to verify or weight the question set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Background: Original EXAM</head><p>The original EXAM method <ref type="bibr" target="#b0">[1]</ref> uses a query-specific test bank of exam questions harvested from school textbooks in the Textbook Question Answering (TQA) dataset <ref type="bibr" target="#b20">[20]</ref>-a dataset from which topics for the TREC CAR Y3 evaluation were derived. Furthermore, they use a custom question answering system that is optimized to answer multiple-choice questions in the style of TQA questions.</p><p>Their approach considers each passage retrieved by a system submitted to TREC CAR Y3 and uses the automated question answering system to extract answers for all test questions. Each of these answers is verified against the answer key for each exam question-tracking correctly answered questions. The system's evaluation score is based on the set of questions that is correctly addressed with any of the top 20 passages-averaged across all queries. The more questions can be answered, the higher the EXAM evaluation score for that system.</p><p>The original EXAM method relies solely on humans to design the exam, with the intent that only a human could identify the core questions that would need to be addressed in a relevant answer. This is in contrast to approaches that generate questions from a gold summary (detailed in Section 2.2), which might lead to questions derived from non-relevant aspects mentioned in relevant text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Approach</head><p>In this work we explore a modernized version of Sander's EXAM Answerability metric, which we call EXAM++. <ref type="foot" target="#foot_2">3</ref> Akin to Sander's method, we use a bank of exam questions to grade systems based on the set of questions that can be correctly answered with information in the system's response. The more questions can be answered with the system's response, and the more passages answer questions well, the higher the EXAM evaluation score of the system. By automating the component that determines the answerability of passages, the evaluation paradigm becomes repeatable and reusable at a reasonable cost. As a result, it can be applied to systems that retrieve passages from a corpus as well as systems that generate content with LLMs.</p><p>Our EXAM++ evaluation system assumes the following inputs: 1. A set of queries, optionally with query subtopics. 2. A set of system responses, which can come in the form of a passage ranking or a set of generated passages. Note that the exam questions are intended to be kept secret from the retrieval/generation system, only to be used for evaluation.</p><p>The EXAM++ evaluation approach is structured into the following three phases that we detail in the remainder of the section and depict in Figure <ref type="figure" target="#fig_0">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Obtaining an exam question bank:</head><p>A process of creating a test bank of query-specific exam questions. 2. Grading system responses: All passages in system responses are graded using an automated LLM-based system to determine which questions are answerable with the passage content. For each passage, the set of answerable questions is tracked along with grades that represent how relevant, complete, and accurate the provided answer is. EXAM++ approach. Left: The system under evaluation retrieves or generates passages 𝑝 ∈ 𝑃 in response to queries 𝑞 (blue). The system does not have access to exam questions. Passages from all systems will be pooled for assessment, and additional passages can be added later as new systems are developed.</p><p>Right: The EXAM++ evaluation system uses three phases detailed in Section 4. For each query 𝑞, an exam question bank 𝑅 𝑞 is developed, which can be modified later in an iterative fashion (purple). All passages from the system response (e.g., p1, p2, p3) are graded based on which questions (r1, r2, ..., r5) can be correctly answered with the passage text (red). We support two modes: one where answers are verified against an answer key (depicted as check marks), or by having an LLM self-rate the answerability on a scale from 0 to 5. The EXAM++ evaluation scores are derived from these grades (green).</p><p>The EXAM-Cover score is based on how many questions are covered, as binary verification or via a minimum self-rating level. For EXAM-Qrels a relevance file for trec_eval is derived, which is based on the coverage or best self-rating obtained by this passage in isolation.</p><p>A human-in-the-loop is ensuring that grade annotations correlate with relevant passages and will improve the test bank in response and adjust the grading system where necessary (cyan). We provide a worked example in Section 5.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXAM evaluation scoring:</head><p>We derive multiple evaluation scores. The more exam questions can be answered well with passages of a systems' response, the higher the system's EXAM-Cover score. The more passages address any of the exam questions well, the higher the system's precision-oriented EXAM score. By exporting passage-level relevance labels, any traditional evaluation metric can be incorporated (we refer to this evaluation score as EXAM-Qrels).</p><p>Our contribution differs from the original EXAM method in several important ways:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Obtaining an exam question bank:</head><p>To obtain exam questions,</p><p>• The original EXAM method is based on manually created multiple-choice exam questions. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>TREC CAR Y3 Question Bank Prompt</head><p>Explore the connection between '{query_title}' with a specific focus on the subtopic '{query_subtopic}'. Generate insightful questions that delve into advanced aspects of '{query_subtopic}', showcasing a deep understanding of the subject matter. Avoid basic or introductory-level inquiries. Give the question set in the following JSON format:</p><formula xml:id="formula_0">```json {"questions":[question_text_1, question_text_2,...]} ``•</formula><p>We propose to semi-automatically generate free-text questions for each query, as described in Section 4.1.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Grading system responses:</head><p>To grade each passage via the answerability of exam questions,</p><p>• The original EXAM method uses a pre-neural multiple-choice question answering system with answer verification. • First, we modernize the question answering system with an LLM-based approach (Section 4.2.1). • Second, we explore the ability of LLMs to self-rate the answerability of a question with given context, without directly verifying the correctness of answer (Section 4.2.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. EXAM-Cover evaluation:</head><p>To evaluate each IR system,</p><p>• With EXAM-Cover, we follow the original EXAM method by evaluating systems according to the number of answerable exam questions (Section 4.3.1). • To improve adoption, we add a variant "EXAM-Qrels" that implements a related idea so that it is inter-operable with the popular evaluation tool trec_eval (Section 4.3.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Phase 1: EXAM++ Question Banks</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1.">Generating Question Banks</head><p>We use a generative LLM, specifically ChatGPT, to automate the creation of free-text questions <ref type="foot" target="#foot_3">4</ref>that are directly tailored to the needs of information retrieval (IR) tasks and specific domain requirements. This approach allows a larger information to be broken down need into insightful and relevant questions that probe deeply into the nuances of each query, enhancing the depth and quality of the question banks. With application to TREC CAR Y3, a set of open-ended questions 𝑅 𝑞 are generated for each subtopic, via a zero-shot prompt as detailed in Table <ref type="table" target="#tab_0">1</ref>.</p><p>The goal during topic development is to have a human judge ensure that essential information about the query is covered by the question bank, and (if necessary) modify the questions accordingly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Phase 2. Grading prompts. The question answering prompt extracts an answer to an exam question from the passage, to be verified with word matching or the verification prompt. Alternatively, the answerability can be self-rated by the LLM without explicitly extracting the answer.</p><p>Question Answering Prompt provide a complete and concise answer to the question based on the context. Question: {ques-tion} Context: {context} Optional: Answer Verification Prompt For the question "{question}" the correct answer is "{correct_answer}". Is "{answer}" an equally correct response to this question? Answer yes or no.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Self-rating Answerability Prompt</head><p>Can the question be answered based on the available context? choose one: -5: The answer is highly relevant, complete, and accurate. -4: The answer is mostly relevant and complete but may have minor gaps or inaccuracies. -3: The answer is partially relevant and complete, with noticeable gaps or inaccuracies.</p><p>-2: The answer has limited relevance and completeness, with significant gaps or inaccuracies.</p><p>-1: The answer is minimally relevant or complete, with substantial shortcomings. -0: The answer is not relevant or complete at all. Question: {question} Context: {context}</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2.">Manual Question Banks</head><p>Alternatively, query-specific question banks 𝑅 𝑞 can be manually constructed from scratch. Optionally this can include a gold answer key for verification, as described in the original EXAM method, which uses such a test bank from the TQA dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Phase 2: Automated EXAM++ Grading</head><p>The grading process leverages a state-of-the-art LLM, such as the FLAN-T5-large <ref type="bibr" target="#b1">[2]</ref> model, chosen to trade-off processing speed and ability to understand complex queries and context. Prompts in Table <ref type="table">2</ref> have been designed for reliable exam grading-especially so that the LLM focuses solely on the provided context rather than relying on its pre-trained knowledge. The LLM is queried separately for each passage to prevent positional biases, ensuring that each answer is contextually derived from the passage to which it corresponds. Pre-processing system responses. Before grading, a judgment pool of all retrieved passages is created for efficient processing. Longer system responses are segmented into paragraph-sized passages 𝑃. Each passage is given a unique identifier (passage_id) to ensure that every part of the response can be individually traced throughout the grading process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1.">LLM-based Question Answering with Answer Checking</head><p>For every passage-question pair (𝑝, 𝑟), we ask the LLM to extract a best effort answer from the passage. We use the prompt in Table <ref type="table">2</ref> (top) with a text-to-text generation pipeline.</p><p>Once answers are extracted, they are verified for correctness against the answer key. The verification process will normalize the correct and predicted answers through lower-casing, stopword removal, and stemming. We then apply a heuristic matching function where a match is considered valid if the edit distance between normalized answers is less than 20% of the length of the longer string.</p><p>Occasionally the LLM will respond with an expression indicating that the question is unanswerable with the provided context. We count an answer as incorrectly answered (grade 0) when we encounter an ill-formed answer (such as "a. " or "(iii)") or one of the following expressions: "unanswerable", "no", "no answer", "not enough information", "unknown", "it is not possible to tell", "it does not say", or "no relevant information".</p><p>Variation: SQuAD2 fine-tuning. We study the impacts of fine-tuning the question answering system using the SQuAD2 dataset <ref type="bibr" target="#b21">[21]</ref>. SQuAD2 is comprised of questions in a similar style to TQA, to be answered in the context of a provided passage. SQuAD2 also includes many training examples where questions are unanswerable with the given context, which is essential to determine the answerability of questions for EXAM++.</p><p>Variation: Answer verification with LLMs. The implementation of the answer verification remains a technical challenge. Noticing that many correct answers are missed because they are phrased differently, we additionally explore asking the LLM to verify the answer match. We verify with the prompt in Table <ref type="table">2</ref> (middle) by providing the extracted answer, the gold answer, and the question.</p><p>We manually analyzed the accuracy of this verification step, based on extracted and correct answers. To give an example from TQA exam question L_0016/NDQ_000615 "During very wet times, the water table will... " for which the correct answer is "rise", this LLM-based process identifies additional answers including "increase", "rising", "be higher", "increase substantially" as well as answers that restate the question such as "During very wet times, the water table will rise. "</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2.">Grading by Self-rating Answerability</head><p>Given the technical challenges of answer verification we explore an easier alternative. We use an answerability system introduced as RUBRIC in Farzi and Dietz <ref type="bibr" target="#b3">[4]</ref>, that self-rates whether the passage 𝑝 answers the question 𝑟 ∈ 𝑅 𝑞 , without first extracting the answer.</p><p>Given each passage-question pair, the LLM rates the answerability on a scale from 0 (worst) to 5 (best) using the prompt provided in Table <ref type="table">2</ref>, bottom. In cases where the LLM does not provide a numerical rating, we default to a rating of 1 for answered questions-with the exception of answers that denote unanswerability (as in Section 4.2.1) for which we assign a grade of 0.</p><p>This method enables an autonomous assessment of answerability and relevance, avoiding technical issues of answer verification when there are different ways to phrase a correct answer or if there are different answers that are equally correct. Moreover, this supports the use of open-ended questions for evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1-20</head><p>The output of the grading phase is, for each passage-question pair (𝑝, 𝑟), a grade that represents the relevance, completeness, and accuracy with which the question is addressed. The grade is 0 if the passage does not address the answer. For question answering with answer verification, the grade is either 1 (if correct) or 0 otherwise. In addition, we track the extracted answer to support manual verification via human judges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Phase 3: EXAM++ Evaluation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1.">EXAM-Cover Evaluation</head><p>We incorporate a coverage-style evaluation metric as suggested by Sander et al. It quantifies the set of exam questions 𝑟 ∈ 𝑅 𝑞 for the query 𝑞 that are covered in retrieved passages 𝑝 ∈ 𝑃 with a minimum grade level 𝜏 , as defined by:</p><formula xml:id="formula_1">EXAM-Cover 𝜏 (𝑃) = 1 |𝑅 𝑞 | |⋃ 𝑝∈𝑃 {𝑟|grade(𝑝, 𝑟) ≥ 𝜏 , ∀𝑟 ∈ 𝑅 𝑞 }|<label>(1)</label></formula><p>To avoid gaming the evaluation metric with a very long system response, the size of the passage set 𝑃 is limited to a fixed budget, e.g. 𝑘 = 20 passages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2.">EXAM-Qrels Evaluation</head><p>Alternatively, we provide relevance labels for each passage facilitating compatibility with traditional IR evaluation metrics, such as implemented in the trec_eval tool. Passage-level relevance labels are obtained by mapping grades to a binary or multi-graded relevance label, EXAM-Label(𝑝) = max</p><formula xml:id="formula_2">𝑟∈𝑅 𝑞 grade(𝑝, 𝑟)<label>(2)</label></formula><p>The EXAM-Label allows to use established IR evaluation metrics that incorporate multigraded relevance labels (such as NDCG), or by choosing a minimum grade indicating relevance, 𝜏 , to control the leniency of the evaluation.</p><p>Like all relevance-label based approaches, the pool of graded passages may impact the evaluation results-therefore, as systems reveal unjudged passages, these should be graded to update the qrel files.</p><p>The downside of this EXAM-Qrels approach is that once a relevance label is determined, the evaluation metric is unaware of which exam questions were covered. To preserve this information, future work should explore integrating EXAM++ with intent-aware evaluation measures such as 𝛼-NDCG <ref type="bibr" target="#b22">[22,</ref><ref type="bibr" target="#b23">23]</ref>.</p><p>Whether EXAM-Cover or EXAM-Qrels is a more appropriate evaluation measure depends on the goals of the information retrieval application. When users are expected to stop after the first relevant passage, then we suggest evaluating with EXAM-Qrels with mean reciprocal rank. When recall is a priority, we suggest using EXAM-Qrels with R-precision or (mean-) average precision (MAP). When the emphasis is on covering diverse facets of relevance, we suggest to use EXAM-Cover.</p><p>1-20</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experimental Evaluation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental Setup</head><p>We experimentally compare variations of our EXAM++ system to the original EXAM method <ref type="bibr" target="#b0">[1]</ref>. The evaluation uses queries, manual TREC judgments, and submitted systems from the third year of the TREC Complex Answer Retrieval track (TREC CAR Y3) <ref type="bibr" target="#b24">[24]</ref>, <ref type="foot" target="#foot_4">5</ref> as these align with manual test questions and results from Sander and Dietz <ref type="bibr" target="#b0">[1]</ref>. Empirical results on other datasets are available in Farzi and Dietz <ref type="bibr" target="#b3">[4]</ref>.</p><p>In experiments with generated question banks, we follow "Phase 1" to obtain ten questions for each of the 721 query-subtopics across 131 queries in CAR Y3. In experiments that use the manual TQA question bank, we use all non-diagram questions with gold answer keys. In preparation for grading (Phase 2), we build a judgment pool of all passages in official judgments and the top 20 of all run submissions-a total of 85,329 passages.</p><p>For question generation we use gpt-3.5-turbo-instruct; for question verification and self-rating we use the FLAN-T5-large model with the text2text-generation pipeline from HuggingFace. <ref type="foot" target="#foot_5">6</ref> We also explore fine-tuning the FLAN-T5-large on the SQuAD2 dataset. The fine-tuned model is available on HuggingFace as sjrhuschlee/flan-t5large-squad2 to be used with the extractive question-answering pipeline.</p><p>We compare the following variations of our approach. EXAM++: Using our generated question banks and grading with self-ratings (Sections 4.1.1 and 4.2.2). Manual EXAM++: As previous but using manual question banks from the TQA dataset <ref type="bibr" target="#b20">[20]</ref> (Sections 4.1.2 and 4.2.2). Manual-EXAM-QA: As previous but grading via question answering using prompts from Table <ref type="table">2</ref> (top), with word-based answer checking (Sections 4.1.2 and 4.2.1). Manual-EXAM-Squad2: As previous but fine-tuning the grading LLM on SQuAD2 and using the question-answering pipeline of a prompt. LLM-verified Manual-EXAM-QA &amp; Manual-EXAM-Squad2: Like the two previous but the extracted answers are verified with the FLAN-T5-large LLM using the answer verification prompt from Table <ref type="table">2</ref> (middle). For all these methods we compare both the EXAM-Qrels and EXAM-Cover evaluation approach. For EXAM-Qrels, we export passage-level EXAM++ relevance labels to be used with trec_eval on traditional evaluation measures. In this experiment we use measures used in the official TREC CAR Y3 evaluation, such as average precision (MAP), normalized cumulative discounted gain (NDCG@20), and R-precision (Rprec).</p><p>We compare to the following reference baselines.</p><p>Original EXAM: Using the results provided by Sander et al <ref type="bibr" target="#b0">[1]</ref>. Significance testing. We perform a standard-error bar overlap test for Figure <ref type="figure" target="#fig_1">2</ref> and only describe significant differences in the text. For leaderboard correlation results in Table <ref type="table">3</ref>, we consider results within ±0.05 as equally good.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Overall Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>EXAM++.</head><p>Each evaluation method gives rise to a leaderboard of systems. Table <ref type="table">3</ref> compares how well each leaderboard correlates with the official TREC leaderboard. Our proposed EXAM++ with minimum grade 𝜏 = 5 obtains overall best results for EXAM-Qrels. In many cases this approach obtains near-perfect rank correlations above 0.9. For reference, rank correlation statistics are on a range from -1 to +1, with 0 indicating no correlation.</p><p>Table <ref type="table" target="#tab_1">4</ref> presents the inter-annotator agreement between manual TREC judgments and predicted relevance labels. We see that especially high self-rating grades obtain a good correlation (Cohen's kappa of 0.38).</p><p>For leaderboards based on the EXAM-Cover metric, we also obtain strong results with EXAM++, but observe even better results using the manually created question bank (Manual EXAM++). We believe that the manual control in question bank design would only select vetted questions that represent relevance. We find that some of the generated questions are too broad, promoting systems that provide information that is not sufficiently specific. Future work should focus on adjusting the question bank generation prompt (Table <ref type="table" target="#tab_0">1</ref>) to obtain more focused questions.</p><p>QA + answer verification. Next, we turn to EXAM++ approaches that determine relevance by verifying extracted answers from passages against gold answer keys. We find that verifying extracted answers (Manual-EXAM-QA) obtains comparable results to the self-rated answerability approach (Manual EXAM++) when used to obtain relevance labels. However, it is slightly worse when used with coverage-based metrics.</p><p>In either case, all our proposed approaches outperform the original EXAM method [1] by using a strong LLM-based question answering method as opposed to a pre-neural question answering system. LLM-based relevance label predictors. None of the direct grading prompts described in Section 2.1 work well on the TREC CAR Y3 dataset-in many cases obtaining weak rank correlations of below 0.5 (marked in grey). This is in contrast to findings on the TREC DL test collections <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b7">8]</ref>, where these direct grading prompts perform extremely well (both when using GPT <ref type="bibr" target="#b5">[6]</ref> and FLAN-T5-large <ref type="bibr" target="#b3">[4]</ref>). We suspect that the exact prompts are designed for the for unambiguous narrowly specified question-style queries as found in the TREC DL collection ("When did rock'n'roll begin?") but struggle with the broad information needs in the TREC CAR Y3 collection (e.g., "the integumentary system").</p><p>Furthermore, the few shot examples designed for the TREC DL domain (used in Faggi- oliB_few, Sun_few) do not generalize to the broad information needs of the TREC CAR Y3 domain. We hope that future research analyzes which of the findings on the DL collection generalize to other information retrieval use cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Obtained System Leaderboards</head><p>Figure <ref type="figure" target="#fig_1">2</ref> presents the impact of different evaluation methods on how systems are ranked on the leaderboard. We choose three of the best performing evaluation methods, spanning across our different options (marked in blue in Table <ref type="table">3</ref>), namely: EXAM++ MAP (grade&gt;=5): Generated question bank, self-rated answerability EXAM++ with EXAM-Qrels, trec_eval using (mean) average precision, relevant grade ≥ 5. Manual EXAM++ Cover (grade&gt;=1): Manual question bank, self-rated answerability EXAM-Cover, relevant grade ≥ 1. Manual-EXAM-Squad2: Manual question bank, using the question answering approach with answer verification on a fine-tuned LLM model, EXAM-Qrels, trec_eval using Rprecision. Orig EXAM <ref type="bibr" target="#b0">[1]</ref>: As reported in Sander and Dietz <ref type="bibr" target="#b0">[1]</ref> as "unnormalized", which is akin to EXAM-Cover. Official leaderboard (MAP): Manual TREC judgments, (mean) average precision, relevant grade ≥ 1, as reported in the TREC CAR Y3 overview paper <ref type="bibr" target="#b24">[24]</ref>. To make the system ranking behavior more visible, all systems' evaluation scores are renormalized so that the highest scores maps to 1.0, and the lowest to 0.0. Several systems use a similar approach, leading to near identical scores on all leaderboards (including the official CAR leaderboard).</p><p>We find that all evaluation methods track the official leaderboard. Self-rating-based EXAM++ follows the shape the best. However, the higher grade cutoff of 𝜏 = 5 leads to much large error bars in contrast to 𝜏 = 4. We find that the question answering-based method Manual-EXAM-Squad2 is too unspecific, assigning the same high score to two-thirds of all systems.</p><p>We find that coverage-based evaluation with Manual EXAM++ and original EXAM promote some of the low-ranking systems. With our experiment it is impossible to say whether this is due to a bias in the official leaderboard (which does not acknowledge coverage) or an issue with the coverage-based evaluation metric. However, the fact that two independent coverage-based implementations agree on assigning ICT-DRMMTKS a higher score, suggests that this system might indeed provide good coverage (albeit at lower precision).   <ref type="table">3</ref>, with official leaderboard and Sander's original EXAM method (as available in Table <ref type="table" target="#tab_1">4</ref> of their paper <ref type="bibr" target="#b0">[1]</ref>). Systems are ordered by the official leaderboard (MAP). To make the general behavior more visible, we use a min/max re-normalization of all evaluation scores. Standard error bars adjusted accordingly.</p><p>We confirm that all methods roughly follow the official leaderboard, some submitted systems are very similar leading to similar evaluation scores under both the official leaderboard and our evaluation methods. Some leaderboards (e.g., Manual-EXAM-Squad2 Rprec) are not able to detect differences in performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Impact of Grade Cutoffs</head><p>While for generated banks of open-ended questions, a higher grade cutoff of 𝜏 = 5 obtains stronger results, we observe the opposite for manual question banks taken from the TQA dataset, where a grade cutoff 𝜏 = 1 produces best results.</p><p>In general, we remark that the appropriate self-rating levels depend on the difficulty of the question bank. Sander et al remark that questions of the TQA collection are often phrased in an obtuse way, as they are designed to encourage (human) students to closely read the text. As a result, too few passages obtain a high grade for most questions, which then results in evaluation scores that don't distinguish between systems.</p><p>For the open-ended questions from our generated test bank, it is generally easier to obtain a high self-rating grade-especially since multiple answers can be considered reasonably relevant.</p><p>Nevertheless, while for EXAM++ the grading cutoff of 𝜏 = 5 obtains slightly better correlations than a cutoff of 𝜏 = 4, the large error bars for cutoff 𝜏 = 5 (cf. Figure <ref type="figure" target="#fig_1">2</ref>) suggest that a lower cutoff might yield a more useful evaluation measure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Self-rating vs. Answer Verification</head><p>We analyze the set of evaluation approaches that use the manual benchmark, i.e., Manual EXAM++ and methods under QA + Answer Verification. The best correlation is achieved with self-rating methods on EXAM-Cover, obtaining a 0.959 Spearman's rank correlation coefficient. However, when it is desired to integrate the evaluation into trec_eval, we find that answer verification approaches are strong contenders. Especially fine-tuning the FLAN-T5-large model on SQuAD2, obtains slightly better results than other methods.</p><p>Given that many correct answers are missed due to a different phrasing, we further explore LLM-based answer verification. However, this adaptation has strong negative effects on leader board correlation, in several cases obtaining a rank correlation of less than 0.5. We suspect that this assigns a relevant grade to too many non-relevant passages, resulting in a degradation of the leaderboard.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.">A Worked Example</head><p>We illustrate our EXAM++ method on an example from the TREC CAR Y3 dataset for query tqa2:L_0384. The passage presented below was retrieved at rank 1 by the dangnt-nlp system and was assessed by TREC judges as 'MUST be mentioned'.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Query title: The Integumentary System</head><p>Query subtopic: Structure of the Skin Passage: ID: b95bf325b7fdacac183b1daf7c118be407f52a3a</p><p>The skin is the largest organ in the human body. Skin is made up of three layers, the epidermis, dermis and the fat layer, also called the hypodermis. The epidermis is the outer layer of skin that keeps vital fluids in and harmful bacteria out of the body. The dermis is the inner layer of skin that contains blood vessels, nerves, hair follicles, oil, and sweat glands. Severe damage to large areas of skin exposes the human organism to dehydration and infections that can result in death. TREC judgment: 3 (MUST be mentioned)</p><p>The TQA question NDQ_007535 "Outer layer of the skin?" was correctly answered as "epidermis" by this passage (highlighted in text). Under the self-rating prompt, FLAN-T5 indicates that this question can be answered in a mostly relevant way but may have minor gaps (self-rated answerability grade of 4).</p><p>A generated exam question, "What are the main components of the epidermis and how do they contribute to the structure of the skin?", was also graded with a self-rating of 4. The corresponding extracted answer is "keeps vital fluids in and harmful bacteria out of the body" (highlighted in text).</p><p>Other generated questions for this query are: 1. What are the different layers of the skin and their respective functions? 2. How does the structure of the skin contribute to its various functions? 3. What is the role of dermal papillae in the structure of the skin? 4. How does the structure of the hypodermis differ from the other layers of the skin? 5. What structural changes occur in the skin due to aging? 6. How does the skin's structure contribute to its role in temperature regulation? 7. What role does the extracellular matrix play in the structure and function of the skin? 8. How does the structure of the skin influence its ability to prevent water loss and maintain hydration? 9. What structural adaptations exist in the skin of different animals and how do they serve their specific needs?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>With EXAM++ we are proposing an alternative evaluation approach that does not merely outsource passage-level relevance determination to LLMs (or human judges). Instead, an exam question bank is created as part of topic development, envisioning that each question addresses an essential piece of information content for the query. As a result, whenever such questions are answerable with responses from a retrieval/generation system, we conclude that the system provides relevant information.</p><p>Using the TREC Complex Answer data set, we demonstrate that (1) our proposed approach can reproduce official TREC leaderboards nearly perfectly; and (2) we outperform several strong LLM-based relevance label predictors <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b7">8]</ref> that were developed in the context of other retrieval benchmarks. In contrast, EXAM++ offers a clear path towards integrating a human-in-the-loop, by supporting the refinement of the exam question banks, as a means for humans to define relevance.</p><p>We believe that more research will improve the question bank generation and LLM-based grading. Future work should study effects on the quality, cost, and satisfaction of human judges working with the EXAM++ approach in our Autograde software <ref type="bibr" target="#b2">[3]</ref>.</p><p>We hope that by integrating EXAM++ evaluation metric with trec_eval, we offer a system that can be easily adopted by future IR evaluation tracks, offering organizers an avenue to reduce assessment costs, obtain reusable test collections for generative information systems. </p><note type="other">Passage</note></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure1: EXAM++ approach. Left: The system under evaluation retrieves or generates passages 𝑝 ∈ 𝑃 in response to queries 𝑞 (blue). The system does not have access to exam questions. Passages from all systems will be pooled for assessment, and additional passages can be added later as new systems are developed. Right: The EXAM++ evaluation system uses three phases detailed in Section 4. For each query 𝑞, an exam question bank 𝑅 𝑞 is developed, which can be modified later in an iterative fashion (purple). All passages from the system response (e.g., p1, p2, p3) are graded based on which questions (r1, r2, ..., r5) can be correctly answered with the passage text (red). We support two modes: one where answers are verified against an answer key (depicted as check marks), or by having an LLM self-rate the answerability on a scale from 0 to 5. The EXAM++ evaluation scores are derived from these grades (green). The EXAM-Cover score is based on how many questions are covered, as binary verification or via a minimum self-rating level. For EXAM-Qrels a relevance file for trec_eval is derived, which is based on the coverage or best self-rating obtained by this passage in isolation. A human-in-the-loop is ensuring that grade annotations correlate with relevant passages and will improve the test bank in response and adjust the grading system where necessary (cyan). We provide a worked example in Section 5.6.</figDesc><graphic coords="5,93.46,84.17,412.52,237.54" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2:The goal is for leaderboards under different evaluation methods to track the official TREC CAR Y3 leaderboard. Selection of leaderboards of systems submitted to TREC CAR Y3 under three of the best correlating EXAM++ measures according to Table3, with official leaderboard and Sander's original EXAM method (as available in Table4of their paper<ref type="bibr" target="#b0">[1]</ref>). Systems are ordered by the official leaderboard (MAP). To make the general behavior more visible, we use a min/max re-normalization of all evaluation scores. Standard error bars adjusted accordingly. We confirm that all methods roughly follow the official leaderboard, some submitted systems are very similar leading to similar evaluation scores under both the official leaderboard and our evaluation methods. Some leaderboards (e.g., Manual-EXAM-Squad2 Rprec) are not able to detect differences in performance.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Phase 1. Question bank generation prompt.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 4</head><label>4</label><figDesc>Grade/judgment agreement for EXAM++.</figDesc><table><row><cell>Grade</cell><cell cols="2">Judgments</cell><cell cols="2">Total Cohen's 𝜅</cell></row><row><cell></cell><cell>1-3</cell><cell>≤0</cell><cell></cell></row><row><cell>4-5</cell><cell>1910</cell><cell cols="2">1117 3027</cell><cell>0.38</cell></row><row><cell>0-3</cell><cell>880</cell><cell cols="2">2445 3325</cell><cell>0.37</cell></row><row><cell cols="5">our predicted relevance labels. We provide count statistics and Cohen's 𝜅 inter-annotator</cell></row><row><cell>agreement statistics.</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="5">Since Sander's work demonstrated that ROUGE metrics are uncorrelated with leaderboard</cell></row><row><cell cols="2">rankings, we omit the comparison here.</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head></head><label></label><figDesc>: {context} Answer: HELM [7]: Prompt designed for evaluating LLMs on information retrieval: Instruction: Does the passage answer the query? Respond with 'Yes' or 'No'. Prompt designed for question-style queries: Instruction: Given a passage and a query, predict whether the passage includes an answer to the query by producing either "Yes" or "No". Prompt FaggioliB with additional few shot examples from the TREC DL collection: Instruction: Indicate if the passage is relevant for the question. Respond with 'Yes' or 'No'. Passage: Its 25 drops per ml, you guys are all wrong. If it is water, the standard was changed 15 -20 years ago to make 20 drops = 1mL. The viscosity of most things is temperature dependent, so this would be at room temperature. Hope this helps. Question: how many eye drops per ml Answer: Yes Passage: RE: How many eyedrops are there in a 10 ml bottle of Cosopt? My Kaiser pharmacy insists that 2 bottles should last me 100 days but I run out way before that time when I am using 4 drops per day. In the past other pharmacies have given me 3 10-ml bottles for 100 days. E: How many eyedrops are there in a 10 ml bottle of Cosopt? My Kaiser pharmacy insists that 2 bottles should last me 100 days but I run out way before that time when I am using 4 drops per day. Question: how many eye drops per ml Answer: No Passage: You can transfer money to your checking account from other Wells Fargo. accounts through Wells Fargo Mobile Banking with the mobile app, online, at any. Wells Fargo ATM, or at a Wells Fargo branch. 1 Money indeposits. Question: can you open a wells fargo account online Answer: No Passage: You can open a Wells Fargo banking account from your home or even online. It is really easy to do, provided you have all of the appropriate documentation. Wells Fargo has so many bank account options that you will be sure to find one that works for you. They offer free checking accounts with free online banking. Prompt Sun with the same few shot examples as FaggioliB_few.</figDesc><table><row><cell>Question: {query_title} Passage: {context} Answer: Sun [5]: Question: {query_title} Passage: {context} Answer: Answer: Yes Question: {query_title} Passage: {context} Answer: FaggioliB_few [6]: Question: can you open a wells fargo account online Sun_few [5]:</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">We recently released a resource to support human judges in supervising this process<ref type="bibr" target="#b2">[3]</ref> https://github.com/ TREMA-UNH/rubric-grading-workbench.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">All baseline prompts are provided in our online appendix.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">An implementation of EXAM++ is available in the Autograding Workbench<ref type="bibr" target="#b2">[3]</ref>.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">This step is identical to generating question-based RUBRICs in Farzi and Dietz<ref type="bibr" target="#b3">[4]</ref>.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">The TREC CAR Y3 test set benchmarkY3test is available at http://trec-car.cs.unh.edu/datareleases/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">https://huggingface.co/google/flan-t5-large</note>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Rank correlations of each evaluation method with different minimum grades 𝜏 with the official TREC CAR Y3 leaderboard. S: Spearman's rank correlation. K: Kendall's Tau correlation. Best evaluation method in bold-italics. Equally good methods within (±0.05) marked in bold. Poor methods (obtaining less than 0.5) marked in grey. Leaderboards of selected methods (marked in blue) are presented in Figure <ref type="figure">2</ref>. In one case, all systems obtained a perfect score (marked with †), therefore the rank correlation cannot be computed. FaggioliB, Sun, HELM, Thomas: Using the same FLAN-T5-large LLM as for EXAM++ but obtaining relevance labels by directly asking whether a passage is relevant for a query. We use a set of established prompts <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8]</ref>, listed in Appendix A. FaggioliB_few, Sun_few: As previous but using few-shot prompts suggested for the TREC Deep Learning track <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref> to test their generalizability.</p><p>We measure the quality of our evaluation paradigm in two ways: Leaderboard rank-correlation: The leaderboard of systems under the EXAM-Cover and EXAM-Qrels metric should be similar to the official TREC CAR Y3 leaderboard. This similarity is evaluated with two rank correlation measures: Spearman's rank correlation coefficient, which measures differences of a system's rank on the leaderboard, and Kendall's 𝜏 rank correlation which penalizes swaps of two systems on the leaderboard. Inter-annotator agreement: High passage-level agreement between official judgments and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Appendix: Relevance Label Predictor Prompts</head><p>Thomas <ref type="bibr" target="#b7">[8]</ref>: As full prompt exceeds the token limitation, we use the following abridged prompt used in citing work: Instruction: You are a search quality rater evaluating the relevance of passages. Given a query and a passages, you must provide a score on an integer scale of 0 to 2 with the following meanings: 2 = highly relevant, very helpful for this query 1 = relevant, may be partly helpful but might contain other irrelevant content 0 = not relevant, should never be shown for this query Question: {query_title} Passage: {context} Answer: FaggioliB [6]: Prompt designed for TREC DL:</p><p>Instruction: Indicate if the passage is relevant for the question. Respond with 'Yes' or 'No'. Question: {query_title}</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Exam: How to evaluate retrieve-and-generate systems for users who do not (yet) know what they want</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Sander</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">DESIRES</title>
		<imprint>
			<biblScope unit="page" from="136" to="146" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Longpre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Vu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Webson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">W</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.13688</idno>
		<title level="m">The flan collection: Designing data and methods for effective instruction tuning</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A workbench for autograding retrieve/generate systems</title>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
		<idno type="DOI">10.1145/3626772.3657871</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1145/3626772.3657871" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR &apos;24) -Resource and Reproducibility Papers</title>
				<meeting>the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR &apos;24) -Resource and Reproducibility Papers</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Pencils down! Automatic rubric-based evaluation of retrieve/generate systems</title>
		<author>
			<persName><forename type="first">N</forename><surname>Farzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on the Theory of Information Retrieval</title>
				<meeting>the International Conference on the Theory of Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Ren</surname></persName>
		</author>
		<idno>arXiv-2304</idno>
		<title level="m">Is chatgpt good at search? investigating large language models as re-ranking agent</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv e-prints</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Perspectives on large language models for relevance judgment</title>
		<author>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Demartini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hagen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hauff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval</title>
				<meeting>the 2023 ACM SIGIR International Conference on Theory of Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="39" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Bommasani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Tsipras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Soylu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yasunaga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Narayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kumar</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2211.09110</idno>
		<title level="m">Holistic evaluation of language models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Large language models can accurately predict searcher preferences</title>
		<author>
			<persName><forename type="first">P</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Spielman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Craswell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mitra</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2309.10621</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Macavaney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Soldaini</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.11266</idno>
		<title level="m">One-shot labeling for automatic relevance estimation</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Who determines what is relevant? humans or ai? why not both? a spectrum of human-ai collaboration in assessing relevance</title>
		<author>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Demartini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hagen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hauff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Kando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">S</forename><surname>Moosavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Lin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2311.09766</idno>
		<title level="m">Llms as narcissistic evaluators: When ego inflates evaluation scores</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Sui</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.17926</idno>
		<title level="m">Large language models are not fair evaluators</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Fok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">S</forename><surname>Weld</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.07722</idno>
		<title level="m">In search of verifiability: Explanations rarely enable complementary performance in ai-advised decision making</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Discourse constraints for document compression</title>
		<author>
			<persName><forename type="first">J</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lapata</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Question answering as an automatic evaluation metric for news article summarization</title>
		<author>
			<persName><forename type="first">M</forename><surname>Eyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Baumel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Elhadad</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N19-1395</idno>
		<ptr target="https://www.aclweb.org/anthology/N19-1395.doi:10.18653/v1/N19-1395" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long and Short Papers</title>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Minneapolis, Minnesota</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="3938" to="3948" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Knowledge graph-augmented abstractive summarization with semantic-driven cloze reward</title>
		<author>
			<persName><forename type="first">L</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.457</idno>
		<ptr target="http://dx.doi.org/10.18653/v1/2020.acl-main" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title/>
		<idno type="DOI">10.18653/v1/2020.acl-main.457</idno>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Deutsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Bedrax-Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roth</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.00490</idno>
		<title level="m">Towards question-answering as an automatic metric for evaluating the content quality of a summary</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Asking and answering questions to evaluate the factual consistency of summaries</title>
		<author>
			<persName><forename type="first">A</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.450</idno>
		<ptr target="https://www.aclweb.org/anthology/2020.acl-main.450.doi:10.18653/v1/2020.acl-main.450" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="5008" to="5020" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2005.14165</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Are you smarter than a sixth grader? Textbook question answering for multimodal machine comprehension</title>
		<author>
			<persName><forename type="first">A</forename><surname>Kembhavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Seo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schwenk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Farhadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajishirzi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition (CVPR</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="5376" to="5384" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Rajpurkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lopyrev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<idno>CoRR abs/1606.05250</idno>
		<ptr target="http://arxiv.org/abs/1606.05250.arXiv:1606.05250" />
		<title level="m">Squad: 100, 000+ questions for machine comprehension of text</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">MacKinnon, Novelty and diversity in information retrieval evaluation</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Clarke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kolla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">V</forename><surname>Cormack</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vechtomova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ashkan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Büttcher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 31st annual international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="659" to="666" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Overview of ntcir-9</title>
		<author>
			<persName><forename type="first">T</forename><surname>Sakai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">P</forename><surname>Kato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-I</forename><surname>Song</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 9th NTCIR Workshop Meeting</title>
				<meeting>the 9th NTCIR Workshop Meeting</meeting>
		<imprint>
			<date type="published" when="2011">2011. 2011</date>
			<biblScope unit="page" from="1" to="7" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">TREC CAR Y3: Complex Answer Retrieval overview</title>
		<author>
			<persName><forename type="first">L</forename><surname>Dietz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Foley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of Text REtrieval Conference (TREC)</title>
				<meeting>Text REtrieval Conference (TREC)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
