=Paper=
{{Paper
|id=Vol-2950/paper-16
|storemode=property
|title=EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want
|pdfUrl=https://ceur-ws.org/Vol-2950/paper-16.pdf
|volume=Vol-2950
|authors=David P. Sander,Laura Dietz
|dblpUrl=https://dblp.org/rec/conf/desires/SanderD21
}}
==EXAM: How to Evaluate Retrieve-and-Generate Systems for Users Who Do Not (Yet) Know What They Want==
<pdf width="1500px">https://ceur-ws.org/Vol-2950/paper-16.pdf</pdf>
<pre>
EXAM: How to Evaluate Retrieve-and-Generate Systems
for Users Who Do Not (Yet) Know What They Want
David P. Sander1 , Laura Dietz2
1
    Bottomline Technologies, 325 Corporate Dr, Portsmouth, NH 03801, United States of America
2
    University of New Hampshire, Durham NH 03824, United States of America


                                             Abstract
                                             Our long-term goal is to develop systems that are forthcoming with information. To be effective, such systems should be
                                             allowed to combine retrieval with language generation. To alleviate challenges such systems pose for today’s IR evaluation
                                             paradigms, we propose EXAM, an evaluation paradigm that uses held-out exam questions and an automated question-
                                             answering system to evaluate how well generated responses can answer follow-up questions—without knowing the exam
                                             questions in advance.

                                             Keywords
                                             Information Retrieval, Natural Language Generation, Evaluation, Conscious Information Needs


1. Introduction
Often users want to learn about a topic they know very
little about. Taylor and Belkin [1, 2] call this a “conscious
information need” originating from an “anomalous state
of knowledge” where the user knows too little about the
                                                                                                                      Figure 1: Evaluating articles (left) through EXAM questions.
topic to ask precise questions. As a result, web search and
conversational search systems do not provide a satisfying
user experience. Instead, users often turn to Wikipedia.
However, depending on the topic, articles may be out-of-                                                              1.1. Vision: Retrieve-and-Generate
date, incomplete, or missing. If this is the case, today’s                                                                 Systems
users embark on a journey of exploratory search where
they are required to manually compile relevant informa-                                                               To compose a comprehensive overview article, our long-
tion from multiple search requests.                                                                                   term goal is to develop retrieve-and-generate systems
    As a remedy, research on interactive information re-                                                              that automatically read the web and organize relevant
trieval is developing novel search interfaces [3]. We con-                                                            information. An ideal overview article would educate
sider a complementary avenue by aiming to provide the                                                                 the user about different aspects of the topic, with the
best possible response in a single interaction turn, by                                                               goal of enabling the user to formulate precise questions
compiling an overview for a topic of the user’s choice.                                                               or search queries. To not waste the user’s time, the ar-
    Within the TREC Complex Answer Retrieval track [4],                                                               ticle should be forthcoming with relevant information
we aspire to retrieve-and-generate overview articles as                                                               that immediately answers obvious follow-up questions
found on Wikipedia. The objective of the third year of                                                                without being explicitly asked.
the track (CAR Y3) is to respond to the query with an                                                                    We envision such systems to perform retrieval, content
article that is composed of existing paragraphs.                                                                      planning, and natural language generation—all while in-
    We offer a new evaluation based on whether these                                                                  ferring which pieces of information are relevant and how
articles answer obvious follow-up questions. Examples                                                                 they fit together. We refer to such systems as retrieve-
are available in the online appendix.1                                                                                and-generate systems to indicate that retrieval is only
                                                                                                                      the first step in the pipeline, and sources will be further
                                                                                                                      processed—possibly using abstractive summarization or
DESIRES 2021 – 2nd International Conference on Design of
Experimental Search & Information REtrieval Systems, September
                                                                                                                      language generation—for presentation to the user.
15–18, 2021, Padua, Italy                                                                                                GPT-3 [5], T5 [6], and other natural language gener-
Envelope-Open dpsander42@gmail.com (D. P. Sander); dietz@cs.unh.edu                                                   ation (NLG) models offer a promising avenue for gen-
(L. Dietz)                                                                                                            erating relevant text in combination with information
Orcid 0000-0001-8508-6357 (D. P. Sander); 0000-0003-1624-3907                                                         retrieval (IR) systems. Recent models achieve great re-
(L. Dietz)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative   sults with respect to grammar, flow of arguments, and
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                         CEUR Workshop Proceedings (CEUR-WS.org)                                                      readability. However, there are concerns whether these
                  1
                      Appendix: https://www.cs.unh.edu/~dietz/appendix/exam/                                          generated articles contain the most relevant information
[7]. Integrating NLG with retrieval will help to instill       influenced Darwin’s work?” The suggested evaluation
the user’s trust in the faithfulness of the provided in-       paradigm assesses the system’s ability to generate query-
formation. The development of retrieve-and-generate            relevant articles which offer comprehensive information.
systems is hindered by the lack of an accepted evaluation      The goal is to preempt the user with answers to potential
paradigm for a fair comparison.                                follow-up questions, thus alleviating the user from the
                                                               burden of asking obvious questions that could have been
1.2. Evaluation Challenges                                     anticipated.
                                                                  EXAM does not rely on relevance assessments or a
Typical IR evaluation paradigms are based on relevance fixed corpus. Once a sufficient question bank is created,
assessments for texts from a known corpus in order to it can be reused to evaluate future systems without any
quantify the relevance of a ranking. The evaluation manual involvements. EXAM can compare retrieval-only
paradigm is directly applicable to predefined passages or systems as well as retrieve-and-generate systems. Since
“legal spans” [8, 9]. However, when arbitrary text spans EXAM only assesses the information content, not the in-
can be retrieved or when retrieved text is modified, this formation source or document, it is a corpus independent
evaluation paradigm needs to be adjusted. A common ap- metric that even allows the comparison of systems that
proach is to predict if retrieved text is sufficiently similar use the open-web as a corpus or neural NLG systems.
to assessed text, as in character-MAP [10, 11], passage-
ROUGE [12], or BERTscore [13]. In a preliminary study
                                                               1.4. The Chicken-and-Egg Problem
we found BERTscore to be successful when the linguis-
tic style is similar, e.g. both are Wikipedia paragraphs. The development of novel retrieve-and-generate systems
However, when gold articles are linguistically different, and the development of suitable evaluation paradigms
BERTscore was found to be less reliable.                       form a chicken-and-egg problem: New systems (the “egg”)
   An open question is how to develop evaluation meth- cannot be studied without an established evaluation,
ods that can identify relevant information without being while novel evaluation paradigms (the “chicken”) can-
affected by the linguistic style in which the information not be tested without established retrieve-and-generate
is presented.                                                  systems. With this work we provide the “chicken” by
                                                               studying the efficacy of the EXAM evaluation metric on
1.3. An Alternative Evaluation Approach retrieval-only systems with respect to an established IR
                                                               benchmark. The efficacy study uses systems submitted
In this work, we discuss an alternative evaluation paradigm to the TREC Complex Answer Retrieval track in Year 3
that directly assesses the usefulness of the retrieved in- for which a question bank is available through the TQA
formation–instead of documents [14]. In particular, the collection which was created by the Allen Institute for AI
evaluation paradigm is directly in service of our design (AI2). We demonstrate that the leaderboard ranking of
goals: to educate the user about a topic they are not fa- systems under EXAM correlates highly with the official
miliar with, while preemptively being forthcoming with track evaluation measures based on manual assessments
answers.                                                       created by the National Institute of Science and Technol-
   We achieve this with a mostly2 automatic metric, called ogy (NIST). In contrast, using a collection of gold articles,
the EXam Answerability Metric (EXAM). EXAM deter- we show that the system ranking under ROUGE does
mines the quality of generated text by conducting an not correlate with the manual assessments, despite the
exam that assesses the article’s suitability for correctly fact that corresponding gold articles contain the right
answering a set of query-relevant follow-up questions as information and obtain a high EXAM score.
depicted in Figure 1. Like an exam in school, the retrieve-
                                                                  Contributions of this paper are as follows.
and-generate systems must identify relevant information
without knowing the exam questions beforehand. An              •  Start a discussion on how to evaluate IR systems that
external Q/A system will attempt to answer the follow-            further process retrieved raw text.
up questions; the more questions that can be answered          •  Suggest EXAM, an alternative evaluation paradigm to
correctly with the article, the higher the system’s quality.      complement existing evaluation strategies.
   We suggest using exam questions that are relatively • Provide a study on TREC CAR Y3, by reusing exam
obvious follow-up questions to the user’s request. For            questions from the related TQA data set. We demon-
example, when a user provides a query such as “Darwin’s           strate a high correlation with traditional IR metrics,
Theory of Evolution”, the generated comprehensive ar-             even in cases where the linguistic style is too different
ticle should directly answer some reasonable follow-up            for ROUGE to work.
questions such as “What species of bird did Darwin ob- • While our motivation arises from conscious informa-
serve on the Galapagos Islands?” and, “Which scientists           tion needs, the EXAM paradigm is applicable to many
    2
        Fully automatic once a benchmark is created.
  areas of IR, including ad hoc document retrieval and                                                               TREC CAR
                                                                       eries Corpus               Generated
  conversational search.                                                              Wikipedia
                                                                                      paragraphs    Articles          Manual
                                                                                                   Darwin's theory
                                                                             Retrieve                                Assessment
Outline. Section 2 provides an overview of related                          & Generate                                   +
evaluation approaches. Section 3 introduces our EXAM                          System                                  Oﬃcial
evaluation paradigm. Section 4 outlines the experimental                                                             Evaluation
evaluation and discusses our results, before concluding
                                                                          Gold             Our Evaluation
in Section 5.
                                                                         Article              System                    EXAM
                                                                                           Q/A                        evaluation
2. Related Work                                                                           System                         score


2.1. Text Summarization
                                                                                   Exam estions          Correct Answers
ROUGE [15] is one of the most popular evaluation met-
rics for text summarization, because the only human              Figure 2: Pipeline of retrieve-and-generate systems and eval-
involvement is to create a reference summary. ROUGE              uation with our proposed evaluation paradigm.
and related metrics like METEOR [16] use n-gram over-
lap to quantify the similarity of phrases and vocabulary
between two texts—one being the reference. Though                2.3. Information Retrieval Evaluation
ROUGE is commonly used, it has some drawbacks, as sug-
                                                          Information Retrieval is commonly evaluated with a pool-
gested by Scialom et al. [17] and Deutsch et al. [18]. Dif-
fering word choice between summaries results in lower     based Cranfield-style paradigm, where the top 𝑘 docu-
ROUGE scores. Because of this, comparing two articles,    ments are pooled and manually assessed for relevance
by two different authors, both about the same topic, could[24]. Dietz and Dalton [8] automate manual IR assess-
have very low ROUGE scores due to dissimilar word         ment by deriving queries and a passage corpus from ex-
choice. We use ROUGE-1 as a baseline for our evaluation   isting articles, then assess passages as relevant when they
paradigm in Section 4.                                    originated from the corresponding article and/or section.
   Alternatively, some automated metrics use a trained    This method does not allow information to deviate from
                                                          predefined passages.
similarity to detect text that is classified as relevant, such
as Reval [19], BERTScore [13], and NUBIA [20].                To evaluate systems that retrieve passages of variable
                                                          length, Keikha et al. [12] use a ROUGE metric to mea-
                                                          sure the n-gram overlap with ground truth passages. An
2.2. Summarization Evaluation with Q/A alternative approach is to use a character-wise MAP to
Eyal et al. [21] compares the quality of generated sum- award credit for shared character sequences [10, 11].
maries to a reference summary, using a Q/A system and         In this work we discuss an alternative evaluation para-
questions generated from entities in the reference sum- digm that focuses on retrieving information.
mary. Deutsch et al. [18] focuses on automatic question
generation from a reference summary. Huang et al. [22]
develop a CLOZE-style training signal that automatically
                                                          3. Approach: EXAM Evaluation
derives multiple-choice questions from reference sum- We propose the automatic EXam Answerability Metric
maries. However, since not all phrases in a reference (EXAM) for evaluating the usefulness of systems which
summary are equally important, such approaches can- retrieve and generate relevant articles in response to
not guarantee that the evaluation will test for relevant topical queries. These articles are evaluated based on
information.                                              how many exam questions an automated Q/A system
   Doddington [23] suggests evaluating machine trans- can answer correctly. Our metric does not use relevance
lation by conducting an exam with questions. In 2002, judgments nor reference summaries. Instead, a bench-
these were answered by human annotators (not a Q/A mark for EXAM consists of a set of queries with follow-up
system), however inconsistencies between annotators led questions that a relevant article should answer. We use
to issues in the evaluation, which was then discontinued. it to measure the relevance and completeness of compre-
We hypothesize that automatic Q/A systems (albeit not hensive articles.
perfect) offer a fair comparison across systems.              We first introduce our general evaluation approach,
                                                                 then explain customizations for evaluating CAR Y3 as
                                                                 used for evaluation.
3.1. EXAM Evaluation Paradigm                               score of a generated article 𝑑𝑞 from system 𝑆 over the
                                                            question bank for the query as,
While motivated by conscious information needs, the
evaluation paradigm can be applied to most topical IR
tasks. Only a suitable bank of exam questions and a Q/A                                   correct answers in 𝑑𝑞
system need to be available.                                    EXAM(𝑑𝑞 |𝑆) =
                                                                                   number of exam questions for 𝑞

3.1.1. Resources Required for Evaluation                       The EXAM score of each retrieve-and-generate system
                                                            is computed for each query (and hence generated arti-
We reserve exam questions and disallow the retrieve-and-    cle), then macro-averaged over multiple queries. Skipped
generate systems under study to access them. Retrieve-      queries are counted as zero score. EXAM awards no
and-read systems are given:                                 credit for unanswered or incorrectly answered questions,
Queries: Given a free-text user query, such as “Dar-        as these suggest that the generated article does not con-
 win’s Theory of Evolution”, systems must generate          tain the right information.3
 a comprehensive response. Queries can be a simple             Similar to other proposals, e.g., of nugget-recall [29],
 keyword query or a more complex expression of infor-       EXAM is a recall-oriented evaluation measure. To pe-
 mation needs, such as a conversation prompt, desired       nalize large amounts of non-relevant information, the
 sub-topics, or usage contexts.                             article length can be restricted as in TREC CAR Y3, NT-
   The systems can access any corpus of their choice.       CIR One-click [30], or composite retrieval [31].
EXAM does not require a predetermined corpus, unlike
pool-based evaluations, such as the official TREC assess- 3.1.3. Normalizing EXAM with Gold Articles
ments [4]. Even corpus-free systems like GPT-3 can be
                                                           As introduced above, our EXAM score enables relative
evaluated.
                                                           quality comparison among systems and baselines. How-
                                                           ever, questions which are too difficult for the Q/A system
   Solely for the evaluation, we require the following to answer, or that are irrelevant to the query, could re-
resources to be available on a per-query basis:            sult in an artificially lowered score. To correct for this,
Exam Questions and Answer Verification: A set of we propose a relative-normalized EXAM score that uses
   reasonably obvious follow-up questions about the que- human-edited gold articles 𝑑𝑞∗ which are written to ad-
   ry’s topic. Any question style (e.g. multiple-choice, dress the query and exam questions. This allows retrieve-
   free text, etc.) can be used as long as the underlying and-generate systems to be scored using the context of
   Q/A system is trained to answer them and the answer an expected best-case scenario.
   can be automatically verified.
                                                                                        ∑𝑞 EXAM(𝑑𝑞 |𝑆)
Q/A System: A high-quality Q/A system that is trained                   n-EXAM(𝑆) =
   to answer exam questions. To be suitable, the Q/A sys-                                ∑𝑞 EXAM(𝑑𝑞∗ )
   tem must use the given article to identify evidence for
                                                              Note that if the gold articles contain less informa-
   the question. All systems must be evaluated using the
                                                           tion than the generated articles, or are written obtusely,
   same Q/A system for EXAM scores to be comparable.
                                                           the retrieve-and-generated articles could earn a higher
                                                           EXAM score than the gold articles, especially if the re-
   The evaluation process will use the above resources as trieve-and-generated articles express information clearer.
depicted in Figure 2. We suggest using a multiple-choice This would result in an n-EXAM score above one. For
Q/A system and an answer key to verify correctness. example, a Q/A system would have difficulty extracting
However, many Q/A systems can be used in our paradigm, answers from a college textbook, when used as a gold
such as the systems of Choi et al. [25], Nie et al. [26], article, as college-level reading material requires reader
or Perez et al. [27]. To be suitable, some Q/A systems inference or significant logical deduction for full com-
would need to be customized, e.g. restricting the sentence prehension. This is one way that human-written, gold
selector of Min et al. [28].                               articles could receive lower EXAM scores than generated
                                                           articles sourced from more plainly written corpora.
3.1.2. EXAM Evaluation Scores
Given the queries and corpus, each system will generate     3.2. EXAM Evaluation for CAR Y3
one article per query. The exam questions are only used     The purpose of the CAR Y3 track is to study retrieval
during evaluation: the Q/A system attempts to answer        algorithms that respond to complex information needs
all exam questions based on the content of the generated
                                                                3
article. For a query 𝑞, we measure the EXAM evaluation            Our focus is not on evaluating the Q/A system, but the useful-
                                                            ness of the generated article to a reader.
ery                Darwin's theory of evolution                             mitted articles with respect to the query and sub-topics.
                                                                             Participants were encouraged to additionally submit para-
                     Voyage of the Beagle
                                                                             graph rankings for each sub-topic. The official CAR Y3
                           How did Darwin come up with the theory of
                                                                             evaluation was based on these rankings and the relevance
Sub-topics                 evolution by natural selection? A major
                           inﬂuence was an amazing scientiﬁc expedition      assessments.4
                     Giant Tortoises
                           The Galpagos Islands are still famous for their   3.2.2. Proposed Alternative: EXAM Evaluation
                           giant tortoises. These gentle giants are found
Gold Article*              almost nowhere else in the world. Darwin was
                                                                             To evaluate the articles of participating systems with
                     ...
                                                                             EXAM, we require a question bank of exam questions
Exam                 Darwin observed that the environment                    with an answer key. Queries are derived from titles of
estion*             on diﬀerent Galapagos Islands was                       TQA textbook chapters which come with multiple-choice
                     correlated with the shell shape of ...                  questions designed by the book author to test (human)
                        a) snails
                                                                             students. We are using these multiple choice questions as
                        b) fossils
                                                                             a question bank for the EXAM metric to assess generated
                        c) tortoises
                        d) none of the above
                                                                             articles of participating systems. In particular, we use all
Correct                                                                      provided non-diagram (i.e. not dependent on a picture)
                     Correct Answer: c)                                      questions.
Answer*
Figure 3: An example from the TQA dataset used to derive the
                                                          Gold Articles: We use the textbook content from the
TREC CAR benchmark as well as data for our EXAM evaluation
                                                          TQA textbook chapters as gold articles for the queries
measure (marked with *). The example is an excerpt from TQA
entry L_0432/NDQ_009501.                                  (also used by the ROUGE baseline as reference summary).
                                                          While the EXAM metric does not require gold articles, we
                                                          report the EXAM score achieved by the gold article for
                                                          reference and include n-EXAM scores as well. As gold
by synthesizing longer answers by collating retrieved articles and questions were designed for middle school
information, mimicking the style of Wikipedia articles. students, many answers are stated in an obtuse way and
For the shared task to yield a reusable benchmark, par- cannot be answered by simple text matches.
ticipant systems were restricted to use a corpus of five
million predefined passages without modifications.
   The Textbook Question-Answering (TQA) [32] dataset
                                                          Used Question-Answering System: As a high-qual-
provides textbook chapters and multiple-choice ques-
                                                          ity Q/A system we use the Decomposable Attention Q/A
tions designed for middle school students. In CAR Y3,
                                                          system provided by the organizers of the TQA challenge.
the queries were taken from titles of textbook chapters,
                                                          The system is trained on the AI2 Reasoning Challenge
and sub-topics were derived from headings. Sub-topics
                                                          dataset (ARC) [33]. The model is adapted from Parikh
are used as “nuggets” in the official assessment and were
                                                          et al. [34], which performs the best on the SNLI [35]
provided to participants. For each query, participants
                                                          dataset, which contains questions similar to TQA ques-
were asked to produce relevant articles by selecting and
                                                          tions.
arranging 20 paragraphs from the provided corpus of
                                                             As inputs, the Q/A system requires a text and a set of
Wikipedia paragraphs.
                                                          questions. The Decomposable Attention model searches
   For the example query “Darwin’s Theory of Evolu-
                                                          the text for passages relevant to each question, then ex-
tion” (tqa2:L_0432), an excerpt of a textbook chapter is
                                                          tracts answers by constructing an assertion per question
depicted in Figure 3. Neither the questions nor the text-
                                                          and answer choice. Assertions without text support are
book content were available to the CAR Y3 participant
                                                          eliminated, the most likely assertion under the attention
systems. The figure depicts an example of how parts of
                                                          model is returned as the answer. If all assertions are re-
the TQA dataset textbook chapters are used in the CAR
                                                          jected, the question is not answered. Both unanswered
Y3 benchmark versus held out for the EXAM evaluation.
                                                          and incorrectly answered questions result in a reduced
The connections between CAR Y3, retrieval systems un-
                                                          EXAM score.
der study, and our proposed evaluation paradigm are
depicted in Figure 2.

3.2.1. Reference: Official CAR Y3 Evaluation
For selected queries (see Table 3), NIST assessors pro-
vided relevance annotations for all paragraphs in all sub-                       4
                                                                                   Only manual assessments are available for CAR Y3. The auto-
                                                                             matic evaluation paradigm was only applicable to CAR Y1.
Table 1
Rank correlation between the leaderboards of different evaluation measures. Standard errors are below 0.02. Range: -1 to +1,
higher is better.

                            EXAM      Prec@R        MAP    nDCG20                   EXAM        Prec@R       MAP    nDCG20
               ROUGE        -0.09      -0.01      -0.07      -0.01     ROUGE        -0.07         0.00      -0.05    0.00
               nDCG20        0.74       0.94       0.95                nDCG20        0.57         0.86       0.88
               MAP           0.75       0.94                           MAP           0.57         0.86
               Prec@R        0.74                                      Prec@R        0.56
                  (a) Spearman’s rank correlation coefficient.           (b) Kendalls’s tau rank correlation coefficient.


Table 3                                                                Spearman’s Rank: High when each system 𝑆 (of 𝑛) has
Dataset statistics.                                                      a similar rank position under both leaderboards A, B:
         132    Queries with generated articles                                     6 ∑𝑆 (𝑟𝑎𝑛𝑘𝐴 (𝑆)−𝑟𝑎𝑛𝑘𝐵 (𝑆))2
                                                                         𝜌 =1−               𝑛(𝑛2 −1)
          20    Paragraphs per article per system
         131    Queries with exam questions
                                                                       Kendall’s Tau: High when the rank order of many sys-
        2320    Exam questions                                           tem pairs 𝑆1 , 𝑆2 is preserved (𝑃 + ) versus swapped (𝑃 − ):
                                                                                +    −
          55    Queries with official TREC CAR assessments               𝜏 = 𝑃𝑃 + −𝑃
                                                                                  +𝑃 −
         303    Subtopics with official TREC CAR assessments              Under any evaluation metric some systems obtain a
        2790    Positively assessed paragraphs                         similar evaluation score within standard error. As this is
                                                                       unlikely to indicate significant difference, we define such
                                                                       system pairs as tied, and thus attribute any score differ-
4. Experimental Evaluation                                             ence to random chance. Therefore, we randomly break
                                                                       ties, which is necessary for Spearman’s rank, to produce
We empirically evaluate EXAM as described in Section
                                                                       the leaderboard and compute the rank correlation, re-
3.2 using articles generated by the CAR Y3 [4] participant
                                                                       peating the process ten times. Results are presented in
systems.
                                                                       Tables 2a and 2b.

4.1. Experiment Setup                                                  4.1.2. Metrics for System Quality
Due to the chicken-and-egg problem, no established ret-                We study the leaderboard of systems under the following
rieve-and-generate benchmarks exist and no established                 evaluation measures.
retrieve-and-generate systems are available. We base
the experimental evaluation on sixteen retrieval systems               EXAM (ours): Our proposed evaluation measure which
submitted to CAR Y3. We use 131 queries that have                        uses a Q/A system to evaluate generated articles (see
a total of 2320 questions in the TQA dataset. We use                     Section 3.2).
each query’s textbook chapter in the TQA dataset as a                  n-EXAM (ours): A relative-normalized version of EXAM
gold article. Dataset statistics are summarized in Table                 that uses a set of gold articles to contextualize the
3. Since these systems are not part of our work, we refer                EXAM score.
to the participant’s description of their systems in the               Official CAR Y3 Evaluation (reference): Systems in
TREC Proceedings5 and CAR Y3 Overview [4].                               CAR Y3 are evaluated using Precision at R (Prec@R),
                                                                         Mean-Average Precision (MAP), and Normalized Dis-
4.1.1. Evaluating the Evaluation Measure                                 counted Cumulated Gain at rank 20 (nDCG20) as im-
                                                                         plemented in trec_eval.6
Our goal is to find an alternative evaluation metric that—
                                                                       ROUGE-1 F1 (baseline): ROUGE evaluates via the sim-
while mostly automatic—offers the same high quality
                                                                         ilarity between a generated article and a gold article.
as a manual assessment conducted by NIST. Hence, our
                                                                         ROUGE-1 F1 combines precision and recall of predict-
measure of success is to produce a system ranking (i.e.,
                                                                         ing words in the summary. Words are lowercased and
leaderboard) that is highly correlated with the official
                                                                         lemmatized, with punctuation and stopwords removed.
CAR Y3 leaderboard. Low or anti-correlation suggests
                                                                         We include ROUGE as a baseline evaluation paradigm,
that an evaluation measure would not agree with a user’s
                                                                         because it is fully automated and widely used in NLG.
sense of relevance. Correlation of leaderboard rankings
is measured in:
                                                                           6
                                                                             TREC evaluation tool available here:
    5
        Proceedings: https://trec.nist.gov/pubs/trec28/trec2019.html   https://github.com/usnistgov/trec_eval
Table 4
Quality of 16 participating systems submitted to TREC CAR Y3 as measured by our proposed EXAM, official TREC CAR
metrics, and ROUGE. Systems are ordered by EXAM score, ranks under other metrics given in column “#”, the best evaluation
scores are marked in bold. Standard errors are about 0.01 or less. Systems whose performance is comparable to the gold
articles are marked with “⋆“ .
                                       Ours                     Official TREC CAR Evaluation         Baseline
            Systems             EXAM   n-EXAM      #    ⋆     Prec@R    MAP     nDCG20         #    ROUGE       #
            rerank2-bert        0.17     1.03      1    ⋆      0.22     0.18      0.31         3     0.42       12
            dangnt-nlp          0.17     1.02      2    ⋆      0.28     0.25      0.38         1     0.41       13
            bert-cknrm-50       0.16     0.99      3    ⋆      0.14     0.11      0.22         12    0.47       2
            irit-run2           0.16     0.94      4    ⋆      0.19     0.16      0.27         4     0.45       5
            rerank3-bert        0.16     0.94      5    ⋆      0.23     0.20      0.34         2     0.44       8
            ict-b-convk         0.16     0.94      6    ⋆      0.19     0.15      0.27         8     0.39       15
            irit-run1           0.16     0.93      7    ⋆      0.19     0.16      0.27         4     0.44       7
            bm25-populated      0.15     0.93      8           0.18     0.14      0.25          9    0.43       10
            unh-tfidf-ptsim     0.15     0.92      9           0.17     0.13      0.23         10    0.43       11
            irit-run3           0.15     0.92     10           0.19     0.16      0.27          4    0.44        6
            unh-bm25-ecmpsg     0.15     0.88     11           0.17     0.13      0.23         10    0.43        9
            ecnu-bm25-1         0.14     0.87     12           0.19     0.15      0.27          7    0.49       1
            ict-b-drmmtks       0.13     0.80     13           0.01     0.01      0.01         16    0.24       16
            uvabottomupch.      0.09     0.57     14           0.04     0.03      0.06         14    0.45        4
            uvabm25rm3          0.09     0.54     15           0.04     0.03      0.06         13    0.45        3
            uvabottomup2        0.09     0.53     16           0.03     0.02      0.04         15    0.40       14
            Gold articles (⋆)   0.17     1.00      -    -       -         -         -          -      -         -


4.2. Results                                                  cial measures is higher than between EXAM and official
                                                              measures.
EXAM: Table 4 displays the leaderboard of tested re-
trieve-and-generate systems ordered by EXAM score.
The trend is clear: Systems that rank high on the official    ROUGE: By contrast, the leaderboard according to
TREC CAR leaderboard also rank high on the EXAM               ROUGE-1 F1 is uncorrelated to the official TREC CAR
leaderboard, and systems that rank low on the official        leaderboard. Ecnu-bm25-1, which has the best ROUGE-1
leaderboard also rank low on the EXAM leaderboard. For        F1 score, is not in the top five of the official leaderboard.
example, the rerank2-bert and dangnt-nlp participant          Tables 2a and 2b demonstrate that the rank correlation
systems are ranked as the top two systems by both the         of ROUGE is near zero across all metrics, which is equiv-
official leaderboard and the EXAM leaderboard. Some           alent to a random ordering.
systems achieved similar EXAM scores as they likely
produce the same passages ordered differently, affecting      4.3. Discussion
official TREC CAR evaluations but not EXAM. Addition-
                                                              We discuss advantages and limitations of the paradigm.
ally, due to the setup described in Section 3.2.2, the best
systems even slightly surpass the gold articles (see dis-
cussion in Section 4.3). Many systems are performing          Resilience to Q/A system errors: Any real-world
within standard-error of the gold articles (marked with       Q/A system will make mistakes, most likely causing cor-
⋆)—these are all similar systems based on the BERT neu-       rect answers contained within the generated article to
ral language model.                                           be missed. Indeed, the Q/A system is unable to correctly
   Tables 2a and 2b display the Spearman’s and Kendall’s      answer many questions with the gold article, in part be-
rank correlations of the EXAM and the official TREC           cause the article and questions were designed to be a
CAR evaluation. Notably, EXAM achieves correlation of         challenge for middle school students. However, despite
0.74 in terms of Spearman’s rank correlation and 0.56 in      these Q/A errors, the study demonstrates EXAM reveals
terms of Kendall’s Tau, both values can range from -1 to      significant quality differences between systems. If our ex-
1. These averages illustrate just how strong EXAM corre-      periment had not been successful, we would not observe
lates with official assessments, despite the much different   any correlation between EXAM and the official CAR Y3
evaluation paradigm. Prec@R, MAP, and nDCG20 use              assessments.
the same relevance assessments and evaluation paradigm.
Hence it is not surpring that the correlation within offi-
Overcoming linguistic differences: We found when                  While exam questions cannot test all possible useful
using the gold article with ROUGE-F1, the system rank-         follow-up questions, we demonstrate that the available
ing does not agree with manual assessments. The issue          question bank is large enough to measure significant dif-
originates from a difference in linguistic style, as gener-    ferences between systems. To identify how little effort
ated articles are constrained to use Wikipedia paragraphs,     would still yield good results, we spent one hour to manu-
but the gold articles are sourced from TQA. Hence, it is       ally create ten questions. While error bars are larger, the
unlikely that gold articles would use the same phrases as      results still correlate with the official leaderboard. (Study
the generated articles—despite both covering the same,         omitted due to space constraints).
relevant topics.
   In a previous (unpublished) study on CAR Y2 data we         Benchmark reusability and comparability: Many
found that ROUGE-F1 [12] obtains a reasonable correla-         IR benchmarks mandate the use of unmodified elements
tion (Kendall’s tau of 0.67, Spearman’s rank of 0.67) when     from a fixed corpus. By contrast, EXAM uses a corpus-
using manually assessed relevant paragraphs instead of         independent evaluation, and thus can be applied across
gold articles. We conclude that ROUGE is struggling to         different corpora and sources, including open web or
overcome the linguistic differences between generated          NLG algorithms. Systems using different sources can all
and gold articles.                                             be evaluated and compared with each other using the
   In contrast, when evaluated under the EXAM measure,         EXAM evaluation.
the gold article obtains the same score as the best partici-
pant system. Given the positive results, we conclude that
EXAM is able to overcome the linguistic differences. We        5. Conclusions and Future
believe this is an important finding as the same issue is         Directions
likely to arise when a retrieve-and-generate system uses
external sources or a fully generative model.                  We discuss an evaluation paradigm for retrieve-and-gene-
                                                               rate systems, which are systems that modify retrieved
Interpretation of N-EXAM: The n-EXAM metric can                raw data before presentation. This poses a challenge for
exceed 1.0 when the gold article is written obtusely, but      today’s IR evaluation paradigms. To facilitate empirical
the generated article explains relevant facts in accessible    research on retrieve-and-generate systems, we discuss an
language. In our study, gold articles are designed for (hu-    alternative evaluation paradigm, the EXam Answerability
man) students to carefully read the text and think about       Metric (EXAM), that tests whether the system provides
the answer, which is challenging for the Q/A system. In        relevant information rather than the right documents.
contrast, the submitted retrieval systems were allowed to         EXAM uses a Q/A system and query-specific question
select content from Wikipedia passages, which are likely       banks to evaluate whether the system response is capable
to clearly state answers to obvious follow-up questions.       of answering some obvious follow-up questions, even
Hence, we suggest to consider EXAM scores on gold              without being explicitly asked to do so. We verify that
articles as guidance, rather than a gold standard. Simi-       leaderboards under the EXAM evaluation and the manual
lar dataset biases are known from work on Multi-Hop            TREC CAR evaluation, agree with a Spearman’s Rank
Question Answering [36]. We remark that this issue also        correlation of 0.74 and Kendall’s Tau of 0.56.
affects the ROUGE evaluation.                                     EXAM has two benefits over the traditional IR evalua-
                                                               tion paradigm: it avoids the need for manual relevance
Universality of quality: Our evaluation paradigm is            assessments, and it can compare systems that use dif-
very different from pool-based Cranfield-style evalua-         ferent (or no) corpora for retrieval. While gold articles
tions practiced in IR today [24]. Initial concerns that        and assessments can be used within the EXAM paradigm,
these paradigms evaluate different measures of quality         at a minimum EXAM only requires humans to curate
have been ameliorated as the experimental evaluation           queries and question banks—the rest of the evaluation
demonstrates a high agreement between EXAM and the             is fully automatic. EXAM also has an advantage over
official CAR Y3 assessments.                                   the text summarization metric, ROUGE [15], as EXAM
                                                               evaluates documents by the relevance of information pro-
 Reduced manual effort: Cranfield-style evaluations            vided, rather than exact wording. This conclusion is in
involve a non-trivial amount of human labor. By contrast,      line with findings of Deutsch et al. [18].
EXAM’s human assessors only develop a bank of ques-               While not studied in this work, EXAM could also be
tions that evaluate the information content of articles.       used to construct a training signal, as long as the exam
EXAM question banks can be reused in a fully automatic         questions are not available as inputs to the retrieve-and-
manner, as the Q/A system conducts the exam.                   generate system.

                                                                 Our long-term goal is to develop systems to support
users who do not (yet) know what exactly they are look- [6] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
ing for. We envision a system that synthesizes a com-            M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
prehensive topical overview by collating retrieved text          limits of transfer learning with a unified text-to-text
with post-processing steps like natural language genera-         transformer, Journal of Machine Learning Research
tion. Permitting different linguistic styles and encourag-       21 (2020) 1–67.
ing comprehensiveness, renders traditional IR evaluation     [7] D. Ippolito, D. Duckworth, C. Callison-Burch,
paradigms as very costly. These goals also pose chal-            D. Eck, Automatic detection of generated text is
lenges regarding benchmark reuse for a fair comparison           easiest when humans are fooled, in: Proceedings
across systems. The EXAM evaluation paradigm pro-                of the 58th Annual Meeting of the Association for
vides a new avenue for retrieve-and-generate research            Computational Linguistics, 2020, pp. 1808–1822.
to evaluate systems by information content.                  [8] L. Dietz, J. Dalton, Humans optional? Auto-
   However, EXAM can also evaluate many other infor-             matic large-scale test collections for entity, passage,
mation retrieval tasks: EXAM allows the comparison               and entity-passage retrieval, Datenbank-Spektrum
of ad hoc retrieval from fixed corpora with open-web             (2020) 1–12.
retrieval. EXAM offers an alternative way to assess re- [9] W. R. Hersh, A. M. Cohen, P. M. Roberts, H. K. Reka-
dundancy for search result diversification. EXAM can             palli, Trec 2006 genomics track overview., in: TREC,
evaluate the information content of each turn of a con-          volume 7, 2006, pp. 500–274.
versational search system as well as the provided infor- [10] J. Kamps, M. Lalmas, J. Pehcevski, Evaluating rele-
mation content over multiple turns. We believe that in           vant in context: Document retrieval with a twist, in:
general, evaluation paradigms that, like EXAM, penalize          Proceedings of the 30th annual international ACM
avoidable conversation turns will encourage information          SIGIR conference on Research and development in
systems that are forthcoming with answers.                       information retrieval, ACM, 2007, pp. 749–750.
                                                            [11] C. Wade, J. Allan, Passage retrieval and evalua-
                                                                 tion, Technical Report, Massachusetts University
Acknowledgments                                                  Amherst Center for Intelligent Information Re-
                                                                 trieval, 2005.
We thank Peter Clark, Ashish Sabharwal, Tushar Khot [12] M. Keikha, J. H. Park, W. B. Croft, Evaluating an-
from the Allen Institute for AI for their help with the          swer passages using summarization measures, in:
TQA dataset and the provision of the Q/A System, and             Proceedings of the 37th international ACM SIGIR
the UNH TREMA group for their guidance in doing this             conference on Research & development in informa-
research.                                                        tion retrieval, ACM, 2014, pp. 963–966.
                                                            [13] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,
References                                                       Y. Artzi, Bertscore: Evaluating text generation with
                                                                 bert, arXiv preprint arXiv:1904.09675 (2019).
  [1] R. S. Taylor, Question-negotiation and information [14] J. Allan, B. Croft, A. Moffat, M. Sanderson, Fron-
      seeking in libraries, College & Research Libraries         tiers, challenges, and opportunities for information
      29 (1968) 178–194.                                         retrieval: Report from swirl 2012 the second strate-
  [2] N. J. Belkin, Anomalous states of knowledge as a           gic workshop on information retrieval in lorne, in:
      basis for information retrieval, Canadian journal          ACM SIGIR Forum, volume 46, ACM New York, NY,
      of information science 5 (1980) 133–143.                   USA, 2012, pp. 2–32.
  [3] T. Ruotsalo, J. Peltonen, M. J. Eugster, D. Głowacka, [15] C.-Y. Lin, ROUGE: A package for automatic eval-
      P. Floréen, P. Myllymäki, G. Jacucci, S. Kaski, In-        uation of summaries, in: Text Summarization
      teractive intent modeling for exploratory search,          Branches Out, Association for Computational Lin-
      ACM Transactions on Information Systems (TOIS)             guistics, Barcelona, Spain, 2004, pp. 74–81. URL:
      36 (2018) 1–46.                                            https://www.aclweb.org/anthology/W04-1013.
  [4] L. Dietz, J. Foley, Trec car y3: Complex answer re- [16] S. Banerjee, A. Lavie, Meteor: An automatic met-
      trieval overview, in: Proceedings of Text REtrieval        ric for mt evaluation with improved correlation
      Conference (TREC), 2019.                                   with human judgments, in: Proceedings of the
  [5] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-         acl workshop on intrinsic and extrinsic evaluation
      plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-       measures for machine translation and/or summa-
      try, A. Askell, et al., Language models are few-shot       rization, 2005, pp. 65–72.
      learners, arXiv preprint arXiv:2005.14165 (2020).
[17] T. Scialom, S. Lamprier, B. Piwowarski, J. Staiano,             tational Linguistics (Volume 1: Long Papers), 2017,
     Answers unite! Unsupervised metrics for rein-                   pp. 209–220.
     forced summarization models, in: Proceedings of            [26] P. Nie, Y. Zhang, X. Geng, A. Ramamurthy, L. Song,
     the 2019 Conference on Empirical Methods in Nat-                D. Jiang, Dc-bert: Decoupling question and doc-
     ural Language Processing and the 9th International              ument for efficient contextual encoding, Proceed-
     Joint Conference on Natural Language Processing                 ings of the 43rd International ACM SIGIR Confer-
     (EMNLP-IJCNLP), Association for Computational                   ence on Research and Development in Information
     Linguistics, Hong Kong, China, 2019, pp. 3246–3256.             Retrieval (2020). URL: http://dx.doi.org/10.1145/
     URL: https://www.aclweb.org/anthology/D19-1320.                 3397271.3401271. doi:1 0 . 1 1 4 5 / 3 3 9 7 2 7 1 . 3 4 0 1 2 7 1 .
     doi:1 0 . 1 8 6 5 3 / v 1 / D 1 9 - 1 3 2 0 .              [27] E. Perez, P. Lewis, W.-t. Yih, K. Cho, D. Kiela, Un-
[18] D. Deutsch, T. Bedrax-Weiss, D. Roth, Towards                   supervised question decomposition for question
     question-answering as an automatic metric for eval-             answering, arXiv preprint arXiv:2002.09758 (2020).
     uating the content quality of a summary, arXiv             [28] S. Min, V. Zhong, R. Socher, C. Xiong, Efficient
     preprint arXiv:2010.00490 (2020).                               and robust question answering from minimal con-
[19] R. Gupta, C. Orasan, J. van Genabith, Reval: A                  text over documents, in: Proceedings of the 56th
     simple and effective machine translation evaluation             Annual Meeting of the Association for Computa-
     metric based on recurrent neural networks, in:                  tional Linguistics (Volume 1: Long Papers), 2018,
     Proceedings of the 2015 Conference on Empirical                 pp. 1725–1735.
     Methods in Natural Language Processing, 2015, pp.          [29] J. Lin, Is question answering better than infor-
     1066–1072.                                                      mation retrieval? Towards a task-based evalua-
[20] H. Kane, M. Y. Kocyigit, A. Abdalla, P. Ajanoh,                 tion framework for question series, in: Human
     M. Coulibali, Nubia: Neural based interchange-                  Language Technologies 2007: The Conference of
     ability assessor for text generation, arXiv preprint            the North American Chapter of the Association
     arXiv:2004.14667 (2020).                                        for Computational Linguistics; Proceedings of the
[21] M. Eyal, T. Baumel, M. Elhadad, Question answer-                Main Conference, 2007, pp. 212–219.
     ing as an automatic evaluation metric for news ar-         [30] T. Sakai, M. P. Kato, Y.-I. Song, Overview of ntcir-
     ticle summarization, in: Proceedings of the 2019                9, in: Proceedings of the 9th NTCIR Workshop
     Conference of the North American Chapter of the                 Meeting, 2011, 2011, pp. 1–7.
     Association for Computational Linguistics: Human           [31] H. Bota, K. Zhou, J. M. Jose, M. Lalmas, Composite
     Language Technologies, Volume 1 (Long and Short                 retrieval of heterogeneous web search, in: Proceed-
     Papers), Association for Computational Linguis-                 ings of the 23rd international conference on World
     tics, Minneapolis, Minnesota, 2019, pp. 3938–3948.              wide web, 2014, pp. 119–130.
     URL: https://www.aclweb.org/anthology/N19-1395.            [32] A. Kembhavi, M. Seo, D. Schwenk, J. Choi,
     doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 3 9 5 .                   A. Farhadi, H. Hajishirzi, Are you smarter than
[22] L. Huang, L. Wu, L. Wang,                     Knowledge         a sixth grader? Textbook question answering for
     graph-augmented abstractive summarization with                  multimodal machine comprehension, 2017 IEEE
     semantic-driven cloze reward, Proceedings of the                Conference on Computer Vision and Pattern Recog-
     58th Annual Meeting of the Association for Com-                 nition (CVPR) (2017) 5376–5384.
     putational Linguistics (2020). URL: http://dx.doi.         [33] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sab-
     org/10.18653/v1/2020.acl-main.457. doi:1 0 . 1 8 6 5 3 /        harwal, C. Schoenick, O. Tafjord, Think you have
     v1/2020.acl- main.457.                                          solved question answering? Try ARC, the AI2 rea-
[23] G. Doddington, Automatic evaluation of machine                  soning challenge, ArXiv abs/1803.05457 (2018).
     translation quality using n-gram co-occurrence             [34] A. Parikh, O. Täckström, D. Das, J. Uszkoreit,
     statistics, in: Proceedings of the Second Interna-              A decomposable attention model for natural lan-
     tional Conference on Human Language Technology                  guage inference, in: Proceedings of the 2016
     Research, HLT ’02, Morgan Kaufmann Publishers                   Conference on Empirical Methods in Natural Lan-
     Inc., San Francisco, CA, USA, 2002, p. 138–145.                 guage Processing, Association for Computational
[24] E. M. Voorhees, The evolution of cranfield, in: In-             Linguistics, Austin, Texas, 2016, pp. 2249–2255.
     formation retrieval evaluation in a changing world,             URL: https://www.aclweb.org/anthology/D16-1244.
     Springer, 2019, pp. 45–69.                                      doi:1 0 . 1 8 6 5 3 / v 1 / D 1 6 - 1 2 4 4 .
[25] E. Choi, D. Hewlett, J. Uszkoreit, I. Polosukhin,          [35] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning,
     A. Lacoste, J. Berant, Coarse-to-fine question an-              A large annotated corpus for learning natural lan-
     swering for long documents, in: Proceedings of the              guage inference, arXiv preprint arXiv:1508.05326
     55th Annual Meeting of the Association for Compu-               (2015).
[36] J. Chen, G. Durrett, Understanding dataset de-       Chapter of the Association for Computational Lin-
     sign choices for multi-hop reasoning, in: Proceed-   guistics: Human Language Technologies, Volume 1
     ings of the 2019 Conference of the North American    (Long and Short Papers), 2019, pp. 4026–4032.

</pre>