Searching Unseen Sources for Historical Information:
                         Evaluation Design for the NTCIR-18 SUSHI Pilot Task
                         Douglas W. Oard1 , Tokinori Suzuki2 , Emi Ishita2 and Noriko Kando3
                         1
                           University of Maryland, College Park, MD USA
                         2
                           Kyushu University, Fukuoka, Japan
                         3
                           National Institute of Informatics, Tokyo, Japan


                                        Abstract
                                        In evaluation of ranked retrieval, the usual assumption is that the documents to be searched can be indexed before
                                        the query is received and the search is performed. The NTCIR-18 SUSHI Pilot Task, by contrast, models the
                                        case in which only a small sample of the documents to be searched can be indexed before the query is received.
                                        This task model arises in the context of searching within large archives of paper documents, for example. The
                                        stark difference in what can be indexed before the query is received has consequences for both task design and
                                        evaluation design, both of which are discussed in this paper.

                                        Keywords
                                        Information retrieval, Archival access, Evaluation


                         1. Introduction
                         Information retrieval has generally been modeled on the idea of the library catalog. We have some
                         collection of materials, we can index those materials in some way, and then in response to a query we can
                         suggest the materials that the searcher might want to see. Archives,1 however, are different. Archives
                         collect unique historical materials, and because those materials are unique, and thus irreplaceable,
                         archives typically must place much greater emphasis on acquisition and preservation than on access.
                         Access is not ignored, but it must be done within stringent resource constraints. It is thus not practical
                         to describe (or digitize) many of the individual docuements in an archival collection. Instead, archivists
                         typically describe collections at higher levels of aggregation, such as folders, boxes, or segments of the
                         collection that go by names such as record groups, series, or fonds.
                            This situation creates challenges for searchers, in part because different parts of an archive often are
                         arranged differently. This happens because archivists can economize on the effort needed to arrange the
                         materials in a collection by taking advantage of the original order in which the materials were organized
                         when they were in active use [1]. For example, materials on space exploration might be originally
                         arranged by program (Mercury, Gemini, Apollo, Shuttle, …), while materials on diplomacy might have
                         been organized by country (Sweden, Uganda, Japan, …). Further down in the organization, we might
                         find the diplomacy materials organized by topic (agriculture, education, economy, …), while the space
                         exploration materials might be organized by function (design, testing, contract management, …). The
                         Swedish agriculture records might then be further organized by author (Kissinger, Smith, Kennan, …),
                         whereas the Apollo design materials might be organized by component (space suit, thruster, radio, …).
                         And so on all the way down. Well, actually not all the way down, since the description process simply
                         must stop before getting to the level of individual documents. After all, the U.K. National Archives has

                          EMTCIR ’24: The First Workshop on Evaluation Methodologies, Testbeds and Community for Information Access Research,
                          December 12, 2024, Tokyo, Japan
                          Envelope-Open oard@umd.edu (D. W. Oard); tokinori@inf.kyushu-u.ac.jp (T. Suzuki); ishita.emi.982@m.kyushu-u.ac.jp (E. Ishita);
                          kando@nii.ac.jp (N. Kando)
                          Orcid 0000−0002−1696−0407 (D. W. Oard); 0000−0002−4715−6198 (T. Suzuki); 000−0002−1398−8906 (E. Ishita);
                          0000-0002-2133-0215 (N. Kando)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                         1
                             The term “archive” is often used in computer science to mean a collection (e.g., a zip archive). In this paper, by contrast, we
                             use “archive” to name a type of information institution. According to the Merriam Webster dictionary, an archive is “a place
                             in which public records or historical materials (such as documents) are preserved.”

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                      1) Relevance Judgement                            2) Structure of Archive Materials
                                       Query

                                                                                            Box 1                       Box 2

                                Searcher            Search System
                                                                             Folder 1      Folder 2      Folder 3      Folder 4
                                  Ranked List of Folders
                                Rank       Folder     Rel. Jud.
                                                                    Output
                                  1     Folder 2        HR
                                  2     Folder 4        HR
                                  3     Folder 3         R
                                                                                            HR HR          R           R    HR
                                  4     Folder 1


Figure 1: Creation of relevance judgments for folders based on the max judgments for judged documents in
that folder. R: Relevant, HR: Highly Relevant.


about 14 billion printed pages, which if put in a single stack would stretch halfway around the world.
Nobody, and indeed no group of 100 people, could ever hope to look at all of that, much less to describe
all of it at the level of individual documents. But even that is just a tiny part of the problem; there
are about 26,000 archives in the United States alone [2], many with nowhere near the level of funding
(relative to the size of their collections) as the U.K. or U.S. National Archives. Simply put, nobody will
ever see all of this stuff.
   The net result of this situation is to shift a greater burden onto those who want to find things in an
archive. Searchers must learn where collections that might contain what they want are stored, they
must know how those collections are organized, they must (given the paucity of digitization) then travel
to an archive, request access to containers (e.g., boxes) that might contain what they are looking for,
and then look through those materials. This is a process in which time is measured in weeks, and costs
might be measured in hundreds or even thousands of U.S. dollars. Per search.
   Our goal in the NTCIR-18 SUSHI2 Pilot Task3 is to begin to reduce the time and expense of finding
materials in an archive. Our focus is on materials that are on physical media (e.g., paper or microfilm),
and on materials that, like the vast majority of most archives (e.g., like 97.6% of the U.S. National
Archives) have not yet been digitized, or even described at the level of individual documents. We seek to
do that by supporting the development of automated systems that can learn from a very limited number
of examples that have been digitized, and from whatever metadata at higher levels of aggregation might
exist, to suggest where in a collection a searcher might most productively look. SUSHI is designed to
support that research by developing test collections that model the real problem, and that do so in a
way that supports insightful and affordable evaluation.


2. The Folder Ranking Task
The principal task in the NTCIR-18 SUSHI Pilot Task is Folder Ranking. Figure 1 displays the input (a
query) and the output (a ranked list of folders) for this task. Specifically, given a query and an unsorted
list of all folders in the collection, use the metadata describing each of those folders, together with a
sparse sample of digitized documents and document-level metadata from some documents in some
of those folders, rank the folders a searcher might most want to see. Table 1 illustrates some of the
2
    Searching Unseen Sources for Historical Information
3
    https://sites.google.com/view/ntcir-sushi-task/


       Table 1
       Examples of some folder-level and document-level metadata in our Folder Ranking Task test collection.
metadata that is available in our Folder Ranking Task test collection. The sparse digitized sample in our
dry run test collection includes five documents per box, with one document sampled from each of the
five largest folders in each box (although the 21 boxes with fewer than five folders have more than one
sample drawn from some folders). For actual final Folder Ranking Task, we plan to explore a mixture of
even sampling (the same number of folders per box) and uneven sampling (with more samples from
some boxes than from others). Both approaches can be useful. For early experiments drawing the same
number of samples from each box can help achieve better control over experimental conditions. In real
archives, however, digitization and description are both unevenly applied, and uneven sampling can
help to characterize the additional challenges that such a situation produces.

2.1. Evaluation for the NTCIR-18 Pilot Task
In order to simulate the real task, we have assembled a collection that we can easily sample and that we
can easily judge for relevance. This is a collection of 31,682 U.S. State Department documents from
1,337 folders in 124 boxes. What makes the collection easily judged is that it is fully digitized, and that
we have topical metadata for every document. Table 1 shows examples of that metadata, which (like
the documents) is from the the US National Archives.
   Since a system under test will know which folders exist, we operationalize the idea of “searching
well” as ranking those folders well. We measure how well a system constructs that ranking using
NDCG@5 as the principal evaluation measure. A cutoff at 5 corresponds to about a half hour’s work
by someone who is actually looking at physical documents in an archive. We estimate this from the
fact that a folder contains an average of 31682/1337 ≈ 24 documents, together with out expectation
that a skilled searcher could recognize a relevant document in 15 seconds (skilled users of archives flip
through documents very quickly). Obtaining the boxes that contain those folders might take another
hour or two, but requesting new boxes could be interleaved with examining results from prior requests.
   In the NTCIR-18 SUSHI Pilot Task we have two ways of getting the topics on which queries are based.
The first approach, used for the dry run, was to randomly select a “query document” that participating
systems did not see at training time, and then to use the title metadata for that document as the query.4
We then treat any document with the same title metadata (from anywhere in the collection, not just
from the known query document) as being relevant. This is an extended variant of known-item retrieval
Note that participating systems can’t see those titles (because systems can only see document-level
metadata for training documents, and we don’t treat training documents as relevant). So systems must
perform some sort of inference in order to rank folders without ever having seen a single one of the
relevant documents anywhere in the collection [3, 4].
   The approach we used to create the dry run test collection can be useful as a basis for initial system
development, but exact matching on document titles is at best a weak proxy for true human relevance
judgments. We therefore need a second approach in which actual people make those judgments. For
our final evaluation, we plan to rely instead on human assessors, preferably graduate students with
a background in history or library science. The assessors will initially create search topics in the
traditional title/description/narrative format,5 based on their understanding of the collection’s content.
They will then check to see if at least a few relevant documents actually exist, using a full-collection
document-level search system that we have built using PyTerrier. With this system, assessors can
issue queries (either free form, or copied form a topic field), rank documents using that query based
on one of several ways of indexing the collection (e.g., OCR-only, Title-only, or both),6 skim the folder
label and title for every document in the resulting ranked list, selectively view PDF scans of individual
documents, search within a document for any term, and record their tentative relevance judgments for

4
  This is actually a bit oversimplified. From initial experiments, we learned that we also need to limit the length of the title,
  because some titles are so long as to be unrealistic surrogates for a human-issued query. We therefore control the query
  length by first assembling all titles in the collection, making separate randomly-ordered lists for all unique 2, 3, 4, and 5-word
  titles, and then having two annotators each select 25 of each that look to them like realistic queries.
5
  In our dry run collection we use the same topic format, but for the dry run all three topic fields are identical.
6
  Other ways include using folder labels to expand document text, or using GPT summaries of OCR text.
any documents that they encounter during the topic development process. Once they have finalized a
topic, we will save their tentative relevance judgments so that they can later finalize those judgments
when performing relevance assessment.
   The relevance assessment process will then be performed in the same way, using the same system,
but with enough time allocated for more careful searching, a process known as interactive search
and judgment [5]. During relevance assessment we will also ask the assessors to review the tentative
relevance judgments that they had created during topic development. We separate the topic development
and relevance judgment processes both for convenience (we need the topics sooner) and to encourage
assessors to treat the topics as well defined and immutable during the relevance assessment process.
   In this first year of the task we don’t plan to use pooling to build assessment pools because participating
systems will have ranked lists of folders, but relevance judgments are made not on folders but on
individual documents (documents which the participating systems never saw, and thus could not have
ranked). In future evaluations we may consider assessment processes that could benefit from folder
pooling (e.g., allocating some assessor time to searching pooled folders more thoroughly). For the
NTCIR-18 SUSHI Pilot Task we won’t use folder pooling as a part of our assessment process, but we
will look at what folder pooling would have produced in order to see if the density of relevant folders is
markedly higher than random selection, and if it is we might employ folder poling in the future.
   Because the systems to be evaluated produced ranked lists of folders, we must map our document-
level relevance judgments to folder-level relevance judgments in some way. As Figure 1 illustrates, for
the NTCIR-18 SUSHI Pilot Task we will aggregate document-level judgments to folder-level judgments
by simply using as the folder’s judgment the highest judgment for any judged document in that folder.
The resulting relevance judgments can then be used directly to compute folder-level nDCG@5, or
binarized to compute, for example, Mean Average Precision (MAP).

2.2. Future Evaluation Design Issues
That’s as far as we expect to be able to get for the NTCIR-18 SUSHI Pilot Task, but the task design raises
several other important evaluation design issues. Here we highlight three of those questions.
   First, our use of nDCG@5 simplifies the goal perhaps more than we might like. All evaluation involves
model building, and all models are simplifications of reality [6]. But we might productively complexify
our evaluation measure in two ways. First, we might switch from a one-and-done measurement
approach to one based on the density of relevant documents in a folder. In our present approach,
systems get no more credit for finding a folder with five relevant documents than for finding a folder
with just one. It is probably more realistic to have some extra credit for highly ranking a folder with
a larger number of relevant documents, and perhaps to get somewhat more credit for finding folders
with fewer documents that have to be looked through (for any given number of relevant documents in
the folder). A cost model based on the discovery rate of relevant documents would be one formulation
that could address both factors. For this, we might also look for inspiration to the evaluation measures
that were designed for the INEX Retrieval In Context task, where the time required to examine ranked
elements from hierarchically structured content was the focus (in that case, the time required to examine
ranked passages that had been extracted from documents) [7].7
   The situation is, however, not really even that simple. In the U.S. National Archives, for example,
searchers request access not to folders, but to the boxes that contain the folders they want to see. All
else equal, we would therefore prefer to find highly ranked folders that happen to be in the same box (or
in nearby boxes, since for practical reasons archivists are often equally happy to fetch a short sequence
of boxes that are stored together). There’s probably no end to how much we could complexify this (e.g.,
how about constructing the shortest path through an archive to pick up some set of boxes that contain
folders that together contain some given number of relevant documents?). We are not yet ready to
commit to a new measure, but we are able to see that we likely will ultimately want one.
   Second, confidence intervals and testing for significant differences is a bit more complex in this
environment than in a typical ranked retrieval evaluation. The reason for this is that we need to account
7
    Thanks to an anonymous reviewer for this suggestion!
not just for random variations from the choice of query, but also random variations from our choice of
the training set. In our dry run we can see the effect of the training set because we run half our queries
with one randomly sampled training set and half with another. But when we compute confidence
intervals, we ignore the training set variation and (for convenience) compute the confidence intervals
only over the queries. In the future we will likely want to use somethign along the lnes of an ANOVA
in an effort to tease apart topic and trainign set effects [8].
   Third, our present approach is vulnerable to the common criticism of classic information retrieval test
collections that they are not typically designed to characterize cross-collection differences. This may be
a minor sin when we might expect BM25 or BERT to work about as well on one English news collection
as another, but in SUSHI we are seeking to model a real situation in which different collections can have
vastly different metadata structures. Because just searching the folder metadata is an obvious baseline
against which to compare, we need to pay attention to these differences in what metadata is available.
And, of course, real archival collections are not all equally amenable to OCR—there are handwritten
collections, photograph collections, collections written entirely in hieroglyphics or cuneiform, and (in
one memorable case) a collection that consisted of nothing but x-ray film containing images of fish
skeletons. We’re not going to be able to explore that entire space of possible collections in finite time,
but that’s not the key concern here. Rather, the question is how best to explore any of it.
   To see why that canbe challenging, it may help to articulate what a collection must have. First, it
must have content for which we can create topics and for which we can perform relevance judgments.
So the fish skeletons are off the table. Second, we must know at least which box contained each item
(e.g., each document), and we would love to also know which folder in that box contained each item.
Third, we really want to have scanned images of all the documents. This third one seems non-negotiable
– we tried going over to the U.S. National Archives and doing relevance judgments for the top-ranked
box for one query. It was a 10 minute drive from our office, but once we got there it took three hours
just to get the box. So doing large numbers of relevance judgments by requesting and then examining
paper doesn’t seem like a scaleable solution.
   It is not hard to find reasonably large collections of digitized materials, and it is not hard to find large
collections that have good box and folder metadata. But it is harder than you might expect to find
both together (and even harder if you initially prefer to omit handwritten materials and photographs).
Because most archives do not yet make both content and metadata available through an API, building
relationships with institutions that have something close to what we need will be key to gaining access
to the collections we need, and for getting approval to share those collections broadly with other
researchers.


3. The Archival Reference Detection Task
The sizes of our training sets in the Folder Ranking Task are designed to model the sparsity of existing
document-level metadata in real collections. For example, sampling an average of 5 documents per box
closely approximates the actual fraction of the U.S. National Archives collection that has document-level
description. One way of improving the potential for inferring where to look for materials that might
match a query is to increase the number of documents for which document-level metadata is available.
That is the goal of the Archival Reference Detection Task. Given a text of endnote or footnote, the task
is to determine whether that footnote or endnote contains any references to archival materials.
   The key insight that motivates our Archival Reference Detection Task is that when scholars cite
materials from archives in their published work, that creates an additional source of document-level
description that we could use in search tasks as well. We can use these descriptions in two ways. Most
directly, we can parse the archival reference to extract document-level metadata such as a document
title and location information (e.g., which archive, which series, which box, and perhaps even which
folder). For example:

    • Roosevelt to Secretary of War, June 3, 1939, Roosevelt Papers, O.F. 268, Box 10; unsigned memo-
      randum, Jan. 6, 1940, ibid., Box 11.
       • Wheeler, D., and R. García-Herrera, 2008: Ships’ logbooks in climatological research: Reflections
         and prospects. Ann. New York Acad. Sci., 1146, 1-15, doi:10.1196/annals.1446.006. Several archive
         sources have been used in the preparation of this paper, including the following: Logbook of
         HMS Richmond. The U.K. National Archives. ADM/51/3949

   In the first example, we can see the collection name “Roosevelt Papers,” document descriptions, and
some box numbers. As the second example illustrates, scholars sometimes also package descriptive
text together with an archival reference in the same footnote or endnote. When present, that could
potentially serve as a useful free-form document-level description of some identifiable document in
an archive. We could also potentially use the content at and near the point where the corresponding
citation was made in the main body of a paper as a free-form description not only of what the cited
document contains, but also of how that document’s content might be useful (in at least one context).
   In our early work on detecting archival references in papers on History [9], we found that 45 of
3,500 automatically extracted footnotes or endnotes were references to archival materials, a prevalence
of about 1.3%. From this we can estimate that if we wish to collect 10,000 archival references, we
would need to run Archival Reference Detection on about a million documents. We thus chose one
million documents as our target collection size for the Archival Reference Detection Task. This is a
classification task in which systems are asked to return a binary decision indicating whether each
footnote or endnote includes an archival reference. For footnotes or endnotes that are classified by a
system as an archival references, a system can optionally also elect to extract the archival reference (i.e.,
to segment the archival reference from any descriptive text that the author had packaged with it).

3.1. Evaluation for the NTCIR-18 Pilot Task
We provided a dry run test collection for the Archival Reference Detection Task that we had manually
annotated for the presence or absence of archival references in our previous research. That collection
contains 1,836 footnotes or endnotes from open-access papers in the field of History that we obtained
using the Semantic Scholar API.8 Annotations were performed by two annotators, one of whom was a
Ph.D. student with expertise in the use of cultural heritage materials. Cohen’s Kappa for agreement on
a subset of that collection, with one of us as the second annotator, was 0.8 [9], which Landis and Koch
would characterize as substantial agreement [10]. From this we conclude that the manual annotation
task is tractable, at least at small scale.
    We therefore began a larger crawl of Semantic Scholar for use in the Archival Reference Detection
Task. Because this is a highly unbalanced binary classification task, we will evaluate participating
systems on that larger collection using the F1 measure (the harmonic mean of precision and recall).
To control evaluation costs, we will compute 𝐹1 using a stratified sample. Our stratification will be
based on the number of participating systems that classified each footnote or endnote as an archival
reference. For example, we will sample footnotes or endnotes that are classified as archival references
by every participating system most densely, and we will very sparsely sample footnotes or endnotes
that are not classified as archival references by any system. We plan to evaluate the optional task
of segregating an archival reference from any text that is packaged with it in the same footnote or
endnote using the Jaccard coefficient. We will compute this Jaccard coefficient on characters, dividing
the number of characters that both the system and the annotator believe are in the archival reference
by the number of characters that either the system or the annotator believe are in the archival reference
(i.e., the intersection over the union).
    To create these annotations, we plan to hire annotators who are graduate students in history or in
some related discipline. Because the scholarly papers from which we extract footnotes and endnotes will
be in English, we will further require reading fluency in English. Based on our earlier experience with
human annotation, we expect that (after training) annotators will be able to classify about two footnotes
or endnotes per minute. We thus expect each annotator to produce about 1,000 annotations per week.
We will therefore design our sampling to select about 5,000 footnotes or endnotes for annotation. We
8
    https://www.semanticscholar.org/product/api
will then subsample from the annotations marked as archival references by an annotator and ask that
annotators further segment the archival reference from any text that is packed with it in the same
footnote or endnote. In our experience, such packaging is relatively infrequent, occurring perhaps 10%
of the time, so we expect this second annotation process to go fairly quickly. Overall, we expect that
annotation will require about one month, although to guard against unexpected delays we perform
assessment in batches that are all sampled in a way that would allow estimation (with broader confidence
intervals) even if annotation of some of the later batches can not be completed in the available time.

3.2. Future Evaluation Design Issues
Our goal in the Archival Reference Detection Task is to begin the process of finding and using archival
references by first finding them. In future editions of the task, we can then extend the goal to include
not just classification and segmentation, but also extraction of specific fields (such as title, archive
and box) and extraction of associated descriptive text from the main body of the paper that cited this
archival reference. Once that has been done, we could progress to extrinsic evaluation, measuring the
benefits to an actual search task of having a broader set of document-level metadata (and other forms
of document-level description) on which to base its inference. This will ultimately require new test
collections for the search task, since our present Folder Ranking Task test collection draws content
from too narrow a subset of the full archival universe to be useful for evaluating the impact of footnotes
or endnotes that could reference anything in any archive anywhere.
   Our initial experience with archival reference detection points to two other potential challenges.
One is that our present approach to stratified sampling is better suited to characterizing what has been
found than it is to characterizing what has been missed. This is a natural consequence of the class skew
in the classification task. In future shared task evaluations, we might want to consider the use of active
learning as a way of better exploring the range of cases that all participating systems are missing [11].
   A second challenge is that materials in some archives are quite clearly receiving more attention in
the scholarly literature than materials in other archives. For example, we see anecdotally that materials
in the U.K. National Archives have been cited much more often (in the small sets that we have examined
to date) than are materials in the U.S. National Archives, despite the holdings of those two institutions
being of similar sizes. More careful study of this skewed distribution seems to be called for, and of
course we can expect that the large-scale results from the Archival Reference Detection Task could
serve as one useful basis for such a study.


4. Building a Research Community
Several years ago, one of us wrote a thought piece in which we identified factors that might lead to
a decline in the demand for shared task information retrieval evaluations [12]. Without rehashing
the complete argument, the basic idea was that many of the important affordances of shared task
evaluations can now also be achieved in other ways, and that some of those way have advantages
in cost, friction, lead time, or scalability. There was, however, one exception to that prognostication,
and that was the value of shared tasks for building research communities. With that in mind, we are
pleased to now have six registered teams (as of late October).9 Looking toward the future, we also know
some people who are working on other aspects of archival access, and we are familiar with an earlier
metadata-focused cultural heritage task run at CLEF,10 We’ve tried to spread the word in both of those
communities. We have also tried to lower barriers to staring on the SUSHI task by making baseline
systems available that potential participants can easily modify without having to develop code anew
for the rather complex data handling that is needed to fully specify the training and test conditions in
the Folder Ranking Task.


9
We also note, however, that two of those teams include task organizers
10
http://ims.dei.unipd.it/data/chic/
5. Conclusion
In this paper we have described the design of the NTCIR-18 SUSHI pilot task, and we have identified
some new evaluation questions that emerge from that work. Although SUSHI is already an NTCIR-18
Pilot Task, there are still many issues around evaluation design, community building, and scaling up
the size of our test collection that we feel could benefit from further discussion at this workshop.


Acknowledgements
This work has been supported in part by the Japan Society for the Promotion of Science KAKENHI
Grant Number 23KK0005 and the National Institute of Informatics Open Collaborative Research 2024
(24S0505).


References
 [1] G. Wiedeman, The historical hazards of finding aids, The American Archivist 82 (2019) 381–420.
 [2] B. Goldman, E. M. Tansey, W. Ray, US archival repository location data, 2023. Website https:
     //osf.io/cft8r/, visited October 3, 2024.
 [3] D. W. Oard, Known by the company it keeps: Proximity-based indexing for physical content in
     archival repositories, in: International Conference on Theory and Practice of Digital Libraries,
     2023, pp. 17–30.
 [4] T. Suzuki, D. W. Oard, E. Ishita, Y. Tomiura, Searching for physical documents in archival
     repositories, in: Proceedings of the 47th International ACM SIGIR Conference on Research and
     Development in Information Retrieval, 2024, pp. 2614–2618.
 [5] M. Sanderson, H. Joho, Forming test collections with no system pooling, in: Proceedings of the
     27th annual International ACM SIGIR Conference on Research and Development in Information
     Retrieval, 2004, pp. 33–40.
 [6] G. E. Box, Science and statistics, Journal of the American Statistical Association 71 (1976) 791–799.
 [7] P. Arvola, J. Kekäläinen, M. Junkkari, Expected reading effort in focused retrieval evaluation,
     Information Retrieval 13 (2010) 460–484.
 [8] N. Ferro, M. Sanderson, How do you test a test? a multifaceted examination of significance tests,
     in: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining,
     2022, pp. 280–288.
 [9] T. Suzuki, D. W. Oard, E. Ishita, Y. Tomiura, Automatically detecting references from the scholarly
     literature to records in archives, in: Proceedings of the 25th International Conference on Asia-
     Pacific Digital Libraries, Springer, 2023, pp. 100–107.
[10] J. Landis, G. Koch, The measurement of observer agreement for categorical data, Biometrics
     (1977).
[11] G. V. Cormack, M. R. Grossman, Evaluation of machine-learning protocols for technology-assisted
     review in electronic discovery, in: Proceedings of the 37th International ACM SIGIR Conference
     on Research & Development in Information Retrieval, 2014, pp. 153–162.
[12] D. W. Oard, The future of information retrieval evaluation: NTCIR’s legacy of research impact, in:
     Evaluating Information Retrieval and Access Tasks, Springer, 2021.