<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-element protocol on IR experiments stability: Application to the TREC-COVID test collection⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Univ. Grenoble Alpes</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grenoble INP</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LIG. Grenoble</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France.</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The evaluation of information retrieval systems is performed using test collections. The classical Cranfield evaluation paradigm is defined on one fixed corpus of documents and topics. Following this paradigm, several systems can only be compared over the same test collections (documents, topics, assessments). In this work, we explore in a systematic way the impact of similarity of test collections on the comparability of the experiments: characterizing the minimal changes between the collections upon which the performance of IR system evaluated can be compared. To do that, we create pair instances of sub-test collections from one reference collection with controlled overlapping elements, and we compare the Ranking of Systems (RoS) of a defined list of IR systems. We can then compute the probability that the RoS are the same across the sub-test collections. We experiment with our framework proposed on the TREC-COVID collections, and two of our findings show that: a) the ranking of systems, according to the MaP, is very stable even for overlaps smaller than 10% for documents, relevance assessments and positive relevance assessments sub-collections, and b) stability is not ensured for MaP, Rprec, Bpref and ndcg evaluation measures even when considering large overlap for the topics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Comparability</kwd>
        <kwd>Rank of systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Classical evaluation of information retrieval systems follows the Cranfield paradigm, based on
the use of a common test collection to evaluate all the systems in comparison. One evaluation
is then a snapshot of the behaviour of systems on a fixed dataset. In a way to study in the
large the quality of a system, a common approach is to test a system on several test collections.
Testing a system on several test collections assesses then the system’s ability to answer diverse
information needs and to cope with various type of datasets. However, diferences between test
collections, according to the content, the structure, or the way the collection has been compiled
can have a huge impact on the results of a single’s system evaluation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this paper, we study
the stability of a system’s evaluation across varying datasets by creating multi-dimensional
variations of a test collection.
      </p>
      <p>
        The question we focus may have an impact to other fields of IR than the evaluation:
• knowing how a test collection evolution afects the stability of the systems evaluation
measures may be of great value for Web search engines, where documents, topics and
relevance assessments are constantly changing ;
• deep learning approaches for IR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] commonly use N-fold validation techniques during
the training. The question to use accurate folds for such validation has to be answered.
Our proposal allows to control what fold to use regarding documents, topics, assessments
of a test collection.
      </p>
      <p>This problem has been partly studied in the state of the art but, to our knowledge, a
comprehensive study on the documents, topics and assessments does not exist yet. Our proposal is
then a) to create, from one test collection, multiple controlled pairs of sub-collections according
to documents, topics and assessments, and b) to study the stability of the ranking of several
systems between theses sub-collections. With this, we are then able to explore the impact of
these pairs. We show that the topics dimension has a greater impact than the assessments and
the documents dimensions.</p>
      <p>In the following, we present first the state of the art in section 2, before detailing our proposal
in section 3. In section 4, we present the experimental setting. Section 5 details the results,
before the discussion in part 6 and the conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>We present here existing works that focus on the impact of variation of test collections on the
evaluation of the quality of systems.</p>
      <p>
        Classically, diferent test collections are used to measure and test the reproducibility of
system results [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To do that, the same systems have to actually be applied on each collection,
otherwise they cannot be compared. The problem of similarity between corpus of documents
has been applied for the transfer of relevance assessments across test collections [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], but not for
the comparison of systems across such collections. Such works do not tackle the problem to
compare systems across test collections.
      </p>
      <p>
        Few works are focusing specifically on the impact of topics in test collections from a ranking
of systems perspective. Robertson and Kanoulas [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] find that topics are not all equal when
evaluating document retrieval, but they do not provide answers on how make use of their
ifndings.
      </p>
      <p>
        Other works study the impact of corpus, assessments or topics changes on the performance of
the systems. Sanderson et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Ferro and Sanderson [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] show that evaluations conducted
on several sub-collections (splits of the document corpus) lead to substantial and statistically
significant diferences in the relative performance of retrieval systems. In the same line, Ferro
and Sanderson [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Voorhees et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] model the performance metrics as several factors
that represent the efects of the system and test collection used in the evaluation. They
found significant efects in the evaluation from the topics, documents and the components of
systems used. Recently, Zobel and Rashidi [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] have shown the experimental variability, using
bootstrapping techniques on the corpus of documents across diferent performance metrics.
These works consider only random corpus splits, and they do not focus specifically, as we do
here, on detecting when the same ranking of systems are achieved.
      </p>
      <p>
        Recent work of Rashidi et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] detailed the impact of three document corpus characteristics:
documents length, document source, and high/low rank of the document. They control the test
collection splits by a “meld factor” of the characteristics (level of diference between the splits)
and they show that each characteristic impacts diferently the performance of the systems.
However, this work does not define thresholds upon which we can rely to define similar
collections. In conclusion, the state of the art shows that the performance of the systems is
afected by changes in the test collection, but to our knowledge no focus was made on finding
when collections can be assessed as comparable according to changes of several of their features.
      </p>
      <p>
        Compared to the state of the art, we investigate here how do the changes, not limited to
document corpus but also including topics and assessments, may afect the comparability of sets
of systems. Moreover, we investigate to what extent it is possible to compare systems evaluated
in changing test collections. Our research questions are: How to quantify the
diference/similarity between test collections? And what diferences in the test collection do guarantee the
comparability of the systems results? We hypothesize that similar test collections produce
the same Ranking of Systems (RoS), similarly to Voorhees et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Voorhees et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
as a generalization of the A-vs-B-comparison from Rashidi et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to more than 2 systems.
We investigate if there exists a measurable level of similarity between the elements of test
collection (documents, topics, and assessments) upon which the sub-collections are considered
comparable.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Comparing Test Collections</title>
      <p>Our goal in this paper is to propose a way to estimate to which extent changes in a test collection
implies changes in the ranking of systems tests on it. Such problem is important to solve, as it
may be used to evaluate the stability of a test collection.</p>
      <p>Before going into detail into the framework that we build, we first define formally what are
comparable test collections.</p>
      <p>Definition 1. Two test collections 1 and 2 are comparable according to an evaluation measure
, if for a given set  of information retrieval systems, the ranking of the systems in  according
to  is the same in 1 and 2.</p>
      <p>
        The performance of systems evaluated in one test collection depends on the features of this
test collection [
        <xref ref-type="bibr" rid="ref6 ref7 ref9">7, 6, 9</xref>
        ]: systems may not have the same ranking across several test collections. A
test collection  is classically defined by the following components: a set of topics ., a set of
documents ., a set of the Relevance Assessments (RA) . (triplets ∈ . × . × { 0, 1}
for binary relevance assessments) and a set of evaluation measures . .
      </p>
      <p>
        Based on these components, we will study the impact of changes using the chosen elements, i.e.,
components or subsets of them, The idea of using elements that may difer from the components
allow us to study more closely specific parts of the test collections: we are then using an
approach similar to Rashidi et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Sanderson et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We study the comparability of
test collections based on changes according to these elements, assuming a single fixed evaluation
measure  from . (see Definition 1). In order to evaluate the stability of IR systems, we
create artificial test sub-collection pairs, built from  . These pairs of sub-collections allow us to
study controlled overlaps between the elements.
      </p>
      <p>Definition 2. For one element  under consideration from a test collection  , let 1 and 2 be
sub-collections of  that difer only by the element , with |1.| = |2.|, all the other elements
||
being equal. The overlapping level  of 1 and 2 is defined as  = |1.| with  = 1. ∩ 2..</p>
      <p>
        Such overlap, in [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ], denotes the similarity between the elements 1. and 2.. If |1.| =
|2.| over several overlapping levels. We force the size of the varying elements to be constant
across the diferent overlapping levels to avoid potential biases due to diferences in size of the
elements considered. When studying the impact of one element, the others are impacted in
a way to ensure consistency in a test collection. In our case, a  test sub-collection
 from a collection  , with respect to the elements ., defines . ⊂ ., so that
. = {(, , )|(, , ) ∈ .,  ∈ .,  ∈ .}.
      </p>
      <p>According to this, we define an experimental protocol that assesses the comparability of test
collections according to one element  of one test collection.</p>
      <p>Definition 3. The protocol that studies the threshold of comparability for one test collection  ,
one evaluation measure , for a given set of overlap values , according to one similarity measure
for the ranks of systems Δ applied on  sub-collections pairs for a set of systems  and a threshold
 , is defined as follows:
• for each overlapping level  ∈ , build  controlled overlapping pairs (,1,,2) of subsets
of  according to the element  ;
• compare the RoS of a given set of systems in  evaluated on one side on ,1 and on the other
side on ,2, using , is done using the ranked lists ,1 and ,2, in a way to assess the
impact of the overlapping  over the element . This is done by a function Δ which estimates
the similarity between the lists ;
• compute the probability ,(Δ(.,1, .,2) &gt;=  ) for which the Δ(,1, ,2) is larger than
 for an overlap  on a given element , for  ∈ [1, ]. This may be computed using classical
maximum likelihood estimate on the n pairs generated, i.e.
,(Δ(.,1, .,2) &gt;=  ) = |{|∈[1,],Δ(,1,,2)&gt;= }| .</p>
      <p>Following this protocol, we are able to define the minimal overlap for which the probability of
having the same RoS is large enough.</p>
      <p>
        As an example, if we consider the element ., the protocol considers the number of
documents overlapping in sub-test collections. For . we consider the number of common
topics in both test collections. For . we extract the proportion of common judged documents
for each topic. In a way to obtain robust results,  has to be large enough (typically greater
than 50 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]).
      </p>
      <p>In this part, we defined a protocol for calculating the impact of the overlap between diferent
elements on the retrieval results. To show the feasibility of our proposal, we now show the
experimental results using the TREC-COVID collection.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>
        We use the complete TREC-COVID test collection [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to measure the comparability of the RoS.
TREC-COVID1 is composed by 191,160 diferent documents, 50 topics and 69,318 assessments
(1,386 assessments per topic in average). This collection is modern (created in 2020), and is
reasonably large. The documents, as well as the topics, are related to COVID. We chose not
to use the original rounds TREC-COVID collection because it is a residual test collection: the
systems evaluation measures are not comparable because the relevant documents from the
previous rounds are removed from the following ones, which afects the performance of the
systems.
      </p>
      <p>
        For each overlapping level , we create 50 test collection pairs, so  = 50. We evaluate 10
classical IR systems with and without Bo1 relevance feedback: S = {BM25, DLH, DirichletLM, PL2,
TF_IDF, BM25_Bo1, DLH_Bo1, DirichletLM_Bo1, PL2_Bo1, TF_IDF_Bo1}, implemented using
PyTerrier [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], with default parameters values. Similarly to Sanderson et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Voorhees
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we use the Kendall Tau similarity coeficient between diferent Ranking of Systems as the
Δ function: it measures the minimum number of pairwise adjacent swaps required to create the
same ranking. For a given set of 50 sub-collection pairs, we average the Kendall Tau coeficients.
The threshold  of comparability (see Definition 3) between RoS is 90% [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The overlapping
values tested are, in percentages  = {5, ..., 100} by steps of 5%. The following classical IR
evaluation measures are reported: MaP, Rprec, Bpref and ndcg.
      </p>
      <p>
        According to the state of the art, we define the elements for a test collection  in a
nonexhaustive way as follows:
• .: the set of documents (similar to Sanderson et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). For ., 5% of overlapping
between to sub-collections corresponds to 4,779 documents (2.5% of the whole collection);
• .: the set of topics (following Robertson and Kanoulas [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). For ., the full topic set
is composed by 25 topics. We vary the number of overlapping topics from 4% to 96% (at
each step we include one more topic);
• .: the set of assessments. This is somewhat related to the idea of Yu et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. .
contains 34,659 assessments: 5% of assessments corresponds to 1,733 assessments;
• .: in a way to show that our protocol is able to cope with subsets of the components,
we study the subsets of the assessments which are relevant. Namely, we study the set
. ⊂ . such that
. = {(, , ) ∈ ., so that  = 1}
assuming binary relevance values. Similar question was studied by Ferro and Sanderson [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
The full set of relevant assessments . contains 13,332 assessments. We vary the
number of overlapping RD from 5% to 95%, an increase of 5% of the number of overlapping
relevant assessments corresponds to 667 assessments.
      </p>
      <p>Next, we will show the impact of each of these elements considered independently on the
TREC-COVID collection.</p>
      <p>(a) Overlap of docs vs. probability of same RoS
(b) Overlap of topics vs. probability of same RoS
(c) Overlap of assessments vs. probability of same (d) Overlap of positive assessments vs. probability</p>
      <p>RoS of same RoS</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The Figure 1 presents the probabilities of having the same ranking of the considered systems,
respectively on ., ., . and . .</p>
      <p>For the evaluation measures presented in Figure 1a regarding the overlap of ., we see, as
expected, that the probability of similar RoS increases as the overlaps increase. In this figure,
the MaP is very stable, as .,(Δ(.,1, .,2) &gt;= 90%) = 1 for each document overlap o
greater than 5%. This underlines the fact that the corpus is very focused on one topic area
(COVID-related documents). The Bpref evaluation measure (green triangle) has the lower
probability for all overlaps tested: the probability of having the same RoS in larger than 90%
only for overlaps greater than 65%. A detailed look on the 50 runs for the overlaps of 5% shows
that: a) on average, the average for MAP values is 0.998 and average for the Rprec values is
0.886: both values are very large, and b) overall 50 of the 50 MAP values are larger than 0.9
where only 24 of the 50 Bprec values are above 0.9. Mainly the large diference comes from the
threshold  : if  = 0.8, then the Rprec and MAP behave similarly.</p>
      <p>The Figure 1b, focusing on topics, exhibits expected behaviors: the more the overlap of topics,
the higher the probability of the same RoS. However, we see that the MaP is not as stable as
Rprec: the Rprec has almost most of the time the higher probability of RoS similarity. Here, the
Bpref is still the least stable measure.</p>
      <p>Figures 1c and 1d, corresponding to the overlaps of the assessments and the overlaps of the
positive assessments respectively, are the most flat ones for all the evaluation measures. MaP
and ndcg always reach the probability of 1 for each overlap values considered. For . the
1Download link: https://ir.nist.gov/covidSubmit/data.html
less stable measure is Bpref and for . it is Rprec.</p>
      <p>The low slopes in the graphs mean a stable comparability across overlaps, and the intercept
is interpreted as the projected minimum comparability value when there is no overlapping
elements. The metric with the lowest slope and highest intercept, for three of the four features,
is MaP. The comparability of test collections for the RPrec is higher than for MaP only for
the Topics experiments. The metric with highest slope and lowest intercept is Bpref in all the
analyzed elements, leading to a larger sensitivity of this measure.</p>
      <p>Figure 2 presents under a radar view the lower overlap values for which the probability of
having a similarity of RoS larger than 0.9 is equal to 1. The overlap value of 100 means that we
did not get any full stability for a partial overlap considered. We see that the only element for
which we are not able to get any stability on the TREC-COVID collection is . (i.e., the topics
splits).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>
        From the Figures 1a, 1c and 1d and 2, we see that the MaP is able to cope with the diferences
between sub-collections. When considering the documents ., relevance assessments .
and relevance assessments . that are relevant, the MaP is able to cope even with very low
overlaps (5%). So, MAP gives us a good comparison over two completely diferent collections,
for the same set of topics. The only element for which none of the evaluation measure is
reaching a probability of 1 is ., reflecting the fact that there is a low stability of the ranking of
systems across very similar collections according to the topics. This finding is in contradiction
with Carterette et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], in which the authors found that totally separated sets of topics led
to the same ranking of systems: it is possible that our smaller set of considered topics is the
reason. Further studies have to be conducted to validate this hypothesis.
      </p>
      <p>
        In Figure 1a, we see that the MaP, Rprec and ndcg are very high for each overlap considered.
This may be explained by the fact that the corpus focuses on one area of the topics related to
Covid, and that there is a large redundancy between the documents. It might be just the case
that MaP is just stable even across diferent collections, but the table  of Fang et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] shows
that this hypothesis does not hold. We guess that this behavior comes from the fact that the
corpus (and topics) are related to one quite specific domain. The probabilities for the Bpref
measures are much lower that the probabilities for the other measures, especially for the small
overlaps (below 60%), Our guess is that the splits of documents impact the assessments (if a
document is removed from a split, it is also removed from the assessment file): as there are less
relevant documents, there are more chances to have non-relevant documents retrieved before
relevant ones, leading to lower the Bpref values.
      </p>
      <p>
        The Figure 1b exhibits the large impact of the overlapping on topics: as the topics do not
behave similarly, the non-overlapping topics lead to very diferent ranking of systems. We see
on the left part of Figure 1b that the probability of similar ranking for bref is very low (for
instance 0.16 for an overlap of 8%). This findings is mainly caused by the fact that the topics in
a test collection are classically built manually, and are supposed to be very diferent (as shown
in Figure 2 of Banks et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] for instance). Such constrain does not hold for documents,
for which there is no redundancy-check achieved. Going further with our protocol by fixing
 = 0.8 the MaP measure is able to get similar rankings for an overlap of 72% of topics. This
shows that the the Kendall Tau values for the MaP obtained are still high.
      </p>
      <p>The two elements that are considering the relevant assessments . and . have
similar behaviors in Figures 1c and 1d: for almost all overlaps the probabilities are greater than
0.8. This shows that using only partial overlaps of assessments smaller than 50% leads to similar
rankings of systems for MaP and ndcg evaluation measures. The Rprec (precision computed
at the number of relevant documents for a topic) is more sensitive to the overlaps of positive
assessments, as this measures is based on the number of positive relevant assessments. Rprec
is especially sensitive to small overlaps of assessments because the positive assessments form
roughly 1/3 of all available assessments.</p>
      <p>As presented above, Figure 2 describes graphically the minimal overlapping, for each element
and each evaluation measure considered, that leads to a probability of 1 to get a 90% similarity
between ranking of systems: the larger the area, the smaller the overlap. From this Figure, we
see on one side that the Rpec and Bpref evaluation measures are more sensitive to the overlaps,
whatever element we consider, and on the other side that the topics are very sensitive to any
overlap ratios (orange line, surface on 100% for all evaluation measures). For this Figure, we
conclude that for MaP and ndcg evaluation measures, having a test collection composed of only
5% of the assessments and 5% of the positive assessments lead to the same ranking of systems:
such result may relax the need for N-fold validation in the case of evaluation of learning-based
IR systems, or may constrain an N-fold validation experiment by using splits that lead to same
ranking of systems.</p>
      <p>
        Our findings are quite consistent results with table 1 of Ferro and Sanderson [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] on the TREC
Adhoc T07 and T08 test collections: MaP and ndcg measures are more sensitive to the topics
splits than to the document splits.
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We have presented a protocol that supports the study the impact of  of a test collection
on the ranking of systems. Our proposal formalizes this crucial part which needs be defined
to perform such study. We then applied the protocol on the TREC-COVID collection. The
outcomes of this study show that the documents and topics are the considered elements that
have the most impact on the stability of the ranking of systems. We also showed that each
evaluation measure behaves very diferently in our experiments. A future work could extend
our proposal to study sub-collections overlaps across several measures.</p>
      <p>
        As a future work, we also would like to extend our proposal to be able to consider jointly
several elements, so that we may detect dependencies between elements, as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Refining
some parts of the protocol will be also considered in the future, as limiting the overlaps on
sets does not cover semantic aspects of documents and topics. The study achieved is limited
to one test collection, and we plan to asses the stability of the results on other test collections,
especially collections with a wider and more general range of topics.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by the ANR Kodicare bi-lateral project, grant ANR-19-CE23-0029 of
the French Agence Nationale de la Recherche, and by the Austrian Science Fund (FWF).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <article-title>Diagnostic evaluation of information retrieval models</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>29</volume>
          (
          <year>2011</year>
          ). URL: https://doi.org/10.1145/1961209.1961210. doi:
          <volume>10</volume>
          .1145/ 1961209.1961210.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <article-title>An introduction to neural information retrieval</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>13</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>126</lpage>
          . URL: http://dx.doi.org/10.1561/ 1500000061.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Breuer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schaer</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <article-title>How to Measure the Reproducibility of System-oriented IR Experiments</article-title>
          ,
          <source>SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          (
          <year>2020</year>
          )
          <fpage>349</fpage>
          -
          <lpage>358</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Simple techniques for cross-collection relevance feedback</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>397</fpage>
          -
          <lpage>409</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          , E. Kanoulas,
          <article-title>On per-topic variance in ir evaluation</article-title>
          ,
          <source>in: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '12,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2012</year>
          , p.
          <fpage>891</fpage>
          -
          <lpage>900</lpage>
          . URL: https://doi.org/10.1145/2348283.2348402. doi:
          <volume>10</volume>
          .1145/2348283. 2348402.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Turpin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          ,
          <article-title>Diferences in efectiveness across subcollections</article-title>
          ,
          <source>ACM International Conference Proceeding Series</source>
          <year>2006</year>
          (
          <year>2012</year>
          )
          <fpage>1965</fpage>
          -
          <lpage>1969</lpage>
          . doi:
          <volume>10</volume>
          .1145/2396761.2398553.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <article-title>Sub-corpora impact on system efectiveness</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>901</fpage>
          -
          <lpage>904</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Sanderson, Improving the accuracy of system performance estimation by using shards</article-title>
          ,
          <source>in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>805</fpage>
          -
          <lpage>814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Samarov</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <article-title>Using replicates in information retrieval evaluation</article-title>
          ,
          <source>ACM Transactions on Information Systems (TOIS) 36</source>
          (
          <year>2017</year>
          )
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          , L. Rashidi,
          <article-title>Corpus Bootstrapping for Assessment of the Properties of Efectiveness Measures</article-title>
          ,
          <source>International Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>1933</fpage>
          -
          <lpage>1952</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rashidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          ,
          <article-title>Evaluating the Predictivity of IR Experiments</article-title>
          ,
          <source>SIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          (
          <year>2021</year>
          )
          <fpage>1667</fpage>
          -
          <lpage>1671</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Can old trec collections reliably evaluate modern neural retrieval models</article-title>
          ?,
          <year>2022</year>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>11086</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <article-title>The efect of topic set size on retrieval experiment error</article-title>
          ,
          <source>in: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '02,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2002</year>
          , p.
          <fpage>316</fpage>
          -
          <lpage>323</lpage>
          . URL: https://doi.org/10.1145/564376.564432. doi:
          <volume>10</volume>
          . 1145/564376.564432.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bedrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Hersh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Trec-covid: constructing a pandemic information retrieval test collection</article-title>
          ,
          <source>in: ACM SIGIR Forum</source>
          , volume
          <volume>54</volume>
          , ACM New York, NY, USA,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <article-title>Declarative experimentation in information retrieval using pyterrier</article-title>
          ,
          <source>in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <article-title>Evaluation by highly relevant documents</article-title>
          ,
          <source>in: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          ,
          <year>2001</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>82</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sitaraman</surname>
          </string-name>
          ,
          <article-title>Minimal test collections for retrieval evaluation</article-title>
          ,
          <source>in: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '06,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2006</year>
          , p.
          <fpage>268</fpage>
          -
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Banks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Over</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.-F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Blind men and elephants: Six approaches to trec data</article-title>
          ,
          <source>Inf. Retr</source>
          .
          <volume>1</volume>
          (
          <issue>1999</issue>
          )
          <fpage>7</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>