<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Grenoble, France
†These authors contributed equally.
$ marlene.gruendel@uni-jena.de (M. Gründel); malte.weber@uni-jena.de (M. Weber); johannes.franke@uni-jena.de
(J. Franke); heinrich.merker@uni-jena.de (J. H. Merker)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Neural Re-Ranking and Rank Fusion for Temporal Stability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marlene Gründel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malte Weber</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Franke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Heinrich Merker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Friedrich-Schiller-Universität Jena</institution>
          ,
          <addr-line>07743 Jena</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>We describe the participation of team Galápagos Tortoise in the LongEval shared task at CLEF 2024. We aim to construct a highly efective retrieval system that, unlike many popular modern models, retains its efectiveness over a long period of time. To this extent, we follow two approaches: First, we experiment with diferent schemes to aggregate passage scores of monoT5 re-rankings. Second, we propose a weighted rank fusion of retrieval models implementing diferent paradigms: RankZephyr, a sparse cross-encoder, ColBERT, and BM25. Our key ifndings indicate that, despite our eforts, all systems exhibit a temporal decline in efectiveness. While using monoT5 with max passage aggregation outperforms mean passage aggregation on all datasets-over longer periods even more significantly-we find that monoT5 is generally too sensitive towards long-term changes to observe meaningful diferences when using another aggregation scheme. Moreover, our rank fusion approach, although dominated by RankZephyr, achieves higher efectiveness than the individual fused models but is also more prone to long-term instability. This emphasizes the importance of developing hybrid models combining lexical and neural systems to obtain highly efective retrieval systems but also shows that to achieve sustainable efectiveness, the fusion components must be selected carefully.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Longitudinal evaluation</kwd>
        <kwd>neural ranking</kwd>
        <kwd>rank fusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Modern retrieval systems typically use a multi-stage re-ranking architecture, where the results of a
recall-oriented (typically lexical) first-stage ranker are subsequently refined with precision-oriented
(typically neural) re-rankers [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">1, 2, 3, 4, 5, 6</xref>
        ]. Such multi-stage models perform well on test collections like
MS MARCO [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Static ad-hoc test collections, however, are prone to train-test leakage [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ] and
do not resemble the realistic scenario where documents and the use of language change over time or new
documents become available [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. Current state-of-the-art models are typically trained on a fixed
dataset containing only documents up to a specific point in time [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. These models trained on
fixed-intime data struggle to maintain their efectiveness when applied to more recent datasets [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ].
      </p>
      <p>
        The LongEval lab explores the extent to which temporal declines in retrieval efectiveness occur
with diferent retrieval paradigms and aims to support the development of retrieval systems that are
persistent in their efectiveness over time [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ]. Systems are evaluated on three test sets of three
months of documents and queries from query logs of the French web search engine Qwant1 in 2023.
      </p>
      <p>
        We experiment with combining more stable lexical with less stable but highly efective neural retrieval
systems in order to develop efective and long-term stable systems. We evaluate two distinct approaches
in our submissions to the LongEval shared task: (1) For the popular monoT5 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] cross-encoder model,
we evaluate the efect of using more than the best-scoring passage when aggregating the document
score after a lexical first-stage retrieval. Our assumption is that a good trade-of between efectiveness
and temporal robustness can be achieved when averaging the scores from the top- documents for an
optimal . And (2) we test if a rank fusion of a variety of efective lexical and neural retrieval systems
is more robust to temporal changes than a single state-of-the-art re-ranking model based on a large
language model (LLM). By using multiple systems trained on diferent datasets or completely unaware
of training data, we seek to improve long-term stability while not degrading efectiveness.
      </p>
      <p>To this extent, we submit five runs to the LongEval shared task and test hypotheses grounded on the
aforementioned assumptions. Three runs use a combination of BM25 and PL2 lexical retrieval with
Bo1 query expansion and monoT5 re-ranking, then aggregating monoT5’s passage-level scores with
diferent aggregation schemes by averaging the score of a subset of the passages. The remaining two
runs are our proposed weighted rank fusion of RankZephyr, a sparse cross-encoder model, ColBERT,
and BM25, as well as just RankZephyr as a baseline.2</p>
      <p>The results for our first group of runs using a combination of lexical first-stage retrieval, query
expansion, and monoT5 re-ranking show (1) that the nDCG efectiveness of monoT5 re-ranking still
declines over time when using top-4 average passage aggregation, (2) that the choice of the passage
aggregation scheme does only marginally impact the overall efectiveness, but also (3) that the diference
in nDCG between the aggregation schemes gets more pronounced over time. Our rank fusion of
RankZephyr with neural and lexical models slightly improves the efectiveness. Yet, both the rank
fusion and RankZephyr demonstrate stronger long-term instability than the other methods examined.
Combinations of lexical and neural systems can therefore increase the efectiveness of retrieval systems,
but are not necessarily accompanied by increased stability. Further research is needed to identify fusion
components that achieve sustainable efectiveness.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Our submission builds on prior work that proposed a way to use monoT5 in a multi-stage document
re-ranking, utilizing document expansion (e.g., using T5 [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]) to enrich documents with their keyword
representation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In our approach, we also perform a query expansion, although with Bo1 [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] instead,
and we use PL2 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and a BM25 [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] scoring as first-stage retrieval. In particular BM25 is applied in
many multi-stage re-ranking architectures to retrieve candidate documents for subsequent re-ranking
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As these multi-stage re-rankers are often limited by the used models’ context window, usually, after
retrieval documents are split into shorter text passages which are passed to the re-rankers [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. After
re-ranking, several strategies are applied to aggregate the passage-level scores [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]
      </p>
      <p>
        As our second line of research, rank fusion combines rankings returned by multiple search engines
such that the combination maximizes a certain efectiveness criterion. Previous works have shown
that such combinations consistently improve retrieval efectiveness [
        <xref ref-type="bibr" rid="ref26 ref27 ref28 ref29 ref30">26, 27, 28, 29, 30</xref>
        ]. In our work,
we fuse four diferent retrieval models: BM25 [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], a sparse cross-encoder [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], ColBERT [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] and
RankZephyr [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. We employ BM25 for its robustness and frequent use in similar research [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Crossencoders are efective [
        <xref ref-type="bibr" rid="ref34 ref35 ref6">34, 6, 35</xref>
        ] but often ineficient with respect to their inference run time, memory
footprint, and energy consumption [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. Compared to full attention as used in monoT5, Schlatt et al.
[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] improved the eficiency while maintaining efectiveness by combining windowed self-attention
and asymmetric cross-attention between sub-sequences [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. We use their eficient yet efective
crossencoder model as another model for our rank fusion approach. ColBERT [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] is also used in our
rank fusion due to implementing a completely diferent retrieval paradigm, late interaction. With
late interaction, ColBERT strives to reconcile eficiency and contextualization while estimating the
relevance of a document for a given query. Finally, we integrate RankZephyr [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], an open-source LLM
for listwise zero-shot re-ranking, that outperforms GPT-4 [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] efectiveness on several datasets.
      </p>
      <p>
        In our system implementations, we use ranx.fuse [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], a Python library for rank fusion and
PyTerrier [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ]. The PyTerrier framework implements a wide range of lexical first-stage retrieval models, such
as PL2 [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and a BM25 [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and also allows for composing multi-stage retrieval pipelines [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ]. The
LongEval datasets were accessed via ir_datasets [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] and its TIREx integration [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ] which allowed us to
use the same containerized software during development and submission, and to archive the submission
code on TIRA [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ].
2Code and data available online: https://github.com/tira-io/ir-lab-jena-leipzig-wise-2023-galapagos-tortoise/
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Approach</title>
      <p>With our participation in the LongEval shared task, we pursue two diferent ranking approaches: First,
we compare retrieval pipelines that implement neural re-ranking with monoT5 but use difering passage
score aggregations. Further, we tune a weighted rank fusion of RankZephyr, a sparse cross-encoder,
ColBERT, and BM25 towards maximizing nDCG@10 on the LongEval data collection from January
2023 [44].</p>
      <sec id="sec-3-1">
        <title>3.1. Neural Re-Ranking with monoT5</title>
        <p>
          Our initial retrieval pipeline consists of a weighted linear score combination of a PL2 scoring [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and a
BM25 scoring [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] with Bo1 query expansion [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], the latter (BM25+Bo1) being weighted twice as high
as the PL2 score. The motivation behind our choice for this initial retrieval stage is to aim for increasing
temporal stability with a fused system of two lexical approaches, but at the same time, not to tune the
weights on the training data to prevent a temporal bias. The top-50 results of the initial retrieval are
then re-ranked with a monoT5 cross-encoder model3 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] that has been fine-tuned on the MS MARCO
passage dataset [45].
        </p>
        <p>
          To reduce computational complexity, the context length of the model is limited to 512 tokens. Thus,
longer web documents need to be split into shorter text passages using a sliding window approach with
a length of 400 tokens per passage and a stride of 64 tokens. The passages are scored with monoT5,
and finally, the passage-level scores are aggregated after re-ranking. Three aggregation schemes are
commonly used [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]:
• The highest score of one of its passages (max passage aggregation),
• the mean score of all of its passages (mean passage aggregation), or
• the mean score of only the top- ranked passages (-max average aggregation).
We have submitted one run for each of the three abovementioned aggregation schemes. To find the
parameter  for the -max average aggregation, we ran a Grid Search with  = 2, 4, . . . , 20 on the
LongEval data collection from June 2022 that yielded the highest nDCG score [46] at  = 4.
        </p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Hypotheses</title>
          <p>Concerning monoT5 re-ranking, we investigate the following two hypotheses:
Hypothesis 1. In the setting presented above, the nDCG efectiveness (or nDCG@10, respectively) of
monoT5 with max passage score aggregation is significantly higher (  = 0.05) than the efectiveness
obtained with mean passage aggregation.</p>
          <p>Hypothesis 2. In the setting presented above, when choosing  such that the nDCG efectiveness (or
nDCG@10, respectively) of the -max average aggregation is maximized, monoT5 with -max average
aggregation yields a significantly higher (  = 0.05) nDCG efectiveness (or nDCG@10, respectively) than
with max passage or mean passage aggregation.</p>
          <p>Hypothesis 1 builds on the intuition that documents containing relevant passages for a given query are
usually considered relevant by users despite possibly also containing irrelevant passages. Hence, the
document’s relevance would be estimated by the highest relevance of any individual passage from the
document. Non-relevant passages should not influence the aggregated scores negatively. However, we
question this rather extreme setting and argue that at least a few relevant passages should often be
required to make a document relevant. For example, even spam pages could sometimes contain relevant
passages by pure chance. Hence, averaging the scores of the best-scoring passages in a document seems
intuitive, which we express in Hypothesis 2.
3https://huggingface.co/castorini/monot5-base-msmarco</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Rank Fusion</title>
        <p>
          Our second approach proposes a weighted rank fusion where we initially retrieve documents with
BM25 [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and re-rank the top-1000 results using a rank fusion model consisting of RankZephyr [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], a
sparse cross-encoder [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], ColBERT [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], and BM25. RankZephyr is a model that surpasses GPT-4 [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]
performance on several datasets [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] but could also be susceptible to a decline in efectiveness on older
data due to its relative novelty. Therefore, other retrieval models are incorporated into the ranking
through rank fusion to ofset this potential disadvantage and achieve time-resilient efectiveness. We
chose the sparse cross-encoder, ColBERT, and BM25 for the rank fusion as they are the most efective
models of their respective paradigms (cross-encoder, late interaction, and lexical ranking). Besides this
rank fusion, for comparison, we also provide a run that only uses RankZephyr (i.e., no rank fusion).
        </p>
        <p>
          The rank fusion was implemented as a weighted sum of scores using the Python library ranx.fuse [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ].
In ranx.fuse, the scores of all constituent models are computed and optimal weights are assigned to the
models’ scores based on a given training dataset. Moreover, before the results from diferent retrieval
models can be fused, the document scores are normalized to make them comparable. This step is
necessary because the retrieval models use diferent scales for scoring [ 47]. We used the standard
min-max-normalization, shifting the minimum score to 0 and scaling the maximum score to 1 [47]. A
weighted sum was selected as the fusion method, as the weights it assigns to the constituent models’
scores are easy to interpret. We optimized the fusion for an optimal nDCG@10 score on the LongEval
January 2023 dataset which yielded the weights listed in Table 1.
        </p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Hypotheses</title>
          <p>Based on our rank fusion approach, we investigate the following hypotheses:
Hypothesis 3. The diferences in the nDCG efectiveness (or nDCG@10, respectively) observed over time
are significantly smaller (  = 0.05) for the rank fusion model described above than for just RankZephyr,
the sparse cross-encoder, ColBERT, or BM25 alone.</p>
          <p>Hypothesis 4. Retrieving documents with the optimized rank fusion model of RankZephyr, the sparse
cross-encoder, ColBERT, and BM25, as described above, achieves a significantly higher (  = 0.05) nDCG
efectiveness (or nDCG@10, respectively) than using each of these models alone.</p>
          <p>Hypothesis 3 follows the intuition that a fused model, that combines diferent retrieval approaches,
should be more persistent in its efectiveness over time because some of the systems it combines could
compensate for errors that other constituent systems make. Since retrieval systems that do not use
temporal-bound training data often achieve a more stable but overall poorer level of efectiveness than
neural models, we hypothesize that our rank fusion approach yields consistently higher efectiveness
than the single models, as expressed in Hypothesis 4.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Submitted Runs</title>
        <p>To improve the reproducibility of our approaches, the submitted runs are published on TIRA and can
be accessed via TIRA.4 We submitted the following five runs:
Run #1 galapagos-tortoise-bm25-bo1-pl2-monot5-max A weighted linear combination of BM25
(with Bo1 query expansion; weight: 2) and PL2 (weight: 1), re-ranked with monoT5.5 After re-ranking,
passages are aggregated by the max passage score aggregation.</p>
        <p>Run #2 galapagos-tortoise-bm25-bo1-pl2-monot5-mean A weighted linear combination of BM25
(with Bo1 query expansion; weight: 2) and PL2 (weight: 1), re-ranked with monoT5.5 After re-ranking,
passages are aggregated by the mean passage score aggregation.</p>
        <p>Run #3 galapagos-tortoise-bm25-bo1-pl2-monot5-kmax-avg-k-4 A weighted linear
combination of BM25 (with Bo1 query expansion; weight: 2) and PL2 (weight: 1), re-ranked with monoT5.5 After
re-ranking, passages are aggregated by the -max average passage score aggregation with  = 4, which
yielded the highest nDCG on the LongEval June 2022 dataset.</p>
        <p>Run #4 galapagos-tortoise-rank-zephyr
pre-trained RankZephyr model.</p>
        <p>Re-ranking the top-1000 documents from BM25 with a
Run #5 galapagos-tortoise-wsum A rank fusion (weighted sum, optimized on the January 2023
dataset) of BM25 (weight: 0.1), the sparse cross-encoder (weight: 0.1), ColBERT (weight: 0.1), and
RankZephyr (weight: 0.7) re-ranking after retrieving the top-1000 documents with BM25. The fused
models themselves were not fine-tuned.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Neural Re-Ranking with monoT5</title>
        <p>Table 2 lists the nDCG and nDCG@10 scores achieved by the three monoT5 variants on the LongEval
datasets from January, June, and August 2023. It can be seen that deploying a max passage aggregation
yields the highest nDCG and nDCG@10 scores on all three datasets. On the datasets from June and
August 2023, the scores achieved by max passage are even significantly higher than the ones obtained
with mean passage aggregation. On the January dataset, however, max, 4-max average, and mean
passage aggregation behave almost identically. As a result, the  values measured on the January dataset
are far away from being significant. The diference in retrieval efectiveness between max passage on
the one hand and 4-max average and mean passage on the other hand increase considerably over time.</p>
        <p>Furthermore, it seems counter-intuitive that 4-max average passage aggregation performs worse
than both max and mean passage aggregation on the January 2023 dataset, given that it is actually a
hybrid of the two extremes. It would be interesting to inspect this dataset further to get an intuition on
why it behaves fundamentally diferent than the others.</p>
        <p>Re-visiting our hypotheses, we can discard Hypothesis 2 that suspected -max average aggregation to
yield significantly higher nDCG and nDCG@10 scores than the competing passage aggregation schemes.
Hypothesis 1, stating that max passage aggregation performs significantly better than mean passage
aggregation, deserves a more careful investigation since our experiments convey highly contradictory
signals. Taking all three datasets into account, we cannot confirm Hypothesis 1.
4Submissions on the Jan. 2023 dataset: https://tira.io/task-overview/ir-lab-padua-2024/longeval-2023-01-20240426-training;
submissions on the June 2023 dataset: https://tira.io/task-overview/ir-lab-padua-2024/longeval-2023-06-20240422-training;
submissions on the August 2023 dataset: https://tira.io/task-overview/ir-lab-padua-2024/longeval-2023-08-20240422-training
5https://huggingface.co/castorini/monot5-base-msmarco</p>
        <p>nDCG</p>
        <p>Apart from our specific research questions, we notice a decline in retrieval efectiveness with respect
to all three aggregation schemes. Table 3 lists the diferences in the nDCG (nDCG@10, respectively)
scores that were obtained on the January, June and August 2023 datasets. As can be seen, in each
ifxed interval the decline in efectiveness is similar for all aggregation schemes. We conclude that on
our datasets monoT5 is too sensitive towards temporal changes to make fine-tuning its aggregation a
question worth investigating further.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Rank Fusion</title>
        <p>Recall that our rank fusion model that was trained to optimize its nDCG@10 score on the LongEval
January 2023 dataset weights RankZephyr with 0.7 and all other models, i.e. the sparse cross-encoder,
ColBERT and BM25 with 0.1 each. Table 4 compares the nDCG score (nDCG@10 score, respectively)
achieved by our rank fusion approach on the LongEval January, June and August 2023 datasets with
the respective scores obtained when using each model purely. On the January and August datasets,
the fusion approach outperforms all other systems, although the diference to the scores obtained with
RankZephyr is only a slight one. On the June dataset RankZephyr beats our fusion approach by a
narrow margin. The ranking of all other systems is stable over all datasets: the sparse cross-encoder
scores better than ColBERT and BM25 yields the smallest nDCG and nDCG@10 scores. Since all models
we investigated re-rank the top-1000 documents retrieved by BM25, this last observation indicates that
neural re-ranking does not deteriorate nDCG scores.</p>
        <p>The calculated  values suggest that the diference between our rank fusion approach and RankZephyr
is not a significant one. This contradicts the intuition we formulated in Hypothesis 4, but seems plausible,</p>
        <p>nDCG
given that in our fusion approach RankZephyr’s score gets weighted with 0.7 and hence dominates
the model. However, on all datasets the nDCG and nDCG@10 scores of our rank fusion approach
are significantly higher than the respective scores of the sparse cross-encoder ColBERT and BM25.
Excluding RankZephyr, Hypothesis 4 can therefore be confirmed. However, the fusion model presumably
benefits greatly from the efectiveness of RankZephyr.</p>
        <p>Table 5 visualizes the diferences between nDCG scores (nDCG@10 scores, respectively) on the three
collections. Similar to our findings in Subsection 4.1, we witness a temporal decline in efectiveness of
all retrieval systems. Moreover, we notice that our highest performing systems, i.e. our rank fusion
approach and RankZephyr exhibit the greatest temporal over-all decline as well. We can therefore
discard Hypothesis 3 that our rank fusion approach are more stable than the other models.</p>
        <p>Inspecting the values in Table 5 further, we compute the pairwise Pearson correlations [49] between
the declines of all evaluated systems and visualize the result in Table 6. As can be seen, our systems are
split into two camps, within which there is a strong pairwise correlation between the declines: The
group of systems with highest efectiveness on the one hand, i.e. our rank fusion approach, RankZephyr
and the evaluated sparse cross-encoder. And the group of systems with lower efectiveness on the other,
i.e. the sparse cross-encoder, ColBERT and BM25. The sparse cross-encoder provides the link between
both camps, as its decline correlates strongly with all systems. This finding is somewhat sobering,
because regardless of how diferent the selected retrieval paradigms are, the decline behaves similarly
on all systems, and is most drastic in our most efective systems.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper, we pursued two diferent research directions to improve the temporal stability of retrieval
systems. First, we experimented with diferent passage score aggregation schemes for monoT5
reranking. We hypothesized that -max average aggregation with a tuned  should yield a higher nDCG
and nDCG@10 efectiveness than max passage aggregation, which in turn should outperform mean
passage aggregation. Second, we proposed a weighted rank fusion of RankZephyr, a sparse
crossencoder, ColBERT, and BM25. Here, we expected the rank fusion approach to be both, more efective
and more temporally stable than each of the fused models alone.</p>
      <p>Regarding neural re-ranking with monoT5 of diferent aggregation schemes, max passage aggregation
indeed outperforms mean passage aggregation with respect to nDCG and nDCG@10, with more
significant diferences on more recent datasets. Additionally, max passage aggregation was found to be
superior to -max passage, contrary to our hypothesis.</p>
      <p>No significant diference was found between the efectiveness of RankZephyr alone and our rank
fusion approach. Albeit, the fusion model yielded significantly higher nDCG and nDCG@10 efectiveness
compared to BM25, the sparse cross-encoder, and ColBERT. This improvement, however, is likely an
efect of the high efectiveness of RankZephyr and its high weight within the fusion model.</p>
      <p>We observed that, despite our eforts, the efectiveness of all evaluated retrieval systems declines over
time. Moreover, the rates at which nDCG and nDCG@10 scores decrease are highly pairwise correlated
between the high-performing rank fusion approach, RankZephyr and the sparse cross-encoder and
generally higher than the decline rates of the less efective ColBERT and BM25. Efectiveness and
temporal stability seem to work against each other here.</p>
      <p>Still, it is contrary to the intuition, that not only the efectiveness of neural re-ranking approaches,
but also of lexical models like BM25 declines over time. While the decline in the efectiveness of neural
models is usually attributed to the increasingly stale data they were trained on, we lack a good intuition
for the temporal decline in BM25’s efectiveness. Hence, it would be worthwhile to investigate whether
the observed decline in retrieval efectiveness of several basic lexical models is statistically significant
over time, to finally distinguish systems with temporal efectiveness decline from those without.</p>
      <p>Further research is also needed to explore the efectiveness of rank fusions whose constituent models
are equally well-performing and more diverse in the conceptual retrieval approach they implement.
Conducting a larger study with diverse fusion candidates could hopefully lead to the development of
efective and temporally stable hybrid models.</p>
      <p>Our research contributes to the understanding of long-term stability in retrieval systems, providing
insights into the performance of various passage score aggregation schemes with monoT5 and rank
fusion methods. Despite observing a general decline in efectiveness over time, our findings highlight
the potential of hybrid models that integrate both neural and lexical approaches and show that further
researches in optimized aggregation techniques or fusion strategies with more diverse candidates can
lead to enhanced long-term retrieval performance.
[44] P. Galuscáková, R. Deveaud, G. G. Sáez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel,
LongEvalretrieval: French-english dynamic test collection for continuous web search evaluation, in: H. Chen,
W. E. Duh, H. Huang, M. P. Kato, J. Mothe, B. Poblete (Eds.), Proceedings of SIGIR 2023, ACM,
2023, pp. 3086–3094. doi:10.1145/3539618.3591921.
[45] R. F. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained
sequence-tosequence model, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of EMNLP 2020, volume EMNLP 2020
of Findings of ACL, ACL, 2020, pp. 708–718. doi:10.18653/v1/2020.findings-emnlp.63.
[46] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, ACM Trans. Inf. Syst.</p>
      <p>20 (2002) 422–446. URL: http://doi.acm.org/10.1145/582415.582418. doi:10.1145/582415.582418.
[47] M. H. Montague, J. A. Aslam, Relevance score normalization for metasearch, in: Proceedings of the
2001 ACM CIKM International Conference on Information and Knowledge Management, Atlanta,
Georgia, USA, November 5-10, 2001, ACM, 2001, pp. 427–433. doi:10.1145/502585.502657.
[48] S. E. Robertson, S. Walker, Some simple efective approximations to the 2-poisson model for
probabilistic weighted retrieval, in: W. B. Croft, C. J. van Rijsbergen (Eds.), Proceedings of SIGIR
1994, ACM/Springer, 1994, pp. 232–241. doi:10.1007/978-1-4471-2099-5_24.
[49] K. Pearson, Note on regression and inheritance in the case of two parents, Proceedings of the
Royal Society of London 58 (1895) 240–242.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Matveeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Burkard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laucius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <article-title>High accuracy retrieval with multiple nested ranker</article-title>
          , in: E. N.
          <string-name>
            <surname>Efthimiadis</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hawking</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Järvelin (Eds.),
          <source>Proceedings of SIGIR</source>
          <year>2006</year>
          , ACM,
          <year>2006</year>
          , pp.
          <fpage>437</fpage>
          -
          <lpage>444</lpage>
          . doi:
          <volume>10</volume>
          .1145/1148170.1148246.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <article-title>A cascade ranking model for eficient ranked retrieval</article-title>
          , in: W. Ma, J. Nie,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chua</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of SIGIR</source>
          <year>2011</year>
          , ACM,
          <year>2011</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>114</lpage>
          . doi:
          <volume>10</volume>
          .1145/2009916.2009934.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Asadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Efectiveness/eficiency tradeofs for candidate generation in multi-stage retrieval architectures</article-title>
          , in: G.
          <string-name>
            <surname>J. F. Jones</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Sheridan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Kelly</surname>
          </string-name>
          , M. de Rijke, T. Sakai (Eds.),
          <source>The 36th International ACM SIGIR conference on research and development in Information Retrieval</source>
          , SIGIR '
          <fpage>13</fpage>
          , Dublin, Ireland -
          <source>July 28 - August 01</source>
          ,
          <year>2013</year>
          , ACM,
          <year>2013</year>
          , pp.
          <fpage>997</fpage>
          -
          <lpage>1000</lpage>
          . doi:
          <volume>10</volume>
          .1145/2484028. 2484132.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gallagher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <article-title>Eficient cost-aware cascade ranking in multi-stage retrieval</article-title>
          , in: N.
          <string-name>
            <surname>Kando</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          de Vries, R. W. White (Eds.),
          <source>Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Shinjuku, Tokyo, Japan,
          <source>August</source>
          <volume>7</volume>
          -
          <issue>11</issue>
          ,
          <year>2017</year>
          , ACM,
          <year>2017</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>454</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3077136.3080819.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Mackenzie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Query driven algorithm selection in early stage retrieval</article-title>
          , in: Y.
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          , Y. Maarek (Eds.),
          <source>Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM</source>
          <year>2018</year>
          ,
          <article-title>Marina Del Rey</article-title>
          , CA, USA, February 5-
          <issue>9</issue>
          ,
          <year>2018</year>
          , ACM,
          <year>2018</year>
          , pp.
          <fpage>396</fpage>
          -
          <lpage>404</lpage>
          . doi:
          <volume>10</volume>
          .1145/3159652.3159676.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          <article-title>, Multi-stage document ranking with BERT (</article-title>
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1910</year>
          .14424. arXiv:
          <year>1910</year>
          .14424.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , L. Deng, MS MARCO:
          <article-title>A human generated machine reading comprehension dataset</article-title>
          , in: T. R.
          <string-name>
            <surname>Besold</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bordes</surname>
          </string-name>
          , A. S.
          <string-name>
            <surname>d'Avila Garcez</surname>
          </string-name>
          , G. Wayne (Eds.),
          <source>Proceedings of CoCo@NIPS</source>
          <year>2016</year>
          , volume
          <volume>1773</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2016</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1773</volume>
          /CoCoNIPS_2016_paper9.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>The Expando-Mono-Duo design pattern for text ranking with pretrained sequence-to-sequence models (</article-title>
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2101.05667. arXiv:
          <volume>2101</volume>
          .
          <fpage>05667</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Linjordet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <article-title>Sanitizing synthetic training data generation for question answering over knowledge graphs</article-title>
          , in: K.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Setty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , K. Berberich (Eds.),
          <source>ICTIR '20: The 2020 ACM SIGIR International Conference on the Theory of Information Retrieval</source>
          , Virtual Event, Norway,
          <source>September 14-17</source>
          ,
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>128</lpage>
          . doi:
          <volume>10</volume>
          .1145/3409256.3409836.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <article-title>Hurdles to progress in long-form question answering</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tür</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online</article-title>
          , June 6-11,
          <year>2021</year>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>4940</fpage>
          -
          <lpage>4957</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>393</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Akiki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <article-title>How train-test leakage afects zero-shot retrieval</article-title>
          , in: D.
          <string-name>
            <surname>Arroyuelo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Poblete (Eds.),
          <source>String Processing and Information Retrieval - 29th International Symposium, SPIRE</source>
          <year>2022</year>
          , Concepción, Chile, November 8-
          <issue>10</issue>
          ,
          <year>2022</year>
          , Proceedings, volume
          <volume>13617</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2022</year>
          , pp.
          <fpage>147</fpage>
          -
          <lpage>161</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -20643-6_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Altmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Pierrehumbert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Motter</surname>
          </string-name>
          ,
          <article-title>Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words</article-title>
          ,
          <source>CoRR abs/0901</source>
          .2349 (
          <year>2009</year>
          ). URL: http://arxiv.org/abs/0901. 2349. arXiv:
          <volume>0901</volume>
          .
          <fpage>2349</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Labov</surname>
          </string-name>
          , Principles of linguistic change, volume volume
          <volume>3</volume>
          , John Wiley and Sons,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yauney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ippolito</surname>
          </string-name>
          ,
          <article-title>A pretrainer's guide to training data: Measuring the efects of data age, domain coverage</article-title>
          , quality, &amp; toxicity,
          <source>CoRR abs/2305</source>
          .13169 (
          <year>2023</year>
          ). URL: https://doi.org/10.48550/arXiv. 2305.13169. doi:
          <volume>10</volume>
          .48550/ARXIV.2305.13169. arXiv:
          <volume>2305</volume>
          .
          <fpage>13169</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Sultan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Castelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Florian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <article-title>Synthetic target domain supervision for open retrieval QA</article-title>
          ,
          <source>CoRR abs/2204</source>
          .09248 (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/ arXiv.2204.09248. arXiv:
          <volume>2204</volume>
          .
          <fpage>09248</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alkhalifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kochkina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Building for tomorrow: Assessing the temporal persistence of text classifiers</article-title>
          ,
          <source>Inf. Process. Manag</source>
          .
          <volume>60</volume>
          (
          <year>2023</year>
          )
          <article-title>103200</article-title>
          . doi:
          <volume>10</volume>
          .1016/J.IPM.
          <year>2022</year>
          .
          <volume>103200</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>A thorough examination on zero-shot dense retrieval</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          , Singapore, December 6-
          <issue>10</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>15783</fpage>
          -
          <lpage>15796</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .FINDINGS-EMNLP.
          <year>1057</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alkhalifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Borkakoty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iommi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liakata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Madabushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Medina-Alias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science (LNCS)</source>
          , Springer, Heidelberg, Germany,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alkhalifa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Borkakoty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Ebshihy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iommi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liakata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Madabushi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Medina-Alias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Extended overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings</source>
          , CEUR-WS,
          <year>Online</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Document expansion by query prediction</article-title>
          , CoRR abs/
          <year>1904</year>
          .08375 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1904</year>
          .08375. arXiv:
          <year>1904</year>
          .08375.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <article-title>Probability models for information retrieval based on divergence from randomness</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Glasgow, UK,
          <year>2003</year>
          . URL: http://theses.gla.ac.uk/1570/.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. J. van Rijsbergen</surname>
          </string-name>
          ,
          <article-title>Probabilistic models of information retrieval based on measuring the divergence from randomness</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>20</volume>
          (
          <year>2002</year>
          )
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          . doi:
          <volume>10</volume>
          .1145/582415. 582416.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          , Okapi at TREC-3, in: D. K. Harman (Ed.),
          <source>Proceedings of The Third Text REtrieval Conference</source>
          , TREC 1994, Gaithersburg, Maryland, USA, November 2-
          <issue>4</issue>
          ,
          <year>1994</year>
          , volume
          <volume>500</volume>
          -225 of NIST Special Publication,
          <source>National Institute of Standards and Technology (NIST)</source>
          ,
          <year>1994</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>126</lpage>
          . URL: http://trec.nist.gov/pubs/trec3/ papers/city.ps.gz.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Figuerola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. A.</given-names>
            <surname>Berrocal</surname>
          </string-name>
          , Á. F.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rodríguez</surname>
          </string-name>
          ,
          <article-title>Segmentation of web documents and retrieval of useful passages</article-title>
          , in: C.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Jijkoun</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mandl</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Peñas</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Petras</surname>
          </string-name>
          , D. Santos (Eds.),
          <source>Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2007</year>
          , Budapest, Hungary,
          <source>September 19-21</source>
          ,
          <year>2007</year>
          , Revised Selected Papers, volume
          <volume>5152</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2007</year>
          , pp.
          <fpage>732</fpage>
          -
          <lpage>736</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>540</fpage>
          -85760-0_
          <fpage>93</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>540</fpage>
          -85760-0\_
          <fpage>93</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Deeper text understanding for IR with contextual neural language modeling</article-title>
          , in: B.
          <string-name>
            <surname>Piwowarski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chevalier</surname>
            , É. Gaussier,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Maarek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Scholer</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2019</year>
          , Paris, France,
          <source>July 21-25</source>
          ,
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>985</fpage>
          -
          <lpage>988</lpage>
          . URL: https://doi.org/10.1145/ 3331184.3331303. doi:
          <volume>10</volume>
          .1145/3331184.3331303.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Fox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Shaw</surname>
          </string-name>
          ,
          <article-title>Combination of multiple searches</article-title>
          , in: D. K. Harman (Ed.),
          <source>Proceedings of TREC</source>
          <year>1993</year>
          , volume
          <volume>500</volume>
          -215 of NIST Special Publication,
          <string-name>
            <surname>NIST</surname>
          </string-name>
          ,
          <year>1993</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>252</lpage>
          . URL: http: //trec.nist.gov/pubs/trec2/papers/ps/vpi.ps.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Analyses of multiple evidence combination</article-title>
          , in: N. J.
          <string-name>
            <surname>Belkin</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          <string-name>
            <surname>Narasimhalu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Willett</surname>
            ,
            <given-names>W. R.</given-names>
          </string-name>
          <string-name>
            <surname>Hersh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Can</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of SIGIR</source>
          <year>1997</year>
          , ACM,
          <year>1997</year>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>276</lpage>
          . doi:
          <volume>10</volume>
          .1145/258525.258587.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Aslam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Montague</surname>
          </string-name>
          ,
          <article-title>Models for metasearch</article-title>
          , in: W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          <string-name>
            <surname>Harper</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          <string-name>
            <surname>Kraft</surname>
          </string-name>
          , J. Zobel (Eds.),
          <source>Proceedings of SIGIR</source>
          <year>2001</year>
          , ACM,
          <year>2001</year>
          , pp.
          <fpage>275</fpage>
          -
          <lpage>284</lpage>
          . doi:
          <volume>10</volume>
          .1145/383952.384007.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lillis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Toolan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Collier</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Dunnion,</surname>
          </string-name>
          <article-title>ProbFuse: a probabilistic approach to data fusion</article-title>
          , in: E. N.
          <string-name>
            <surname>Efthimiadis</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hawking</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Järvelin (Eds.),
          <source>Proceedings of SIGIR</source>
          <year>2006</year>
          , ACM,
          <year>2006</year>
          , pp.
          <fpage>139</fpage>
          -
          <lpage>146</lpage>
          . doi:
          <volume>10</volume>
          .1145/1148170.1148197.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Büttcher</surname>
          </string-name>
          ,
          <article-title>Reciprocal rank fusion outperforms condorcet and individual rank learning methods</article-title>
          , in: J.
          <string-name>
            <surname>Allan</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Aslam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sanderson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhai</surname>
          </string-name>
          , J. Zobel (Eds.),
          <source>Proceedings of SIGIR</source>
          <year>2009</year>
          , ACM,
          <year>2009</year>
          , pp.
          <fpage>758</fpage>
          -
          <lpage>759</lpage>
          . doi:
          <volume>10</volume>
          .1145/1571941.1572114.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schlatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Hagen, Investigating the Efects of Sparse Attention on Cross-Encoders</article-title>
          , in: N.
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Tonellotto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lipani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
          </string-name>
          , I. Ounis (Eds.),
          <source>Proceedings of ECIR</source>
          <year>2024</year>
          , volume
          <volume>14608</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2024</year>
          , pp.
          <fpage>173</fpage>
          -
          <lpage>190</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -56027-9_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          , M. Zaharia,
          <article-title>ColBERT: Eficient and efective passage search via contextualized late interaction over BERT (</article-title>
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2004</year>
          .12832. arXiv:
          <year>2004</year>
          .12832.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharifymoghaddam</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Lin,</surname>
          </string-name>
          <article-title>RankZephyr: Efective and robust zero-shot listwise reranking is a breeze! (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2312.02724. arXiv:
          <volume>2312</volume>
          .
          <fpage>02724</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>Passage</surname>
          </string-name>
          re-ranking
          <string-name>
            <surname>with</surname>
            <given-names>BERT</given-names>
          </string-name>
          , CoRR abs/
          <year>1901</year>
          .04085 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1901</year>
          .04085. arXiv:
          <year>1901</year>
          .04085.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Nardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Frieder</surname>
          </string-name>
          ,
          <article-title>Eficient document re-ranking for transformers by precomputing term representations</article-title>
          , in: J. X.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , X. Cheng, J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Murdock</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wen</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , China,
          <source>July 25-30</source>
          ,
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          . doi:
          <volume>10</volume>
          .1145/3397271.3401093.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          , G. Zuccon, Reduce, reuse,
          <source>recycle: Green information retrieval research</source>
          , in: E. Amigó,
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Carterette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          , G. Kazai (Eds.),
          <source>SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Madrid, Spain,
          <source>July 11 - 15</source>
          ,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>2825</fpage>
          -
          <lpage>2837</lpage>
          . doi:
          <volume>10</volume>
          .1145/3477495.3531766.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-4
          <source>technical report</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bassani</surname>
          </string-name>
          , L. Romelli, ranx.fuse
          <article-title>: A Python library for metasearch, in: M. A</article-title>
          .
          <string-name>
            <surname>Hasan</surname>
          </string-name>
          , L. Xiong (Eds.),
          <source>Proceedings of CIKM</source>
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>4808</fpage>
          -
          <lpage>4812</lpage>
          . doi:
          <volume>10</volume>
          .1145/3511808.3557207.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          , S. MacAvaney, I. Ounis,
          <article-title>PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval</article-title>
          ,
          <source>in: Proceedings of CIKM</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>4526</fpage>
          -
          <lpage>4533</lpage>
          . doi:
          <volume>10</volume>
          .1145/3459637.3482013.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <article-title>Declarative experimentation in information retrieval using PyTerrier</article-title>
          ,
          <source>in: Proceedings of ICTIR</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <article-title>Simplified data wrangling with ir_datasets</article-title>
          , in: F. Diaz,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          , T. Sakai (Eds.),
          <source>Proceedings of SIGIR</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>2429</fpage>
          -
          <lpage>2436</lpage>
          . doi:
          <volume>10</volume>
          .1145/3404835.3463254.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Reimer</surname>
          </string-name>
          , S. MacAvaney,
          <string-name>
            <given-names>N.</given-names>
            <surname>Deckers</surname>
          </string-name>
          , S. Reich, J.
          <string-name>
            <surname>Bevendorf</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>The information retrieval experiment platform</article-title>
          ,
          <source>in: Proceedings of SIGIR</source>
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>2826</fpage>
          -
          <lpage>2836</lpage>
          . doi:
          <volume>10</volume>
          .1145/3539618.3591888.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grahm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elstner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Loebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Continuous Integration for Reproducible Shared Tasks with TIRA.io</article-title>
          ,
          <source>in: Proceedings of ECIR 2023, Lecture Notes in Computer Science</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>241</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -28241-6_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>