1. Introduction

Grenoble, France †These authors contributed equally. $ marlene.gruendel@uni-jena.de (M. Gründel); malte.weber@uni-jena.de (M. Weber); johannes.franke@uni-jena.de (J. Franke); heinrich.merker@uni-jena.de (J. H. Merker)

Neural Re-Ranking and Rank Fusion for Temporal Stability

Marlene Gründel

Malte Weber

Johannes Franke

Jan Heinrich Merker

0 0 Friedrich-Schiller-Universität Jena , 07743 Jena , Germany

2024

000 0 0003

We describe the participation of team Galápagos Tortoise in the LongEval shared task at CLEF 2024. We aim to construct a highly efective retrieval system that, unlike many popular modern models, retains its efectiveness over a long period of time. To this extent, we follow two approaches: First, we experiment with diferent schemes to aggregate passage scores of monoT5 re-rankings. Second, we propose a weighted rank fusion of retrieval models implementing diferent paradigms: RankZephyr, a sparse cross-encoder, ColBERT, and BM25. Our key ifndings indicate that, despite our eforts, all systems exhibit a temporal decline in efectiveness. While using monoT5 with max passage aggregation outperforms mean passage aggregation on all datasets-over longer periods even more significantly-we find that monoT5 is generally too sensitive towards long-term changes to observe meaningful diferences when using another aggregation scheme. Moreover, our rank fusion approach, although dominated by RankZephyr, achieves higher efectiveness than the individual fused models but is also more prone to long-term instability. This emphasizes the importance of developing hybrid models combining lexical and neural systems to obtain highly efective retrieval systems but also shows that to achieve sustainable efectiveness, the fusion components must be selected carefully.

eol>Longitudinal evaluation neural ranking rank fusion

1. Introduction

Modern retrieval systems typically use a multi-stage re-ranking architecture, where the results of a recall-oriented (typically lexical) first-stage ranker are subsequently refined with precision-oriented (typically neural) re-rankers [ 1, 2, 3, 4, 5, 6 ]. Such multi-stage models perform well on test collections like MS MARCO [ 7, 8 ]. Static ad-hoc test collections, however, are prone to train-test leakage [ 9, 10, 11 ] and do not resemble the realistic scenario where documents and the use of language change over time or new documents become available [ 12, 13 ]. Current state-of-the-art models are typically trained on a fixed dataset containing only documents up to a specific point in time [ 14 ]. These models trained on fixed-intime data struggle to maintain their efectiveness when applied to more recent datasets [ 15, 16, 17 ].

The LongEval lab explores the extent to which temporal declines in retrieval efectiveness occur with diferent retrieval paradigms and aims to support the development of retrieval systems that are persistent in their efectiveness over time [ 18, 19 ]. Systems are evaluated on three test sets of three months of documents and queries from query logs of the French web search engine Qwant1 in 2023.

We experiment with combining more stable lexical with less stable but highly efective neural retrieval systems in order to develop efective and long-term stable systems. We evaluate two distinct approaches in our submissions to the LongEval shared task: (1) For the popular monoT5 [ 8 ] cross-encoder model, we evaluate the efect of using more than the best-scoring passage when aggregating the document score after a lexical first-stage retrieval. Our assumption is that a good trade-of between efectiveness and temporal robustness can be achieved when averaging the scores from the top- documents for an optimal . And (2) we test if a rank fusion of a variety of efective lexical and neural retrieval systems is more robust to temporal changes than a single state-of-the-art re-ranking model based on a large language model (LLM). By using multiple systems trained on diferent datasets or completely unaware of training data, we seek to improve long-term stability while not degrading efectiveness.

To this extent, we submit five runs to the LongEval shared task and test hypotheses grounded on the aforementioned assumptions. Three runs use a combination of BM25 and PL2 lexical retrieval with Bo1 query expansion and monoT5 re-ranking, then aggregating monoT5’s passage-level scores with diferent aggregation schemes by averaging the score of a subset of the passages. The remaining two runs are our proposed weighted rank fusion of RankZephyr, a sparse cross-encoder model, ColBERT, and BM25, as well as just RankZephyr as a baseline.2

The results for our first group of runs using a combination of lexical first-stage retrieval, query expansion, and monoT5 re-ranking show (1) that the nDCG efectiveness of monoT5 re-ranking still declines over time when using top-4 average passage aggregation, (2) that the choice of the passage aggregation scheme does only marginally impact the overall efectiveness, but also (3) that the diference in nDCG between the aggregation schemes gets more pronounced over time. Our rank fusion of RankZephyr with neural and lexical models slightly improves the efectiveness. Yet, both the rank fusion and RankZephyr demonstrate stronger long-term instability than the other methods examined. Combinations of lexical and neural systems can therefore increase the efectiveness of retrieval systems, but are not necessarily accompanied by increased stability. Further research is needed to identify fusion components that achieve sustainable efectiveness.

2. Related Work

Our submission builds on prior work that proposed a way to use monoT5 in a multi-stage document re-ranking, utilizing document expansion (e.g., using T5 [ 20 ]) to enrich documents with their keyword representation [ 8 ]. In our approach, we also perform a query expansion, although with Bo1 [ 21 ] instead, and we use PL2 [ 22 ] and a BM25 [ 23 ] scoring as first-stage retrieval. In particular BM25 is applied in many multi-stage re-ranking architectures to retrieve candidate documents for subsequent re-ranking [ 3 ]. As these multi-stage re-rankers are often limited by the used models’ context window, usually, after retrieval documents are split into shorter text passages which are passed to the re-rankers [ 24 ]. After re-ranking, several strategies are applied to aggregate the passage-level scores [ 25 ]

As our second line of research, rank fusion combines rankings returned by multiple search engines such that the combination maximizes a certain efectiveness criterion. Previous works have shown that such combinations consistently improve retrieval efectiveness [ 26, 27, 28, 29, 30 ]. In our work, we fuse four diferent retrieval models: BM25 [ 23 ], a sparse cross-encoder [ 31 ], ColBERT [ 32 ] and RankZephyr [ 33 ]. We employ BM25 for its robustness and frequent use in similar research [ 3 ]. Crossencoders are efective [ 34, 6, 35 ] but often ineficient with respect to their inference run time, memory footprint, and energy consumption [ 36 ]. Compared to full attention as used in monoT5, Schlatt et al. [ 31 ] improved the eficiency while maintaining efectiveness by combining windowed self-attention and asymmetric cross-attention between sub-sequences [ 31 ]. We use their eficient yet efective crossencoder model as another model for our rank fusion approach. ColBERT [ 32 ] is also used in our rank fusion due to implementing a completely diferent retrieval paradigm, late interaction. With late interaction, ColBERT strives to reconcile eficiency and contextualization while estimating the relevance of a document for a given query. Finally, we integrate RankZephyr [ 33 ], an open-source LLM for listwise zero-shot re-ranking, that outperforms GPT-4 [ 37 ] efectiveness on several datasets.

In our system implementations, we use ranx.fuse [ 38 ], a Python library for rank fusion and PyTerrier [ 39 ]. The PyTerrier framework implements a wide range of lexical first-stage retrieval models, such as PL2 [ 22 ] and a BM25 [ 23 ], and also allows for composing multi-stage retrieval pipelines [ 40 ]. The LongEval datasets were accessed via ir_datasets [ 41 ] and its TIREx integration [ 42 ] which allowed us to use the same containerized software during development and submission, and to archive the submission code on TIRA [ 43 ]. 2Code and data available online: https://github.com/tira-io/ir-lab-jena-leipzig-wise-2023-galapagos-tortoise/

3. Approach

With our participation in the LongEval shared task, we pursue two diferent ranking approaches: First, we compare retrieval pipelines that implement neural re-ranking with monoT5 but use difering passage score aggregations. Further, we tune a weighted rank fusion of RankZephyr, a sparse cross-encoder, ColBERT, and BM25 towards maximizing nDCG@10 on the LongEval data collection from January 2023 [44].

3.1. Neural Re-Ranking with monoT5

Our initial retrieval pipeline consists of a weighted linear score combination of a PL2 scoring [ 22 ] and a BM25 scoring [ 23 ] with Bo1 query expansion [ 21 ], the latter (BM25+Bo1) being weighted twice as high as the PL2 score. The motivation behind our choice for this initial retrieval stage is to aim for increasing temporal stability with a fused system of two lexical approaches, but at the same time, not to tune the weights on the training data to prevent a temporal bias. The top-50 results of the initial retrieval are then re-ranked with a monoT5 cross-encoder model3 [ 8 ] that has been fine-tuned on the MS MARCO passage dataset [45].

To reduce computational complexity, the context length of the model is limited to 512 tokens. Thus, longer web documents need to be split into shorter text passages using a sliding window approach with a length of 400 tokens per passage and a stride of 64 tokens. The passages are scored with monoT5, and finally, the passage-level scores are aggregated after re-ranking. Three aggregation schemes are commonly used [ 25 ]: • The highest score of one of its passages (max passage aggregation), • the mean score of all of its passages (mean passage aggregation), or • the mean score of only the top- ranked passages (-max average aggregation). We have submitted one run for each of the three abovementioned aggregation schemes. To find the parameter for the -max average aggregation, we ran a Grid Search with = 2, 4, . . . , 20 on the LongEval data collection from June 2022 that yielded the highest nDCG score [46] at = 4.

3.1.1. Hypotheses

Concerning monoT5 re-ranking, we investigate the following two hypotheses: Hypothesis 1. In the setting presented above, the nDCG efectiveness (or nDCG@10, respectively) of monoT5 with max passage score aggregation is significantly higher ( = 0.05) than the efectiveness obtained with mean passage aggregation.

Hypothesis 2. In the setting presented above, when choosing such that the nDCG efectiveness (or nDCG@10, respectively) of the -max average aggregation is maximized, monoT5 with -max average aggregation yields a significantly higher ( = 0.05) nDCG efectiveness (or nDCG@10, respectively) than with max passage or mean passage aggregation.

Hypothesis 1 builds on the intuition that documents containing relevant passages for a given query are usually considered relevant by users despite possibly also containing irrelevant passages. Hence, the document’s relevance would be estimated by the highest relevance of any individual passage from the document. Non-relevant passages should not influence the aggregated scores negatively. However, we question this rather extreme setting and argue that at least a few relevant passages should often be required to make a document relevant. For example, even spam pages could sometimes contain relevant passages by pure chance. Hence, averaging the scores of the best-scoring passages in a document seems intuitive, which we express in Hypothesis 2. 3https://huggingface.co/castorini/monot5-base-msmarco

3.2. Rank Fusion

Our second approach proposes a weighted rank fusion where we initially retrieve documents with BM25 [ 23 ] and re-rank the top-1000 results using a rank fusion model consisting of RankZephyr [ 33 ], a sparse cross-encoder [ 31 ], ColBERT [ 32 ], and BM25. RankZephyr is a model that surpasses GPT-4 [ 37 ] performance on several datasets [ 33 ] but could also be susceptible to a decline in efectiveness on older data due to its relative novelty. Therefore, other retrieval models are incorporated into the ranking through rank fusion to ofset this potential disadvantage and achieve time-resilient efectiveness. We chose the sparse cross-encoder, ColBERT, and BM25 for the rank fusion as they are the most efective models of their respective paradigms (cross-encoder, late interaction, and lexical ranking). Besides this rank fusion, for comparison, we also provide a run that only uses RankZephyr (i.e., no rank fusion).

The rank fusion was implemented as a weighted sum of scores using the Python library ranx.fuse [ 38 ]. In ranx.fuse, the scores of all constituent models are computed and optimal weights are assigned to the models’ scores based on a given training dataset. Moreover, before the results from diferent retrieval models can be fused, the document scores are normalized to make them comparable. This step is necessary because the retrieval models use diferent scales for scoring [ 47]. We used the standard min-max-normalization, shifting the minimum score to 0 and scaling the maximum score to 1 [47]. A weighted sum was selected as the fusion method, as the weights it assigns to the constituent models’ scores are easy to interpret. We optimized the fusion for an optimal nDCG@10 score on the LongEval January 2023 dataset which yielded the weights listed in Table 1.

3.2.1. Hypotheses

Based on our rank fusion approach, we investigate the following hypotheses: Hypothesis 3. The diferences in the nDCG efectiveness (or nDCG@10, respectively) observed over time are significantly smaller ( = 0.05) for the rank fusion model described above than for just RankZephyr, the sparse cross-encoder, ColBERT, or BM25 alone.

Hypothesis 4. Retrieving documents with the optimized rank fusion model of RankZephyr, the sparse cross-encoder, ColBERT, and BM25, as described above, achieves a significantly higher ( = 0.05) nDCG efectiveness (or nDCG@10, respectively) than using each of these models alone.

Hypothesis 3 follows the intuition that a fused model, that combines diferent retrieval approaches, should be more persistent in its efectiveness over time because some of the systems it combines could compensate for errors that other constituent systems make. Since retrieval systems that do not use temporal-bound training data often achieve a more stable but overall poorer level of efectiveness than neural models, we hypothesize that our rank fusion approach yields consistently higher efectiveness than the single models, as expressed in Hypothesis 4.

3.3. Submitted Runs

To improve the reproducibility of our approaches, the submitted runs are published on TIRA and can be accessed via TIRA.4 We submitted the following five runs: Run #1 galapagos-tortoise-bm25-bo1-pl2-monot5-max A weighted linear combination of BM25 (with Bo1 query expansion; weight: 2) and PL2 (weight: 1), re-ranked with monoT5.5 After re-ranking, passages are aggregated by the max passage score aggregation.

Run #2 galapagos-tortoise-bm25-bo1-pl2-monot5-mean A weighted linear combination of BM25 (with Bo1 query expansion; weight: 2) and PL2 (weight: 1), re-ranked with monoT5.5 After re-ranking, passages are aggregated by the mean passage score aggregation.

Run #3 galapagos-tortoise-bm25-bo1-pl2-monot5-kmax-avg-k-4 A weighted linear combination of BM25 (with Bo1 query expansion; weight: 2) and PL2 (weight: 1), re-ranked with monoT5.5 After re-ranking, passages are aggregated by the -max average passage score aggregation with = 4, which yielded the highest nDCG on the LongEval June 2022 dataset.

Run #4 galapagos-tortoise-rank-zephyr pre-trained RankZephyr model.

Re-ranking the top-1000 documents from BM25 with a Run #5 galapagos-tortoise-wsum A rank fusion (weighted sum, optimized on the January 2023 dataset) of BM25 (weight: 0.1), the sparse cross-encoder (weight: 0.1), ColBERT (weight: 0.1), and RankZephyr (weight: 0.7) re-ranking after retrieving the top-1000 documents with BM25. The fused models themselves were not fine-tuned.

4. Results 4.1. Neural Re-Ranking with monoT5

Table 2 lists the nDCG and nDCG@10 scores achieved by the three monoT5 variants on the LongEval datasets from January, June, and August 2023. It can be seen that deploying a max passage aggregation yields the highest nDCG and nDCG@10 scores on all three datasets. On the datasets from June and August 2023, the scores achieved by max passage are even significantly higher than the ones obtained with mean passage aggregation. On the January dataset, however, max, 4-max average, and mean passage aggregation behave almost identically. As a result, the values measured on the January dataset are far away from being significant. The diference in retrieval efectiveness between max passage on the one hand and 4-max average and mean passage on the other hand increase considerably over time.

Furthermore, it seems counter-intuitive that 4-max average passage aggregation performs worse than both max and mean passage aggregation on the January 2023 dataset, given that it is actually a hybrid of the two extremes. It would be interesting to inspect this dataset further to get an intuition on why it behaves fundamentally diferent than the others.

Re-visiting our hypotheses, we can discard Hypothesis 2 that suspected -max average aggregation to yield significantly higher nDCG and nDCG@10 scores than the competing passage aggregation schemes. Hypothesis 1, stating that max passage aggregation performs significantly better than mean passage aggregation, deserves a more careful investigation since our experiments convey highly contradictory signals. Taking all three datasets into account, we cannot confirm Hypothesis 1. 4Submissions on the Jan. 2023 dataset: https://tira.io/task-overview/ir-lab-padua-2024/longeval-2023-01-20240426-training; submissions on the June 2023 dataset: https://tira.io/task-overview/ir-lab-padua-2024/longeval-2023-06-20240422-training; submissions on the August 2023 dataset: https://tira.io/task-overview/ir-lab-padua-2024/longeval-2023-08-20240422-training 5https://huggingface.co/castorini/monot5-base-msmarco

nDCG

Apart from our specific research questions, we notice a decline in retrieval efectiveness with respect to all three aggregation schemes. Table 3 lists the diferences in the nDCG (nDCG@10, respectively) scores that were obtained on the January, June and August 2023 datasets. As can be seen, in each ifxed interval the decline in efectiveness is similar for all aggregation schemes. We conclude that on our datasets monoT5 is too sensitive towards temporal changes to make fine-tuning its aggregation a question worth investigating further.

4.2. Rank Fusion

Recall that our rank fusion model that was trained to optimize its nDCG@10 score on the LongEval January 2023 dataset weights RankZephyr with 0.7 and all other models, i.e. the sparse cross-encoder, ColBERT and BM25 with 0.1 each. Table 4 compares the nDCG score (nDCG@10 score, respectively) achieved by our rank fusion approach on the LongEval January, June and August 2023 datasets with the respective scores obtained when using each model purely. On the January and August datasets, the fusion approach outperforms all other systems, although the diference to the scores obtained with RankZephyr is only a slight one. On the June dataset RankZephyr beats our fusion approach by a narrow margin. The ranking of all other systems is stable over all datasets: the sparse cross-encoder scores better than ColBERT and BM25 yields the smallest nDCG and nDCG@10 scores. Since all models we investigated re-rank the top-1000 documents retrieved by BM25, this last observation indicates that neural re-ranking does not deteriorate nDCG scores.

The calculated values suggest that the diference between our rank fusion approach and RankZephyr is not a significant one. This contradicts the intuition we formulated in Hypothesis 4, but seems plausible,

nDCG given that in our fusion approach RankZephyr’s score gets weighted with 0.7 and hence dominates the model. However, on all datasets the nDCG and nDCG@10 scores of our rank fusion approach are significantly higher than the respective scores of the sparse cross-encoder ColBERT and BM25. Excluding RankZephyr, Hypothesis 4 can therefore be confirmed. However, the fusion model presumably benefits greatly from the efectiveness of RankZephyr.

Table 5 visualizes the diferences between nDCG scores (nDCG@10 scores, respectively) on the three collections. Similar to our findings in Subsection 4.1, we witness a temporal decline in efectiveness of all retrieval systems. Moreover, we notice that our highest performing systems, i.e. our rank fusion approach and RankZephyr exhibit the greatest temporal over-all decline as well. We can therefore discard Hypothesis 3 that our rank fusion approach are more stable than the other models.

Inspecting the values in Table 5 further, we compute the pairwise Pearson correlations [49] between the declines of all evaluated systems and visualize the result in Table 6. As can be seen, our systems are split into two camps, within which there is a strong pairwise correlation between the declines: The group of systems with highest efectiveness on the one hand, i.e. our rank fusion approach, RankZephyr and the evaluated sparse cross-encoder. And the group of systems with lower efectiveness on the other, i.e. the sparse cross-encoder, ColBERT and BM25. The sparse cross-encoder provides the link between both camps, as its decline correlates strongly with all systems. This finding is somewhat sobering, because regardless of how diferent the selected retrieval paradigms are, the decline behaves similarly on all systems, and is most drastic in our most efective systems.

5. Conclusion and Future Work

In this paper, we pursued two diferent research directions to improve the temporal stability of retrieval systems. First, we experimented with diferent passage score aggregation schemes for monoT5 reranking. We hypothesized that -max average aggregation with a tuned should yield a higher nDCG and nDCG@10 efectiveness than max passage aggregation, which in turn should outperform mean passage aggregation. Second, we proposed a weighted rank fusion of RankZephyr, a sparse crossencoder, ColBERT, and BM25. Here, we expected the rank fusion approach to be both, more efective and more temporally stable than each of the fused models alone.

Regarding neural re-ranking with monoT5 of diferent aggregation schemes, max passage aggregation indeed outperforms mean passage aggregation with respect to nDCG and nDCG@10, with more significant diferences on more recent datasets. Additionally, max passage aggregation was found to be superior to -max passage, contrary to our hypothesis.

No significant diference was found between the efectiveness of RankZephyr alone and our rank fusion approach. Albeit, the fusion model yielded significantly higher nDCG and nDCG@10 efectiveness compared to BM25, the sparse cross-encoder, and ColBERT. This improvement, however, is likely an efect of the high efectiveness of RankZephyr and its high weight within the fusion model.

We observed that, despite our eforts, the efectiveness of all evaluated retrieval systems declines over time. Moreover, the rates at which nDCG and nDCG@10 scores decrease are highly pairwise correlated between the high-performing rank fusion approach, RankZephyr and the sparse cross-encoder and generally higher than the decline rates of the less efective ColBERT and BM25. Efectiveness and temporal stability seem to work against each other here.

Still, it is contrary to the intuition, that not only the efectiveness of neural re-ranking approaches, but also of lexical models like BM25 declines over time. While the decline in the efectiveness of neural models is usually attributed to the increasingly stale data they were trained on, we lack a good intuition for the temporal decline in BM25’s efectiveness. Hence, it would be worthwhile to investigate whether the observed decline in retrieval efectiveness of several basic lexical models is statistically significant over time, to finally distinguish systems with temporal efectiveness decline from those without.

Further research is also needed to explore the efectiveness of rank fusions whose constituent models are equally well-performing and more diverse in the conceptual retrieval approach they implement. Conducting a larger study with diverse fusion candidates could hopefully lead to the development of efective and temporally stable hybrid models.

Our research contributes to the understanding of long-term stability in retrieval systems, providing insights into the performance of various passage score aggregation schemes with monoT5 and rank fusion methods. Despite observing a general decline in efectiveness over time, our findings highlight the potential of hybrid models that integrate both neural and lexical approaches and show that further researches in optimized aggregation techniques or fusion strategies with more diverse candidates can lead to enhanced long-term retrieval performance. [44] P. Galuscáková, R. Deveaud, G. G. Sáez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel, LongEvalretrieval: French-english dynamic test collection for continuous web search evaluation, in: H. Chen, W. E. Duh, H. Huang, M. P. Kato, J. Mothe, B. Poblete (Eds.), Proceedings of SIGIR 2023, ACM, 2023, pp. 3086–3094. doi:10.1145/3539618.3591921. [45] R. F. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-tosequence model, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of EMNLP 2020, volume EMNLP 2020 of Findings of ACL, ACL, 2020, pp. 708–718. doi:10.18653/v1/2020.findings-emnlp.63. [46] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, ACM Trans. Inf. Syst.

20 (2002) 422–446. URL: http://doi.acm.org/10.1145/582415.582418. doi:10.1145/582415.582418. [47] M. H. Montague, J. A. Aslam, Relevance score normalization for metasearch, in: Proceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management, Atlanta, Georgia, USA, November 5-10, 2001, ACM, 2001, pp. 427–433. doi:10.1145/502585.502657. [48] S. E. Robertson, S. Walker, Some simple efective approximations to the 2-poisson model for probabilistic weighted retrieval, in: W. B. Croft, C. J. van Rijsbergen (Eds.), Proceedings of SIGIR 1994, ACM/Springer, 1994, pp. 232–241. doi:10.1007/978-1-4471-2099-5_24. [49] K. Pearson, Note on regression and inheritance in the case of two parents, Proceedings of the Royal Society of London 58 (1895) 240–242.

[1]

Matveeva ,

Burges ,

Burkard ,

Laucius ,

Wong , High accuracy retrieval with multiple nested ranker , in: E. N. Efthimiadis , S. T.

Dumais , D.

Hawking , K. Järvelin (Eds.), Proceedings of SIGIR 2006 , ACM, 2006 , pp. 437 - 444 . doi: 10 .1145/1148170.1148246.

[2]

Wang ,

Lin ,

Metzler , A cascade ranking model for eficient ranked retrieval , in: W. Ma, J. Nie,

Baeza-Yates ,

Chua , W. B. Croft (Eds.), Proceedings of SIGIR 2011 , ACM, 2011 , pp. 105 - 114 . doi: 10 .1145/2009916.2009934.

[3]

Asadi ,

Lin , Efectiveness/eficiency tradeofs for candidate generation in multi-stage retrieval architectures , in: G. J. F. Jones , P.

Sheridan , D.

Kelly , M. de Rijke, T. Sakai (Eds.), The 36th International ACM SIGIR conference on research and development in Information Retrieval , SIGIR ' 13 , Dublin, Ireland - July 28 - August 01 , 2013 , ACM, 2013 , pp. 997 - 1000 . doi: 10 .1145/2484028. 2484132.

[4]

Chen ,

Gallagher ,

Blanco ,

J. S.

Culpepper , Eficient cost-aware cascade ranking in multi-stage retrieval , in: N. Kando , T.

Sakai , H.

Joho , H.

Li , A. P. de Vries, R. W. White (Eds.), Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , Shinjuku, Tokyo, Japan, August 7 - 11 , 2017 , ACM, 2017 , pp. 445 - 454 . doi: 10 .1145/ 3077136.3080819.

[5]

J. M.

Mackenzie ,

J. S.

Culpepper ,

Blanco ,

Crane ,

C. L. A.

Clarke ,

Lin , Query driven algorithm selection in early stage retrieval , in: Y. Chang , C.

Zhai , Y.

Liu , Y. Maarek (Eds.), Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM 2018 , Marina Del Rey , CA, USA, February 5- 9 , 2018 , ACM, 2018 , pp. 396 - 404 . doi: 10 .1145/3159652.3159676.

[6]

R. F.

Nogueira ,

Yang ,

Cho ,

Lin

, Multi-stage document ranking with BERT (

2019 ). URL: http://arxiv.org/abs/ 1910 .14424. arXiv: 1910 .14424.

[7]

Nguyen ,

Rosenberg ,

Song ,

Gao ,

Tiwary ,

Majumder , L. Deng, MS MARCO: A human generated machine reading comprehension dataset , in: T. R. Besold , A. Bordes , A. S. d'Avila Garcez , G. Wayne (Eds.), Proceedings of CoCo@NIPS 2016 , volume 1773 of CEUR Workshop Proceedings, CEUR-WS.org , 2016 . URL: https://ceur-ws. org/ Vol- 1773 /CoCoNIPS_2016_paper9.pdf.

[8]

Pradeep ,

R. F.

Nogueira ,

Lin , The Expando-Mono-Duo design pattern for text ranking with pretrained sequence-to-sequence models ( 2021 ). URL: https://arxiv.org/abs/2101.05667. arXiv: 2101 . 05667 .

[9]

Linjordet ,

Balog , Sanitizing synthetic training data generation for question answering over knowledge graphs , in: K. Balog , V.

Setty , C.

Lioma , Y.

Liu , M.

Zhang , K. Berberich (Eds.), ICTIR '20: The 2020 ACM SIGIR International Conference on the Theory of Information Retrieval , Virtual Event, Norway, September 14-17 , 2020 , ACM, 2020 , pp. 121 - 128 . doi: 10 .1145/3409256.3409836.

[10]

Krishna ,

Roy ,

Iyyer , Hurdles to progress in long-form question answering , in: K. Toutanova , A.

Rumshisky , L.

Zettlemoyer , D.

Hakkani-Tür , I.

Beltagy , S.

Bethard , R.

Cotterell , T.

Chakraborty , Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online , June 6-11, 2021 , Association for Computational Linguistics, 2021 , pp. 4940 - 4957 . doi: 10 .18653/v1/ 2021 .naacl-main. 393 .

[11]

Fröbe ,

Akiki ,

Potthast ,

Hagen , How train-test leakage afects zero-shot retrieval , in: D. Arroyuelo , B. Poblete (Eds.), String Processing and Information Retrieval - 29th International Symposium, SPIRE 2022 , Concepción, Chile, November 8- 10 , 2022 , Proceedings, volume 13617 of Lecture Notes in Computer Science, Springer, 2022 , pp. 147 - 161 . doi: 10 .1007/978-3- 031 -20643-6_ 11 .

[12]

E. G.

Altmann ,

J. B.

Pierrehumbert ,

A. E.

Motter , Beyond word frequency: Bursts, lulls, and scaling in the temporal distributions of words , CoRR abs/0901 .2349 ( 2009 ). URL: http://arxiv.org/abs/0901. 2349. arXiv: 0901 . 2349 .

[13]

Labov , Principles of linguistic change, volume volume 3 , John Wiley and Sons, 2011 .

[14]

Longpre ,

Yauney ,

Reif ,

Lee ,

Roberts ,

Zoph ,

Zhou ,

Wei ,

Robinson ,

Mimno ,

Ippolito , A pretrainer's guide to training data: Measuring the efects of data age, domain coverage , quality, & toxicity, CoRR abs/2305 .13169 ( 2023 ). URL: https://doi.org/10.48550/arXiv. 2305.13169. doi: 10 .48550/ARXIV.2305.13169. arXiv: 2305 . 13169 .

[15]

R. G.

Reddy ,

Iyer ,

M. A.

Sultan ,

Zhang ,

Sil ,

Castelli ,

Florian ,

Roukos , Synthetic target domain supervision for open retrieval QA , CoRR abs/2204 .09248 ( 2022 ). doi: 10 .48550/ arXiv.2204.09248. arXiv: 2204 . 09248 .

[16]

Alkhalifa ,

Kochkina ,

Zubiaga , Building for tomorrow: Assessing the temporal persistence of text classifiers , Inf. Process. Manag . 60 ( 2023 ) 103200 . doi: 10 .1016/J.IPM. 2022 . 103200 .

[17]

Ren ,

Qu ,

Liu ,

Zhao ,

Wu ,

Ding ,

Wu ,

Wang ,

Wen , A thorough examination on zero-shot dense retrieval , in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Singapore, December 6- 10 , 2023 , Association for Computational Linguistics, 2023 , pp. 15783 - 15796 . doi: 10 .18653/V1/ 2023 .FINDINGS-EMNLP. 1057 .

[18]

Alkhalifa ,

Borkakoty ,

Deveaud ,

El-Ebshihy ,

Espinosa-Anke ,

Fink ,

Galuščáková ,

Gonzalez-Saez ,

Goeuriot ,

Iommi ,

Liakata ,

H. T.

Madabushi ,

Medina-Alias ,

Mulhem ,

Piroi ,

Popel ,

Zubiaga , Overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance , in: L. Goeuriot , P.

Mulhem , G.

Quénot , D.

Schwab , L.

Soulier , G. M. D. Nunzio , P. Galuščáková , A. G. S. de Herrera , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science (LNCS) , Springer, Heidelberg, Germany, 2024 .

[19]

Alkhalifa ,

Borkakoty ,

Deveaud ,

El-Ebshihy ,

Espinosa-Anke ,

Fink ,

Galuščáková ,

Gonzalez-Saez ,

Goeuriot ,

Iommi ,

Liakata ,

H. T.

Madabushi ,

Medina-Alias ,

Mulhem ,

Piroi ,

Popel ,

Zubiaga , Extended overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings , CEUR-WS, Online , 2024 .

[20]

R. F.

Nogueira ,

Yang ,

Lin ,

Cho , Document expansion by query prediction , CoRR abs/ 1904 .08375 ( 2019 ). URL: http://arxiv.org/abs/ 1904 .08375. arXiv: 1904 .08375.

[21]

Amati , Probability models for information retrieval based on divergence from randomness , Ph.D. thesis , University of Glasgow, UK, 2003 . URL: http://theses.gla.ac.uk/1570/.

[22]

Amati , C. J. van Rijsbergen , Probabilistic models of information retrieval based on measuring the divergence from randomness , ACM Trans. Inf. Syst . 20 ( 2002 ) 357 - 389 . doi: 10 .1145/582415. 582416.

[23]

S. E.

Robertson ,

Walker ,

Jones ,

Hancock-Beaulieu ,

Gatford , Okapi at TREC-3, in: D. K. Harman (Ed.), Proceedings of The Third Text REtrieval Conference , TREC 1994, Gaithersburg, Maryland, USA, November 2- 4 , 1994 , volume 500 -225 of NIST Special Publication, National Institute of Standards and Technology (NIST) , 1994 , pp. 109 - 126 . URL: http://trec.nist.gov/pubs/trec3/ papers/city.ps.gz.

[24]

C. G.

Figuerola ,

J. L. A.

Berrocal , Á. F.

Rodríguez , Segmentation of web documents and retrieval of useful passages , in: C. Peters , V.

Jijkoun , T.

Mandl , H.

Müller , D. W.

Oard , A.

Peñas , V.

Petras , D. Santos (Eds.), Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of the Cross-Language Evaluation Forum , CLEF 2007 , Budapest, Hungary, September 19-21 , 2007 , Revised Selected Papers, volume 5152 of Lecture Notes in Computer Science, Springer, 2007 , pp. 732 - 736 . URL: https://doi.org/10.1007/978-3- 540 -85760-0_ 93 . doi: 10 .1007/978-3- 540 -85760-0\_ 93 .

[25]

Dai ,

Callan , Deeper text understanding for IR with contextual neural language modeling , in: B. Piwowarski , M.

Chevalier , É. Gaussier, Y.

Maarek , J.

Nie , F.

Scholer (Eds.), Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR 2019 , Paris, France, July 21-25 , 2019 , ACM, 2019 , pp. 985 - 988 . URL: https://doi.org/10.1145/ 3331184.3331303. doi: 10 .1145/3331184.3331303.

[26]

E. A.

Fox ,

J. A.

Shaw , Combination of multiple searches , in: D. K. Harman (Ed.), Proceedings of TREC 1993 , volume 500 -215 of NIST Special Publication, NIST , 1993 , pp. 243 - 252 . URL: http: //trec.nist.gov/pubs/trec2/papers/ps/vpi.ps.

[27]

Lee , Analyses of multiple evidence combination , in: N. J. Belkin , A. D.

Narasimhalu , P.

Willett , W. R.

Hersh , F.

Can , E. M.

Voorhees (Eds.), Proceedings of SIGIR 1997 , ACM, 1997 , pp. 267 - 276 . doi: 10 .1145/258525.258587.

[28]

J. A.

Aslam ,

M. H.

Montague , Models for metasearch , in: W. B. Croft , D. J.

Harper , D. H.

Kraft , J. Zobel (Eds.), Proceedings of SIGIR 2001 , ACM, 2001 , pp. 275 - 284 . doi: 10 .1145/383952.384007.

[29]

Lillis ,

Toolan ,

R. W.

Collier , J. Dunnion, ProbFuse: a probabilistic approach to data fusion , in: E. N. Efthimiadis , S. T.

Dumais , D.

Hawking , K. Järvelin (Eds.), Proceedings of SIGIR 2006 , ACM, 2006 , pp. 139 - 146 . doi: 10 .1145/1148170.1148197.

[30]

G. V.

Cormack ,

C. L. A.

Clarke ,

Büttcher , Reciprocal rank fusion outperforms condorcet and individual rank learning methods , in: J. Allan , J. A.

Aslam , M.

Sanderson , C.

Zhai , J. Zobel (Eds.), Proceedings of SIGIR 2009 , ACM, 2009 , pp. 758 - 759 . doi: 10 .1145/1571941.1572114.

[31]

Schlatt ,

Fröbe , M. Hagen, Investigating the Efects of Sparse Attention on Cross-Encoders , in: N. Goharian , N.

Tonellotto , Y.

He , A.

Lipani , G.

McDonald , C.

Macdonald , I. Ounis (Eds.), Proceedings of ECIR 2024 , volume 14608 of Lecture Notes in Computer Science, Springer, 2024 , pp. 173 - 190 . doi: 10 .1007/978-3- 031 -56027-9_ 11 .

[32]

Khattab , M. Zaharia, ColBERT: Eficient and efective passage search via contextualized late interaction over BERT ( 2020 ). URL: https://arxiv.org/abs/ 2004 .12832. arXiv: 2004 .12832.

[33]

Pradeep ,

Sharifymoghaddam , J. Lin,

RankZephyr: Efective and robust zero-shot listwise reranking is a breeze! (

2023 ). doi: 10 .48550/arXiv.2312.02724. arXiv: 2312 . 02724 .

[34]

R. F.

Nogueira ,

Cho , Passage re-ranking with

BERT

, CoRR abs/ 1901 .04085 ( 2019 ). URL: http://arxiv.org/abs/ 1901 .04085. arXiv: 1901 .04085.

[35]

MacAvaney ,

F. M.

Nardini ,

Perego ,

Tonellotto ,

Goharian ,

Frieder , Eficient document re-ranking for transformers by precomputing term representations , in: J. X. Huang , Y. Chang , X. Cheng, J. Kamps , V.

Murdock , J.

Wen , Y. Liu (Eds.), Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , SIGIR 2020 ,

Virtual

Event , China, July 25-30 , 2020 , ACM, 2020 , pp. 49 - 58 . doi: 10 .1145/3397271.3401093.

[36]

Scells ,

Zhuang , G. Zuccon, Reduce, reuse, recycle: Green information retrieval research , in: E. Amigó,

Castells ,

Gonzalo ,

Carterette ,

J. S.

Culpepper , G. Kazai (Eds.), SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , Madrid, Spain, July 11 - 15 , 2022 , ACM, 2022 , pp. 2825 - 2837 . doi: 10 .1145/3477495.3531766.

[37] OpenAI , Gpt-4 technical report , 2024 . arXiv: 2303 . 08774 .

[38]

Bassani , L. Romelli, ranx.fuse : A Python library for metasearch, in: M. A . Hasan , L. Xiong (Eds.), Proceedings of CIKM 2022 , ACM, 2022 , pp. 4808 - 4812 . doi: 10 .1145/3511808.3557207.

[39]

Macdonald ,

Tonellotto , S. MacAvaney, I. Ounis, PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval , in: Proceedings of CIKM 2021 , ACM, 2021 , pp. 4526 - 4533 . doi: 10 .1145/3459637.3482013.

[40]

Macdonald ,

Tonellotto , Declarative experimentation in information retrieval using PyTerrier , in: Proceedings of ICTIR 2020 , 2020 .

[41]

MacAvaney ,

Yates ,

Feldman ,

Downey ,

Cohan ,

Goharian , Simplified data wrangling with ir_datasets , in: F. Diaz,

Shah ,

Suel ,

Castells ,

Jones , T. Sakai (Eds.), Proceedings of SIGIR 2021 , ACM, 2021 , pp. 2429 - 2436 . doi: 10 .1145/3404835.3463254.

[42]

Fröbe ,

J. H.

Reimer , S. MacAvaney,

Deckers , S. Reich, J. Bevendorf , B.

Stein , M.

Hagen , M.

Potthast , The information retrieval experiment platform , in: Proceedings of SIGIR 2023 , ACM, 2023 , pp. 2826 - 2836 . doi: 10 .1145/3539618.3591888.

[43]

Fröbe ,

Wiegmann ,

Kolyada ,

Grahm ,

Elstner ,

Loebe ,

Hagen ,

Stein ,

Potthast , Continuous Integration for Reproducible Shared Tasks with TIRA.io , in: Proceedings of ECIR 2023, Lecture Notes in Computer Science , Springer, 2023 , pp. 236 - 241 . doi: 10 .1007/ 978-3- 031 -28241-6_ 20 .