Team OpenWebSearch at CLEF 2024: LongEval Daria Alexander1 , Maik Fröbe2 , Gijs Hendriksen1 , Ferdinand Schlatt2 , Matthias Hagen2 , Djoerd Hiemstra1 , Martin Potthast3 and Arjen P. de Vries1 1 Radboud Universiteit Nijmegen 2 Friedrich-Schiller-Universität Jena 3 University of Kassel, hessian.AI, ScaDS.AI Abstract We describe the OpenWebSearch group’s participation in the CLEF 2024 LongEval IR track. Our submitted runs explore how historical data from the past can be transferred into future retrieval systems. Therefore, we incorporate relevance information from past click logs into the query reformulation process via keyqueries and into the indexing process via a reverted index and ultimately incorporate both into learning-to-rank pipelines to ensure that retrieval is also possible for novel queries that were not seen before. Our evaluation shows that keyqueries substantially outperform other approaches for queries with historical click data available. Keywords learning-to-rank, query logs, keyqueries 1. Introduction Historical data obtained from query logs may substantially help to improve the rankings of future retrieval models. The scenario of the LongEval retrieval task [1, 2, 3, 4, 5, 6, 7] aims to study this area where retrieval models have access to relevance labels estimated from past query logs with click models to provide effective rankings in the future. Especially queries that have been seen before, i.e., for which past relevance information is available, have a high potential to leverage past relevance information for highly effective rankings if the intent of the query did not drift. For example, under the most simple assumption that queries have the same intent and that documents did not change, almost perfect rankings can be derived by simply ordering documents for a query by their estimated relevance from past query logs. However, as query intents and document content might change substantially over time, this transfer of old relevance information to future retrieval tasks might not be straightforward. We implement this relevance transfer for queries that overlap from past query logs to future retrieval tasks via two orthogonal concepts: (1) query reformulation with keyqueries, and (2) document reformu- lation. For the query reformulation, we leverage the concept of keyqueries [8, 9] that try, for a set of target documents, to identify the query that ranks the target documents highly while ensuring that the resulting query does not overfit on the target documents. For the document reformulation, we combined the concept of the corpus graph [10] with the concept of the reverted index [11]. Specifically, we identify which documents are highly similar to documents that were relevant to some query in the past (i.e., some form of a corpus graph construction) to subsequently index those documents with the queries to which they were relevant in the past (i.e., some form of a reverted index). If documents would not change their meaning and if queries would not change their intent, both concepts, the query reformulation and the document reformulation, would yield ideal rankings. Still, a realistic search engine would also need to produce good rankings for new queries or queries, respectively, documents that changed their content or meaning. To address this problem and to generalize to new queries and potentially changed query intents, we incorporate our query and document reformulations into learning-to-rank models. Learning-to-Rank aims to identify a combination of features that produce an effective ranking [12]. Even in the era of pre-trained transformers [13], feature-based learning-to-rank remains important as it can integrate features not available in transformers, compensating for knowledge to which transformers have no CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings access [14, 15]. Especially commercial search engines might combine many features, e.g., a recent leak claims that Google search incorporates more than 14 000 features into their ranking.1 Overall, we create a set of over 100 features derived from submissions to the Workshop on Open Web Search [16, 17] and combine them with learning-to-rank in our submissions. Our code and trained LambdaMART models are available online.2 2. Related Work We review related work on redundancy in information retrieval setups, keyqueries, and the corpus graph and reverted index. Redundancy in Information Retrieval Setups Normally, it is good practice to avoid redundancy between training, validation, and test splits in experiments, as otherwise, the effectiveness could be overestimated due to train–test leakage [18, 19]. Especially for IR experiments, redundant documents might cause effectiveness scores to be overestimated because retrieval models get a reward for showing the same model multiple times [20, 21]. Similar problems can occur for learned models that might overfit to redundancy in the training data [22]. However, in the LongEval scenario, redundancy emerges naturally, as queries and documents might overlap over time, which is no form of train–test leakage as the datasets are partitioned over time [1, 3]. In this setting, redundant data might be especially helpful, e.g., as previously showcased when relevance judgments were transferred from the ClueWeb09 corpus to ClueWeb12 via near-duplicate detection [23]. We follow this approach and transfer the relevance judgments to the newer dataset splits in the LongEval scenario via keyqueries and the corpus graph. Keyqueries The concept of keyqueries [9] aims to formulate a query that retrieves a set of target documents at the top-positions and has been applied to scholarly search [24], medical search [25], privacy scenarios [26], etc. For a set 𝐷 of documents, a query 𝑞 is a keyquery against some retrieval system 𝑆, iff 𝑞 fulfills the following three conditions [9]: (1) every 𝑑 ∈ 𝐷 is in the top-𝑘 results returned by 𝑆 for 𝑞, (2) 𝑞 has at least 𝑙 results, and (3) no 𝑞 ′ ⊂ 𝑞 fulfills the first two conditions. The first two conditions (i.e., the parameters 𝑘 and 𝑙) determine the desired specificity and the generality of a keyquery, while the third condition is a minimality constraint to avoid adding further terms to a query that already retrieves the target records at high ranks. Previous work applied this concept only to static corpora, but we now extend it to evolving corpora in the LongEval scenario. Corpus Graph and Reverted Indexes The corpus graph [10] consists of nodes that correspond to documents in the corpus and edges that are formed based on the similarity of documents. This similarity is either lexical or semantic, and is used in a re-ranking scenario to also consider documents highly similar to the top-ranked documents to improve the recall [10]. The reverted index [11] directly stores which documents should be ranked for which queries. We combine both concepts in the LongEval scenario: by building a corpus graph between the documents that were relevant to some queries in the past to the documents in the current corpus, we index those documents into the reverted index. 3. Methodology Our last year participation at LongEval was aimed at finding out whether generating multiple query variants for the same information need improves retrieval effectiveness. We generated query variants using ChatGPT and fused ranking results obtained with the original query and different query variants. We found out that query variant generation improves over time and follows the same trend as BM25 1 https://sparktoro.com/blog/an-anonymous-source-shared-thousands-of-leaked-google-search-api-documents-with-me- everyone-in-seo-should-see-them 2 https://github.com/OpenWebSearch/LONGEVAL-24 baseline, therefore the query variant generation showed its robustness [27]. Still, the improvements were only minor. However, last year we did not explore the information that is provided by the documents in the past and whether this information can be useful for the future. Therefore, we decided to extend the queries with the terms for the relevant documents. Also, for the non-overlapping queries we wanted to have a system that does not rely on information from the past. For that we used a learning-to-rank approach, which utilizes features from the components submitted to the Workshop on Open Web Search (WOWS) [16]. 3.1. Keyqueries We noticed that queries overlap over different time slots, and in case their intent stays the same, we aim to transfer their relevance information to the new time slots. Consecutively, for those queries we know what documents were clicked a few months ago. We decided to use this feedback and query expansion with the BO1 model [28] to create keyqueries and use the same approach as [25]. Thereby, we use BO1 to obtain candidate terms for query terms, as pilot experiments showed that BO1 expansion terms yield higher effectiveness than RM3 [29] expansions. We inserted the clicked documents into the current corpus and reformulated the queries with the BO1 model until those documents were in the top positions. After that we removed old documents from the ranking. This implementation of the keyquery concept is not the most effective one, more effective approaches that leverage a generate-and-test paradigm [26] exist and are interesting directions for future work (i.e., explicitly generating many variants and selecting the variants that are highly effective). 3.2. Reverted index First, we identify documents in the new corpus that are higly similar to documents that were relevant to documents that were relevant to some queries in the past data (we use all available past data). We find those candidate terms by building an index with PyTerrier for the new corpus and submitting every relevant document from the past to the new corpus to retrieve the 10 nearest neighbors according to BM25. We then create a reverted index by indexing the document of position 1 with 10 times the query to which a document was relevant, the document on position 2 with 9 times, etc. For the final retrieval, we use BM25 against this constructed reverted index. 3.3. Learning to Rank While a large share of the queries in the test collections have overlap with the queries in the training splits, this is of course not the case for all queries. Hence, we also needed a system that could be used when information from the past could not be exploited directly. For these cases, we also developed a simple learning-to-rank approach, which used features from a large number of components submitted to the Workshop on Open Web Search (WOWS) [16]. For our learning-to-rank systems, we re-ranked the top 100 BM25 results using LambdaMART [30]. We implemented our pipelines with PyTerrier [31], using LightGBM [32] for the LambdaMART imple- mentation. The feature extraction components were all executed in TIREx [33] once, after which their outputs were cached for easy repeated experimentation. We split the 2024 training set into a training and validation split, which we used to tune LightGBM’s hyperparameters. We performed several runs, each with different subsets of features: ows-ltr-wows-base-rerank Query-only scores (QPP scores [34], classified intents [35], and health- relatedness [36]); document-only scores (health-relatedness [36], classified genre [37], and read- ability scores [37]); and lexical matching models built into PyTerrier (BM25, PL2, DirichletLM, DLH, and LGD). Table 1 The effectiveness of the seven submitted runs and the BM25 baseline on the June and August 2023 test sets, respectively. We report the nDCG and the nDCG@10 as well as nDCG and nDCG@10 when unjudged documents are removed (Cond. nDCG and Cond. nDCG@10) Approach / Run nDCG nDCG@10 Cond. nDCG Cond. nDCG@10 June August June August June August June August ows-bm25-bo1-keyqueries 0.332 0.242 0.240 0.190 0.471 0.350 0.448 0.343 ows-bm25-reverted-index 0.305 0.228 0.241 0.192 0.400 0.307 0.390 0.305 ows-ltr-all 0.293 0.223 0.226 0.186 0.395 0.305 0.384 0.303 ows-ltr-wows-rerank-and-reverted-index 0.289 0.212 0.219 0.172 0.402 0.305 0.393 0.302 ows-ltr-wows-rerank-and-keyquery 0.283 0.216 0.212 0.176 0.396 0.306 0.386 0.303 ows-ltr-wows-all-rerank 0.245 0.204 0.155 0.158 0.389 0.304 0.378 0.301 ows-ltr-wows-base-rerank 0.239 0.177 0.151 0.120 0.390 0.301 0.378 0.298 BM25 0.252 0.191 0.166 0.141 0.388 0.300 0.375 0.297 ows-ltr-wows-all-rerank The features from ows-ltr-wows-base-rerank plus additional, neural- based query-document scores (RankZephyr [38], Sparse Cross Encoder [39], LiT5 [40], SBERT [41], MonoT5 [42], ColBERT [43], and ANCE [44]). ows-ltr-wows-rerank-and-reverted-index The features from ows-ltr-wows-all-rerank, plus three features related to the reverted index: 1) whether the query-document pair has been encountered in the past, 2) the maximum score for this query-document pair in the past, and 3) the mean score for this query-document pair in the past. ows-ltr-wows-rerank-and-keyquery The features from ows-ltr-wows-all-rerank, plus two keyquery-related features: 1) whether this query-document pair has been encountered in the keyquery run, and 2) the score of this query-document pair in the keyquery run. ows-ltr-all A combination of all features described above. Note that some of the features – especially the neural query-document features – can be prohibitively expensive to compute in a real-world system. Our learning-to-rank results thus indicate the theoretical performance of a system using all of these models together, while in practice, a system might only use a small subset of them. 4. Results We will evaluate our submitted runs on all queries and on only overlapping queries. 4.1. Results for all queries and overlapping queries We report the nDCG [45] without cutoff and at a cutoff at 10, and condensed variants where all unjudged documents are removed [46] (although this better handles the effects of unjudged documents than dedicated measures like Bpref [46], it is known to overestimate the effectiveness [47] which was only recently confirmed [48]). The share of unjudged documents is 68-77% for June and 73-82% for August 2023 (cutoff at 10) depending on the runs. Table 1 shows that most of the runs outperform the baseline, with the baseline never being the best approach. It is a big improvement in comparison to the last year when the baseline was still the best approach for several runs. We can see that using keyqueries outperforms other approaches along with the reverted index for nDCG@10. In Table 2 we present the results for the queries that are overlapping between January, June and August. Overall, our scores are higher when considering only overlapping queries rather than all queries. We can observe that utilising the information from the past click logs is beneficial especially for Table 2 The effectiveness of the seven submitted runs and the BM25 baseline on queries that overlap between January 2023 train set, June 2023 test set and August 2023 test set: 126 queries in June and 141 queries in August. We report the nDCG and the nDCG@10 as well as nDCG and nDCG@10 when unjudged documents are removed (Cond. nDCG and Cond. nDCG@10) Approach / Run nDCG nDCG@10 Cond. nDCG Cond. nDCG@10 June August June August June August June August ows-bm25-bo1-keyqueries 0.408 0.315 0.267 0.223 0.606 0.494 0.572 0.488 ows-bm25-reverted-index 0.334 0.266 0.263 0.219 0.439 0.366 0.429 0.371 . ows-ltr-all 0.305 0.242 0.224 0.175 0.426 0.352 0.414 0.355 ows-ltr-wows-rerank-and-reverted-index 0.301 0.244 0.219 0.187 0.432 0.361 0.424 0.364 ows-ltr-wows-rerank-and-keyquery 0.293 0.240 0.206 0.178 0.425 0.358 0.414 0.362 ows-ltr-wows-all-rerank 0.246 0.232 0.138 0.162 0.419 0.353 0.405 0.354 ows-ltr-wows-base-rerank 0.253 0.194 0.157 0.112 0.421 0.352 0.408 0.353 BM25 0.270 0.213 0.168 0.143 0.423 0.346 0.407 0.346 2000 Feature Importance (gain) 1500 1000 500 0 Reverted Keyquery Query Document Lexical Neural Index Figure 1: Feature importance per feature type. For feature importance, we use the ‘gain’ metric, which measures, per feature, the total performance gain obtained from splits using that feature. the queries that were used before. Also, the approaches that use previously clicked documents perform much better compared to the approaches that do not use any historical information. 4.2. Learning to rank feature importance Since our learning-to-rank approach uses a large number of different features, we were curious to see which features have the largest impact on the performance of the model. We inspect the ‘gain’ feature importance scores as reported by LightGBM [32], i.e., for each feature, the total gain obtained by splits in the decision tree in which that feature was used. Figure 1 shows the feature importances per feature type. As can be expected, the query-document scores have the largest impact on the performance of the model, with the neural matching models being most important overall. Interestingly, the reverted index and keyquery features seem not to help the model all that much, even though we have seen large improvements in performance if we use those techniques directly (as opposed to only using them as features in the learning to rank model). In Figure 2, we explore the 5 most important features for the query-only, document-only and query- document features. For the lexical matching models, we see that BM25 is the least important by a large margin. This could be caused by the fact we already use BM25 to select the top 100 documents before Query-only features Document-only features 2000 Feature Importance (gain) 1500 1000 500 0 qpp_var qpp_cla... qpp_nqc+5 qpp_smv... qpp_scq coleman... pos_pro... probabi... pos_pro... prop_ad... Lexical query-document features Neural query-document features 2000 Feature Importance (gain) 1500 1000 500 0 bm25 pl2 dlh lgd dirichl... colbert monot5 list_in... sbert sparse_... Figure 2: Top 5 most important features for query-only, document-only, and query-document (lexical and neural matching scores) features. For feature importance, we use the ‘gain’ metric, which measures, per feature, the total performance gain obtained from splits using that feature. re-ranking, so the BM25 scores are already incorporated in the ranking. The neural ranking models, which were the most important features, still vary quite a bit in their importance. Interestingly, the sparse cross-encoder is weighed more heavily than models with full attention mechanisms like MonoT5, LiT5 and even more powerful models like RankZephyr. Similarly, SBERT, a bi-encoder model, is also deemed quite important by LightGBM. Importantly, this teaches us that we might not even need the most performant (e.g., full attention cross-encoder) models in our pipeline; using more lightweight models in a learning-to-rank setting might already boost performance by a large margin. 5. Conclusion We presented the Open Web Search (OWS) team’s submission to the LongEval shared task at CLEF 2024. The motivation behind our approach was twofold. For previously encountered queries, we made explicit use of the clicked documents in the past; either through a keyquery approach or by finding similar documents to the clicked documents in the new corpus. For unseen queries, we applied a learning-to-rank model with a variety of query-only, document-only and query-document features. Our results show that making explicit use of clicked documents for previously encountered queries heavily improves the performance of our system, even when the corpus has evolved in the meantime. Acknowledgments This work has received funding from the European Union’s Horizon Europe research and innovation pro- gram under grant agreement No 101070014 (OpenWebSearch.EU, https://doi.org/10.3030/101070014). References [1] R. Alkhalifa, I. M. Bilal, H. Borkakoty, J. Camacho-Collados, R. Deveaud, A. El-Ebshihy, L. E. Anke, G. G. Sáez, P. Galuscáková, L. Goeuriot, E. Kochkina, M. Liakata, D. Loureiro, H. T. Madabushi, P. Mulhem, F. Piroi, M. Popel, C. Servan, A. Zubiaga, Longeval: Longitudinal evaluation of model performance at CLEF 2023, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval - 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2-6, 2023, Proceedings, Part III, volume 13982 of Lecture Notes in Computer Science, Springer, 2023, pp. 499–505. URL: https://doi.org/10.1007/978-3-031-28241-6_58. doi:10.1007/978-3-031-28241-6\_58. [2] R. Alkhalifa, I. M. Bilal, H. Borkakoty, J. Camacho-Collados, R. Deveaud, A. El-Ebshihy, L. E. Anke, G. N. G. Sáez, P. Galuscáková, L. Goeuriot, E. Kochkina, M. Liakata, D. Loureiro, P. Mulhem, F. Piroi, M. Popel, C. Servan, H. T. Madabushi, A. Zubiaga, Extended overview of the CLEF-2023 longeval lab on longitudinal evaluation of model performance, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2181–2203. URL: https://ceur-ws.org/Vol-3497/paper-184.pdf. [3] P. Galuscáková, R. Deveaud, G. G. Sáez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel, Longeval- retrieval: French-english dynamic test collection for continuous web search evaluation, in: H. Chen, W. E. Duh, H. Huang, M. P. Kato, J. Mothe, B. Poblete (Eds.), Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, ACM, 2023, pp. 3086–3094. URL: https://doi.org/10.1145/3539618.3591921. doi:10.1145/3539618.3591921. [4] R. Alkhalifa, H. Borkakoty, R. Deveaud, A. El-Ebshihy, L. E. Anke, T. Fink, G. G. Sáez, P. Galuscáková, L. Goeuriot, D. Iommi, M. Liakata, H. T. Madabushi, P. Medina-Alias, P. Mulhem, F. Piroi, M. Popel, C. Servan, A. Zubiaga, Longeval: Longitudinal evaluation of model performance at CLEF 2024, in: N. Goharian, N. Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information Retrieval - 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24-28, 2024, Proceedings, Part VI, volume 14613 of Lecture Notes in Computer Science, Springer, 2024, pp. 60–66. URL: https://doi.org/10.1007/978-3-031-56072-9_8. doi:10.1007/978-3-031-56072-9\_8. [5] R. Alkhalifa, H. Borkakoty, R. Deveaud, A. El-Ebshihy, L. E. Anke, T. Fink, G. G. Sáez, P. Galuscáková, L. Goeuriot, D. Iommi, M. Liakata, H. T. Madabushi, P. Medina-Alias, P. Mulhem, F. Piroi, M. Popel, C. Servan, A. Zubiaga, Overview of the CLEF-2023 LongEval Lab on Longitudinal Evaluation of Model Performance, in: G. F. N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS.org, 2024. [6] R. Alkhalifa, H. Borkakoty, R. Deveaud, A. El-Ebshihy, L. Espinosa-Anke, T. Fink, P. Galuščáková, G. Gonzalez-Saez, L. Goeuriot, D. Iommi, M. Liakata, H. T. Madabushi, P. Medina-Alias, P. Mul- hem, F. Piroi, M. Popel, A. Zubiaga, Overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M. D. Nunzio, P. Galuščáková, A. G. S. de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science (LNCS), Springer, Heidelberg, Germany, 2024. [7] R. Alkhalifa, H. Borkakoty, R. Deveaud, A. El-Ebshihy, L. Espinosa-Anke, T. Fink, P. Galuščáková, G. Gonzalez-Saez, L. Goeuriot, D. Iommi, M. Liakata, H. T. Madabushi, P. Medina-Alias, P. Mulhem, F. Piroi, M. Popel, A. Zubiaga, Extended overview of the CLEF 2024 LongEval Lab on Longitudinal Evaluation of Model Performance, in: G. Faggioli, N. Ferro, P. Galuščáková, A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS, Online, 2024. [8] T. Gollub, M. Hagen, M. Michel, B. Stein, From Keywords to Keyqueries: Content Descriptors for the Web, in: C. Gurrin, G. Jones, D. Kelly, U. Kruschwitz, M. de Rijke, T. Sakai, P. Sheridan (Eds.), 36th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2013), ACM, 2013, pp. 981–984. doi:10.1145/2484028.2484181. [9] M. Hagen, A. Beyer, T. Gollub, K. Komlossy, B. Stein, Supporting Scholarly Search with Keyqueries, in: N. Ferro, F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. Di Nunzio, C. Hauff, G. Silvello (Eds.), Advances in Information Retrieval. 38th European Conference on IR Research (ECIR 2016), volume 9626 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2016, pp. 507–520. doi:10.1007/978-3-319-30671-1_37. [10] S. MacAvaney, N. Tonellotto, C. Macdonald, Adaptive re-ranking with a corpus graph, in: M. A. Hasan, L. Xiong (Eds.), Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022, ACM, 2022, pp. 1491–1500. URL: https://doi.org/10.1145/3511808.3557231. doi:10.1145/3511808.3557231. [11] J. Pickens, M. Cooper, G. Golovchinsky, Reverted indexing for feedback and expansion, in: J. X. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, A. An (Eds.), Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, ACM, 2010, pp. 1049–1058. URL: https://doi.org/10.1145/1871437. 1871571. doi:10.1145/1871437.1871571. [12] T. Liu, Learning to Rank for Information Retrieval, Springer, 2011. URL: https://doi.org/10.1007/ 978-3-642-14267-3. doi:10.1007/978-3-642-14267-3. [13] J. Lin, R. F. Nogueira, A. Yates, Pretrained Transformers for Text Ranking: BERT and Beyond, Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers, 2021. URL: https://doi.org/10.2200/S01123ED1V01Y202108HLT053. doi:10.2200/ S01123ED1V01Y202108HLT053. [14] D. Dato, S. MacAvaney, F. M. Nardini, R. Perego, N. Tonellotto, The istella22 dataset: Bridging traditional and neural learning to rank evaluation, in: E. Amigó, P. Castells, J. Gonzalo, B. Carterette, J. S. Culpepper, G. Kazai (Eds.), SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, ACM, 2022, pp. 3099–3107. URL: https://doi.org/10.1145/3477495.3531740. doi:10.1145/3477495.3531740. [15] M. Fröbe, S. Günther, M. Probst, M. Potthast, M. Hagen, The Power of Anchor Text in the Neural Retrieval Era, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.), Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022. [16] S. M. Farzana, M. Fröbe, M. Granitzer, G. Hendriksen, D. Hiemstra, M. Potthast, S. Zerhoudi, The first international workshop on open web search (wows), in: European Conference on Information Retrieval, Springer, 2024, pp. 426–431. [17] S. M. Farzana, M. Fröbe, M. Granitzer, G. Hendriksen, D. Hiemstra, M. Potthast, S. Zerhoudi (Eds.), Proceedings of the first International Workshop on Open Web Search co-located with the 46th European Conference on Information Retrieval ECIR 2024, number 3689 in CEUR Workshop Proceedings, 2024. URL: https://ceur-ws.org/Vol-3689/. [18] K. Krishna, A. Roy, M. Iyyer, Hurdles to progress in long-form question answering, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Association for Computational Linguistics, 2021, pp. 4940–4957. URL: https://doi.org/10.18653/v1/2021.naacl-main.393. doi:10.18653/V1/2021.NAACL-MAIN.393. [19] M. Fröbe, C. Akiki, M. Potthast, M. Hagen, How Train-Test Leakage Affects Zero-shot Re- trieval, in: D. Arroyuelo, B. Poblete (Eds.), 29th International Symposium on String Processing and Information Retrieval (SPIRE 2022), volume 13617, Concepción, Chile, 2022. doi:10.1007/ 978-3-031-20643-6_11. [20] Y. Bernstein, J. Zobel, Redundant documents and search effectiveness, in: O. Herzog, H. Schek, N. Fuhr, A. Chowdhury, W. Teiken (Eds.), Proceedings of the 2005 ACM CIKM In- ternational Conference on Information and Knowledge Management, Bremen, Germany, October 31 - November 5, 2005, ACM, 2005, pp. 736–743. URL: https://doi.org/10.1145/1099554.1099733. doi:10.1145/1099554.1099733. [21] M. Fröbe, J. Bittner, M. Potthast, M. Hagen, The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines, in: J. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. Silva, F. Martins (Eds.), Advances in Information Retrieval. 42nd European Conference on IR Research (ECIR 2020), volume 12036 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2020, pp. 12–19. doi:10.1007/978-3-030-45442-5_2. [22] M. Fröbe, J. Bevendorff, J. Reimer, M. Potthast, M. Hagen, Sampling Bias Due to Near-Duplicates in Learning to Rank, in: 43rd International ACM Conference on Research and Development in Information Retrieval (SIGIR 2020), ACM, 2020, pp. 1997–2000. doi:10.1145/3397271.3401212. [23] M. Fröbe, J. Bevendorff, L. Gienapp, M. Völske, B. Stein, M. Potthast, M. Hagen, CopyCat: Near- Duplicates within and between the ClueWeb and the Common Crawl, in: F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, T. Sakai (Eds.), 44th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2021), ACM, 2021, pp. 2398–2404. doi:10.1145/ 3404835.3463246. [24] M. Völske, T. Gollub, M. Hagen, B. Stein, A keyquery-based classification system for CORE, D Lib Mag. 20 (2014). URL: https://doi.org/10.1045/november14-voelske. doi:10.1045/ NOVEMBER14-VOELSKE. [25] M. Fröbe, S. Günther, A. Bondarenko, J. Huck, M. Hagen, Using keyqueries to reduce misin- formation in health-related search results, in: ROMCIR 2022: The 2nd Workshop on Reducing Online Misinformation through Credible Information Retrieval, held as part of ECIR 2022: the 44th European Conference on Information Retrieval, 2022. [26] M. Fröbe, E. O. Schmidt, M. Hagen, Efficient Query Obfuscation with Keyqueries, in: 20th International IEEE/WIC/ACM Conference on Web Intelligence (WI-IAT 2021), ACM, 2021. doi:10. 1145/3486622.3493950. [27] M. Fröbe, G. Hendriksen, A. P. de Vries, M. Potthast, Open web search at longeval 2023: Reciprocal rank fusion on automatically generated query variants, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2432–2440. URL: https://ceur-ws.org/Vol-3497/paper-195.pdf. [28] G. Amati, Divergence from randomness, Ph.D. thesis, Ph. D. thesis, Department of Computer Science, University of Glasgow, 2003. [29] N. A. Jaleel, J. Allan, W. B. Croft, F. Diaz, L. S. Larkey, X. Li, M. D. Smucker, C. Wade, Umass at TREC 2004: Novelty and HARD, in: E. M. Voorhees, L. P. Buckland (Eds.), Proceedings of the Thirteenth Text REtrieval Conference, TREC 2004, Gaithersburg, Maryland, USA, November 16-19, 2004, volume 500-261 of NIST Special Publication, National Institute of Standards and Technology (NIST), 2004. URL: http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf. [30] Q. Wu, C. J. Burges, K. M. Svore, J. Gao, Adapting boosting for information retrieval measures, Information Retrieval 13 (2010) 254–270. [31] C. Macdonald, N. Tonellotto, Declarative experimentation in information retrieval using pyterrier, in: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, 2020, pp. 161–168. [32] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, Advances in Neural Information Processing Systems 30 (2017). [33] M. Fröbe, J. H. Reimer, S. MacAvaney, N. Deckers, S. Reich, J. Bevendorff, B. Stein, M. Hagen, M. Potthast, The information retrieval experiment platform, in: Proceedings of the 46th Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 2826–2836. [34] O. Zendel, M. Fröbe, G. Faggioli, Qpptk@ tirex: Simplified query performance prediction for ad-hoc retrieval experiments, in: [17], 2024, pp. 50–62. URL: https://ceur-ws.org/Vol-3689/. [35] D. Alexander, W. Kusa, A. P. de Vries, Orcas-i query intent predictor as component of tira, in: [17], 2024, pp. 23–29. URL: https://ceur-ws.org/Vol-3689/. [36] F. Schlatt, Efficiently scoring the health-relatedness of web pages, in: [17] 14–22. URL: https: //ceur-ws.org/Vol-3689/. [37] L. Erben, M. Hampel, M.-C. Kuns, V. Melisch, P. Natzschka, W. Pertsch, L. Razouk, R. Stolle, R. T. Thoss, T. G. Trinh, et al., Assembling four open web search components, in: [17] 73–93. URL: https://ceur-ws.org/Vol-3689/. [38] R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, arXiv preprint arXiv:2312.02724 (2023). [39] F. Schlatt, M. Fröbe, M. Hagen, Investigating the effects of sparse attention on cross-encoders, in: European Conference on Information Retrieval, Springer, 2024, pp. 173–190. [40] M. S. Tamber, R. Pradeep, J. Lin, Scaling down, litting up: Efficient zero-shot listwise reranking with seq2seq encoder-decoder models, arXiv preprint arXiv: 2312.16098 (2023). [41] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084. [42] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to-sequence model, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 708–718. URL: https: //aclanthology.org/2020.findings-emnlp.63. doi:10.18653/v1/2020.findings-emnlp.63. [43] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 39–48. [44] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, A. Overwijk, Approximate nearest neighbor negative contrastive learning for dense text retrieval, arXiv preprint arXiv:2007.00808 (2020). [45] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, ACM Trans. Inf. Syst. 20 (2002) 422–446. URL: http://doi.acm.org/10.1145/582415.582418. doi:10.1145/582415. 582418. [46] T. Sakai, Alternatives to bpref, in: W. Kraaij, A. P. de Vries, C. L. A. Clarke, N. Fuhr, N. Kando (Eds.), SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23-27, 2007, ACM, 2007, pp. 71–78. URL: https://doi.org/10.1145/1277741.1277756. doi:10.1145/1277741.1277756. [47] T. Sakai, Comparing metrics across TREC and NTCIR: the robustness to system bias, in: J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K. Choi, A. Chowdhury (Eds.), Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, Napa Valley, California, USA, October 26-30, 2008, ACM, 2008, pp. 581–590. URL: https://doi.org/10.1145/1458082.1458159. doi:10.1145/1458082.1458159. [48] M. Fröbe, L. Gienapp, M. Potthast, M. Hagen, Bootstrapped nDCG Estimation in the Presence of Unjudged Documents, in: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), volume 13980 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 313–329. doi:10.1007/978-3-031-28244-7_20.