Leveraging Prior Relevance Signals in Web Search Notebook for the LongEval Lab at CLEF 2024 Jüri Keller1,* , Timo Breuer1 and Philipp Schaer1 1 TH Köln (University of Applied Sciences), Claudiusstr. 1, Cologne, 50678, Germany Abstract This work reports our participation in the retrieval task of the second LongEval lab iteration at CLEF 2024. As part of this year’s contribution, we analyze to which extent prior relevance signals on the document level and term level can be used to improve the retrieval effectiveness. In order to exploit these kinds of signals, we fetch corresponding document identifiers pointing to the same document in the different dataset slices of all timestamps. Based on several heuristics, we submit and evaluate a total of five systems that either follow our previous year’s methodology or that combine baseline rankings with prior relevance signals. Our evaluations provide insights to which extent these signals can be used but let us also conclude with several recommendations for future work. Most notably, we envision a companion resource that ties together all slices of the dataset by unified document identifiers to have a better understanding of more rigorous data splits and to avoid potential data leakage that might affect the evaluation of (deep) learning-based systems. Keywords Web Search, Longitudinal Evaluation, Continuous Evaluation, Replicability 1. Introduction The overall goal of Information Retrieval (IR) systems is to assist in finding relevant information for given information needs. These systems are designed to cope with the ongoing flood of information. Since the information landscape is continuously changing, the foundational data basis is ever-evolving, exposing systems to always new and updated information. While these changes directly influence the retrieval effectiveness, they are rarely considered during evaluation. Web search is an especially dynamic search scenario since websites and queries change quickly. This evolving search setting is brought to a test in the LongEval shared task that has the main goal of evaluating how systems cope with changes over time. Therefore, retrieval systems are evaluated on progressing snapshots of a test collection. While the systems are exposed to changing data, users expect consistent, good effectiveness. To provide and maintain this, it is essential to quantify the effectiveness regularly. However, since the foundational data source, including documents, topics, and qrels is evolving, it is an open question for how long evaluations remain valid or can be generalized. While the LongEval shared task provides a one-of-a-kind test bed for the described endeavors, previous submissions mainly investigate how the changes in the test collection affect systems that remain static across time. The systems submitted last year did not use the temporal aspects of the test collection and instead treated the snapshots as independent search tasks. Exploiting the historical data as prior relevance signals in ranking systems is an important step in learning about the temporal connection between the snapshots and what differentiates longitudinal evaluations from cross test collection evaluations. In our contribution, we aim to investigate different approaches that exploit prior relevance signals on different levels in intuitive and explainable ways. Our goal is to provide initial approaches to adapt the system to the evolving retrieval setup by using past snapshots, including ranked documents and qrels. By that, the system itself becomes an evolving component in this evaluation, like the documents, CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. $ jueri.keller@th-koeln.de (J. Keller); timo.breuer@th-koeln.de (T. Breuer); philipp.schaer@th-koeln.de (P. Schaer)  0000-0002-9392-8646 (J. Keller); 0000-0002-1765-2449 (T. Breuer); 0000-0002-8817-4632 (P. Schaer) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings topics, and qrels. While these methods are mainly limited to topics with a known history, i.g. queries that were already issued previously, the Average Retrieval Performance (ARP) clearly improves. In short, the contributions of this work are: • Three experimental systems that exploit information of the past snapshots (past rankings and qrels) as prior relevance signals and two baselines are submitted to the LongEval shared task. • An extensive evaluation of the submitted systems regarding their behavior and effectiveness across time through reproducibility and replicability measures. • We outline and discuss in detail three directions of future work, namely the need for intellectual relevance labels, possible data leakage, and a companion resource that makes the snapshots more accessible. To facilitate reproducibility, we make the code publicly available on GitHub.1 2. LongEval 2024 Dataset The LongEval dataset is a retrieval test collection introduced in the LongEval2 CLEF lab in 2023 [1]. It contains multiple sub-collections that resemble snapshots of different points in time of the French, privacy-focused search engine Qwant.3 The goal of the LongEval lab are longitudinal evaluations in IR. The original test collection that was used in last year’s iteration of the lab contained three sub-collections (WT, ST, and LT). In this year’s iteration, three additional sub-collections are released and added to the test collection. By that, the test collection grows over time, covering longer time spans. Since no standardized naming of the different sub-collections exists, in this work, we name them by simply iterating over all sub-collections in chronological order, starting with 𝑡0 for last year’s WT sub-collection. The new snapshots that are added in this iteration are therefore: • 𝑡3 : used as train split captured in January 2023. • 𝑡4 : used as a test split, capturing data from June 2023. The snapshot has a gap of five months to 𝑡3 and the runs are submitted for lag6. • 𝑡5 : also used as a test split, capturing data from August 2023. The snapshot has a gap of seven months to 𝑡3 and the runs are submitted for lag8. The test collection is originally available in French, but an English automatic translation is additionally provided, on which this work mainly relies on. While we generally consider the translations to be good, some mismatches and near duplicates occur. This is also reflected in a higher effectiveness for runs using the French version [2]. The test collection statistics of the newly added snapshots are summarized in Table 1. They contain between 1.7 and 2.5 million documents and 404 to 1518 topics. Compared to last year’s snapshots, it contains more documents but fewer topics on average. The qrels classify the documents as not relevant, relevant, and highly relevant. The number of qrels is balanced almost evenly. Roughly the same amount of positive and negative labels are present while regarding the positive labels, only one-third are highly relevant. The distribution across topics is highly skewed, as visualized in Figure 1. The dataset’s different snapshots overlap, strengthening the connection between them. Similar to the snapshots investigated last year, some topics appear in multiple or even all snapshots. Likewise, the document collection holds websites present in all snapshots. Additionally, many more topics and documents are added and removed over time. Besides creating and updating operations on the collection, some documents that are present in multiple snapshots also change over time. For components that were already present in a previous snapshot, the history of this component can be constructed by relating the current version to its previous ones. Tracking these different changes, even on an abstract 1 https://github.com/irgroup/CLEF2024-LongEval-CIR 2 https://clef-longeval.github.io/ 3 https://www.qwant.com/ Table 1 Test collection statistics of the three newly added snapshots. Besides the count of documents, topics, and qrels, the total number of qrels distributed across the three relevance labels (not_rel, rel, high_rel) and the summarizing statistics (mean, min, max) of all qrels per topic are displayed. The qrels are filtered to the ones used in the shared task. Documents Topics Qrels not_rel rel high_rel mean min max 𝑡3 2,049,729 599 9,785 5423 2891 1471 16.3 6 56 𝑡4 1,790,028 404 5,835 3146 1699 990 14.4 1 50 𝑡5 2,531,614 1518 24,861 14602 7078 3181 16.4 1 59 Figure 1: Histograms of the qrels distribution across topics. level, remains challenging since each snapshot uses different identifiers. Therefore, no direct connection between topics, documents, or qrels is possible. A detour must be made via additional information to connect the components across time. We chose to link the documents by the provided URLs and the topics by exact string matches of the French query string. Linking the components by exact matches on the strings is straightforward but not reliable because this will not identify (near) duplicates for which the URL or query slightly changed. Even though the English version of the dataset is used in general, for the topic linking, the French queries are used to avoid differences occurring by the translation, especially with regards to the larger time gap between the snapshots from this year and the dataset from last years iteration in which the translation system might have changed. Linking between different temporal versions of the same topics and documents allows us to exploit the history of the test collection for the ranking. The snapshots appear to be rather loosely connected since the overlap in topics is small. While in total 404 to 1518 topics are available, only 98 topics are present in all three new snapshots. In comparison, 124 topics could be found with the same matching method that are available in all snapshots from last year [3]. Presumably, even fewer topics are present in the snapshots of both iterations combined. We define the overlap of topics between sub-collections as core queries. Limiting the topic set to these core queries reduces the changes in the test collection to the document and relevance components. 3. Approaches and Implementations Based on the data analysis of corresponding document identifiers described in the previous section, we can identify the same document in different snapshots of the datasets, i.e., the test collection at different timestamps. The following approaches are based on the heuristic that prior snapshots or rankings retrieved from earlier snapshots, bear some kind of relevance information that can be exploited for the ranking of the current timestamp. Generally, we derive relevance signals for the ranking at timestamp 𝑡𝑛 from the sub-collection or a ranking at one or more earlier timestamps 𝑡𝑛−1 . This history of snapshots can also include the sub-collections from last year (𝑡0 , 𝑡1 , 𝑡2 ). Compared to an earlier snapshot, a current ranking at 𝑡𝑛 can contain new documents that are added after 𝑡𝑛−1 , documents that were also present at 𝑡𝑛−1 , and documents that were present at 𝑡𝑛−1 but now have different content. An overview of all runs and their general methodology is given in Table 2. Table 2 Overview of our submitted runs and the underlying methodology Run Method Modality of Prior Relevance Signal CIR_BM25 Baseline ranking based on BM25 - (cf. 3.1) CIR_BM25+monoT5 Reranking based on pre-trained - monoT5 (cf. 3.1) CIR_BM25+time_boost Fusion based on weighting prior Documents in prior ranking documents (cf. 3.2) CIR_BM25+qrel_boost Boosting based on weighting prior, Judged prior documents judged documents (cf. 3.2) CIR_BM25+RF Query expansion with terms of Terms from prior relevant documents prior, relevant documents (cf. 3.3) 3.1. “Off-The-Shelve” Baseline and Transformer-Based Reranking Given the queries of the timestamp 𝑡𝑛 , we run BM25 [4] on the corresponding index. BM25 is a well-established lexical retrieval method that is still competitive for many applications where training data is sparse. We run it with default parameters as implemented in the PyTerrier toolkit [5]. The corresponding run submission is entitled CIR_BM25. The underlying retrieval method of the run submission CIR_BM25+monoT5 reranks the BM25 baseline run CIR_BM25 with monoT5 [6]. We make use of the monoT5 version that was pre-trained on MS MARCO passages.4 To use T5 as the reranking model, the documents are truncated after 512 sub-word tokens. This system was submitted as an additional baseline to BM25 since it performed well on last year’s iteration of the LongEval Lab [3]. The same implementation as last year’s is reused for this run. 3.2. Direct Boost by Prior Ranking and Relevance Information The run submission BM25+time_boost is based on the hypothesis that given a document appears in both the ranking at the current timestamp 𝑡𝑛 and an earlier timestamp 𝑡𝑛−1 , it is relevant — at least to the degree that it was not removed from the index and is still considered as a potential document in the ranking. For this approach, besides the baseline BM25 ranking at timestamp 𝑡𝑛 , we also determine rankings at the earlier timestamp 𝑡𝑛−1 with the same set of queries but with the corresponding index at timestamp 𝑡𝑛−1 . Both the scores and the document identifiers are normalized. This procedure results in two runs for the same set of topics that have possibly different documents ranked. Then, the documents that are also ranked at 𝑡𝑛−1 are boosted by {︃ 𝜆2 if 𝑑 ∈ 𝑟𝑞,𝑡𝑛−1 𝜌𝑞,𝑑 (𝜆) = (1) (1 − 𝜆) otherwise 2 where 𝑟𝑞,𝑡𝑛−1 is the ranking 𝑟𝑞 corresponding to the query 𝑞 at timestamp 𝑡𝑛−1 and 𝜆 is a free parameter. Figure 2 illustrates the weighting score 𝜌𝑞,𝑑 based on different 𝜆 values. The basic intuition is that we can control the ratio between weights that are assigned to documents that are either contained in a ranking at an earlier timestamp or those documents that are not. 4 https://huggingface.co/castorini/monot5-base-msmarco Weighting of Prior Documents 1.0 0.8 0.6 q, d 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 2 Document is in prior ranking (and relevant) 1 2 Document is not in prior ranking (or irrelevant) Figure 2: Weighting scheme for (relevant) documents of prior rankings Suppose we assign a 𝜆 value larger than 0.5. In that case, we would emphasize prior, i.e., older documents in the ranking, which is a rather conservative ranking policy in a real-world setting. On the other hand, 𝜆 values lower than 0.5 yield a higher weight to new documents that were not contained in a prior ranking, corresponding to a more experimental, risky ranking policy in a real-world setting. If 𝜆 = 0.5 the ranking remains the same, and the effect of the weighting is negligible. For the submitted run, 𝜆 = 0.503 was selected based on a grid search on the train set at 𝑡3 . Notably, this run only relies on the fact that documents were ranked earlier, but neglects any direct relevance information. The run submission BM25+qrel_boost extends this rather weak heuristic by taking the relevance label directly into account and considering not only the 𝑡𝑛−1 snapshot but further prior snapshots. In this sense, this run follows a similar ranking policy (conservative vs. experimental) but with a focus on relevance information obtained from prior labels. Independent of any prior ranking, all documents 𝑑 ranked for a query 𝑞 at 𝑡𝑛 that were judged at a previous snapshot, e.g. 𝑡𝑛−1 are boosted by: if 𝑞𝑟𝑒𝑙𝑞,𝑑 = 1 ⎧ 2 ⎨𝜆 ⎪ 𝜌𝑞,𝑑 (𝜆) = 𝜆 𝜇2 if 𝑞𝑟𝑒𝑙𝑞,𝑑 > 1 . (2) (1 − 𝜆)2 if 𝑞𝑟𝑒𝑙𝑞,𝑑 = 0 ⎪ ⎩ This extends the boost described in Eq. 1 to additionally account for highly relevant documents. The additional free parameter 𝜇 can be used to assign a different boost to documents with higher relevance labels. The boosting can be repeated for further points in time by using more distanced qrels e.g. 𝑡𝑛−2 . This leads to increasing scores for documents that are labeled relevant at multiple points in time, which is in line with our initial assumption. For the submitted run, 𝜆 = 0.7 was chosen, and all available previous snapshots (𝑡3 , 𝑡2 , 𝑡1 , 𝑡0 ) as history. Due to a bug that was only found after submission, instead of highly relevant documents, all relevant documents were boosted again by 𝜆2 𝜇 with 𝜇 = 2. 3.3. Query Expansion Based on Prior Relevance Feedback In addition to the run submissions described above, we propose one additional approach that aligns with the general idea of exploiting prior relevance signals. BM25+RF follows a query expansion method, making use of the relevance feedback provided by prior documents, i.e., those documents with a relevant label at earlier timestamps. The relevant documents of all earlier timestamps are normalized and deduplicated based on the URLs. Then, the vocabulary of all remaining relevant documents corresponding to a query is unified. After stopwords are removed, the top ten expansion terms are extracted based on the TF-IDF scores and are used to expand the original query string. Hypothetically, this approach allows us to profit from the prior relevance information independent of a particular ranking or the labels for query-document pairs. Since not all topics have a history with relevant documents available, the rankings for unknown topics are reranked by pseudo-relevance feedback based on RM3 [7]. Similarly, the query is expanded by ten terms extracted from the top three documents ranked by BM25. 4. Results The ARP measured by P@10, bpref [8], nDCG [9], and also Mean Reciprocal Rank (MRR) [10] are reported in Table 3. Since the test collection has shallow pools, we report bpref instead of MAP [11]. Additionally, the MRR is reported since it relies only on the first relevant result ranked and fits the web search use case well. Following on from last year’s evaluation [3], we report the replicability measures Delta Relative Improvement (ΔRI) and Effect Ratio (ER) [12] to quantify how the effectiveness changes across time. Since the snapshots are obtained from the same search engine, they are naturally related, but due to the changes over time, they are still very different. Thus, a comparison is not straightforward, and an advanced comparison strategy is needed to factor in the changed recall base [13, 14]. The ΔRI basically describes the changes in effectiveness compared to 𝑡3 and the ER describes how well the effect measured at 𝑡3 is recovered. Both measures rely on BM25 as a pivot system in a way that not the direct scores at two points in time are compared but the deltas to the pivot system. Since both the experimental and the pivot systems are exposed to the same snapshot, the impact of the evaluation environmental should be reduced, and the results should be more comparable. Additionally, Kendall’s Tau [15] and RBO [16] are reported in Table 4 to directly compare the rankings at later points in time to the ranking at 𝑡3 independently of any effectiveness. The retrieval effectiveness changes over time regarding all systems and measures. For almost all systems, it is decreasing compared to 𝑡3 . It can be observed that the newer the snapshot, the lower the measured effectiveness. The ΔRI shows more minor changes in effectiveness. In contrast to the ARP trends, it indicates a slightly increasing effectiveness as indicated by the negative ΔRI values. All runs per snapshot are tested for significantce compared to the BM25 baseline using paired t-tests with 𝛼 = 0.05 and Bonferroni correction. Most runs show significant differences except BM25+time_- boost at 𝑡4 and 𝑡5 for all measures or BM25+monoT5 at 𝑡4 measured by bpref or at 𝑡3 by MRR. The ranking of systems appears to be relatively consistent across time. The BM25 baseline is out- performed by BM25+monoT5, but only by a little. The system BM25+time_boost initially performs the worst but is on par with BM25 for the later snapshots. These rankings are so similar that no signifi- cant differences can be measured. The two systems, BM25+qrel_boost and BM25+RF systems, often outperform BM25+monoT5. In particular, the BM25+qrel_boost system performs substantially better. The ranking similarity between two points in time measured by Kendall’s Tau and RBO appears to be generally low. Especially the BM25+RF system shows a low similarity regarding Kendall’s Tau, maybe because of the query rewriting that depends on the history. If the effectiveness changes are compared to the ranking similarity as displayed in Table 4, even small changes in effectiveness can lead to vastly different rankings. For example, the system BM25+time_boost evaluated at 𝑡3 and 𝑡4 show only small improvements on all measures but also among the smallest similarities in the rankings for Kendall’s Tau and RBO. Interestingly, the system BM25+monoT5 shows similar effectiveness to the system BM25+qrel_boost considering nDCG but different ranking similarities measured by RBO. BM25 has the most similar rankings, followed by BM25+monoT5 and BM25+qrel_boost. The systems that rely on previous relevance signals, especially BM25+qrel_boost but also BM25+RF most often show more variance over time in the ΔRI and ER values, compared to the other systems. For Table 3 Experimental results of the different retrieval systems across all snapshots. Statistically significant differences in the ARP from BM25 at the same point in time are marked with an asterisk (*). For the replicability measure BM25 is used as the pivot system, and the changes are measured in comparison to 𝑡3 . For these rows, the ideal values are indicated. P@10 bpref nDCG MRR System t ARP ΔRI ER ARP ΔRI ER ARP ΔRI ER ARP ΔRI ER BM25 𝑡3 0.1624 - - 0.4373 - - 0.3638 - - 0.3660 - - BM25 𝑡4 0.1370 - - 0.3572 - - 0.2817 - - 0.3162 - - BM25 𝑡5 0.1076 - - 0.2791 - - 0.2106 - - 0.3116 - - +monoT5 𝑡3 0.1776* 0 1 0.4571* 0 1 0.3839* 0 1 0.3929 0 1 +monoT5 𝑡4 0.1591* -0.0675 1.4513 0.3719 0.0043 0.7401 0.3081* -0.0384 1.3122 0.3847* -0.1436 2.5532 +monoT5 𝑡5 0.1246* -0.0640 1.1155 0.2862* 0.0198 0.3591 0.2303* -0.0385 0.9824 0.3551* -0.0662 1.6190 +time_boost 𝑡3 0.1007* 0 1 0.4178* 0 1 0.2758* 0 1 0.2774* 0 1 +time_boost 𝑡4 0.1380 -0.3873 -0.0161 0.3583 -0.0477 -0.0562 0.2850 -0.2534 -0.0369 0.3308 -0.2882 -0.1651 +time_boost 𝑡5 0.1065 -0.3702 0.0171 0.2798 -0.0472 -0.0372 0.2116 -0.2469 -0.0120 0.3210* -0.2722 -0.1065 +qrel_boost 𝑡3 0.1870* 0 1 0.4515* 0 1 0.3910* 0 1 0.4394* 0 1 +qrel_boost 𝑡4 0.1975* -0.2906 2.4630 0.3815* -0.0354 1.7037 0.3536* -0.1805 2.6442* 0.4922* -0.3562 2.3981 +qrel_boost 𝑡5 0.1349* -0.1021 1.1097 0.2881* 0.0005 0.6289 0.2419* -0.0740 1.1515* 0.4063* -0.1033 1.2897 +RF 𝑡3 0.1758* 0 1 0.4819* 0 1 0.3955* 0 1 0.3997* 0 1 +RF 𝑡4 0.1608* -0.0915 1.7806 0.3958* -0.0059 0.8642 0.3197* -0.0476 1.1966 0.3821* -0.1164 1.9556 +RF 𝑡5 0.1230* -0.0606 1.1504 0.2990* 0.0306 0.4467 0.2293* -0.0013 0.5878 0.3373* 0.0097 0.7620 Table 4 The similarity of the document rankings from different systems according to Kendall’s Tau and RBO. Since for some topics less than 1000 documents were retrieved, only the top 100 documents are considered for this comparison. System t 𝜏 @100 RBO@100 BM25 𝑡3 1 1 BM25 𝑡4 0.0132 0.4203 BM25 𝑡5 0.0152 0.3961 +monoT5 𝑡3 1 1 +monoT5 𝑡4 0.0126 0.4211 +monoT5 𝑡5 0.0122 0.3960 +time_boost 𝑡3 1 1 +time_boost 𝑡4 -0.0167 0.2086 +time_boost 𝑡5 0.0051 0.1847 +qrel_boost 𝑡3 1 1 +qrel_boost 𝑡4 0.0234 0.3223 +qrel_boost 𝑡5 0.0150 0.3275 +RF 𝑡3 1 1 +RF 𝑡4 -0.0006 0.2266 +RF 𝑡5 0.0003 0.2336 example, the system BM25+qrel_boost indicates an increased effectiveness (nDCG ΔRI of -0.1805) at 𝑡4 but only a smaller increase (nDCG ΔRI of -0.0740) at 𝑡5 . Simultaneously, the nDCG ΔRI for BM25+monoT5 only differs by 0.0001 points. 5. Discussion and Future Work Based on our experimental results, we basically see three directions for future work, namely: 1. the addition of deep relevance judgments pools, 2. a more in-depth analysis of potential data leakage, Figure 3: Effectiveness of the submitted systems across all snapshots measured by nDCG (left), Bpref (center), and P@10 (right). 3. the curation of a novel resource that ties together the different snapshots of the dataset (by identifying corresponding documents identifiers that point to the same document). In general, we can see a performance drop for all systems and all measures over the three timestamps. None of our submitted systems specifically addressed the goal to identify new potentially relevant documents in their methodology. Instead, the submissions rather followed a retrospective approach where the relevance information of prior rankings was exploited. As expected the baseline method BM25 is outperformed when reranking it in the second stage with monoT5, which is in line with our observations from the lab’s first iteration. Even though the performance drops over time for the BM25 baseline and the monoT5 reranking alike, the latter outperforms BM25 wrt. every measure at each timestamp. The approach underlying CIR_BM25+time_boost is less effective at timestamp 𝑡3 but on par at 𝑡4 and 𝑡5 with BM25. This lets us conclude that it is not advisable to simply emphasize prior documents for the sake of better retrieval effectiveness without considering any other relevance information. However, we have to point out that at timestamps 𝑡4 and 𝑡5 the retrieval effectiveness does not deteriorate, which is an important insight especially when considering that the rank correlation between the rankings at different timestamps is low. That means the rankings are quite different but have the same retrieval effectiveness in the end. From a user perspective, this could imply tremendous differences that are simply not reflected by the effectiveness scores. We assume that this can be explained by the dataset currently only having shallow relevance judgment pools. Suppose that more documents would have been assigned with a relevance label. In that case, we would probably see more diverse outcomes here. In the future, we would strongly recommend to curate a companion dataset or resource with intellectual relevance labels that complements the dataset with deep relevance judgment pools. A good example of an equivalent is the combination of the TripClick [17] and TripJudge [18] datasets, that cover both click-based and annotator-based relevance labels. Of course, the use of large language models for generating relevance labels could be a viable option if costs for human annotators are too high [19]. However, when comparing the monoT5 results to the other run submissions, we also see that specifically CIR_BM25+qrel_boost and CIR_BM25+RF are on par or even outperform monoT5, which opens up several interesting points for discussion. These systems are the best-performing ones among our submissions. Regarding CIR_BM25+RF, we can conclude that query expansion with terms from prior relevant documents is a viable option to make use of available relevance signals. While our experimental results show rather minor improvements over the monoT5 reranking, this approach reliably yields better scores. In the future, other language models for generating the expansion terms should be considered and evaluated. CIR_BM25+qrel_boost opens up several other interesting perspectives for a more in-depth analysis. The approach of this run submission specifically boosts those documents that were judged as relevant earlier. This general approach was rather successful in the OpenSearch TREC and its CLEF predecessor LL4IR [20, 21]. Instead of (derived) relevance judgments, the original implementation used click data to approximate relevance. Nevertheless, it has to be considered that these documents and judgments could have served as training data for deep learning retrieval methods. This is critical as data leakage might be an issue when the datasets are used as provided and we recommend that future work should analyze this circumstance in more detail [22]. We have the concern that this circumstance was not that apparent as the documents with the same URL and the same content have different identifiers in the different versions of the datasets at different timestamps. That finally leads us to our third and final direction for future work, which would be a companion resource to the LongEval datasets that ties together all of the six dataset’s snapshots by unifying the document identifiers and making potential document overlaps, which may cause data leakage in a learning-based scenario, explicit. A higher variance in ΔRI and ER for the systems that rely on the relevance signals of previous snapshots could be observed. This can be interpreted as validation of the comparison strategy. The runs directly utilize the relevance signals in more or less immediate. However, for both test points in time (𝑡4 and 𝑡5 ) the same history of snapshots (𝑡0 to 𝑡3 ) is used. This means that the relevance signals for 𝑡4 are more up-to-date compared to 𝑡5 and a higher improvement can be expected. In addition, the similarity of the rankings further supports this interpretation. Both, BM25+qrel_boost and BM25+RF, indicate especially consistent similar rankings over time, caused by boosting the same documents for both runs. 6. Conclusion As part of this year’s contributions to the LongEval lab, we conclude that leveraging prior relevance signals from existing logs for the sake of better retrieval effectiveness is a promising direction. Fur- thermore — in line with our earlier observations from the first iteration of the lab — we confirm that the retrieval effectiveness changes with different snapshots of the dataset and addressing the need for making retrieval systems more robust, reliable, and predictable is an exciting direction for future work that should receive more attention. In particular, we applied some heuristics to make use of the available relevance information that can be obtained from prior snapshots of the dataset to improve a recent ranking. To do so, it was a requirement to identify the same document across the different snapshots. Our preliminary analysis revealed that there are indeed the same documents in different snapshots and using these (relevance) signals helps to improve retrieval effectiveness. Based on these outcomes, we envision the contribution of a companion resource that ties together the dataset’s snapshots to identify documents and duplicates that would allow a more in-depth analysis about the document’s relevance over time but also make a more rigorous experimental setup possible. Acknowledgments We would like to express our gratitude to the LongEval Shared Task organizers for their invaluable efforts in constructing the LongEval dataset. Their dedication and hard work have provided an essential foundation for our research. We also gratefully acknowledge the support of the German Research Foundation (DFG) through project grant No. 407518790. References [1] P. Galuscáková, R. Deveaud, G. G. Sáez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel, Longeval- retrieval: French-english dynamic test collection for continuous web search evaluation, in: SIGIR, ACM, 2023, pp. 3086–3094. [2] R. Alkhalifa, I. M. Bilal, H. Borkakoty, J. Camacho-Collados, R. Deveaud, A. El-Ebshihy, L. E. Anke, G. G. Sáez, P. Galuscáková, L. Goeuriot, E. Kochkina, M. Liakata, D. Loureiro, P. Mulhem, F. Piroi, M. Popel, C. Servan, H. T. Madabushi, A. Zubiaga, Overview of the CLEF-2023 longeval lab on longitudinal evaluation of model performance, in: CLEF, volume 14163 of Lecture Notes in Computer Science, Springer, 2023, pp. 440–458. [3] J. Keller, T. Breuer, P. Schaer, Evaluating temporal persistence using replicability measures, in: CLEF (Working Notes), volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 2441–2457. [4] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, M. Gatford, Okapi at TREC-3, in: D. K. Harman (Ed.), Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special Publication, National Institute of Standards and Technology (NIST), 1994, pp. 109–126. URL: http://trec.nist.gov/pubs/trec3/ papers/city.ps.gz. [5] C. Macdonald, N. Tonellotto, Declarative experimentation in information retrieval using pyterrier, in: ICTIR, ACM, 2020, pp. 161–168. [6] R. Pradeep, R. F. Nogueira, J. Lin, The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, CoRR abs/2101.05667 (2021). [7] N. A. Jaleel, J. Allan, W. B. Croft, F. Diaz, L. S. Larkey, X. Li, M. D. Smucker, C. Wade, Umass at TREC 2004: Novelty and HARD, in: TREC, volume 500-261 of NIST Special Publication, National Institute of Standards and Technology (NIST), 2004. [8] C. Buckley, E. M. Voorhees, Retrieval evaluation with incomplete information, in: M. Sanderson, K. Järvelin, J. Allan, P. Bruza (Eds.), SIGIR 2004: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK, July 25-29, 2004, ACM, 2004, pp. 25–32. URL: https://doi.org/10.1145/1008992.1009000. doi:10.1145/ 1008992.1009000. [9] K. Järvelin, J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, ACM Trans. Inf. Syst. 20 (2002) 422–446. URL: http://doi.acm.org/10.1145/582415.582418. doi:10.1145/582415. 582418. [10] E. M. Voorhees, The TREC-8 question answering track report, in: E. M. Voorhees, D. K. Harman (Eds.), Proceedings of The Eighth Text REtrieval Conference, TREC 1999, Gaithersburg, Maryland, USA, November 17-19, 1999, volume 500-246 of NIST Special Publication, National Institute of Standards and Technology (NIST), 1999. URL: http://trec.nist.gov/pubs/trec8/papers/qa_report.pdf. [11] E. M. Voorhees, C. Buckley, Retrieval system evaluation. in harman, in: E. Voorhees, D. K. Harman, National Institute of Standards and Technology (U.S.) (Eds.), TREC: Experiment and Evaluation in Information Retrieval, Digital Libraries and Electronic Publishing, MIT Press, Cambridge, Mass, 2005. [12] T. Breuer, N. Ferro, N. Fuhr, M. Maistro, T. Sakai, P. Schaer, I. Soboroff, How to measure the reproducibility of system-oriented IR experiments, in: J. X. Huang, Y. Chang, X. Cheng, J. Kamps, V. Murdock, J. Wen, Y. Liu (Eds.), Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020, ACM, 2020, pp. 349–358. URL: https://doi.org/10.1145/3397271.3401036. doi:10.1145/ 3397271.3401036. [13] G. G. Saez, Continuous Evaluation Framework for Information Retrieval Systems, Theses, Univer- sité Grenoble Alpes [2020-....], 2023. URL: https://theses.hal.science/tel-04547265. [14] J. Keller, T. Breuer, P. Schaer, Evaluation of temporal change in ir test collections, in: Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’24), July 13, 2024, Washington, DC, USA, ACM, 2024. doi:10.1145/3664190.3672530. [15] M. G. Kendall, Rank correlation methods., Griffin, 1948. [16] W. Webber, A. Moffat, J. Zobel, A similarity measure for indefinite rankings, ACM Trans. Inf. Syst. 28 (2010) 20:1–20:38. URL: https://doi.org/10.1145/1852102.1852106. doi:10.1145/1852102. 1852106. [17] N. Rekabsaz, O. Lesota, M. Schedl, J. Brassey, C. Eickhoff, Tripclick: The log files of a large health web search engine, in: F. Diaz, C. Shah, T. Suel, P. Castells, R. Jones, T. Sakai (Eds.), SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, ACM, 2021, pp. 2507–2513. URL: https://doi.org/10.1145/3404835.3463242. doi:10.1145/3404835.3463242. [18] S. Althammer, S. Hofstätter, S. Verberne, A. Hanbury, Tripjudge: A relevance judgement test collection for tripclick health retrieval, in: M. A. Hasan, L. Xiong (Eds.), Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, October 17-21, 2022, ACM, 2022, pp. 3801–3805. URL: https://doi.org/10.1145/3511808.3557714. doi:10.1145/3511808.3557714. [19] O. Zendel, J. S. Culpepper, F. Scholer, P. Thomas, Enhancing human annotation: Leveraging large language models and efficient batch processing, in: P. D. Clough, M. Harvey, F. Hopfgartner (Eds.), Proceedings of the 2024 ACM SIGIR Conference on Human Information Interaction and Retrieval, CHIIR 2024, Sheffield, United Kingdom, March 10-14, 2024, ACM, 2024, pp. 340–345. URL: https://doi.org/10.1145/3627508.3638322. doi:10.1145/3627508.3638322. [20] P. Schaer, N. Tavakolpoursaleh, Historical clicks for product search: GESIS at CLEF LL4IR 2015, in: L. Cappellato, N. Ferro, G. J. F. Jones, E. SanJuan (Eds.), Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015, volume 1391 of CEUR Workshop Proceedings, CEUR-WS.org, 2015. URL: https://ceur-ws.org/Vol-1391/26-CR.pdf. [21] N. Tavakolpoursaleh, M. Neumann, P. Schaer, Ir-cologne at TREC 2017 opensearch track: Rerun- ning popularity ranking experiments in a living lab, in: E. M. Voorhees, A. Ellis (Eds.), Proceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, Novem- ber 15-17, 2017, volume 500-324 of NIST Special Publication, National Institute of Standards and Technology (NIST), 2017. URL: https://trec.nist.gov/pubs/trec26/papers/IR-Cologne-O.pdf. [22] S. Kapoor, A. Narayanan, Leakage and the reproducibility crisis in machine-learning-based science, Patterns 4 (2023) 100804. URL: https://doi.org/10.1016/j.patter.2023.100804. doi:10.1016/ J.PATTER.2023.100804.