Analyzing the Effectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability Soyoung Yoon1,† , Jongyoon Kim1,† and Seung-won Hwang1,* 1 Seoul National University (SNU), 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Korea Abstract Benchmarking the performance of information retrieval (IR) methods are mostly conducted within a fixed set of documents (static corpora). However, in real-world web search engine environments, the document set is continuously updated and expanded. Addressing these discrepancies and measuring the temporal persistence of IR systems is crucial. By investigating the LongEval benchmark, specifically designed for such dynamic environments, our findings demonstrate the effectiveness of a listwise reranking approach, which proficiently handles inaccuracies induced by temporal distribution shifts. Among listwise rerankers, our findings show that ListT5, which effectively mitigates the positional bias problem by adopting the Fusion-in-Decoder architecture, is especially effective, and more so, as temporal drift increases, on the test-long subset. Keywords information retrieval, listwise reranking, temporal misalignment, positional bias, fusion-in-decoder, reranking, generative retrieval 1. Introduction The majority of studies on information retrieval systems are concentrated on benchmarks that target static snapshots of knowledge. This leaves a gap in our understanding of how these models fare in dynamic environments where knowledge is temporal and constantly accumulating, and the value of adaptiveness to new information is underscored. Moreover, unlike statistical retrieval systems such as BM25, neural-based retrieval models have been found to underperform on unseen data without prior training, posing a challenge for direct application to temporal updates [1]. Attempting to navigate these issues with naïve full fine-tuning is computationally expensive, prone to excessive forgetting, and ultimately impractical. Regarding these circumstances, improving the temporal persistence of the retrieval model, i.e., improving the robustness of the model with respect to time change, is an important research field we should gain attention to. The LongEval Retrieval Challenge [2] aims to specificically target this problem, aligning more closely with real-world retrieval applications and scenarios. Meanwhile, we believe that the temporal change can also be viewed as one specific form of a distribution shift, and we believe applying retrieval methods that are effective in handling out-of- domain data could also be effective for handling the temporal persistence. BEIR [3] is regarded as one of the well-known benchmarks that evaluate the model’s out-of-distribution retrieval performance. Until now, findings suggest [4] that re-ranking is very effective to handle ood retrieval. Specifically, a line of work on listwise reranking, a format that sees multiple passages at once when conducting reranking [5, 6, 7, 8, 9, 10] has shown to be effective on achieving SOTA performance on BEIR. Listwise rerankers can condition on and compare multiple passages to calibrate the relevance scores better, and can reduce the inaccuracy of predictions arising from domain shift, as theoretically supported by [11], and empirically evidenced by a line of works such as [5, 10]. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France * Corresponding author. † These authors contributed equally. $ soyoung.yoon@snu.ac.kr (S. Yoon); john.jongyoon.kim@snu.ac.kr (J. Kim); seungwonh@snu.ac.kr (S. Hwang) € https://soyoung97.github.io/profile/ (S. Yoon); https://artemisdicotiar.github.io/cv.html (J. Kim); https://seungwonh.github.io/ (S. Hwang)  0009-0004-8669-8741 (S. Yoon); 0009-0004-5617-8999 (J. Kim); 0000-0003-0782-0661 (S. Hwang) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: Overview of the longeval retrieval challenge. The task is evaluated into two parts: short-term persistency and long-term persistency. However, naive application of listwise reranking with LLMs are known to face the Lost-in-the middle problem [2], favoring passages presented in the first and last position of the listwise input. Also, the large parametric size of the model itself requires high computational cost and affects negatively on the efficiency. Recently, a model named ListT5 [7] which uses the FiD architecture to conduct listwise reranking has been shown to be effective in both efficiency and performance, mitigating the positional bias by FiD and performing well despite its relatively small model size. In this paper, we aim to bridge this gap by participating in the competition for temporal knowledge retrieval called the LongEval Retrieval Challenge [2], which aligns more closely with real-world re- trieval applications and scenarios, illustrated in Figure 1. In particular, we frame temporal knowledge adaptation as another form of zero-shot domain retrieval and investigate the effectiveness of ListT5, listwise reranking with positional invariance, on the LongEval task. Findings show that ListT5, despite its smaller model size, is more effective than the RankZephyr [12] or other reranking variants, and is especially effective as the temporal shift becomes longer, showing superior performance on the long subset. Our experiments on this benchmark reveal that applying listwise reranking greatly helps generalizing in temporal misalignment, ensuring flexibility for temporal knowledge accumulation, even without further training. 2. The LongEval Challenge Started at 2023, The LongEval Challenge [2] in Figure 1, is a shared task designed to evaluate the temporal persistence of information retrieval systems. It addresses the challenge of maintaining model performance over time as test data becomes increasingly distant from the training data. LongEval sets itself apart from traditional IR tasks by focusing on the development of systems that can adapt to dynamic temporal text changes, introducing time as a new dimension for ranking model performance. Our method was evaluated using LongEval retrieval benchmark datasets from 2023 competitions. The LongEval retrieval task aims to address the distribution shift between the training and test datasets, which occurs due to differences in the timing of data collection. To assess the resilience and temporal alignment of the retrieval system, the task offers two test datasets: test-short, which is the one with a small time shift from the training dataset, and test-long, the other with a more significant time shift. In 2023, the test-short dataset exhibited a 1-month shift from the training dataset, while the test-long dataset showed a 3-month shift. In 2024, the test-short dataset had a 5-month shift, and the test-long dataset had a 7-month shift. Throughout the whole train, test-short, and test-long datasets in both two years, queries are having Table 1 Detailed statistics of the six datasets from two year competitions in LongEval retrieval. This table presents the number of queries, number of documents, average length of query and document, and the number of relevant documents. Year 2023 2024 Train Test-Short Test-Long Train Test-Short Test-Long Time Shift (months) 0 +1 +3 0 +5 +7 Total # Queries 672 882 923 599 407 1518 Total # Documents 1.57M 1.59M 1.08M 2.05M 1.79M 2.53M Average Query Length (words) 2.76 2.71 2.55 2.45 2.41 2.48 Average Document Length (words) 793.1 792.9 806.1 770.5 624.8 429.1 Total # Relevance Annotations 9655 12217 13467 9785 88301 156170 Relevant Document / Query 4.0 3.96 4.32 7.28 4.15 5.48 length of 2 words approximately. This can easily state that almost all queries are keyword queries. Unlike query, as the documents are being crawled and collected from Qwant search logs, documents are often long, having around 800 words. The number of documents are mostly 1.5 millions. The LongEval retrieval benchmark provides the relevance annotations, which are constructed by utilizing attractiveness probability of Dynamic Bayesian Network (DBN) click model trained on Qwant data. As click model uses search log as implicit feedback, which the user clicked the document or even stayed on the document with certain threshold time, the relevant annotation per query is roughly around 4. All the statistics can be found on Table 1. 3. Related Work 3.1. Temporal Retrieval While there is a large body of prior work regarding temporal update of the language models itself [13] or real-time QA models [14, 15], there are relatively few works [16] regarding the development of adaptive retrieval systems. In the era where retrieval-augmented generation systems are well knonw and widely used for its effectiveness and performance [17], along with developing the adaptivity for the generation model for downstream tasks, it is crucial to develop retrieval models to better adapt to new information. 3.2. Listwise Reranking Pointwise v.s. Listwise reranking Until now, the field of zero-shot reranking has been generally driven by cross-encoder models [4] such as MonoT5 [18]. As shown in Figure 2, MonoT5 is a pointwise reranker, which leverages pre-defined token probabilities (i.e., true / false) as relevance scores at inference time. However, while being efficient (requiring O(𝑛) number of forward passes to rank n passages), these models rely on pointwise reranking of each passage, and thus lacks the ability to compare between passages relatively at inference time. This could lead to a sub-optimal solution in the task of reranking, where the discrimination and ordering between passages are crucial. Unlike MonoT5, listwise rerankers consider the relative relevance between documents, thus being robust to domain shift [11]. The comparison as to how pointwise and listwise rerankers differ is illustrated at Fig.2. Listwise reranking with LLM prompting Listwise reranking [5, 6, 7, 8, 9, 10, 12] is a line of work that gives multiple passages at once to the model and output the permutations, or orderings between passages by the relevance to the input query. Specifically, the prompts are formulated in a similar way as follows: I will provide you with 20 passages, each indicated by numerical identifier []. Figure 2: Explanation of listwise reranking models with respect to the pointwise ranking variants. Pointwise reranking individually assigns relevance scores to each documents, where listwise reranking feeds a list of documents at once to the model and let the model generate the relative order of documents. Rank the passages based on their relevance to the search query: {query}. [1] {passage_1} [2] {passage_2} [3] {passage_3} [4] {passage_4} ... [20] {passage_20} Search Query: {query} Rank the 20 passages above based on their relevance to the search query. All the passages should be included and listed using identifiers, in descending order of relevance. The output format should be [] > [], e.g., [4] > [2]. Only respond with the ranking results, do not say any word or explain. Given the above prompt, listwise reranking models are trained to output orderings of 𝑘=20 passages, i.e., [1] > [3] > [20] > .. [19]. Then, the orderings are parsed into list of numbers and the passages are sorted by the output orderings. A sliding window approach is usually adopted to rank top-𝑛 passages where 𝑛 is bigger than the window size the model can accept (𝑘), which is explained at the next paragraph. Due to its ability to utilize the generative capability of large language models by zero-shot prompting, listwise reranking is widely gaining attention. However, this format presents a challenge due to the monolithic and lengthy input size, since the model must process the information of multiple passages at once. Due to its lengthy input size, applying listwise reranking has been limited to LLMs trained with large context lengths. Therefore, application of listwise reranking to smaller models such as T5 [19] has been limited to seeing a pair of passages at once [20], which exhibits quadratic time complexity, far from being practical. Listwise reranking with ListT5 The ListT5 [7] model jointly considering the relevancy of multiple candidate passages at both training and inference time based on the Fusion-in-Decoder (FiD) [21] architecture. By using the FiD architecture ListT5 effectively addresses this issue of positional bias, as each passage is processed by the encoder with identical positional encoding. Consequently, the decoder cannot exploit positional bias. Also, ListT5 effectively reduces the input length on the Figure 3: Illustration of different sorting strategies for listwise reranking, mainly the sliding window approach used for listwise reranking models for with LLMs and the tournament sort approach used for ListT5. In the example, the number of total candidate passages 𝑛 is 8, window size is 4, and stride is 2 where the hyperparameter 𝑟 for ListT5 is 2. Please refer to the ListT5 paper [7] for more detailed explanation about tournament sort. encoder level, by computing listwise reranking at the decoder part. Additionally, to better extend to diverse scenarios, e.g., to rerank k passages given a much larger number of n candidate passages than the model can see at once, ListT5 adopts a hierarchical tournament sort approach rather than a sliding window approach, to efficiently cache output and does not require multiple evaluations over the entire n passages. The illustration of the two strategies for listwise reranking, as to how the sliding window approach and the tournament sort (used for ListT5) works, is illustrated at Fig.3. Typically, they rank top-𝑛 (n=100) passages with the window size of 20 and stride of 10. ListT5 [7] uses listwise reranking with list size of 5 with FiD and conducts tournament sort for efficient reranking. Unlike other listwise reranking models which needs large sized models with context length, ListT5 is much more computationally efficient, using the relatively small-sized T5 architecture. This was made possible by utilizing the Fusion-in-Decoder [21] architecture with tournament sort [22]. ListT5 also effectively mitigates the lost-in-the middle problem, thereby having superior performance on zero-shot retrieval [3]. 4. Baseline Models In this section, we explain about the details on the models we used for both the submission and additional ablation experiments (both baseline and ours) for the LongEval challenge. First-stage retrieval model: BM25, RepLLaMA We mainly used two different first-stage retrieval models: neural-based bi-encoder models, or lexical-based statistical methods. Lexical-based statistical models like BM25 [23] measures the relevance between query and document based on the words. Bi-Encoder models, such as the DPR [24] model, measures the similarity of output embeddings between the query and the document, typically with cosine similarity or dot product. Since we can dump the embeddings of given documents asynchronously and use them on inference time, the Bi-Encoder approach can be computed very efficiently using the aid of optimized similarity search engines such as FAISS. While neural-based first-stage retrievers like ColBERT [25] could be a good option for its effectiveness and efficiency and its ability to capture semantic similarities, BM25 [23], which is a statistical retriever, consistently demonstrates robust performance, especially for zero-shot retrieval Figure 4: Overview of the retrieval process of the submission to the LongEval challenge. benchmarks [3]. Therefore, in this project, we experiment with both neural-based and statistical- based retrievers as the first-stage retrieval system, and represent them as the weakest baseline to be experimented with various reranking models. Specifically, we used both retrieval models as the hybrid approach for submission, and independently use them for further ablation experiments to see the impact of first-stage retrievers for final model reranking performance. Pointwise reranking - MonoT5 To compare the effectiveness of listwise reranking with respect to pointwise reranking on temporal misalignment, we experiment with MonoT5 [18]. MonoT5 is widely known for its effectiveness on zero-shot retrieval. We used the model with the huggingface identifier of castorini/monot5-3b-msmarco-10k, with maximum input size of 1024. Listwise reranking - RankZephyr, RankVicuna, Llama-3-8B zeroshot, and ListT5 Among the listwise reranking models, we experiment with works that are gaining attention for its superior performance with respect to the BEIR[3] benchmark, and that could be fully reproducible by using open- sourced models, which was RankZephyr [12], and RankVicuna [9]. In addition, to test the effectiveness of zero-shot listwise reranking, we experiment with the Llama-3-8B-Instruct [26] model, which was not fine-tuned on the listwise reranking task (and thus are completely zero-shot.) All listwise reranking models except ListT5 were conducted by modifying code provided from the RankLLM repository 1 . The RankLLM repository provides codes to evaluate RankVicuna and RankZephyr models on the pyserini-indexed dataset (such as TREC-DL, MS MARCO, or BEIR). We applied slight modifications to the original code to accept custom datasets (The LongEval test set) and accept the Llama-3-8b model to the system (The codebase to inference models are based on the FastChat2 library, and they already accept Llama-3 family models). The results are saved in the same output format as other models (MonoT5, ListT5) and then go into the same evaluation process as the other models. We used maximum input size of 4096 for RankZephyr, RankVicuna, and Llama-3-8B, and 1024 for ListT5-3B. 5. Submission Details 5.1. Overview In this section, we describe the detailed process of our submission to the challenge. Fig 4 is the overview of our system. We first describe the selection choice of the corpus, data cleanup process and explain about the hybrid retrieval process followed by reranking. To this end, we have submitted 2 versions: one with the MonoT5 and one with ListT5. In summary, the submitted run used BM25 as the first-stage retrieval model to select top-1000 documents, RepLLaMa to select top-100 documents among them, and used either MonoT5 or ListT5 for the final reranking process, as described at Fig 4. 1 http://rankllm.ai/ 2 https://github.com/lm-sys/FastChat 5.2. Dataset Selection Selection-language The LongEval challenge provides a dataset with two languages, French and English. As the dataset was originally collected from the search log of Qwant, the French search engine, they provided the French version of documents and queries. For the non-French researchers, LongEval challenge organizers designed automatic French-to-English translation that utilizes fasttext [27] to detect the language of each sentence in the document and French-English CUBBITT [28] to ensure the high quality of translation [29]. Galuscáková et al. [29] state that they limited the translation length by 500 bytes at once to reduce the catastrophic error propagation, there may be some unintended translation caused due to the fault inherited in the translator. Even though there might be unknown translator errors in the English dataset, as we are non-French researchers, we selected the English dataset to qualitatively analyze the results in detail. Selection-documents The challenge dataset included URLs, but we only focused on the text field of the corpus and maintained the content without any additional post-processing, except document cleanup. The participants in 2023 showed some methods utilizing the URL fields to crawl some additional information from the original document or simulate the search. As we assume that these methods cannot be further applied to the general circumstance of temporal shift, we decide not to collect or simulate additional data. We only focused on the text field provided on the corpus to ensure that ours can be applied to other temporal shift tasks that receive query and corpus as input. 5.3. Evaluation 5.3.1. Proxy Metric To validate our method, we utilized the dataset released in the previous round (2023), as the relevance annotation for 2024 was not released before the challenge submission reached the end. We initially set up the evaluation logic with 2023 test datasets and employed it as a preliminary performance measurement for various experiments to estimate performance on 2024 test datasets. 5.3.2. Detailed Explanation of Evaluation For ease of analysis, we recorded all the possible information on each stage of retrieval including: • query id • query string • retrieved document id • retrieved document string • retrieved document model score • true relevance annotation (if available) This information is recorded in the form of jsonline, where each line indicate retrieval results and true relevance annotation of each query from one test dataset. The jsonline files are then passed to evaluation logic written with TREC code3 . 5.4. Data Preprocessing - Data Cleanup For the document, we have cleaned the document before any other experiments. Note that we did not apply any modification techniques for the query itself (e.g., query expansion). We used the original query without modification, and only cleaned up the documents. The LongEval retrieval dataset was constructed by extracting search logs from Qwant, from selected topics. Because their corpus was extracted from the SERPs of Qwant, its text includes HTML tags, cracked encoded strings, and unwanted 3 We used pytrec_eval (https://github.com/cvangysel/pytrec_eval), the python wrapper version of trec-eval emails and URLs, such as line-break characters (∖n) or encoding cracked characters (∖u00b), as shown on the following box. Uncleaned Document Top Message by razzmouette \u00bb 30 Oct 2012, 17:33\nGiven the MTO I will be comfirme Grevenlingendam tomorrow from 15 to 17 with claques of 25-27 it is quite a lot.\nTop Loca- tion:\nSt. Gilles Message by Xav \u00bb Oct.\n2012\n, 17:41 Pareil, Grevenlingendam !\nWe organise a departure from BXL ...\nI have a place in my small car, leaving St Gilles.\nTop Message by Dimitri \u00bb 30 Oct 2012, 18:22 Salute to all, I am looking to sail tomorrow in Grevelingendam too!\nI am leaving from Brussels (Montgommery), if possible to reach a car I would be ultra boiling.\nThe following table shows the results of the survey.\nThank you in advance!\nDim \u00a0 PS: on this website via Manu LVR Top\nResponding\nDeveloped\nby phpBB \u00ae\nForum\nSoftware\n\u00a9 phpBB Limited\n ... This makes the document difficult to read even for the human, which also impacts the language model since it is trained with cleaned data. Therefore, we apply the dataset cleanup code 4 , which preprocess the corpus by: • replacing line-break character to the empty string. • transliterating to the closest ASCII representation. • fixing Unicode errors. • replacing phone numbers, URLs, and emails to the empty string. • replacing HTML tags with the empty string. For cleantext, we used the following code to remove and replace the first 4 types. 1 from cleantext import clean 2 3 clean( 4 text, 5 fix_unicode=True, # fix various unicode errors 6 to_ascii=True, # transliterate to closest ASCII representation 7 lower=False, # lowercase text 8 no_line_breaks=True, # fully strip line breaks as opposed to only normalizing them 9 no_urls=True, # replace all URLs with a special token 10 no_emails=True, # replace all email addresses with a special token 11 no_phone_numbers=True, # replace all phone numbers with a special token 12 no_numbers=False, # replace all numbers with a special token 13 no_digits=False, # replace all digits with a special token 14 no_currency_symbols=False, # replace all currency symbols with a special token 15 no_punct=False, # remove punctuations 16 replace_with_punct="", # instead of removing punctuations you may replace them 17 replace_with_url="", 18 replace_with_email="", 19 replace_with_phone_number="", 20 replace_with_number="", 21 replace_with_digit="0", 22 replace_with_currency_symbol="", 23 lang="en" # set to 'de' for German special handling 24 ) 4 https://github.com/prasanthg3/cleantext Then we removed HTML tags with simple regex that matches all text that is surrounded by angled brackets. After this pre-processing, we can obtain the more readable document which will also make the retrieval model not to suffer from unintended characters, as written on the following box. Cleaned Document Top Message by razzmouette 3̈0 Oct 2012, 17:33 Given the MTO I will be comfirme Grevenlin- gendam tomorrow from 15 to 17 with claques of 25-27 it is quite a lot. Top Location: St. Gilles Message by Xav Öct. 2012 , 17:41 Pareil, Grevenlingendam ! We organise a departure from BXL ... I have a place in my small car, leaving St Gilles. Top Message by Dimitri 3̈0 Oct 2012, 18:22 Salute to all, I am looking to sail tomorrow in Grevelingendam too! I am leaving from Brussels (Montgommery), if possible to reach a car I would be ultra boiling. The following table shows the results of the survey. Thank you in advance! Dim PS: on this website via Manu LVR Top Responding Developed by phpBB R Forum Software C phpBB Limited This process reduces the length of document about 0.1% as described on the Table 2. This statistics indicates that the majority of contents remain while successfully removing or replacing unreadable characters. Table 2 The document length in characters before and after pre-processing. The pre-processing incorporates replacing, removing, and transliterating the unreadable characters. Dataset Original Document Length Cleaned Document Length Train 4687.3 4655.0 Test-Short 3802.5 3775.6 Test-Long 2611.6 2593.1 To verify that human readable document will improve the language model performance, we conducted simple experiment with BM25 on 2023 dataset. BM25 indexed both uncleaned corpus and cleaned corpus and the performances are measured with nDCG@10 and nDCG@100, which is reported on Table 3. Table 3 Comparative evaluation of nDCG@10 and nDCG@100 scores between BM25 index on cleaned and uncleaned LongEval 2023 corpus. We can see that applying document cleanup improves ranking scores on average for both NDCG@10 and NDCG@100. Metric Method Train Test-Long Test-Short Average BM25 (uncleaned) 16.88 17.37 17.47 17.24 nDCG@10 BM25 16.76 17.38 17.66 17.27 BM25 (uncleaned) 24.23 25.27 24.72 24.74 nDCG@100 BM25 24.20 25.46 25.02 24.89 The result showing that applying document cleanup by filtering out unwanted values had a slight improvement on performance. Compared with the BM25 (uncleaned) version, the BM25 version had +0.03 for nDCG@10 in average, and +0.15 gain of nDCG@100, compared without pre-processing. Seeing this improvement, we decided to use the same cleaned data for the 2024 test set. 5.5. First Stage Retrieval: BM25 BM25 is a lexical based sparse retrieval model utilizes statistics of the corpus. It is often used by search engines to estimate the relevance of documents to a given search query. It is part of the family of probabilistic information retrieval models. BM25 incorporates several heuristics to balance term frequency and document length normalization, making it effective in practical applications. For the implementaiton, we utilized python library castorini/pyserini [30]. We followed default argument for the pyserini.index.lucene command. 1 python -m pyserini.index.lucene \ 2 --collection JsonCollection \ 3 --input ${input_corpus_dir}$ \ 4 --index ${index_save_dir}$ \ 5 --generator DefaultLuceneDocumentGenerator \ 6 --threads 36 \ 7 --storePositions --storeDocvectors --storeRaw 8 9 python -m pyserini.search.lucene \ 10 --index ${index_save_dir} \ 11 --topics ${query_path} \ 12 --output ${ranking_save_path} \ 13 --bm25 Using code and parameters, we retrieved the top 1000 documents with BM25. We verified that the recall was already 0.9 when top 1000 documents are sorted in the order of true relevance annotations. This indicates that reranking the top 1000 document can improve the retrieval result. 5.6. Hybrid Retrieval: RepLLaMA-7B RepLLaMA [31] is a fine-tuned version of the LLaMA language model tailored for multi-stage text retrieval tasks. As a RepLLaMA is a dense bi-encoder retrieval model, it encodes query and document by taking each as separate input and dot-product each other to measure the similarity between the query and the document. The models demonstrate strong zero-shot performance on out-of-domain dataset. Using the huggingface parameters castorini/repllama-v1-7b-lora-passage and the imple- mentation [32], we tested RepLLaMA in parallel with BM25. The RepLLaMA tokenizes the first 512 tokens for both document and query, but it takes the hidden state on the index of EOS token, which may not be located on the 512th output. For each query, given the top 1000 documents indexed by BM25 with RepLLaMA, we conduct additional hybrid retrieval by RepLLaMA-7B and selected the top 100 from them, which were then handed on to the candidate documents for reranking models. All the commands and hyperparameters to index and retrieve documents are followed the description on [32] 5 . 5.7. Reranking We performed reranking with pointwise (MonoT5) and listwise (ListT5) rerankers on the top 100 documents retrieved by BM25 and RepLLaMA. MonoT5 MonoT5 is widely known for its effectiveness on zero-shot retrieval and uses pointwise reranking. MonoT5 is the model that takes a concatenated string of a query 𝑞 and a document 𝑑 as input, Query: 𝑞 Document: 𝑑. The model is fine-tuned to return a word either "true" or "false" to determine whether the document is relevant to the query or not. The returned logit is softmaxed to calculate the probability of a "true" token to be assigned, which is used as the relevance score. As the monoT5 is 5 https://github.com/texttron/tevatron/tree/main/examples/repllama designed for re-ranking, the model iteratively takes each of the documents, that is concatenated with a query, from the top 𝑘 ranking, and outputs its relevance score. For a fair comparison between MonoT5 and ListT5, we specify 𝑘 as 30, since the competition measured not only metrics@10 but also beyond @10. The model parameter we used for MonoT5 can be found on castorini/monot5-3b-msmarco-10k. ListT5 In our official submission to the Codalab, we used the ListT5-3b model (with the huggingface identifier of Soyoung97/ListT5-3b). While the default setup of ListT5 uses 𝑟=2 and reranks top- 10 passages, we modify the setup to use 𝑟=2 and run the model to rerank top-30 passages, to see improvements with NDCG@100 along with NDCG@10. Due to the time limitation and the deadline schedule, we used the top-100 retrieved results from BM25 for the first-stage retrieval model, reranks top 20 passages by ListT5, and appended the top 1000 results from RepLLama. 5.8. Other Details For the hardware specifications, each model runs on a different system. • BM25 (Lucene) utilized CPU only with around 100GB RAM for about 30 minutes to index in the size of 3GB (short), 8GB (long) on SSD. • RepLLaMA used approximately 30GB of NVIDIA A6000 48GB (with batch size of 16) for about 17 hours to index 4 shards in the size of 3GB (short), 8GB (long) on SSD. • MonoT5-3B used full memory (with batch size of 25) of a single NVIDIA H100 80GB. • ListT5-3B used full memory (with batch size of 16) of a single NVIDIA H100 80GB. • Other listwise rerankers (using the rankllm.ai repository) are run on a single NVIDIA H100 80GB. It took about 20 hours to finish inference on approx. 2000 short and long queries. 5.9. Submission Results We conducted multiple experiments and submitted the results utilizing MonoT5 and ListT5. Our method only reranks the top 100 for efficiency, but we found that the challenge measures some metrics @all, where @all indicates @1000 as the submission detail states the results are taken up to 1000 ranking results. Therefore, we filled up the remaining top 1000 results based on the BM25 results. The evaluation of submitted results can be found on Table 4. The result confirms the effectiveness of the ListT5 outperforming in all metrics compared to MonoT5. The ListT5 outperforms MonoT5 by +5.29 (short), +3.8 (long) on nDCG@10, and +1.51 (short) and +1.66 (long) on nDCG@100. From the results, we conclude that listwise reranking also helps to mitigate the temporal misalignment of information retrieval systems, compared with pointwise reranking methods. Moreover, to compare the retrieval performance in fair conditions, we conjectured the language usage on most of the submissions on the leaderboard. We found that the highest ranked participant who used English is mam10eks from team OWS. As far as we can verify when compared with teams that used only the English dataset, we achieved top 1 in all metrics except for nDCG@all, MAP@all, and P@10 in test-short, where we ranked top 2. 6. What Makes ListT5 Effective on Temporal Shift? After the competition, the gold relevance annotations for the 2024 test short / long subset were released, and by utilizing them, we conducted additional experiments to further analyze the effectiveness of ListT5. Initially, we experiment on verifying whether temporal shift is related to domain shift, and that test-long is more shifted than test-short. Subsequently, we conducted experiments to answer three specific research questions. Table 4 The official evaluations of our submissions and best performance with the English dataset (conjecture, the mam10eks team) on test-short (denoted as S) and test-long (denoted as L). The best results on each metric are highlighted in bold. nDCG MAP P Recall @10 @all @all @10 @1000 S L S L S L S L S L (Ours) ListT5 33.45 25.07 23.00 19.33 19.67 14.23 18.09 14.29 59.92 40.18 (Ours) MonoT5 28.16 21.27 21.49 17.67 17.67 12.74 16.78 13.13 42.45 29.45 (Other team) mam10eks 33.32 24.26 24.06 19.02 19.94 13.87 18.31 13.96 55.29 36.59 Figure 5: IDF similarity between datasets and MS-MARCO is measured with Jensen-Shannon divergence. The larger value indicates a more similar vocabulary distribution. 6.1. Hypothesis: Invariance of ListT5 pronounced in bigger shift We hypothesized the most important factor for ListT5 compared with other listwise reranking variants is its permutation invariance property, and that this property is pronounced in bigger shift of domains. In order to show this property along with other analysis, we first analyzed the degree of domain shift between short and long subsets. Temporal shift correlated to temporal shift. The distribution shift in documents is a primary concern in information retrieval. Neural models demonstrates strong performance on test datasets that follow similar data attributes of training datasets, but models struggle to retrieve documents from the test corpus with different distributions than the training corpus. This attribute is also pointed out in the BEIR paper [3], concluding that there are no correlation between models that excel at in-domain test set and out-of-domain test sets. In real search scenarios, since continuously training the retrieval model with respect to the corpus update is computationally expensive and difficult, we train the retrieval model up to a certain point, fix the model, and use it. The retrieval model then experiences a performance drop as the time elapses, as the model struggles to retrieve recent documents that show distribution changed compared to the training dataset. This scenario is much alike the LongEval challenge, where the document is updated through time, and this is the reason why we assume that temporal shift is one format of distribution shift, and solving the temporal shift can be attained by resolving the distribution shift. To measure the extent of the distribution shift between the train, test-short and test-long subset Figure 6: Overview of the additional experiment setup. Compared with the hybrid retrieval process we used on the actual submission (Fig. 4), we independently use 2 different first-stage retrievers (BM25 v.s. RepLLaMA), and compare with diverse different rerankers, including MonoT5(pointwise), RankVicuna, RankZephyr, Llama-3-8b (zero-shot prompting), and ListT5-3B, on the test-short and test-long of the LongEval 2024 test set. Table 5 Comparative evaluation of nDCG@10 and nDCG@100 scores between baselines and ours on the test-short and test-long subset. The best performance on each dataset for each method is highlighted in bold. First-stage Retrieval: BM25 First-stage Retrieval: RepLlama NDCG@10 NDCG@100 NDCG@10 NDCG@100 short long short long short long short long Initial 16.09 13.74 24.40 18.57 21.04 17.17 29.82 22.29 MonoT5-3B 21.49 17.67 28.16 21.27 21.72 18.77 30.35 23.27 Llama3-8B 15.03 12.72 23.49 17.65 20.19 16.22 29.08 21.41 RankVicuna-7B 18.68 15.28 26.41 19.91 19.74 16.46 29.22 22.19 RankZephyr-7B 22.72 17.70 29.06 21.47 23.96 19.18 31.90 23.75 ListT5-3B 22.41 18.21 28.62 21.65 23.11 19.45 31.27 23.80 of LongEval-2024, we employ inverse document frequency (IDF) and measure the similarity with the Jensen-Shannon divergence. Each document is tokenized with T5 tokenizer 6 and truncated the document by the length of 1024 tokens. 7 In addition to the LongEval-2024 training corpus, MS-MARCO is considered as the in-domain, a dataset to compare, as we target a zero-shot setting. Results from Figure 5 show that the test-long subset (47.3) were more out-of-domain compared to the test-short subset (49.8), since lower values indicate that they are more far away from each domain. From this experiment, we conclude that temporal shift is correlated with domain shift, and test-long is more shifted than test-short, which follows the original conjecture. 6.2. Research Questions & Experimental Setup Figure 6 is the experimental overview on testing the effectiveness of ListT5 on various aspects. After finding out that temporal shift is related with domain shift, we try to answer 3 research questions regarding the effectiveness of ListT5 applied on the temporal domain: 1. Is listwise reranking more effective than pointwise rerankers for temporal persistence? 2. Will the invariance property of ListT5 (by FiD) be pronounced in bigger shift? 3. Does this effect hold when using different first-stage retrievers? 6 Soyoung97/ListT5-3b 7 Special tokens are neglected. In order to answer the 3 research questions, we additionally conduct experiments and report results on different baseline models. As shown on Fig. 6, we experiment with 2 different first-stage retrieval models, 5 different reranking models including pointwise, listwise reranking models, and ListT5, and report the NDCG, MAP, Precision, and Recall performance. For each evaluation metric, we report both measures at @10 and @100 for through analysis. Mainly, we use and compare with the NDCG@10 and NDCG@100 results on Table.5, and use results on other metrics (Table. 6, 7, 8) as supporting evidence. 6.2.1. RQ1. Listwise Reranking vs Pointwise Reranking Looking at the tables, we can see that generally, listwise reranking is much more effective than pointwise rerankers. The performance on RankVicuna, RankZephyr, and ListT5 is much more higher than MonoT5- 3B models, for both short and long subsets. However, it seems that fine-tuning is crucial for rerankers, since zero-shot results from Llama3-8B was not as effective as the fine-tuned counterparts. 6.2.2. RQ2. Invariant Listwise Reranking (ListT5) On comparing ListT5 with other listwise reranking models, we notice an interesting finding. On the short subset (with less temporal shift), RankZephyr was more effective. On the long subset, however, ListT5, which has invariant properties, was more effective. This property was consistent across all metrics including NDCG, MAP, Precision, and Recall, regardless of @k (@10, @100). We believe that the robustness to positional bias helped ListT5 perform better than other counterparts on the long subset, where the initial ordering was not very trustable. We hypothesize that ListT5 with invariant properties will be much more effective as the temporal shift becomes even larger. 6.2.3. RQ3. Impact of First-Stage Retrievers While we used hybrid retrieval - a combination of BM25 and RepLLaMA for submission, we believed it was necessary to compare with statistical and neural retrievers, to see the impact on the reranking performance depending on the effectiveness of the initial retriever. Thus, we experiment and report results using either BM25 or RepLLaMA. The results show that while RepLLaMA is a better first-stage retrieval model than BM25 (higher precision, higher recall), the orderings between ListT5 and baseline models doesn’t change much, and found that most properties held without significant differences. 6.3. Summary of findings By investigating on 3 research questions, we conclude that permutation-invariant listwise reranking (ListT5) is an effective method not only for general out-of-domain data but also to mitigate the temporal drift, and the findings hold regardless of the choice of the first-stage retrievers. It is particularly interesting that it performs better than RankZephyr-7B in test-long subsets where distribution shift is more pronounced. 7. Conclusion and Future Work In this paper, we focus on analyzing the effectiveness of listwise reranking (ListT5) on the LongEval Challenge set, to investigate its effectiveness on the temporal misalignment scenario. Our findings (on the Jensen-Shannon divergence) suggest that temporal misalignment could be viewed as one form of out-of-distribution scenario. Compared with other listwise reranking methods such as RankZephyr, we find that applying permutation-invariant listwise reranking becomes more effective as temporal drift increases, ListT5 achieving higher performance on the test-long subset with half the parametric model size. To this end, we aim to develop a search engine capable of delivering robust and stable results, even as the available document sets change over time. Table 6 Comparative evaluation of MAP@10 and MAP@100 scores between baselines and ours on the test- short and test-long subset. The best performance on each dataset for each method is highlighted in bold. First-stage Retrieval: BM25 First-stage Retrieval: RepLlama MAP@10 MAP@100 MAP@10 MAP@100 short long short long short long short long Initial 10.39 8.24 13.44 9.70 13.94 10.66 17.63 12.35 MonoT5-3B 14.74 11.41 17.67 12.74 14.86 11.95 18.46 13.54 Llama3-8B 9.21 7.29 12.35 8.76 12.71 9.70 16.45 11.40 RankVicuna-7B 13.06 9.64 16.14 11.12 13.52 10.30 17.32 12.17 RankZephyr-7B 15.94 11.50 18.74 12.79 16.43 12.49 20.04 14.04 ListT5-3B 15.71 12.01 18.32 13.23 16.26 12.74 19.51 14.24 Table 7 Comparative evaluation of Precision@10 and Precision@100 scores between baselines and ours on the test-short and test-long subset. The best performance on each dataset for each method is highlighted in bold. Notice that Precision@100 scores are same, since the model variants are all rerankers initialized from first-stage retrieval models. First-stage Retrieval: BM25 First-stage Retrieval: RepLlama Precision@10 Precision@100 Precision@10 Precision@100 short long short long short long short long Initial 12.82 10.30 16.88 12.82 MonoT5-3B 16.78 13.13 17.35 14.15 Llama3-8B 12.57 10.15 16.68 12.65 2.91 1.98 3.41 2.29 RankVicuna-7B 14.79 10.92 15.99 11.87 RankZephyr-7B 17.23 12.57 18.94 13.96 ListT5-3B 17.56 13.40 18.17 14.38 Table 8 Comparative evaluation of Recall@10 and Recall@100 scores between baselines and ours on the test-short and test-long subset. The best performance on each dataset for each method is highlighted in bold. Notice that Recall@100 scores are same, since the model variants are all rerankers initialized from first-stage retrieval models. First-stage Retrieval: BM25 First-stage Retrieval: RepLlama Recall@10 Recall@100 Recall@10 Recall@100 short long short long short long short long Initial 19.23 15.78 25.29 19.65 MonoT5-3B 25.45 20.10 26.29 21.90 Llama3-8B 18.90 15.53 24.99 19.41 42.45 29.45 49.28 33.81 RankVicuna-7B 22.23 16.67 23.88 18.13 RankZephyr-7B 25.74 19.28 28.16 21.25 ListT5-3B 26.56 20.52 27.40 22.19 8. Limitation The LongEval Challenge dataset is being collected from the French search engine, Qwant. Therefore, the dataset has been collected in French, (e.g., mostly used Euro(€) to represent currency). However, as we are not native French, it is difficult to utilize the provided French document subset. Therefore, we conducted all experiments in English. However, as we referred to the challenge leaderboard, the methods that utilized French showed a (much better) huge performance gap between English results. For future work, we hope to make a multilingual version of ListT5, where it is currently only limited to English. We believe ListT5 with the multilingual version would drastically improve its applicability to a much wider domain. Also, investigating and improving the temporal shift in a multi-lingual setting seems to be an interesting next future step. Lastly, due to time limitations, we had to choose the hybrid approach on the submission in the challenge. Our results would have been better if we had used RepLLaMA as the single first-stage retrieval model and rerank top-1000 passages instead of 100. However, due to time and computing resource limitations, we were unable to submit the results on time. Acknowledging these limitations, we would like to participate once again, if the competition holds, in the future. References [1] P. Röttger, J. Pierrehumbert, Temporal adaptation of BERT and performance on downstream document classification: Insights from social media, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2400–2412. URL: https: //aclanthology.org/2021.findings-emnlp.206. doi:10.18653/v1/2021.findings-emnlp.206. [2] P. G. R. Deveaud, G. Gonzalez-Saez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel, Longeval- retrieval: French-english dynamic test collection for continuous web search evaluation, 2023. arXiv:2303.03229. [3] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, I. Gurevych, Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021. arXiv:2104.08663. [4] G. Rosa, L. Bonifacio, V. Jeronymo, H. Abonizio, M. Fadaee, R. Lotufo, R. Nogueira, In defense of cross-encoders for zero-shot retrieval, 2022. arXiv:2212.06121. [5] X. Ma, X. Zhang, R. Pradeep, J. Lin, Zero-shot listwise document reranking with a large language model, 2023. arXiv:2305.02156. [6] W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, Z. Ren, Is chatgpt good at search? investigating large language models as re-ranking agents, 2023. arXiv:2304.09542. [7] S. Yoon, E. Choi, J. Kim, Y. Kim, H. Yun, S. won Hwang, Listt5: Listwise reranking with fusion-in- decoder improves zero-shot retrieval, 2024. arXiv:2402.15838. [8] H. Zhuang, Z. Qin, R. Jagerman, K. Hui, J. Ma, J. Lu, J. Ni, X. Wang, M. Bendersky, Rankt5: Fine-tuning t5 for text ranking with ranking losses, 2022. arXiv:2210.10634. [9] R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankvicuna: Zero-shot listwise document reranking with open-source large language models, 2023. arXiv:2309.15088. [10] W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren, Z. Chen, D. Yin, Z. Ren, Instruction distillation makes large language models efficient zero-shot rankers, 2023. arXiv:2311.01555. [11] R. Xian, H. Zhuang, Z. Qin, H. Zamani, J. Lu, J. Ma, K. Hui, H. Zhao, X. Wang, M. Bendersky, Learning list-level domain-invariant representations for ranking, 2023. arXiv:2212.10764. [12] R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. arXiv:2312.02724. [13] B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, W. W. Cohen, Time-aware language models as temporal knowledge bases, Transactions of the Association for Computational Linguistics 10 (2022) 257–273. URL: https://aclanthology.org/2022.tacl-1.15. doi:10.1162/tacl_ a_00459. [14] K. Luu, D. Khashabi, S. Gururangan, K. Mandyam, N. A. Smith, Time waits for no one! analysis and challenges of temporal misalignment, 2022. arXiv:2111.07408. [15] J. Kasai, K. Sakaguchi, Y. Takahashi, R. L. Bras, A. Asai, X. Yu, D. Radev, N. A. Smith, Y. Choi, K. Inui, Realtime qa: What’s the answer right now?, 2024. arXiv:2207.13332. [16] Z. Zhang, M. Fang, L. Chen, Retrievalqa: Assessing adaptive retrieval-augmented generation for short-form open-domain question answering, 2024. arXiv:2402.16457. [17] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-augmented generation for large language models: A survey, 2024. arXiv:2312.10997. [18] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to-sequence model, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 708–718. URL: https://www.aclweb.org/anthology/ 2020.findings-emnlp.63. doi:10.18653/v1/2020.findings-emnlp.63. [19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html. [20] R. Pradeep, R. Nogueira, J. Lin, The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, 2021. arXiv:2101.05667. [21] G. Izacard, E. Grave, Leveraging passage retrieval with generative models for open domain question answering, 2021. arXiv:2007.01282. [22] K. McLuckie, A. Barber, Tournament Sort, Macmillan Education UK, London, 1986, pp. 68–86. URL: https://doi.org/10.1007/978-1-349-08147-9_5. doi:10.1007/978-1-349-08147-9_5. [23] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond, Foundations and Trends in Information Retrieval 3 (2009) 333–389. URL: https://doi.org/10.1561/1500000019. doi:10.1561/1500000019. [24] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval for open-domain question answering, in: Proceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), Association for Computational Linguis- tics, Online, 2020, pp. 6769–6781. URL: https://www.aclweb.org/anthology/2020.emnlp-main.550. doi:10.18653/v1/2020.emnlp-main.550. [25] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 39–48. [26] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md. [27] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification, arXiv preprint arXiv:1607.01759 (2016). [28] M. Popel, M. Tomkova, J. Tomek, Ł. Kaiser, J. Uszkoreit, O. Bojar, Z. Žabokrtskỳ, Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals, Nature communications 11 (2020) 1–15. [29] P. Galuscáková, R. Deveaud, G. González Sáez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel, Longeval- retrieval: French-english dynamic test collection for continuous web search evaluation, in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 3086–3094. [30] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: A Python toolkit for repro- ducible information retrieval research with sparse and dense representations, in: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), 2021, pp. 2356–2362. [31] X. Ma, L. Wang, N. Yang, F. Wei, J. Lin, Fine-tuning llama for multi-stage text retrieval, arXiv preprint arXiv:2310.08319 (2023). [32] L. Gao, X. Ma, J. J. Lin, J. Callan, Tevatron: An efficient and flexible toolkit for dense retrieval, ArXiv abs/2203.05765 (2022).