<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Analyzing the Efectiveness of Listwise Reranking with Positional Invariance on Temporal Generalizability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soyoung Yoon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jongyoon Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seung-won Hwang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Seoul National University (SNU)</institution>
          ,
          <addr-line>1 Gwanak-ro, Gwanak-gu, Seoul 08826</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Benchmarking the performance of information retrieval (IR) methods are mostly conducted within a fixed set of documents (static corpora). However, in real-world web search engine environments, the document set is continuously updated and expanded. Addressing these discrepancies and measuring the temporal persistence of IR systems is crucial. By investigating the LongEval benchmark, specifically designed for such dynamic environments, our findings demonstrate the efectiveness of a listwise reranking approach, which proficiently handles inaccuracies induced by temporal distribution shifts. Among listwise rerankers, our findings show that ListT5, which efectively mitigates the positional bias problem by adopting the Fusion-in-Decoder architecture, is especially efective, and more so, as temporal drift increases, on the test-long subset.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;information retrieval</kwd>
        <kwd>listwise reranking</kwd>
        <kwd>temporal misalignment</kwd>
        <kwd>positional bias</kwd>
        <kwd>fusion-in-decoder</kwd>
        <kwd>reranking</kwd>
        <kwd>generative retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The majority of studies on information retrieval systems are concentrated on benchmarks that target
static snapshots of knowledge. This leaves a gap in our understanding of how these models fare in
dynamic environments where knowledge is temporal and constantly accumulating, and the value of
adaptiveness to new information is underscored. Moreover, unlike statistical retrieval systems such as
BM25, neural-based retrieval models have been found to underperform on unseen data without prior
training, posing a challenge for direct application to temporal updates [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Attempting to navigate
these issues with naïve full fine-tuning is computationally expensive, prone to excessive forgetting,
and ultimately impractical. Regarding these circumstances, improving the temporal persistence of
the retrieval model, i.e., improving the robustness of the model with respect to time change, is an
important research field we should gain attention to. The LongEval Retrieval Challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aims to
specificically target this problem, aligning more closely with real-world retrieval applications and
scenarios. Meanwhile, we believe that the temporal change can also be viewed as one specific form
of a distribution shift, and we believe applying retrieval methods that are efective in handling
out-ofdomain data could also be efective for handling the temporal persistence. BEIR [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is regarded as one
of the well-known benchmarks that evaluate the model’s out-of-distribution retrieval performance.
Until now, findings suggest [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that re-ranking is very efective to handle ood retrieval. Specifically,
a line of work on listwise reranking, a format that sees multiple passages at once when conducting
reranking [
        <xref ref-type="bibr" rid="ref10 ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9, 10</xref>
        ] has shown to be efective on achieving SOTA performance on BEIR. Listwise
rerankers can condition on and compare multiple passages to calibrate the relevance scores better, and
can reduce the inaccuracy of predictions arising from domain shift, as theoretically supported by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
and empirically evidenced by a line of works such as [
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ].
      </p>
      <p>
        However, naive application of listwise reranking with LLMs are known to face the Lost-in-the middle
problem [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], favoring passages presented in the first and last position of the listwise input. Also, the
large parametric size of the model itself requires high computational cost and afects negatively on
the eficiency. Recently, a model named ListT5 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which uses the FiD architecture to conduct listwise
reranking has been shown to be efective in both eficiency and performance, mitigating the positional
bias by FiD and performing well despite its relatively small model size.
      </p>
      <p>
        In this paper, we aim to bridge this gap by participating in the competition for temporal knowledge
retrieval called the LongEval Retrieval Challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which aligns more closely with real-world
retrieval applications and scenarios, illustrated in Figure 1. In particular, we frame temporal knowledge
adaptation as another form of zero-shot domain retrieval and investigate the efectiveness of ListT5,
listwise reranking with positional invariance, on the LongEval task. Findings show that ListT5, despite
its smaller model size, is more efective than the RankZephyr [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or other reranking variants, and is
especially efective as the temporal shift becomes longer, showing superior performance on the long
subset.
      </p>
      <p>Our experiments on this benchmark reveal that applying listwise reranking greatly helps generalizing
in temporal misalignment, ensuring flexibility for temporal knowledge accumulation, even without
further training.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The LongEval Challenge</title>
      <p>
        Started at 2023, The LongEval Challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in Figure 1, is a shared task designed to evaluate the
temporal persistence of information retrieval systems. It addresses the challenge of maintaining model
performance over time as test data becomes increasingly distant from the training data. LongEval
sets itself apart from traditional IR tasks by focusing on the development of systems that can adapt to
dynamic temporal text changes, introducing time as a new dimension for ranking model performance.
      </p>
      <p>Our method was evaluated using LongEval retrieval benchmark datasets from 2023 competitions.
The LongEval retrieval task aims to address the distribution shift between the training and test datasets,
which occurs due to diferences in the timing of data collection. To assess the resilience and temporal
alignment of the retrieval system, the task ofers two test datasets: test-short, which is the one with a
small time shift from the training dataset, and test-long, the other with a more significant time shift.
In 2023, the test-short dataset exhibited a 1-month shift from the training dataset, while the test-long
dataset showed a 3-month shift. In 2024, the test-short dataset had a 5-month shift, and the test-long
dataset had a 7-month shift.</p>
      <p>Throughout the whole train, test-short, and test-long datasets in both two years, queries are having
length of 2 words approximately. This can easily state that almost all queries are keyword queries.
Unlike query, as the documents are being crawled and collected from Qwant search logs, documents
are often long, having around 800 words. The number of documents are mostly 1.5 millions.</p>
      <p>The LongEval retrieval benchmark provides the relevance annotations, which are constructed by
utilizing attractiveness probability of Dynamic Bayesian Network (DBN) click model trained on Qwant
data. As click model uses search log as implicit feedback, which the user clicked the document or even
stayed on the document with certain threshold time, the relevant annotation per query is roughly
around 4. All the statistics can be found on Table 1.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <sec id="sec-3-1">
        <title>3.1. Temporal Retrieval</title>
        <p>
          While there is a large body of prior work regarding temporal update of the language models itself [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
or real-time QA models [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ], there are relatively few works [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] regarding the development of
adaptive retrieval systems. In the era where retrieval-augmented generation systems are well knonw
and widely used for its efectiveness and performance [ 17], along with developing the adaptivity for
the generation model for downstream tasks, it is crucial to develop retrieval models to better adapt to
new information.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Listwise Reranking</title>
        <p>
          Pointwise v.s. Listwise reranking Until now, the field of zero-shot reranking has been generally
driven by cross-encoder models [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] such as MonoT5 [18]. As shown in Figure 2, MonoT5 is a pointwise
reranker, which leverages pre-defined token probabilities (i.e., true / false) as relevance scores at
inference time. However, while being eficient (requiring O( ) number of forward passes to rank n
passages), these models rely on pointwise reranking of each passage, and thus lacks the ability to
compare between passages relatively at inference time. This could lead to a sub-optimal solution in the
task of reranking, where the discrimination and ordering between passages are crucial. Unlike MonoT5,
listwise rerankers consider the relative relevance between documents, thus being robust to domain
shift [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The comparison as to how pointwise and listwise rerankers difer is illustrated at Fig.2.
Listwise reranking with LLM prompting Listwise reranking [
          <xref ref-type="bibr" rid="ref10 ref12 ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9, 10, 12</xref>
          ] is a line of work
that gives multiple passages at once to the model and output the permutations, or orderings between
passages by the relevance to the input query. Specifically, the prompts are formulated in a similar way
as follows:
I will provide you with 20 passages, each indicated by numerical identifier [].
Rank the 20 passages above based on their relevance to the search query.
All the passages should be included and listed using identifiers, in descending
order of relevance. The output format should be [] &gt; [], e.g., [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] &gt; [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
Only respond with the ranking results, do not say any word or explain.
Given the above prompt, listwise reranking models are trained to output orderings of =20 passages,
i.e., [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] &gt; [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] &gt; [20] &gt; .. [19]. Then, the orderings are parsed into list of numbers and the
passages are sorted by the output orderings. A sliding window approach is usually adopted to rank
top- passages where  is bigger than the window size the model can accept (), which is explained at
the next paragraph. Due to its ability to utilize the generative capability of large language models by
zero-shot prompting, listwise reranking is widely gaining attention. However, this format presents a
challenge due to the monolithic and lengthy input size, since the model must process the information of
multiple passages at once. Due to its lengthy input size, applying listwise reranking has been limited to
LLMs trained with large context lengths. Therefore, application of listwise reranking to smaller models
such as T5 [19] has been limited to seeing a pair of passages at once [20], which exhibits quadratic time
complexity, far from being practical.
        </p>
        <p>
          Listwise reranking with ListT5 The ListT5 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] model jointly considering the relevancy of multiple
candidate passages at both training and inference time based on the Fusion-in-Decoder (FiD) [21]
architecture. By using the FiD architecture ListT5 efectively addresses this issue of positional bias,
as each passage is processed by the encoder with identical positional encoding. Consequently, the
decoder cannot exploit positional bias. Also, ListT5 efectively reduces the input length on the
encoder level, by computing listwise reranking at the decoder part. Additionally, to better extend to
diverse scenarios, e.g., to rerank k passages given a much larger number of n candidate passages than
the model can see at once, ListT5 adopts a hierarchical tournament sort approach rather than a sliding
window approach, to eficiently cache output and does not require multiple evaluations over the entire
n passages.
        </p>
        <p>The illustration of the two strategies for listwise reranking, as to how the sliding window approach
and the tournament sort (used for ListT5) works, is illustrated at Fig.3. Typically, they rank top-
(n=100) passages with the window size of 20 and stride of 10.</p>
        <p>
          ListT5 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] uses listwise reranking with list size of 5 with FiD and conducts tournament sort for
eficient reranking. Unlike other listwise reranking models which needs large sized models with context
length, ListT5 is much more computationally eficient, using the relatively small-sized T5 architecture.
This was made possible by utilizing the Fusion-in-Decoder [21] architecture with tournament sort [22].
ListT5 also efectively mitigates the lost-in-the middle problem, thereby having superior performance
on zero-shot retrieval [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Baseline Models</title>
      <p>In this section, we explain about the details on the models we used for both the submission and additional
ablation experiments (both baseline and ours) for the LongEval challenge.</p>
      <p>
        First-stage retrieval model: BM25, RepLLaMA We mainly used two diferent first-stage retrieval
models: neural-based bi-encoder models, or lexical-based statistical methods. Lexical-based statistical
models like BM25 [23] measures the relevance between query and document based on the words.
Bi-Encoder models, such as the DPR [24] model, measures the similarity of output embeddings between
the query and the document, typically with cosine similarity or dot product. Since we can dump
the embeddings of given documents asynchronously and use them on inference time, the Bi-Encoder
approach can be computed very eficiently using the aid of optimized similarity search engines such
as FAISS. While neural-based first-stage retrievers like ColBERT [ 25] could be a good option for its
efectiveness and eficiency and its ability to capture semantic similarities, BM25 [ 23], which is a
statistical retriever, consistently demonstrates robust performance, especially for zero-shot retrieval
benchmarks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Therefore, in this project, we experiment with both neural-based and
statisticalbased retrievers as the first-stage retrieval system, and represent them as the weakest baseline to be
experimented with various reranking models. Specifically, we used both retrieval models as the hybrid
approach for submission, and independently use them for further ablation experiments to see the impact
of first-stage retrievers for final model reranking performance.
      </p>
      <p>Pointwise reranking - MonoT5 To compare the efectiveness of listwise reranking with respect to
pointwise reranking on temporal misalignment, we experiment with MonoT5 [18]. MonoT5 is widely
known for its efectiveness on zero-shot retrieval. We used the model with the huggingface identifier of
castorini/monot5-3b-msmarco-10k, with maximum input size of 1024.</p>
      <p>
        Listwise reranking - RankZephyr, RankVicuna, Llama-3-8B zeroshot, and ListT5 Among
the listwise reranking models, we experiment with works that are gaining attention for its superior
performance with respect to the BEIR[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] benchmark, and that could be fully reproducible by using
opensourced models, which was RankZephyr [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and RankVicuna [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In addition, to test the efectiveness
of zero-shot listwise reranking, we experiment with the Llama-3-8B-Instruct [26] model, which was not
ifne-tuned on the listwise reranking task (and thus are completely zero-shot.) All listwise reranking
models except ListT5 were conducted by modifying code provided from the RankLLM repository 1.
The RankLLM repository provides codes to evaluate RankVicuna and RankZephyr models on the
pyserini-indexed dataset (such as TREC-DL, MS MARCO, or BEIR). We applied slight modifications to
the original code to accept custom datasets (The LongEval test set) and accept the Llama-3-8b model to
the system (The codebase to inference models are based on the FastChat2 library, and they already accept
Llama-3 family models). The results are saved in the same output format as other models (MonoT5,
ListT5) and then go into the same evaluation process as the other models. We used maximum input size
of 4096 for RankZephyr, RankVicuna, and Llama-3-8B, and 1024 for ListT5-3B.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Submission Details</title>
      <sec id="sec-5-1">
        <title>5.1. Overview</title>
        <p>In this section, we describe the detailed process of our submission to the challenge. Fig 4 is the overview
of our system. We first describe the selection choice of the corpus, data cleanup process and explain
about the hybrid retrieval process followed by reranking. To this end, we have submitted 2 versions:
one with the MonoT5 and one with ListT5. In summary, the submitted run used BM25 as the first-stage
retrieval model to select top-1000 documents, RepLLaMa to select top-100 documents among them,
and used either MonoT5 or ListT5 for the final reranking process, as described at Fig 4.</p>
        <sec id="sec-5-1-1">
          <title>1http://rankllm.ai/ 2https://github.com/lm-sys/FastChat</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Dataset Selection</title>
        <p>Selection-language The LongEval challenge provides a dataset with two languages, French and
English. As the dataset was originally collected from the search log of Qwant, the French search engine,
they provided the French version of documents and queries. For the non-French researchers, LongEval
challenge organizers designed automatic French-to-English translation that utilizes fasttext [27] to
detect the language of each sentence in the document and French-English CUBBITT [28] to ensure the
high quality of translation [29]. Galuscáková et al. [29] state that they limited the translation length
by 500 bytes at once to reduce the catastrophic error propagation, there may be some unintended
translation caused due to the fault inherited in the translator. Even though there might be unknown
translator errors in the English dataset, as we are non-French researchers, we selected the English
dataset to qualitatively analyze the results in detail.</p>
        <p>Selection-documents The challenge dataset included URLs, but we only focused on the text field
of the corpus and maintained the content without any additional post-processing, except document
cleanup. The participants in 2023 showed some methods utilizing the URL fields to crawl some additional
information from the original document or simulate the search. As we assume that these methods
cannot be further applied to the general circumstance of temporal shift, we decide not to collect or
simulate additional data. We only focused on the text field provided on the corpus to ensure that ours
can be applied to other temporal shift tasks that receive query and corpus as input.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Evaluation</title>
        <p>5.3.1. Proxy Metric
To validate our method, we utilized the dataset released in the previous round (2023), as the relevance
annotation for 2024 was not released before the challenge submission reached the end. We initially
set up the evaluation logic with 2023 test datasets and employed it as a preliminary performance
measurement for various experiments to estimate performance on 2024 test datasets.
5.3.2. Detailed Explanation of Evaluation
For ease of analysis, we recorded all the possible information on each stage of retrieval including:
• query id
• query string
• retrieved document id
• retrieved document string
• retrieved document model score
• true relevance annotation (if available)
This information is recorded in the form of jsonline, where each line indicate retrieval results and
true relevance annotation of each query from one test dataset. The jsonline files are then passed to
evaluation logic written with TREC code3.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Data Preprocessing - Data Cleanup</title>
        <p>For the document, we have cleaned the document before any other experiments. Note that we did not
apply any modification techniques for the query itself (e.g., query expansion). We used the original
query without modification, and only cleaned up the documents. The LongEval retrieval dataset was
constructed by extracting search logs from Qwant, from selected topics. Because their corpus was
extracted from the SERPs of Qwant, its text includes HTML tags, cracked encoded strings, and unwanted
3We used pytrec_eval (https://github.com/cvangysel/pytrec_eval), the python wrapper version of trec-eval
emails and URLs, such as line-break characters (∖n) or encoding cracked characters (∖u00b), as shown
on the following box.</p>
        <sec id="sec-5-4-1">
          <title>Uncleaned Document</title>
          <p>Top Message by razzmouette \u00bb 30 Oct 2012, 17:33\nGiven the MTO I will be comfirme
Grevenlingendam tomorrow from 15 to 17 with claques of 25-27 it is quite a lot.\nTop
Location:\nSt. Gilles Message by Xav \u00bb Oct.\n2012\n, 17:41 Pareil, Grevenlingendam !\nWe
organise a departure from BXL ...\nI have a place in my small car, leaving St Gilles.\nTop
Message by Dimitri \u00bb 30 Oct 2012, 18:22 Salute to all, I am looking to sail tomorrow in
Grevelingendam too!\nI am leaving from Brussels (Montgommery), if possible to reach a car
I would be ultra boiling.\nThe following table shows the results of the survey.\nThank you in
advance!\nDim \u00a0 PS: on this website via Manu LVR Top\nResponding\nDeveloped\nby
phpBB \u00ae\nForum\nSoftware\n\u00a9 phpBB Limited\n ...</p>
          <p>This makes the document dificult to read even for the human, which also impacts the language
model since it is trained with cleaned data. Therefore, we apply the dataset cleanup code 4, which
preprocess the corpus by:
• replacing line-break character to the empty string.
• transliterating to the closest ASCII representation.
• fixing Unicode errors.
• replacing phone numbers, URLs, and emails to the empty string.</p>
          <p>• replacing HTML tags with the empty string.</p>
          <p>For cleantext, we used the following code to remove and replace the first 4 types.
from cleantext import clean
clean(
text,
fix_unicode=True, # fix various unicode errors
to_ascii=True, # transliterate to closest ASCII representation
lower=False, # lowercase text
no_line_breaks=True, # fully strip line breaks as opposed to only normalizing them
no_urls=True, # replace all URLs with a special token
no_emails=True, # replace all email addresses with a special token
no_phone_numbers=True, # replace all phone numbers with a special token
no_numbers=False, # replace all numbers with a special token
no_digits=False, # replace all digits with a special token
no_currency_symbols=False, # replace all currency symbols with a special token
no_punct=False, # remove punctuations
replace_with_punct="", # instead of removing punctuations you may replace them
replace_with_url="",
replace_with_email="",
replace_with_phone_number="",
replace_with_number="&lt;NUMBER&gt;",
replace_with_digit="0",
replace_with_currency_symbol="&lt;CUR&gt;",
lang="en" # set to 'de' for German special handling
)</p>
          <p>Then we removed HTML tags with simple regex that matches all text that is surrounded by angled
brackets.</p>
          <p>After this pre-processing, we can obtain the more readable document which will also make the
retrieval model not to sufer from unintended characters, as written on the following box.</p>
        </sec>
        <sec id="sec-5-4-2">
          <title>Cleaned Document</title>
          <p>Top Message by razzmouette 3¨0 Oct 2012, 17:33 Given the MTO I will be comfirme
Grevenlingendam tomorrow from 15 to 17 with claques of 25-27 it is quite a lot. Top Location: St. Gilles
Message by Xav Öct. 2012 , 17:41 Pareil, Grevenlingendam ! We organise a departure from BXL
... I have a place in my small car, leaving St Gilles. Top Message by Dimitri 3¨0 Oct 2012, 18:22
Salute to all, I am looking to sail tomorrow in Grevelingendam too! I am leaving from Brussels
(Montgommery), if possible to reach a car I would be ultra boiling. The following table shows
the results of the survey. Thank you in advance! Dim PS: on this website via Manu LVR Top
Responding Developed by phpBB R Forum Software C phpBB Limited</p>
          <p>This process reduces the length of document about 0.1% as described on the Table 2. This statistics
indicates that the majority of contents remain while successfully removing or replacing unreadable
characters.</p>
          <p>To verify that human readable document will improve the language model performance, we conducted
simple experiment with BM25 on 2023 dataset. BM25 indexed both uncleaned corpus and cleaned
corpus and the performances are measured with nDCG@10 and nDCG@100, which is reported on
Table 3.</p>
          <p>Method</p>
          <p>Train</p>
          <p>Test-Long</p>
          <p>Test-Short</p>
          <p>Average
BM25 (uncleaned)
BM25
BM25 (uncleaned)
BM25</p>
          <p>The result showing that applying document cleanup by filtering out unwanted values had a slight
improvement on performance. Compared with the BM25 (uncleaned) version, the BM25 version had
+0.03 for nDCG@10 in average, and +0.15 gain of nDCG@100, compared without pre-processing. Seeing
this improvement, we decided to use the same cleaned data for the 2024 test set.</p>
        </sec>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. First Stage Retrieval: BM25</title>
        <p>BM25 is a lexical based sparse retrieval model utilizes statistics of the corpus. It is often used by
search engines to estimate the relevance of documents to a given search query. It is part of the family
of probabilistic information retrieval models. BM25 incorporates several heuristics to balance term
frequency and document length normalization, making it efective in practical applications.</p>
        <p>For the implementaiton, we utilized python library castorini/pyserini [30]. We followed default
argument for the pyserini.index.lucene command.
python -m pyserini.search.lucene \
--index ${index_save_dir} \
--topics ${query_path} \
--output ${ranking_save_path} \
--bm25</p>
        <p>Using code and parameters, we retrieved the top 1000 documents with BM25. We verified that the
recall was already 0.9 when top 1000 documents are sorted in the order of true relevance annotations.
This indicates that reranking the top 1000 document can improve the retrieval result.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Hybrid Retrieval: RepLLaMA-7B</title>
        <p>RepLLaMA [31] is a fine-tuned version of the LLaMA language model tailored for multi-stage text
retrieval tasks. As a RepLLaMA is a dense bi-encoder retrieval model, it encodes query and document
by taking each as separate input and dot-product each other to measure the similarity between the
query and the document. The models demonstrate strong zero-shot performance on out-of-domain
dataset.</p>
        <p>Using the huggingface parameters castorini/repllama-v1-7b-lora-passage and the
implementation [32], we tested RepLLaMA in parallel with BM25. The RepLLaMA tokenizes the first 512
tokens for both document and query, but it takes the hidden state on the index of EOS token, which
may not be located on the 512th output. For each query, given the top 1000 documents indexed by
BM25 with RepLLaMA, we conduct additional hybrid retrieval by RepLLaMA-7B and selected the top
100 from them, which were then handed on to the candidate documents for reranking models. All the
commands and hyperparameters to index and retrieve documents are followed the description on [32] 5.</p>
      </sec>
      <sec id="sec-5-7">
        <title>5.7. Reranking</title>
        <p>We performed reranking with pointwise (MonoT5) and listwise (ListT5) rerankers on the top 100
documents retrieved by BM25 and RepLLaMA.</p>
        <p>MonoT5 MonoT5 is widely known for its efectiveness on zero-shot retrieval and uses pointwise
reranking. MonoT5 is the model that takes a concatenated string of a query  and a document  as input,
Query:  Document: . The model is fine-tuned to return a word either "true" or "false" to determine
whether the document is relevant to the query or not. The returned logit is softmaxed to calculate the
probability of a "true" token to be assigned, which is used as the relevance score. As the monoT5 is</p>
        <sec id="sec-5-7-1">
          <title>5https://github.com/texttron/tevatron/tree/main/examples/repllama</title>
          <p>designed for re-ranking, the model iteratively takes each of the documents, that is concatenated with a
query, from the top  ranking, and outputs its relevance score.</p>
          <p>For a fair comparison between MonoT5 and ListT5, we specify  as 30, since the competition measured
not only metrics@10 but also beyond @10. The model parameter we used for MonoT5 can be found on
castorini/monot5-3b-msmarco-10k.</p>
          <p>ListT5 In our oficial submission to the Codalab, we used the ListT5-3b model (with the huggingface
identifier of Soyoung97/ListT5-3b). While the default setup of ListT5 uses =2 and reranks
top10 passages, we modify the setup to use =2 and run the model to rerank top-30 passages, to see
improvements with NDCG@100 along with NDCG@10. Due to the time limitation and the deadline
schedule, we used the top-100 retrieved results from BM25 for the first-stage retrieval model, reranks
top 20 passages by ListT5, and appended the top 1000 results from RepLLama.</p>
        </sec>
      </sec>
      <sec id="sec-5-8">
        <title>5.8. Other Details</title>
        <p>For the hardware specifications, each model runs on a diferent system.</p>
        <p>• BM25 (Lucene) utilized CPU only with around 100GB RAM for about 30 minutes to index in the
size of 3GB (short), 8GB (long) on SSD.
• RepLLaMA used approximately 30GB of NVIDIA A6000 48GB (with batch size of 16) for about 17
hours to index 4 shards in the size of 3GB (short), 8GB (long) on SSD.
• MonoT5-3B used full memory (with batch size of 25) of a single NVIDIA H100 80GB.
• ListT5-3B used full memory (with batch size of 16) of a single NVIDIA H100 80GB.
• Other listwise rerankers (using the rankllm.ai repository) are run on a single NVIDIA H100 80GB.</p>
        <p>It took about 20 hours to finish inference on approx. 2000 short and long queries.</p>
      </sec>
      <sec id="sec-5-9">
        <title>5.9. Submission Results</title>
        <p>We conducted multiple experiments and submitted the results utilizing MonoT5 and ListT5. Our method
only reranks the top 100 for eficiency, but we found that the challenge measures some metrics @all,
where @all indicates @1000 as the submission detail states the results are taken up to 1000 ranking
results. Therefore, we filled up the remaining top 1000 results based on the BM25 results. The evaluation
of submitted results can be found on Table 4. The result confirms the efectiveness of the ListT5
outperforming in all metrics compared to MonoT5. The ListT5 outperforms MonoT5 by +5.29 (short),
+3.8 (long) on nDCG@10, and +1.51 (short) and +1.66 (long) on nDCG@100. From the results, we
conclude that listwise reranking also helps to mitigate the temporal misalignment of information
retrieval systems, compared with pointwise reranking methods.</p>
        <p>Moreover, to compare the retrieval performance in fair conditions, we conjectured the language
usage on most of the submissions on the leaderboard. We found that the highest ranked participant
who used English is mam10eks from team OWS. As far as we can verify when compared with teams
that used only the English dataset, we achieved top 1 in all metrics except for nDCG@all, MAP@all,
and P@10 in test-short, where we ranked top 2.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. What Makes ListT5 Efective on Temporal Shift?</title>
      <p>After the competition, the gold relevance annotations for the 2024 test short / long subset were released,
and by utilizing them, we conducted additional experiments to further analyze the efectiveness of
ListT5. Initially, we experiment on verifying whether temporal shift is related to domain shift, and that
test-long is more shifted than test-short. Subsequently, we conducted experiments to answer three
specific research questions.
33.45
28.16
33.32</p>
      <p>L</p>
      <p>S
18.09
16.78
18.31</p>
      <p>L</p>
      <p>S</p>
      <sec id="sec-6-1">
        <title>6.1. Hypothesis: Invariance of ListT5 pronounced in bigger shift</title>
        <p>We hypothesized the most important factor for ListT5 compared with other listwise reranking variants
is its permutation invariance property, and that this property is pronounced in bigger shift of domains.
In order to show this property along with other analysis, we first analyzed the degree of domain shift
between short and long subsets.</p>
        <p>
          Temporal shift correlated to temporal shift. The distribution shift in documents is a primary
concern in information retrieval. Neural models demonstrates strong performance on test datasets that
follow similar data attributes of training datasets, but models struggle to retrieve documents from the
test corpus with diferent distributions than the training corpus. This attribute is also pointed out in the
BEIR paper [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], concluding that there are no correlation between models that excel at in-domain test
set and out-of-domain test sets.
        </p>
        <p>In real search scenarios, since continuously training the retrieval model with respect to the corpus
update is computationally expensive and dificult, we train the retrieval model up to a certain point, fix
the model, and use it. The retrieval model then experiences a performance drop as the time elapses,
as the model struggles to retrieve recent documents that show distribution changed compared to the
training dataset. This scenario is much alike the LongEval challenge, where the document is updated
through time, and this is the reason why we assume that temporal shift is one format of distribution
shift, and solving the temporal shift can be attained by resolving the distribution shift.</p>
        <p>To measure the extent of the distribution shift between the train, test-short and test-long subset
of LongEval-2024, we employ inverse document frequency (IDF) and measure the similarity with
the Jensen-Shannon divergence. Each document is tokenized with T5 tokenizer 6 and truncated the
document by the length of 1024 tokens. 7 In addition to the LongEval-2024 training corpus, MS-MARCO
is considered as the in-domain, a dataset to compare, as we target a zero-shot setting. Results from
Figure 5 show that the test-long subset (47.3) were more out-of-domain compared to the test-short
subset (49.8), since lower values indicate that they are more far away from each domain. From this
experiment, we conclude that temporal shift is correlated with domain shift, and test-long is more
shifted than test-short, which follows the original conjecture.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Research Questions &amp; Experimental Setup</title>
        <p>Figure 6 is the experimental overview on testing the efectiveness of ListT5 on various aspects. After
ifnding out that temporal shift is related with domain shift, we try to answer 3 research questions
regarding the efectiveness of ListT5 applied on the temporal domain:
1. Is listwise reranking more efective than pointwise rerankers for temporal persistence?
2. Will the invariance property of ListT5 (by FiD) be pronounced in bigger shift?
3. Does this efect hold when using diferent first-stage retrievers?
6Soyoung97/ListT5-3b
7Special tokens are neglected.</p>
        <p>In order to answer the 3 research questions, we additionally conduct experiments and report results
on diferent baseline models. As shown on Fig. 6, we experiment with 2 diferent first-stage retrieval
models, 5 diferent reranking models including pointwise, listwise reranking models, and ListT5, and
report the NDCG, MAP, Precision, and Recall performance. For each evaluation metric, we report both
measures at @10 and @100 for through analysis. Mainly, we use and compare with the NDCG@10 and
NDCG@100 results on Table.5, and use results on other metrics (Table. 6, 7, 8) as supporting evidence.
6.2.1. RQ1. Listwise Reranking vs Pointwise Reranking
Looking at the tables, we can see that generally, listwise reranking is much more efective than pointwise
rerankers. The performance on RankVicuna, RankZephyr, and ListT5 is much more higher than
MonoT53B models, for both short and long subsets. However, it seems that fine-tuning is crucial for rerankers,
since zero-shot results from Llama3-8B was not as efective as the fine-tuned counterparts.
6.2.2. RQ2. Invariant Listwise Reranking (ListT5)
On comparing ListT5 with other listwise reranking models, we notice an interesting finding. On the
short subset (with less temporal shift), RankZephyr was more efective. On the long subset, however,
ListT5, which has invariant properties, was more efective. This property was consistent across all
metrics including NDCG, MAP, Precision, and Recall, regardless of @k (@10, @100). We believe that
the robustness to positional bias helped ListT5 perform better than other counterparts on the long
subset, where the initial ordering was not very trustable. We hypothesize that ListT5 with invariant
properties will be much more efective as the temporal shift becomes even larger.
6.2.3. RQ3. Impact of First-Stage Retrievers
While we used hybrid retrieval - a combination of BM25 and RepLLaMA for submission, we believed
it was necessary to compare with statistical and neural retrievers, to see the impact on the reranking
performance depending on the efectiveness of the initial retriever. Thus, we experiment and report
results using either BM25 or RepLLaMA. The results show that while RepLLaMA is a better first-stage
retrieval model than BM25 (higher precision, higher recall), the orderings between ListT5 and baseline
models doesn’t change much, and found that most properties held without significant diferences.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Summary of findings</title>
        <p>By investigating on 3 research questions, we conclude that permutation-invariant listwise reranking
(ListT5) is an efective method not only for general out-of-domain data but also to mitigate the temporal
drift, and the findings hold regardless of the choice of the first-stage retrievers. It is particularly
interesting that it performs better than RankZephyr-7B in test-long subsets where distribution shift is
more pronounced.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>In this paper, we focus on analyzing the efectiveness of listwise reranking ( ListT5) on the LongEval
Challenge set, to investigate its efectiveness on the temporal misalignment scenario. Our findings (on
the Jensen-Shannon divergence) suggest that temporal misalignment could be viewed as one form of
out-of-distribution scenario. Compared with other listwise reranking methods such as RankZephyr, we
ifnd that applying permutation-invariant listwise reranking becomes more efective as temporal drift
increases, ListT5 achieving higher performance on the test-long subset with half the parametric model
size. To this end, we aim to develop a search engine capable of delivering robust and stable results, even
as the available document sets change over time.
Initial
MonoT5-3B
Llama3-8B
RankVicuna-7B
RankZephyr-7B
ListT5-3B
short</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitation</title>
      <p>The LongEval Challenge dataset is being collected from the French search engine, Qwant. Therefore,
the dataset has been collected in French, (e.g., mostly used Euro(€) to represent currency). However,
as we are not native French, it is dificult to utilize the provided French document subset. Therefore,
we conducted all experiments in English. However, as we referred to the challenge leaderboard, the
short
methods that utilized French showed a (much better) huge performance gap between English results.
For future work, we hope to make a multilingual version of ListT5, where it is currently only limited to
English. We believe ListT5 with the multilingual version would drastically improve its applicability to a
much wider domain. Also, investigating and improving the temporal shift in a multi-lingual setting
seems to be an interesting next future step. Lastly, due to time limitations, we had to choose the
hybrid approach on the submission in the challenge. Our results would have been better if we had
used RepLLaMA as the single first-stage retrieval model and rerank top-1000 passages instead of 100.
However, due to time and computing resource limitations, we were unable to submit the results on time.
Acknowledging these limitations, we would like to participate once again, if the competition holds, in
the future.
[17] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-augmented
generation for large language models: A survey, 2024. arXiv:2312.10997.
[18] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to-sequence
model, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for
Computational Linguistics, Online, 2020, pp. 708–718. URL: https://www.aclweb.org/anthology/
2020.findings-emnlp.63. doi: 10.18653/v1/2020.findings-emnlp.63.
[19] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning
Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[20] R. Pradeep, R. Nogueira, J. Lin, The expando-mono-duo design pattern for text ranking with
pretrained sequence-to-sequence models, 2021. arXiv:2101.05667.
[21] G. Izacard, E. Grave, Leveraging passage retrieval with generative models for open domain question
answering, 2021. arXiv:2007.01282.
[22] K. McLuckie, A. Barber, Tournament Sort, Macmillan Education UK, London, 1986, pp. 68–86. URL:
https://doi.org/10.1007/978-1-349-08147-9_5. doi:10.1007/978-1-349-08147-9_5.
[23] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond, Foundations
and Trends in Information Retrieval 3 (2009) 333–389. URL: https://doi.org/10.1561/1500000019.
doi:10.1561/1500000019.
[24] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage
retrieval for open-domain question answering, in: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Association for Computational
Linguistics, Online, 2020, pp. 6769–6781. URL: https://www.aclweb.org/anthology/2020.emnlp-main.550.
doi:10.18653/v1/2020.emnlp-main.550.
[25] O. Khattab, M. Zaharia, Colbert: Eficient and efective passage search via contextualized late
interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research
and development in Information Retrieval, 2020, pp. 39–48.
[26] AI@Meta, Llama 3 model card (2024). URL: https://github.com/meta-llama/llama3/blob/main/</p>
      <p>MODEL_CARD.md.
[27] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for eficient text classification, arXiv
preprint arXiv:1607.01759 (2016).
[28] M. Popel, M. Tomkova, J. Tomek, Ł. Kaiser, J. Uszkoreit, O. Bojar, Z. Žabokrtsky`, Transforming
machine translation: a deep learning system reaches news translation quality comparable to
human professionals, Nature communications 11 (2020) 1–15.
[29] P. Galuscáková, R. Deveaud, G. González Sáez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel,
Longevalretrieval: French-english dynamic test collection for continuous web search evaluation, in:
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2023, pp. 3086–3094.
[30] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: A Python toolkit for
reproducible information retrieval research with sparse and dense representations, in: Proceedings of the
44th Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval (SIGIR 2021), 2021, pp. 2356–2362.
[31] X. Ma, L. Wang, N. Yang, F. Wei, J. Lin, Fine-tuning llama for multi-stage text retrieval, arXiv
preprint arXiv:2310.08319 (2023).
[32] L. Gao, X. Ma, J. J. Lin, J. Callan, Tevatron: An eficient and flexible toolkit for dense retrieval,
ArXiv abs/2203.05765 (2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Röttger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pierrehumbert</surname>
          </string-name>
          ,
          <article-title>Temporal adaptation of BERT and performance on downstream document classification: Insights from social media</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Punta Cana, Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>2400</fpage>
          -
          <lpage>2412</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .findings-emnlp.
          <volume>206</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-emnlp.
          <volume>206</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. G. R.</given-names>
            <surname>Deveaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gonzalez-Saez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piroi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popel</surname>
          </string-name>
          , Longevalretrieval:
          <article-title>French-english dynamic test collection for continuous web search evaluation</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>03229</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2104</volume>
          .
          <fpage>08663</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bonifacio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jeronymo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Abonizio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fadaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lotufo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <article-title>In defense of cross-encoders for zero-shot retrieval</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2212</volume>
          .
          <fpage>06121</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Zero-shot listwise document reranking with a large language model</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>02156</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S. Wang,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Is chatgpt good at search? investigating large language models as re-ranking agents</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>09542</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yun</surname>
          </string-name>
          , S. won Hwang,
          <article-title>Listt5: Listwise reranking with fusion-indecoder improves zero-shot retrieval</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>15838</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jagerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hui</surname>
          </string-name>
          , J. Ma, J. Lu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Bendersky, Rankt5: Fine-tuning t5 for text ranking with ranking losses</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2210</volume>
          .
          <fpage>10634</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharifymoghaddam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Rankvicuna: Zero-shot listwise document reranking with open-source large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>15088</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , L. Yan,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Instruction distillation makes large language models eficient zero-shot rankers</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2311</volume>
          .
          <fpage>01555</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Xian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          , J. Ma,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <article-title>Learning list-level domain-invariant representations for ranking</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2212</volume>
          .
          <fpage>10764</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pradeep</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharifymoghaddam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Rankzephyr: Efective and robust zero-shot listwise reranking is a breeze!</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2312</volume>
          .
          <fpage>02724</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhingra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Eisenschlos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gillick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eisenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <article-title>Time-aware language models as temporal knowledge bases</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>257</fpage>
          -
          <lpage>273</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .tacl-
          <volume>1</volume>
          .15. doi:
          <volume>10</volume>
          .1162/tacl_ a_
          <fpage>00459</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Khashabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mandyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith,</surname>
          </string-name>
          <article-title>Time waits for no one! analysis and challenges of temporal misalignment</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2111</volume>
          .
          <fpage>07408</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Takahashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Inui</surname>
          </string-name>
          ,
          <article-title>Realtime qa: What's the answer right now</article-title>
          ?,
          <year>2024</year>
          . arXiv:
          <volume>2207</volume>
          .
          <fpage>13332</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fang</surname>
          </string-name>
          , L. Chen,
          <article-title>Retrievalqa: Assessing adaptive retrieval-augmented generation for short-form open-domain question answering</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>16457</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>