Enhancing Fusion-in-Decoder for Multi-Granularity Ranking
                         Haeju Park1,* , Kyungjae Lee2 , Sunghyun Park3 and Moontae Lee4
                         1
                           LG AI Research, Republic of Korea
                         2
                           LG AI Research, Republic of Korea
                         3
                           LG AI Research, Republic of Korea
                         4
                           LG AI Research, Republic of Korea


                                         Abstract
                                         Large Language Models (LLMs) have demonstrated exceptional performance across various natural language tasks, leveraging extensive
                                         knowledge from massive datasets. However, their reliance solely on parametric knowledge often leads to the generation of inaccurate or
                                         outdated content, particularly in domain-specific tasks. Retrieval Augmented Generation (RAG) has emerged as a promising approach
                                         to address this limitation by incorporating external knowledge without necessitating re-training. While RAG enhances the accuracy
                                         of LLM-generated content, effectively retrieving external knowledge remains a challenge due to potential noise and computational
                                         costs. To address this, traditional information retrieval systems adopt two-stage approaches, utilizing efficient retrievers followed by
                                         reranking mechanisms. Recently, transformer-based architectures, including BERT and T5 models, have shown promise as effective
                                         rerankers. However, such models have limited context size and only perform single-granularity ranking at a time, hindering their
                                         effectiveness and efficiency. In this paper, we first explore the existing rerankers such as RankT5 and RFiD, highlighting challenges in
                                         multi-granularity ranking. Subsequently, we introduce PFiD (Passage Fusion-in-Decoder), a simple yet efficient approach aimed at
                                         effectively ranking both document and passage simultaneously. Through empirical evaluation, we demonstrate the efficacy of PFiD in
                                         improving effectiveness and efficiency, offering a promising direction for further research in this domain.

                                         Keywords
                                         Information Systems, Retrieval Augmented Generation, Large Language Model


                         1. Introduction                                                                                            a reranker. However, these models have limited context
                                                                                                                                    size and only perform single-granularity ranking during
                         Despite their remarkable capabilities and growth, Large Lan-                                               inference, which hinders their effectiveness and efficiency
                         guage Models (LLMs) [1, 2, 3, 4] still tend to generate factu-                                             in real-world RAG scenarios.
                         ally incorrect or outdated content as their knowledge solely                                                  To this end, in this paper, we focus on the multi-
                         relies on their parametric knowledge, especially in domain-                                                granularity ranking task, which ranks both document and
                         specific or knowledge-intensive tasks [5, 6, 7]. Retrieval                                                 passage simultaneously. Specifically, we first investigate the
                         Augmented Generation (RAG) approaches [8, 9, 10, 11] have                                                  single-passage cross-encoder models such as MonoT5 [22]
                         gained significant attention, which improve the quality of                                                 and RankT5 [21]. It achieves superior performance across
                         LLM-generated output by grounding on external knowledge                                                    various ranking tasks, but due to the constraint of input
                         to supplement the LLMs’ parametric knowledge, without                                                      tokens, its efficiency is limited in real-world RAG scenarios.
                         having to re-train the LLMs. RAG leverages a powerful                                                      Next, we present the use of multi-passage cross-encoder,
                         information retrieval model, which is designed to search                                                   such as FiD [9] and RFiD [24]. These models alleviate the
                         large datasets or knowledge bases. The retrieved informa-                                                  input tokens limit by leveraging multi-passage, but they
                         tion is then incorporated into LLMs, enabling it to generate                                               directly use the cross-attention score of the decoder as a pas-
                         more accurate and contextually relevant content. By incor-                                                 sage relevance, which is implicitly learned, and encounter
                         porating external knowledge, RAG can effectively reduce                                                    difficulty with distinguishing relative differences between
                         the problem of generating factually incorrect or outdated                                                  passages. Thereafter, we propose a simple and effective PFiD
                         content in LLMs [12, 13].                                                                                  (Passage Fusion-in-Decoder) for multi-granularity ranking.
                            However, current RAG frameworks have major chal-                                                        PFiD extends the FiD model by generating a document-level
                         lenges when it comes to the effectiveness and efficiency of                                                relevance token, enabling both document retrieval and pas-
                         information retrieval systems: First, LLMs tend to generate                                                sage ranking. Furthermore, PFiD adopts the inter-passage
                         inaccurate responses on distracting (or noisy) contexts, thus                                              attention mechanism to learn relative passage relevance
                         the performance of retrieval models has a significant im-                                                  explicitly, using the special tokens at the beginning of the
                         pact on the quality of RAG’s responses [14, 15, 11]. Second,                                               input text to represent the entire context.
                         the retrieval component of RAG requires searching through                                                     Experiments on MIRACL passage ranking dataset [25]
                         large-scale knowledge bases or the web, which can be com-                                                  demonstrate that PFiD improves effectiveness and efficiency
                         putationally expensive and slow [11]. Due to the above                                                     compared to existing approaches, especially in RAG scenar-
                         challenges, existing retrieval systems adopt two-stage ap-                                                 ios.
                         proaches, an efficient first-stage retriever such as BM25 [16]
                         and DPR [17] retrieves a set of documents from a larger
                         dataset, and then a second-stage reranker is used to rerank                                                2. Preliminaries
                         retrieved documents for precise ranking. Recently, with the
                         advent of transformer-based models such as BERT [18] and                                                   2.1. Task definition
                         T5 [19], more architectures including bi-encoder [17], cross-
                                                                                                                                    Given a user query 𝑞 and a document (or passage) corpus
                         encoder [20], encoder-decoder [21, 22], and decoder-only
                                                                                                                                    𝐶 = {𝐷1 , 𝐷2 , ..., 𝐷𝑛 }, the goal of document retrieval is to
                         models [23], have gradually shown their effectiveness as
                                                                                                                                    find the 𝑘 documents that are most relevant to the query 𝑞.
                          Information Retrieval’s Role in RAG Systems (IR-RAG) - 2024                                               In our multi-granularity ranking setting, which consists of
                         *
                           Corresponding author.                                                                                    document retrieval and passage ranking tasks, the document
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                     Attribution 4.0 International (CC BY 4.0).
                                                                                                                                    retrieval task is to perform reranking on BM25 retrieved

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
top-𝑘 documents. While traditional passage ranking tasks           directly regarded as passage relevance scores the same as
typically involve ranking entire passages, in this paper, the      in [9]. However, even with the rationale, the cross-attention
passage ranking task focuses solely on ranking passages            mechanism still lacks for distinguishing relative differences
within the retrieved document itself, which aligns more            between passages, as it is implicitly guided by a rationale
closely with real-world RAG scenarios and is thus more             classifier solely trained on point-wise binary classification
feasible.                                                          loss.
2.2. Ranking models
Pre-trained Language Models (PLMs) are currently the most          3. Method
effective ranking models, which can be categorized into: bi-       In this section, we briefly discuss a simple but effective Pas-
encoders and cross-encoders. Bi-encoders encode a query            sage Fusion-in-Decoder (PFiD) for multi-granularity rank-
and a passage separately to obtain semantic representa-            ing. PFiD adopts the FiD [9] architecture as a base model,
tions [17], emerging as powerful first-stage retrievers by         further extends FiD by utilizing true and false token
pre-computing the passage representations offline. Instead,        as a target token to model document-level relevance, en-
cross-encoders take the concatenation of the query and             abling multi-granularity ranking simultaneously. Addition-
a passage, and perform query-passage interactions [20],            ally, PFiD integrates inter-passage attention to learn relative
which have been conceived as second-stage rerankers, de-           passage relevances explicitly, which is similar to the list-
signed to explicitly refine the results provided by the first-     wise training objective of RankT5 [21].
stage retrieval. In this paper, for brevity, we also refer other
PLMs, such as encoder-only [17, 20], decoder-only [23],            Fusion-in-Decoder for Document Retrieval. Formally,
and encoder-decoder [21, 22] models that perform query-            Given a question 𝑞 and a set of 𝑘 passages within the doc-
passage interactions simultaneously, as cross-encoders.            ument 𝐷𝑛 = {𝑃1𝑛 , 𝑃2𝑛 , ..., 𝑃𝑘𝑛 }, the FiD encoder outputs
   There are several PLM-based cross-encoders, includ-             the 𝑘-th passage embeddings Hk ∈ R𝐿×𝑑 , where 𝐿 denotes
ing sequence-to-sequence language models such as                   the maximum token length, and 𝑑 denotes the dimension of
MonoT5 [22] and RankT5 [21] for ranking task, as well              hidden states, which are then concatenated as the input of
as multi-passage reader models like FiD [9] and RFiD [24]          the fusion decoder [H1 , H2 , ..., Hk ].
for RAG tasks, which have been demonstrated superior ef-
fectiveness.                                                                      Hk = FiD-Encoder(𝑞 + 𝑃𝑘𝑛 )                             (1)
MonoT5. MonoT5 [22] is the first work to define a ranking             The FiD decoder utilizes [H1 , H2 , ..., Hk ] to generate the
task as a text generation task by leveraging T5 [19] encoder-      target token 𝑇 = true 𝑜𝑟 false. Therefore, the loss func-
decoder model. A query-document pair is concatenated into          tion can be defined as follows:
an input sequence Query : 𝑞 Document : 𝐷𝑛 Relevant:,
and utilizes true and false as target tokens to represent                       𝑇
their relevance. Then, the model is fine-tuned for text gen-
                                                                               ∑︁
                                                                   ℒ𝐹 𝑖𝐷 = −          log 𝑝(𝑦𝑖 |𝑦1 , 𝑦2 , ...., 𝑦𝑖−1 , [H1 , H2 , ..., Hk ])
eration task. After training, the ranking scores are derived                    𝑖=1
from the logits of true token, based on the softmax applied                                                                    (2)
only on the logits of the true and false tokens.                   Inter-passage Attention. Previous work [24] tackled the
RankT5. Following MonoT5 [22], the input sequence is               issue of spurious passages by employing a binary classifier
similar except that RankT5 do not include the Relevant:            on the first token’s encoder hidden states Hk,1 , to determine
postfix. Then, the model use the <extra_id_10> as target           whether the passage is a rationale passage to the query.
token to learn unnormalized ranking score. The model is            Then, guide the decoder by appending the additional em-
trained with list-wise ranking loss directly, instead of using     beddings toward the end of the encoder’s hidden states
text generation loss as in MonoT5 [22]. However, these             [H1 , H2 , ..., Hk , Hk+1 ], where Hk+1 ∈ R2×𝑑 is trainable ra-
models cannot be directly used for long document retrieval         tionale embedding. However, as Table 2 shows, it drastically
due to the maximum input length constraint as in most              underperforms in passage ranking tasks by a large margin,
PLMs, which hinders their effectiveness in the document            as it does not explicitly model relative passage relevance.
retrieval task.                                                       Instead, to mitigate this, we utilize inter-passage atten-
                                                                   tion to model interactions between passages explicitly. PFiD
FiD. The FiD model further extends T5 [19] encoder-decoder
                                                                   builds a set of input sequences by appending the first to-
model, taking multiple 𝑘 passages as input, encoding sepa-
                                                                   ken hidden states of each pair as B = [H1,1 , H2,1 , ..., Hk,1 ],
rately, and then feeds the concatenated 𝑘 encoder hidden
                                                                   where H𝑖,𝑗 denotes the 𝑗-th token embeddings of 𝑖-th pas-
states into a T5 decoder to generate the answer. Relevance
                                                                   sage. In a standard cross-encoder, the first token of the
scores for passages are computed using cross-attention
                                                                   encoder aggregate query-passage information to compute
scores, which entail averaging the attention score across all
                                                                   a relevance score. We further use this token to depict the
tokens within the passage and all layers and heads within
                                                                   relative semantics via self-attention mechanism. Inspried
the decoder [26].
                                                                   by [27], we consider single-layer transformer model to de-
RFiD. While FiD [9] treats all passages equally within its en-     pict relative passage relevance as follows:
coders, solely depending on the cross-attention mechanism
to establish correlations between the decoder and encoders,
which may identify the incorrect answer by referring to                                 QK⊺
                                                                                   (︂         )︂
                                                                   B
                                                                   ̃︀ = softmax         √          V, where Q = BWQ , K = BWK , V = BWV
spurious passages. Instead, RFiD [24] improves FiD by iden-                              𝑑
tifying potential answer-containing passages (or rationale)                                                                (3)
among the candidates and guiding the decoder with the              in which matrices WQ , WK , WV ∈ R𝑑×𝑑 are learnable pa-
identified rationales. Afterward, cross-attention scores are       rameters. The information from different passages is fused
                                                                                                    # of Documents = 1                           # of Documents = 2
                                                                                          0.7                                          0.7
and exchanged via the self-attention mechanism. The train-
                                                                                          0.6                                          0.6
ing loss used for inter-passage attention can be defined as
follows:


                                                                                nDCG@10


                                                                                                                             nDCG@10
                                                                                          0.5                                          0.5


                                ̃︀ 𝑘 W𝐵 ) ∈ R2 ,                                          0.4                                          0.4
                   𝑝𝑘 = softmax(B
                                                                       (4)                0.3                                          0.3

       ℒ𝑝𝑎𝑠𝑠𝑎𝑔𝑒 = −(𝑦 log(𝑝𝑘 ) + (1 − 𝑦) log(1 − 𝑝𝑘 ))                                    0.2                                          0.2
                                                                                                1      2    3      4     5                   1      2     3    4       5
                                                                                                       # of Passages                                # of Passages
where 𝑦 is the passage relevance label, and the overall train-                                      # of Documents = 5                           # of Documents = 10
                                                                                          0.7                                          0.7
ing objective of PFiD is:
                                                                                          0.6                                          0.6
                     ℒ𝑎𝑙𝑙 = ℒ𝐹 𝑖𝐷 + 𝜆ℒ𝑝𝑎𝑠𝑠𝑎𝑔𝑒 ,                        (5)


                                                                                nDCG@10


                                                                                                                             nDCG@10
                                                                                          0.5                                          0.5

where 𝜆 is a hyperparameter to balance two losses.                                        0.4                                          0.4

                                                                                          0.3                                          0.3

4. Experimental setup                                                                     0.2
                                                                                                1      2    3      4     5
                                                                                                                                       0.2
                                                                                                                                             1      2     3    4       5
                                                                                                      # of Passages                                 # of Passages
4.1. Datasets                                                                                                   RankT5                 RFiD             PFiD
We use MIRACL [25] passage ranking dataset for our ex-                       Figure 1: Passage ranking results on the real-world RAG sce-
periments. The MIRACL [25] dataset is a large-scale, open-                   narios. We first retrieve # of documents and rerank # passages
domain, human-generated multi-document ranking dataset                       within the retrieved documents.
which is similar to MS MARCO [28], but MIRACL owns
its advantage by providing segmented document collection,
enabling both document retrieval and passage ranking.1 For                   4.3. Experimental Details
the document retrieval task, we construct the document
retrieval dataset by regarding a document with at least one                  We adopt T5-base [19] as our base model, using Adam [29]
positive passage, as a positive document. Table 1 shows the                  with a learning rate of 10−4 and a dropout rate of 0.1. For
statistics of the datasets.                                                  both training and inference, we use the top-100 passages
                                                                             and truncate them to 200 of the maximum token length. The
Table 1                                                                      hyperparameter 𝜆 is set to 0.5. For the document retrieval
Statistics of Datasets.                                                      task, we perform ranking on BM25 top-100 retrieved docu-
                                                                             ments, whereas passage ranking ranks the passages within
            Task          # train   # dev   # avg judgement   # corpus
                                                                             the given positive document. We also conduct experiments
     Document Retrieval   22,548    6,404        2.22         5,758,285
      Passage Ranking     29,416    8,350        2.75         32,893,221     on real-world RAG scenarios, considering both document
                                                                             retrieval and passage ranking simultaneously. We use the
4.2. Baselines                                                               evaluation metric of the nDCG [30], Recall, and MRR scores
                                                                             to evaluate the effectiveness. All experiments are conducted
We compare PFiD against the following three types of rank-                   on a single NVIDIA A100 GPU (40GB). In this work, we
ing baselines. The first is Single-Passage Cross-encoder                     do not consider other training approaches including data
(SPC) baselines, including MonoT5 [22], and RankT5 [21].                     augmentation, knowledge distillation, or negative sampling
Due to the constraint of input tokens, we only take the                      strategies as delving into their effects falls outside the scope
first-𝑘 tokens in the document retrieval task. An alterna-                   of our objectives.
tive approach is to score each passage independently, and
then take the passage with the highest score as the rep-
resentative for ranking the document, or directly perform                    5. Results and Analysis
retrieval over the segmented passages. However, we will
omit these approaches as the former lacks efficiency, and                    Retrieval and Ranking. Table 2 presents our evaluation
the latter is not scalable for real-world RAG scenarios. Then,               results on document retrieval and passage ranking tasks.
the model is trained list-wisely with randomly sampled                       The key observations are as follows: (i) MPC significantly
negatives from the entire passage sets; The second is Multi-                 outperforms SPC in document retrieval task by aggregat-
Passage Cross-encoder (MPC) baselines, including FiD [9]                     ing multiple 𝑘 passages, alleviating the problem of limited
and RFiD [24]. For comparison in our experimental setting,                   context size in SPC. In particular, one can see that PFiD out-
both FiD and RFiD models are trained with the target token                   performs RFiD by a large margin on both document ranking
of true 𝑜𝑟 false, enabling both document retrieval and                       and passage ranking task. This indicates that by leveraging
passage ranking. All SPC and MPC baselines used in this                      passage-wise context to guide the decoder, we can better
experiment are initialized with T5-base model; The third is                  identify relative passage relevance. Note that compared with
the most frequently employed lexical ranker BM25 [16]. We                    the existing SPC baselines, our method achieves ranking effi-
use the Elasticsearch engine with default parameters 𝑘1 =                    ciency by explicitly removing the need for each granularity
1.2, and 𝑏 = 0.75.                                                           ranking. PFiD directly consumes the entire document, and
                                                                             scores the relevance of the entire passages and document
                                                                             simultaneously. (ii) RFiD, implicitly guiding the decoder
                                                                             with rationale embedding shows improvement over FiD by
                                                                             a large margin, however, it is still even worse than BM25. It
1
                                                                             suggests that implicitly guiding indeed benefits the model’s
    MS MARCO also provide segmented document collection, but the
    segmented corpus do not align with passages in passage ranking tasks.
                                                                             ranking ability to some extent. However, when ranking
     Table 2
     The evaluation results of different baselines. As for the document retrieval, we rank the top 100 documents retrieved by BM25,
     while the passage ranking task ranks the passage within the retrieved document. 𝑁 denotes the number of documents to
     rank, whereas 𝑃 denotes the number of passages in the document. The best performances are in †. Latency indicates the total
     inference time from document retrieval to passage ranking, which is measured by averaging the time taken for each query
     with a single thread and a single batch on the GPU.
                                      Document Retrieval                           Passage Ranking
    Model      Category                                                                                                 Complexity    Latency (s)
                           top-𝑘   MRR@10     Recall@5     Recall@10   MRR@10          nDCG@5       nDCG@10
 BM25              -        |C|     0.3951      0.3683      0.4736      0.7366          0.7718         0.7856                -        0.32 (x1.00)
 MonoT5          SPC       100       0.6204     0.5141       0.5794      0.8571         0.8774         0.8803           𝑂(𝑁 + 𝑁 𝑃 )   5.65 (x17.65)
 RankT5          SPC       100       0.6352     0.4992       0.5605     0.8778†        0.8916†        0.8952†           𝑂(𝑁 + 𝑁 𝑃 )   5.64 (x17.62)
 FiD             MPC       100       0.6322     0.5139       0.5821      0.3464         0.3725         0.4260             𝑂(𝑁 )       1.17 (x3.65)
 RFiD            MPC       100       0.7177     0.5743       0.6407      0.5617         0.6036         0.6359             𝑂(𝑁 )       1.21 (x3.78)
 PFiD (Ours)     MPC       100      0.7231†    0.5937†      0.6516†      0.8530         0.8726         0.8780             𝑂(𝑁 )       1.23 (x3.84)


various passages from multi-documents, traditional MPC
is completely indistinguishable, suggesting cross-attention                 0.25                 cross attention 0.25                           PFiD
score from the decoder is not suited for the passage ranking                0.20                                 0.20
task. (iii) SPC achieves superior performance over MPC                      0.15                                 0.15
in passage ranking task, as it is trained with rich negative
                                                                            0.10                                 0.10
samples from other documents, while MPC is only trained
with in-document negatives. Additionally, even with in-                     0.05                                 0.05
document negatives, when trained with inter-passage atten-                  0.00                                 0.00
                                                                                   0        20         40                0       20        40
tion, PFiD can achieve ranking effectiveness that rivals that          Figure 2: Distribution of the rank of positive passages.
of SPC, suggesting that incorporating an additional module
to identify relevant passages is more effective than relying
solely on the cross-attention mechanism of the decoder.
Results on real-world RAG scenario. Next, we investi-
                                                                       References
gate the effectiveness of PFiD in real-world RAG scenarios.             [1] OpenAI,              Gpt-4        technical          report,           2024.
We first retrieve # documents from the candidates, and                       arXiv:2303.08774.
rerank # passages within the retrieved documents. Figure 1              [2] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
represents the result of our evaluation. Notably, from Table 2              hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
we observed that MPC outperforms SPC in document re-                        gava, S. Bhosale, D. Bikel, L. B. et al., Llama 2:
trieval tasks, however, the performance drastically drops in                Open foundation and fine-tuned chat models, 2023.
this setting, as cross-attention scores from the decoder are                arXiv:2307.09288.
indistinguishable across passages from multi-documents.                 [3] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin,
Additionally, despite RankT5 reaching the best effective-                   A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. C. et al.,
ness on the passage ranking task, it did not exhibit any im-                Palm 2 technical report, 2023. arXiv:2305.10403.
provements over our method in real-world RAG scenarios,                 [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-
suggesting the importance of the multi-granularity rank-                    plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
ing. Instead, PFiD consistently outperforms all baselines,                  try, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger,
by leveraging the complementary nature of SPC and MPC.                      T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu,
PFiD is capable of more efficiently retrieving documents                    C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
and ranking passages, and capturing the relative semantic                   S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
correlation between different passages, leading to superior                 A. Radford, I. Sutskever, D. Amodei, Language models
performance.                                                                are few-shot learners, 2020. arXiv:2005.14165.
Cross-attention vs PFiD. As discussed above, PFiD has                   [5] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi,
the advantage of identifying relevant passages compared to                  H. Hajishirzi, When not to trust language models:
previous models like RFiD since it explicitly models relative               Investigating effectiveness of parametric and non-
passage relevance. We investigate the effects of the cross-                 parametric memories, in: A. Rogers, J. Boyd-Graber,
attention scores of the decoder and our passage ranking                     N. Okazaki (Eds.), Proceedings of the 61st Annual
scores for the passage ranking task. Figure 2 illustrates the               Meeting of the Association for Computational Linguis-
distribution of the rank of positive passages. As depicted                  tics (Volume 1: Long Papers), Association for Compu-
in Figure 2, the PFiD is more strongly correlated with pas-                 tational Linguistics, Toronto, Canada, 2023, pp. 9802–
sage relevances than cross-attention scores, suggesting the                 9822. URL: https://aclanthology.org/2023.acl-long.546.
PFiD focuses more on positive passages by explicitly learn-                 doi:10.18653/v1/2023.acl-long.546.
ing relative passage relevance. Our experimental results                [6] J. Maynez, S. Narayan, B. Bohnet, R. McDonald,
show that the enhanced ability to identify relevant passages                On faithfulness and factuality in abstractive sum-
contributes to overall performance improvement.                             marization, in: D. Jurafsky, J. Chai, N. Schluter,
                                                                            J. Tetreault (Eds.), Proceedings of the 58th An-
                                                                            nual Meeting of the Association for Computational
                                                                            Linguistics, Association for Computational Linguis-
                                                                            tics, Online, 2020, pp. 1906–1919. URL: https://
      aclanthology.org/2020.acl-main.173. doi:10.18653/                 transformer, 2023. arXiv:1910.10683.
     v1/2020.acl-main.173.                                         [20] R. Nogueira, K. Cho, Passage re-ranking with bert,
 [7] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang,                2020. arXiv:1901.04085.
     Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A survey           [21] H. Zhuang, Z. Qin, R. Jagerman, K. Hui, J. Ma,
     on hallucination in large language models: Princi-                 J. Lu, J. Ni, X. Wang, M. Bendersky, Rankt5: Fine-
     ples, taxonomy, challenges, and open questions, 2023.              tuning t5 for text ranking with ranking losses, 2022.
     arXiv:2311.05232.                                                  arXiv:2210.10634.
 [8] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin,      [22] R. Nogueira, Z. Jiang, J. Lin, Document ranking
     N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rock-               with a pretrained sequence-to-sequence model, 2020.
     täschel, S. Riedel, D. Kiela, Retrieval-augmented                  arXiv:2003.06713.
     generation for knowledge-intensive nlp tasks, 2021.           [23] X. Ma, L. Wang, N. Yang, F. Wei, J. Lin, Fine-
     arXiv:2005.11401.                                                  tuning llama for multi-stage text retrieval, 2023.
 [9] G. Izacard, E. Grave, Leveraging passage retrieval                 arXiv:2310.08319.
     with generative models for open domain question               [24] C. Wang, H. Yu, Y. Zhang, Rfid: Towards rational
     answering, in: P. Merlo, J. Tiedemann, R. Tsar-                    fusion-in-decoder for open-domain question answer-
     faty (Eds.), Proceedings of the 16th Conference of                 ing, 2023. arXiv:2305.17041.
     the European Chapter of the Association for Com-              [25] X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo,
     putational Linguistics: Main Volume, Association for               D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh,
     Computational Linguistics, Online, 2021, pp. 874–                  J. Lin, MIRACL: A Multilingual Retrieval Dataset Cov-
     880. URL: https://aclanthology.org/2021.eacl-main.74.              ering 18 Diverse Languages, Transactions of the Asso-
     doi:10.18653/v1/2021.eacl-main.74.                                 ciation for Computational Linguistics 11 (2023) 1114–
[10] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni,          1131.
     T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave,     [26] G. Izacard, E. Grave, Distilling knowledge from
     Atlas: Few-shot learning with retrieval augmented                  reader to retriever for question answering, 2022.
     language models, Journal of Machine Learning Re-                   arXiv:2012.04584.
     search 24 (2023) 1–43. URL: http://jmlr.org/papers/v24/       [27] J. Yang, Z. Liu, C. Li, G. Sun, X. Xie, Longtriever:
     23-0037.html.                                                      a pre-trained long text encoder for dense document
[11] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,           retrieval, in: H. Bouamor, J. Pino, K. Bali (Eds.), Pro-
     J. Sun, M. Wang, H. Wang, Retrieval-augmented gen-                 ceedings of the 2023 Conference on Empirical Methods
     eration for large language models: A survey, 2024.                 in Natural Language Processing, Association for Com-
     arXiv:2312.10997.                                                  putational Linguistics, Singapore, 2023, pp. 3655–3665.
[12] H. He, H. Zhang, D. Roth, Rethinking with re-                      URL: https://aclanthology.org/2023.emnlp-main.223.
     trieval: Faithful large language model inference, 2022.            doi:10.18653/v1/2023.emnlp-main.223.
     arXiv:2301.00303.                                             [28] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu,
[13] N. Thakur, L. Bonifacio, X. Zhang, O. Ogundepo, E. Ka-             R. Majumder, A. McNamara, B. Mitra, T. Nguyen,
     malloo, D. Alfonso-Hermelo, X. Li, Q. Liu, B. Chen,                M. Rosenberg, X. Song, A. Stoica, S. Tiwary, T. Wang,
     M. Rezagholizadeh, J. Lin, Nomiracl: Knowing when                  Ms marco: A human generated machine reading com-
     you don’t know for robust multilingual retrieval-                  prehension dataset, 2018. arXiv:1611.09268.
     augmented generation, 2024. arXiv:2312.11361.                 [29] D. P. Kingma, J. Ba, Adam: A method for stochastic
[14] F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi,         optimization, 2017. arXiv:1412.6980.
     N. Schärli, D. Zhou, Large language models can be             [30] K. Järvelin, J. Kekäläinen, Cumulated gain-based eval-
     easily distracted by irrelevant context, in: A. Krause,            uation of ir techniques, ACM Trans. Inf. Syst. 20 (2002)
     E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett        422–446. URL: https://doi.org/10.1145/582415.582418.
     (Eds.), Proceedings of the 40th International Confer-              doi:10.1145/582415.582418.
     ence on Machine Learning, volume 202 of Proceedings
     of Machine Learning Research, PMLR, 2023, pp. 31210–
     31227. URL: https://proceedings.mlr.press/v202/shi23a.
     html.
[15] A. Asai, Z. Wu, Y. Wang, A. Sil, H. Hajishirzi, Self-rag:
     Learning to retrieve, generate, and critique through
     self-reflection, 2023. arXiv:2310.11511.
[16] S. Robertson, H. Zaragoza, The probabilistic rele-
     vance framework: Bm25 and beyond, Foundations
     and Trends in Information Retrieval 3 (2009) 333–389.
     doi:10.1561/1500000019.
[17] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu,
     S. Edunov, D. Chen, W. tau Yih, Dense passage re-
     trieval for open-domain question answering, 2020.
     arXiv:2004.04906.
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
     Pre-training of deep bidirectional transformers for lan-
     guage understanding, 2019. arXiv:1810.04805.
[19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
     M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
     limits of transfer learning with a unified text-to-text