Enhancing Fusion-in-Decoder for Multi-Granularity Ranking Haeju Park1,* , Kyungjae Lee2 , Sunghyun Park3 and Moontae Lee4 1 LG AI Research, Republic of Korea 2 LG AI Research, Republic of Korea 3 LG AI Research, Republic of Korea 4 LG AI Research, Republic of Korea Abstract Large Language Models (LLMs) have demonstrated exceptional performance across various natural language tasks, leveraging extensive knowledge from massive datasets. However, their reliance solely on parametric knowledge often leads to the generation of inaccurate or outdated content, particularly in domain-specific tasks. Retrieval Augmented Generation (RAG) has emerged as a promising approach to address this limitation by incorporating external knowledge without necessitating re-training. While RAG enhances the accuracy of LLM-generated content, effectively retrieving external knowledge remains a challenge due to potential noise and computational costs. To address this, traditional information retrieval systems adopt two-stage approaches, utilizing efficient retrievers followed by reranking mechanisms. Recently, transformer-based architectures, including BERT and T5 models, have shown promise as effective rerankers. However, such models have limited context size and only perform single-granularity ranking at a time, hindering their effectiveness and efficiency. In this paper, we first explore the existing rerankers such as RankT5 and RFiD, highlighting challenges in multi-granularity ranking. Subsequently, we introduce PFiD (Passage Fusion-in-Decoder), a simple yet efficient approach aimed at effectively ranking both document and passage simultaneously. Through empirical evaluation, we demonstrate the efficacy of PFiD in improving effectiveness and efficiency, offering a promising direction for further research in this domain. Keywords Information Systems, Retrieval Augmented Generation, Large Language Model 1. Introduction a reranker. However, these models have limited context size and only perform single-granularity ranking during Despite their remarkable capabilities and growth, Large Lan- inference, which hinders their effectiveness and efficiency guage Models (LLMs) [1, 2, 3, 4] still tend to generate factu- in real-world RAG scenarios. ally incorrect or outdated content as their knowledge solely To this end, in this paper, we focus on the multi- relies on their parametric knowledge, especially in domain- granularity ranking task, which ranks both document and specific or knowledge-intensive tasks [5, 6, 7]. Retrieval passage simultaneously. Specifically, we first investigate the Augmented Generation (RAG) approaches [8, 9, 10, 11] have single-passage cross-encoder models such as MonoT5 [22] gained significant attention, which improve the quality of and RankT5 [21]. It achieves superior performance across LLM-generated output by grounding on external knowledge various ranking tasks, but due to the constraint of input to supplement the LLMs’ parametric knowledge, without tokens, its efficiency is limited in real-world RAG scenarios. having to re-train the LLMs. RAG leverages a powerful Next, we present the use of multi-passage cross-encoder, information retrieval model, which is designed to search such as FiD [9] and RFiD [24]. These models alleviate the large datasets or knowledge bases. The retrieved informa- input tokens limit by leveraging multi-passage, but they tion is then incorporated into LLMs, enabling it to generate directly use the cross-attention score of the decoder as a pas- more accurate and contextually relevant content. By incor- sage relevance, which is implicitly learned, and encounter porating external knowledge, RAG can effectively reduce difficulty with distinguishing relative differences between the problem of generating factually incorrect or outdated passages. Thereafter, we propose a simple and effective PFiD content in LLMs [12, 13]. (Passage Fusion-in-Decoder) for multi-granularity ranking. However, current RAG frameworks have major chal- PFiD extends the FiD model by generating a document-level lenges when it comes to the effectiveness and efficiency of relevance token, enabling both document retrieval and pas- information retrieval systems: First, LLMs tend to generate sage ranking. Furthermore, PFiD adopts the inter-passage inaccurate responses on distracting (or noisy) contexts, thus attention mechanism to learn relative passage relevance the performance of retrieval models has a significant im- explicitly, using the special tokens at the beginning of the pact on the quality of RAG’s responses [14, 15, 11]. Second, input text to represent the entire context. the retrieval component of RAG requires searching through Experiments on MIRACL passage ranking dataset [25] large-scale knowledge bases or the web, which can be com- demonstrate that PFiD improves effectiveness and efficiency putationally expensive and slow [11]. Due to the above compared to existing approaches, especially in RAG scenar- challenges, existing retrieval systems adopt two-stage ap- ios. proaches, an efficient first-stage retriever such as BM25 [16] and DPR [17] retrieves a set of documents from a larger dataset, and then a second-stage reranker is used to rerank 2. Preliminaries retrieved documents for precise ranking. Recently, with the advent of transformer-based models such as BERT [18] and 2.1. Task definition T5 [19], more architectures including bi-encoder [17], cross- Given a user query π‘ž and a document (or passage) corpus encoder [20], encoder-decoder [21, 22], and decoder-only 𝐢 = {𝐷1 , 𝐷2 , ..., 𝐷𝑛 }, the goal of document retrieval is to models [23], have gradually shown their effectiveness as find the π‘˜ documents that are most relevant to the query π‘ž. Information Retrieval’s Role in RAG Systems (IR-RAG) - 2024 In our multi-granularity ranking setting, which consists of * Corresponding author. document retrieval and passage ranking tasks, the document Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). retrieval task is to perform reranking on BM25 retrieved CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings top-π‘˜ documents. While traditional passage ranking tasks directly regarded as passage relevance scores the same as typically involve ranking entire passages, in this paper, the in [9]. However, even with the rationale, the cross-attention passage ranking task focuses solely on ranking passages mechanism still lacks for distinguishing relative differences within the retrieved document itself, which aligns more between passages, as it is implicitly guided by a rationale closely with real-world RAG scenarios and is thus more classifier solely trained on point-wise binary classification feasible. loss. 2.2. Ranking models Pre-trained Language Models (PLMs) are currently the most 3. Method effective ranking models, which can be categorized into: bi- In this section, we briefly discuss a simple but effective Pas- encoders and cross-encoders. Bi-encoders encode a query sage Fusion-in-Decoder (PFiD) for multi-granularity rank- and a passage separately to obtain semantic representa- ing. PFiD adopts the FiD [9] architecture as a base model, tions [17], emerging as powerful first-stage retrievers by further extends FiD by utilizing true and false token pre-computing the passage representations offline. Instead, as a target token to model document-level relevance, en- cross-encoders take the concatenation of the query and abling multi-granularity ranking simultaneously. Addition- a passage, and perform query-passage interactions [20], ally, PFiD integrates inter-passage attention to learn relative which have been conceived as second-stage rerankers, de- passage relevances explicitly, which is similar to the list- signed to explicitly refine the results provided by the first- wise training objective of RankT5 [21]. stage retrieval. In this paper, for brevity, we also refer other PLMs, such as encoder-only [17, 20], decoder-only [23], Fusion-in-Decoder for Document Retrieval. Formally, and encoder-decoder [21, 22] models that perform query- Given a question π‘ž and a set of π‘˜ passages within the doc- passage interactions simultaneously, as cross-encoders. ument 𝐷𝑛 = {𝑃1𝑛 , 𝑃2𝑛 , ..., π‘ƒπ‘˜π‘› }, the FiD encoder outputs There are several PLM-based cross-encoders, includ- the π‘˜-th passage embeddings Hk ∈ R𝐿×𝑑 , where 𝐿 denotes ing sequence-to-sequence language models such as the maximum token length, and 𝑑 denotes the dimension of MonoT5 [22] and RankT5 [21] for ranking task, as well hidden states, which are then concatenated as the input of as multi-passage reader models like FiD [9] and RFiD [24] the fusion decoder [H1 , H2 , ..., Hk ]. for RAG tasks, which have been demonstrated superior ef- fectiveness. Hk = FiD-Encoder(π‘ž + π‘ƒπ‘˜π‘› ) (1) MonoT5. MonoT5 [22] is the first work to define a ranking The FiD decoder utilizes [H1 , H2 , ..., Hk ] to generate the task as a text generation task by leveraging T5 [19] encoder- target token 𝑇 = true π‘œπ‘Ÿ false. Therefore, the loss func- decoder model. A query-document pair is concatenated into tion can be defined as follows: an input sequence Query : π‘ž Document : 𝐷𝑛 Relevant:, and utilizes true and false as target tokens to represent 𝑇 their relevance. Then, the model is fine-tuned for text gen- βˆ‘οΈ ℒ𝐹 𝑖𝐷 = βˆ’ log 𝑝(𝑦𝑖 |𝑦1 , 𝑦2 , ...., π‘¦π‘–βˆ’1 , [H1 , H2 , ..., Hk ]) eration task. After training, the ranking scores are derived 𝑖=1 from the logits of true token, based on the softmax applied (2) only on the logits of the true and false tokens. Inter-passage Attention. Previous work [24] tackled the RankT5. Following MonoT5 [22], the input sequence is issue of spurious passages by employing a binary classifier similar except that RankT5 do not include the Relevant: on the first token’s encoder hidden states Hk,1 , to determine postfix. Then, the model use the as target whether the passage is a rationale passage to the query. token to learn unnormalized ranking score. The model is Then, guide the decoder by appending the additional em- trained with list-wise ranking loss directly, instead of using beddings toward the end of the encoder’s hidden states text generation loss as in MonoT5 [22]. However, these [H1 , H2 , ..., Hk , Hk+1 ], where Hk+1 ∈ R2×𝑑 is trainable ra- models cannot be directly used for long document retrieval tionale embedding. However, as Table 2 shows, it drastically due to the maximum input length constraint as in most underperforms in passage ranking tasks by a large margin, PLMs, which hinders their effectiveness in the document as it does not explicitly model relative passage relevance. retrieval task. Instead, to mitigate this, we utilize inter-passage atten- tion to model interactions between passages explicitly. PFiD FiD. The FiD model further extends T5 [19] encoder-decoder builds a set of input sequences by appending the first to- model, taking multiple π‘˜ passages as input, encoding sepa- ken hidden states of each pair as B = [H1,1 , H2,1 , ..., Hk,1 ], rately, and then feeds the concatenated π‘˜ encoder hidden where H𝑖,𝑗 denotes the 𝑗-th token embeddings of 𝑖-th pas- states into a T5 decoder to generate the answer. Relevance sage. In a standard cross-encoder, the first token of the scores for passages are computed using cross-attention encoder aggregate query-passage information to compute scores, which entail averaging the attention score across all a relevance score. We further use this token to depict the tokens within the passage and all layers and heads within relative semantics via self-attention mechanism. Inspried the decoder [26]. by [27], we consider single-layer transformer model to de- RFiD. While FiD [9] treats all passages equally within its en- pict relative passage relevance as follows: coders, solely depending on the cross-attention mechanism to establish correlations between the decoder and encoders, which may identify the incorrect answer by referring to QK⊺ (οΈ‚ )οΈ‚ B ΜƒοΈ€ = softmax √ V, where Q = BWQ , K = BWK , V = BWV spurious passages. Instead, RFiD [24] improves FiD by iden- 𝑑 tifying potential answer-containing passages (or rationale) (3) among the candidates and guiding the decoder with the in which matrices WQ , WK , WV ∈ R𝑑×𝑑 are learnable pa- identified rationales. Afterward, cross-attention scores are rameters. The information from different passages is fused # of Documents = 1 # of Documents = 2 0.7 0.7 and exchanged via the self-attention mechanism. The train- 0.6 0.6 ing loss used for inter-passage attention can be defined as follows: nDCG@10 nDCG@10 0.5 0.5 ΜƒοΈ€ π‘˜ W𝐡 ) ∈ R2 , 0.4 0.4 π‘π‘˜ = softmax(B (4) 0.3 0.3 β„’π‘π‘Žπ‘ π‘ π‘Žπ‘”π‘’ = βˆ’(𝑦 log(π‘π‘˜ ) + (1 βˆ’ 𝑦) log(1 βˆ’ π‘π‘˜ )) 0.2 0.2 1 2 3 4 5 1 2 3 4 5 # of Passages # of Passages where 𝑦 is the passage relevance label, and the overall train- # of Documents = 5 # of Documents = 10 0.7 0.7 ing objective of PFiD is: 0.6 0.6 β„’π‘Žπ‘™π‘™ = ℒ𝐹 𝑖𝐷 + πœ†β„’π‘π‘Žπ‘ π‘ π‘Žπ‘”π‘’ , (5) nDCG@10 nDCG@10 0.5 0.5 where πœ† is a hyperparameter to balance two losses. 0.4 0.4 0.3 0.3 4. Experimental setup 0.2 1 2 3 4 5 0.2 1 2 3 4 5 # of Passages # of Passages 4.1. Datasets RankT5 RFiD PFiD We use MIRACL [25] passage ranking dataset for our ex- Figure 1: Passage ranking results on the real-world RAG sce- periments. The MIRACL [25] dataset is a large-scale, open- narios. We first retrieve # of documents and rerank # passages domain, human-generated multi-document ranking dataset within the retrieved documents. which is similar to MS MARCO [28], but MIRACL owns its advantage by providing segmented document collection, enabling both document retrieval and passage ranking.1 For 4.3. Experimental Details the document retrieval task, we construct the document retrieval dataset by regarding a document with at least one We adopt T5-base [19] as our base model, using Adam [29] positive passage, as a positive document. Table 1 shows the with a learning rate of 10βˆ’4 and a dropout rate of 0.1. For statistics of the datasets. both training and inference, we use the top-100 passages and truncate them to 200 of the maximum token length. The Table 1 hyperparameter πœ† is set to 0.5. For the document retrieval Statistics of Datasets. task, we perform ranking on BM25 top-100 retrieved docu- ments, whereas passage ranking ranks the passages within Task # train # dev # avg judgement # corpus the given positive document. We also conduct experiments Document Retrieval 22,548 6,404 2.22 5,758,285 Passage Ranking 29,416 8,350 2.75 32,893,221 on real-world RAG scenarios, considering both document retrieval and passage ranking simultaneously. We use the 4.2. Baselines evaluation metric of the nDCG [30], Recall, and MRR scores to evaluate the effectiveness. All experiments are conducted We compare PFiD against the following three types of rank- on a single NVIDIA A100 GPU (40GB). In this work, we ing baselines. The first is Single-Passage Cross-encoder do not consider other training approaches including data (SPC) baselines, including MonoT5 [22], and RankT5 [21]. augmentation, knowledge distillation, or negative sampling Due to the constraint of input tokens, we only take the strategies as delving into their effects falls outside the scope first-π‘˜ tokens in the document retrieval task. An alterna- of our objectives. tive approach is to score each passage independently, and then take the passage with the highest score as the rep- resentative for ranking the document, or directly perform 5. Results and Analysis retrieval over the segmented passages. However, we will omit these approaches as the former lacks efficiency, and Retrieval and Ranking. Table 2 presents our evaluation the latter is not scalable for real-world RAG scenarios. Then, results on document retrieval and passage ranking tasks. the model is trained list-wisely with randomly sampled The key observations are as follows: (i) MPC significantly negatives from the entire passage sets; The second is Multi- outperforms SPC in document retrieval task by aggregat- Passage Cross-encoder (MPC) baselines, including FiD [9] ing multiple π‘˜ passages, alleviating the problem of limited and RFiD [24]. For comparison in our experimental setting, context size in SPC. In particular, one can see that PFiD out- both FiD and RFiD models are trained with the target token performs RFiD by a large margin on both document ranking of true π‘œπ‘Ÿ false, enabling both document retrieval and and passage ranking task. This indicates that by leveraging passage ranking. All SPC and MPC baselines used in this passage-wise context to guide the decoder, we can better experiment are initialized with T5-base model; The third is identify relative passage relevance. Note that compared with the most frequently employed lexical ranker BM25 [16]. We the existing SPC baselines, our method achieves ranking effi- use the Elasticsearch engine with default parameters π‘˜1 = ciency by explicitly removing the need for each granularity 1.2, and 𝑏 = 0.75. ranking. PFiD directly consumes the entire document, and scores the relevance of the entire passages and document simultaneously. (ii) RFiD, implicitly guiding the decoder with rationale embedding shows improvement over FiD by a large margin, however, it is still even worse than BM25. It 1 suggests that implicitly guiding indeed benefits the model’s MS MARCO also provide segmented document collection, but the segmented corpus do not align with passages in passage ranking tasks. ranking ability to some extent. However, when ranking Table 2 The evaluation results of different baselines. As for the document retrieval, we rank the top 100 documents retrieved by BM25, while the passage ranking task ranks the passage within the retrieved document. 𝑁 denotes the number of documents to rank, whereas 𝑃 denotes the number of passages in the document. The best performances are in †. Latency indicates the total inference time from document retrieval to passage ranking, which is measured by averaging the time taken for each query with a single thread and a single batch on the GPU. Document Retrieval Passage Ranking Model Category Complexity Latency (s) top-π‘˜ MRR@10 Recall@5 Recall@10 MRR@10 nDCG@5 nDCG@10 BM25 - |C| 0.3951 0.3683 0.4736 0.7366 0.7718 0.7856 - 0.32 (x1.00) MonoT5 SPC 100 0.6204 0.5141 0.5794 0.8571 0.8774 0.8803 𝑂(𝑁 + 𝑁 𝑃 ) 5.65 (x17.65) RankT5 SPC 100 0.6352 0.4992 0.5605 0.8778† 0.8916† 0.8952† 𝑂(𝑁 + 𝑁 𝑃 ) 5.64 (x17.62) FiD MPC 100 0.6322 0.5139 0.5821 0.3464 0.3725 0.4260 𝑂(𝑁 ) 1.17 (x3.65) RFiD MPC 100 0.7177 0.5743 0.6407 0.5617 0.6036 0.6359 𝑂(𝑁 ) 1.21 (x3.78) PFiD (Ours) MPC 100 0.7231† 0.5937† 0.6516† 0.8530 0.8726 0.8780 𝑂(𝑁 ) 1.23 (x3.84) various passages from multi-documents, traditional MPC is completely indistinguishable, suggesting cross-attention 0.25 cross attention 0.25 PFiD score from the decoder is not suited for the passage ranking 0.20 0.20 task. (iii) SPC achieves superior performance over MPC 0.15 0.15 in passage ranking task, as it is trained with rich negative 0.10 0.10 samples from other documents, while MPC is only trained with in-document negatives. Additionally, even with in- 0.05 0.05 document negatives, when trained with inter-passage atten- 0.00 0.00 0 20 40 0 20 40 tion, PFiD can achieve ranking effectiveness that rivals that Figure 2: Distribution of the rank of positive passages. of SPC, suggesting that incorporating an additional module to identify relevant passages is more effective than relying solely on the cross-attention mechanism of the decoder. Results on real-world RAG scenario. Next, we investi- References gate the effectiveness of PFiD in real-world RAG scenarios. [1] OpenAI, Gpt-4 technical report, 2024. We first retrieve # documents from the candidates, and arXiv:2303.08774. rerank # passages within the retrieved documents. Figure 1 [2] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- represents the result of our evaluation. Notably, from Table 2 hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- we observed that MPC outperforms SPC in document re- gava, S. Bhosale, D. Bikel, L. B. et al., Llama 2: trieval tasks, however, the performance drastically drops in Open foundation and fine-tuned chat models, 2023. this setting, as cross-attention scores from the decoder are arXiv:2307.09288. indistinguishable across passages from multi-documents. [3] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, Additionally, despite RankT5 reaching the best effective- A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. C. et al., ness on the passage ranking task, it did not exhibit any im- Palm 2 technical report, 2023. arXiv:2305.10403. provements over our method in real-world RAG scenarios, [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- suggesting the importance of the multi-granularity rank- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- ing. Instead, PFiD consistently outperforms all baselines, try, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, by leveraging the complementary nature of SPC and MPC. T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, PFiD is capable of more efficiently retrieving documents C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, and ranking passages, and capturing the relative semantic S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, correlation between different passages, leading to superior A. Radford, I. Sutskever, D. Amodei, Language models performance. are few-shot learners, 2020. arXiv:2005.14165. Cross-attention vs PFiD. As discussed above, PFiD has [5] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, the advantage of identifying relevant passages compared to H. Hajishirzi, When not to trust language models: previous models like RFiD since it explicitly models relative Investigating effectiveness of parametric and non- passage relevance. We investigate the effects of the cross- parametric memories, in: A. Rogers, J. Boyd-Graber, attention scores of the decoder and our passage ranking N. Okazaki (Eds.), Proceedings of the 61st Annual scores for the passage ranking task. Figure 2 illustrates the Meeting of the Association for Computational Linguis- distribution of the rank of positive passages. As depicted tics (Volume 1: Long Papers), Association for Compu- in Figure 2, the PFiD is more strongly correlated with pas- tational Linguistics, Toronto, Canada, 2023, pp. 9802– sage relevances than cross-attention scores, suggesting the 9822. URL: https://aclanthology.org/2023.acl-long.546. PFiD focuses more on positive passages by explicitly learn- doi:10.18653/v1/2023.acl-long.546. ing relative passage relevance. Our experimental results [6] J. Maynez, S. Narayan, B. Bohnet, R. McDonald, show that the enhanced ability to identify relevant passages On faithfulness and factuality in abstractive sum- contributes to overall performance improvement. marization, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, Association for Computational Linguis- tics, Online, 2020, pp. 1906–1919. URL: https:// aclanthology.org/2020.acl-main.173. doi:10.18653/ transformer, 2023. arXiv:1910.10683. v1/2020.acl-main.173. [20] R. Nogueira, K. Cho, Passage re-ranking with bert, [7] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, 2020. arXiv:1901.04085. Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A survey [21] H. Zhuang, Z. Qin, R. Jagerman, K. Hui, J. Ma, on hallucination in large language models: Princi- J. Lu, J. Ni, X. Wang, M. Bendersky, Rankt5: Fine- ples, taxonomy, challenges, and open questions, 2023. tuning t5 for text ranking with ranking losses, 2022. arXiv:2311.05232. arXiv:2210.10634. [8] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, [22] R. Nogueira, Z. Jiang, J. Lin, Document ranking N. Goyal, H. KΓΌttler, M. Lewis, W. tau Yih, T. Rock- with a pretrained sequence-to-sequence model, 2020. tΓ€schel, S. Riedel, D. Kiela, Retrieval-augmented arXiv:2003.06713. generation for knowledge-intensive nlp tasks, 2021. [23] X. Ma, L. Wang, N. Yang, F. Wei, J. Lin, Fine- arXiv:2005.11401. tuning llama for multi-stage text retrieval, 2023. [9] G. Izacard, E. Grave, Leveraging passage retrieval arXiv:2310.08319. with generative models for open domain question [24] C. Wang, H. Yu, Y. Zhang, Rfid: Towards rational answering, in: P. Merlo, J. Tiedemann, R. Tsar- fusion-in-decoder for open-domain question answer- faty (Eds.), Proceedings of the 16th Conference of ing, 2023. arXiv:2305.17041. the European Chapter of the Association for Com- [25] X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo, putational Linguistics: Main Volume, Association for D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh, Computational Linguistics, Online, 2021, pp. 874– J. Lin, MIRACL: A Multilingual Retrieval Dataset Cov- 880. URL: https://aclanthology.org/2021.eacl-main.74. ering 18 Diverse Languages, Transactions of the Asso- doi:10.18653/v1/2021.eacl-main.74. ciation for Computational Linguistics 11 (2023) 1114– [10] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, 1131. T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, [26] G. Izacard, E. Grave, Distilling knowledge from Atlas: Few-shot learning with retrieval augmented reader to retriever for question answering, 2022. language models, Journal of Machine Learning Re- arXiv:2012.04584. search 24 (2023) 1–43. URL: http://jmlr.org/papers/v24/ [27] J. Yang, Z. Liu, C. Li, G. Sun, X. Xie, Longtriever: 23-0037.html. a pre-trained long text encoder for dense document [11] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, retrieval, in: H. Bouamor, J. Pino, K. Bali (Eds.), Pro- J. Sun, M. Wang, H. Wang, Retrieval-augmented gen- ceedings of the 2023 Conference on Empirical Methods eration for large language models: A survey, 2024. in Natural Language Processing, Association for Com- arXiv:2312.10997. putational Linguistics, Singapore, 2023, pp. 3655–3665. [12] H. He, H. Zhang, D. Roth, Rethinking with re- URL: https://aclanthology.org/2023.emnlp-main.223. trieval: Faithful large language model inference, 2022. doi:10.18653/v1/2023.emnlp-main.223. arXiv:2301.00303. [28] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, [13] N. Thakur, L. Bonifacio, X. Zhang, O. Ogundepo, E. Ka- R. Majumder, A. McNamara, B. Mitra, T. Nguyen, malloo, D. Alfonso-Hermelo, X. Li, Q. Liu, B. Chen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, T. Wang, M. Rezagholizadeh, J. Lin, Nomiracl: Knowing when Ms marco: A human generated machine reading com- you don’t know for robust multilingual retrieval- prehension dataset, 2018. arXiv:1611.09268. augmented generation, 2024. arXiv:2312.11361. [29] D. P. Kingma, J. Ba, Adam: A method for stochastic [14] F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, optimization, 2017. arXiv:1412.6980. N. SchΓ€rli, D. Zhou, Large language models can be [30] K. JΓ€rvelin, J. KekΓ€lΓ€inen, Cumulated gain-based eval- easily distracted by irrelevant context, in: A. Krause, uation of ir techniques, ACM Trans. Inf. Syst. 20 (2002) E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett 422–446. URL: https://doi.org/10.1145/582415.582418. (Eds.), Proceedings of the 40th International Confer- doi:10.1145/582415.582418. ence on Machine Learning, volume 202 of Proceedings of Machine Learning Research, PMLR, 2023, pp. 31210– 31227. URL: https://proceedings.mlr.press/v202/shi23a. html. [15] A. Asai, Z. Wu, Y. Wang, A. Sil, H. Hajishirzi, Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023. arXiv:2310.11511. [16] S. Robertson, H. Zaragoza, The probabilistic rele- vance framework: Bm25 and beyond, Foundations and Trends in Information Retrieval 3 (2009) 333–389. doi:10.1561/1500000019. [17] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W. tau Yih, Dense passage re- trieval for open-domain question answering, 2020. arXiv:2004.04906. [18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for lan- guage understanding, 2019. arXiv:1810.04805. [19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text