=Paper=
{{Paper
|id=Vol-3784/short7
|storemode=property
|title=Enhancing Fusion-in-Decoder for Multi-Granularity Ranking
|pdfUrl=https://ceur-ws.org/Vol-3784/short7.pdf
|volume=Vol-3784
|authors=Haeju Park,Kyungjae Lee,Sunghyun Park,Moontae Lee
|dblpUrl=https://dblp.org/rec/conf/ir-rag/ParkL0L24
}}
==Enhancing Fusion-in-Decoder for Multi-Granularity Ranking==
Enhancing Fusion-in-Decoder for Multi-Granularity Ranking
Haeju Park1,* , Kyungjae Lee2 , Sunghyun Park3 and Moontae Lee4
1
LG AI Research, Republic of Korea
2
LG AI Research, Republic of Korea
3
LG AI Research, Republic of Korea
4
LG AI Research, Republic of Korea
Abstract
Large Language Models (LLMs) have demonstrated exceptional performance across various natural language tasks, leveraging extensive
knowledge from massive datasets. However, their reliance solely on parametric knowledge often leads to the generation of inaccurate or
outdated content, particularly in domain-specific tasks. Retrieval Augmented Generation (RAG) has emerged as a promising approach
to address this limitation by incorporating external knowledge without necessitating re-training. While RAG enhances the accuracy
of LLM-generated content, effectively retrieving external knowledge remains a challenge due to potential noise and computational
costs. To address this, traditional information retrieval systems adopt two-stage approaches, utilizing efficient retrievers followed by
reranking mechanisms. Recently, transformer-based architectures, including BERT and T5 models, have shown promise as effective
rerankers. However, such models have limited context size and only perform single-granularity ranking at a time, hindering their
effectiveness and efficiency. In this paper, we first explore the existing rerankers such as RankT5 and RFiD, highlighting challenges in
multi-granularity ranking. Subsequently, we introduce PFiD (Passage Fusion-in-Decoder), a simple yet efficient approach aimed at
effectively ranking both document and passage simultaneously. Through empirical evaluation, we demonstrate the efficacy of PFiD in
improving effectiveness and efficiency, offering a promising direction for further research in this domain.
Keywords
Information Systems, Retrieval Augmented Generation, Large Language Model
1. Introduction a reranker. However, these models have limited context
size and only perform single-granularity ranking during
Despite their remarkable capabilities and growth, Large Lan- inference, which hinders their effectiveness and efficiency
guage Models (LLMs) [1, 2, 3, 4] still tend to generate factu- in real-world RAG scenarios.
ally incorrect or outdated content as their knowledge solely To this end, in this paper, we focus on the multi-
relies on their parametric knowledge, especially in domain- granularity ranking task, which ranks both document and
specific or knowledge-intensive tasks [5, 6, 7]. Retrieval passage simultaneously. Specifically, we first investigate the
Augmented Generation (RAG) approaches [8, 9, 10, 11] have single-passage cross-encoder models such as MonoT5 [22]
gained significant attention, which improve the quality of and RankT5 [21]. It achieves superior performance across
LLM-generated output by grounding on external knowledge various ranking tasks, but due to the constraint of input
to supplement the LLMsβ parametric knowledge, without tokens, its efficiency is limited in real-world RAG scenarios.
having to re-train the LLMs. RAG leverages a powerful Next, we present the use of multi-passage cross-encoder,
information retrieval model, which is designed to search such as FiD [9] and RFiD [24]. These models alleviate the
large datasets or knowledge bases. The retrieved informa- input tokens limit by leveraging multi-passage, but they
tion is then incorporated into LLMs, enabling it to generate directly use the cross-attention score of the decoder as a pas-
more accurate and contextually relevant content. By incor- sage relevance, which is implicitly learned, and encounter
porating external knowledge, RAG can effectively reduce difficulty with distinguishing relative differences between
the problem of generating factually incorrect or outdated passages. Thereafter, we propose a simple and effective PFiD
content in LLMs [12, 13]. (Passage Fusion-in-Decoder) for multi-granularity ranking.
However, current RAG frameworks have major chal- PFiD extends the FiD model by generating a document-level
lenges when it comes to the effectiveness and efficiency of relevance token, enabling both document retrieval and pas-
information retrieval systems: First, LLMs tend to generate sage ranking. Furthermore, PFiD adopts the inter-passage
inaccurate responses on distracting (or noisy) contexts, thus attention mechanism to learn relative passage relevance
the performance of retrieval models has a significant im- explicitly, using the special tokens at the beginning of the
pact on the quality of RAGβs responses [14, 15, 11]. Second, input text to represent the entire context.
the retrieval component of RAG requires searching through Experiments on MIRACL passage ranking dataset [25]
large-scale knowledge bases or the web, which can be com- demonstrate that PFiD improves effectiveness and efficiency
putationally expensive and slow [11]. Due to the above compared to existing approaches, especially in RAG scenar-
challenges, existing retrieval systems adopt two-stage ap- ios.
proaches, an efficient first-stage retriever such as BM25 [16]
and DPR [17] retrieves a set of documents from a larger
dataset, and then a second-stage reranker is used to rerank 2. Preliminaries
retrieved documents for precise ranking. Recently, with the
advent of transformer-based models such as BERT [18] and 2.1. Task definition
T5 [19], more architectures including bi-encoder [17], cross-
Given a user query π and a document (or passage) corpus
encoder [20], encoder-decoder [21, 22], and decoder-only
πΆ = {π·1 , π·2 , ..., π·π }, the goal of document retrieval is to
models [23], have gradually shown their effectiveness as
find the π documents that are most relevant to the query π.
Information Retrievalβs Role in RAG Systems (IR-RAG) - 2024 In our multi-granularity ranking setting, which consists of
*
Corresponding author. document retrieval and passage ranking tasks, the document
Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
retrieval task is to perform reranking on BM25 retrieved
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
top-π documents. While traditional passage ranking tasks directly regarded as passage relevance scores the same as
typically involve ranking entire passages, in this paper, the in [9]. However, even with the rationale, the cross-attention
passage ranking task focuses solely on ranking passages mechanism still lacks for distinguishing relative differences
within the retrieved document itself, which aligns more between passages, as it is implicitly guided by a rationale
closely with real-world RAG scenarios and is thus more classifier solely trained on point-wise binary classification
feasible. loss.
2.2. Ranking models
Pre-trained Language Models (PLMs) are currently the most 3. Method
effective ranking models, which can be categorized into: bi- In this section, we briefly discuss a simple but effective Pas-
encoders and cross-encoders. Bi-encoders encode a query sage Fusion-in-Decoder (PFiD) for multi-granularity rank-
and a passage separately to obtain semantic representa- ing. PFiD adopts the FiD [9] architecture as a base model,
tions [17], emerging as powerful first-stage retrievers by further extends FiD by utilizing true and false token
pre-computing the passage representations offline. Instead, as a target token to model document-level relevance, en-
cross-encoders take the concatenation of the query and abling multi-granularity ranking simultaneously. Addition-
a passage, and perform query-passage interactions [20], ally, PFiD integrates inter-passage attention to learn relative
which have been conceived as second-stage rerankers, de- passage relevances explicitly, which is similar to the list-
signed to explicitly refine the results provided by the first- wise training objective of RankT5 [21].
stage retrieval. In this paper, for brevity, we also refer other
PLMs, such as encoder-only [17, 20], decoder-only [23], Fusion-in-Decoder for Document Retrieval. Formally,
and encoder-decoder [21, 22] models that perform query- Given a question π and a set of π passages within the doc-
passage interactions simultaneously, as cross-encoders. ument π·π = {π1π , π2π , ..., πππ }, the FiD encoder outputs
There are several PLM-based cross-encoders, includ- the π-th passage embeddings Hk β RπΏΓπ , where πΏ denotes
ing sequence-to-sequence language models such as the maximum token length, and π denotes the dimension of
MonoT5 [22] and RankT5 [21] for ranking task, as well hidden states, which are then concatenated as the input of
as multi-passage reader models like FiD [9] and RFiD [24] the fusion decoder [H1 , H2 , ..., Hk ].
for RAG tasks, which have been demonstrated superior ef-
fectiveness. Hk = FiD-Encoder(π + πππ ) (1)
MonoT5. MonoT5 [22] is the first work to define a ranking The FiD decoder utilizes [H1 , H2 , ..., Hk ] to generate the
task as a text generation task by leveraging T5 [19] encoder- target token π = true ππ false. Therefore, the loss func-
decoder model. A query-document pair is concatenated into tion can be defined as follows:
an input sequence Query : π Document : π·π Relevant:,
and utilizes true and false as target tokens to represent π
their relevance. Then, the model is fine-tuned for text gen-
βοΈ
βπΉ ππ· = β log π(π¦π |π¦1 , π¦2 , ...., π¦πβ1 , [H1 , H2 , ..., Hk ])
eration task. After training, the ranking scores are derived π=1
from the logits of true token, based on the softmax applied (2)
only on the logits of the true and false tokens. Inter-passage Attention. Previous work [24] tackled the
RankT5. Following MonoT5 [22], the input sequence is issue of spurious passages by employing a binary classifier
similar except that RankT5 do not include the Relevant: on the first tokenβs encoder hidden states Hk,1 , to determine
postfix. Then, the model use the as target whether the passage is a rationale passage to the query.
token to learn unnormalized ranking score. The model is Then, guide the decoder by appending the additional em-
trained with list-wise ranking loss directly, instead of using beddings toward the end of the encoderβs hidden states
text generation loss as in MonoT5 [22]. However, these [H1 , H2 , ..., Hk , Hk+1 ], where Hk+1 β R2Γπ is trainable ra-
models cannot be directly used for long document retrieval tionale embedding. However, as Table 2 shows, it drastically
due to the maximum input length constraint as in most underperforms in passage ranking tasks by a large margin,
PLMs, which hinders their effectiveness in the document as it does not explicitly model relative passage relevance.
retrieval task. Instead, to mitigate this, we utilize inter-passage atten-
tion to model interactions between passages explicitly. PFiD
FiD. The FiD model further extends T5 [19] encoder-decoder
builds a set of input sequences by appending the first to-
model, taking multiple π passages as input, encoding sepa-
ken hidden states of each pair as B = [H1,1 , H2,1 , ..., Hk,1 ],
rately, and then feeds the concatenated π encoder hidden
where Hπ,π denotes the π-th token embeddings of π-th pas-
states into a T5 decoder to generate the answer. Relevance
sage. In a standard cross-encoder, the first token of the
scores for passages are computed using cross-attention
encoder aggregate query-passage information to compute
scores, which entail averaging the attention score across all
a relevance score. We further use this token to depict the
tokens within the passage and all layers and heads within
relative semantics via self-attention mechanism. Inspried
the decoder [26].
by [27], we consider single-layer transformer model to de-
RFiD. While FiD [9] treats all passages equally within its en- pict relative passage relevance as follows:
coders, solely depending on the cross-attention mechanism
to establish correlations between the decoder and encoders,
which may identify the incorrect answer by referring to QKβΊ
(οΈ )οΈ
B
ΜοΈ = softmax β V, where Q = BWQ , K = BWK , V = BWV
spurious passages. Instead, RFiD [24] improves FiD by iden- π
tifying potential answer-containing passages (or rationale) (3)
among the candidates and guiding the decoder with the in which matrices WQ , WK , WV β RπΓπ are learnable pa-
identified rationales. Afterward, cross-attention scores are rameters. The information from different passages is fused
# of Documents = 1 # of Documents = 2
0.7 0.7
and exchanged via the self-attention mechanism. The train-
0.6 0.6
ing loss used for inter-passage attention can be defined as
follows:
nDCG@10
nDCG@10
0.5 0.5
ΜοΈ π Wπ΅ ) β R2 , 0.4 0.4
ππ = softmax(B
(4) 0.3 0.3
βπππ π πππ = β(π¦ log(ππ ) + (1 β π¦) log(1 β ππ )) 0.2 0.2
1 2 3 4 5 1 2 3 4 5
# of Passages # of Passages
where π¦ is the passage relevance label, and the overall train- # of Documents = 5 # of Documents = 10
0.7 0.7
ing objective of PFiD is:
0.6 0.6
βπππ = βπΉ ππ· + πβπππ π πππ , (5)
nDCG@10
nDCG@10
0.5 0.5
where π is a hyperparameter to balance two losses. 0.4 0.4
0.3 0.3
4. Experimental setup 0.2
1 2 3 4 5
0.2
1 2 3 4 5
# of Passages # of Passages
4.1. Datasets RankT5 RFiD PFiD
We use MIRACL [25] passage ranking dataset for our ex- Figure 1: Passage ranking results on the real-world RAG sce-
periments. The MIRACL [25] dataset is a large-scale, open- narios. We first retrieve # of documents and rerank # passages
domain, human-generated multi-document ranking dataset within the retrieved documents.
which is similar to MS MARCO [28], but MIRACL owns
its advantage by providing segmented document collection,
enabling both document retrieval and passage ranking.1 For 4.3. Experimental Details
the document retrieval task, we construct the document
retrieval dataset by regarding a document with at least one We adopt T5-base [19] as our base model, using Adam [29]
positive passage, as a positive document. Table 1 shows the with a learning rate of 10β4 and a dropout rate of 0.1. For
statistics of the datasets. both training and inference, we use the top-100 passages
and truncate them to 200 of the maximum token length. The
Table 1 hyperparameter π is set to 0.5. For the document retrieval
Statistics of Datasets. task, we perform ranking on BM25 top-100 retrieved docu-
ments, whereas passage ranking ranks the passages within
Task # train # dev # avg judgement # corpus
the given positive document. We also conduct experiments
Document Retrieval 22,548 6,404 2.22 5,758,285
Passage Ranking 29,416 8,350 2.75 32,893,221 on real-world RAG scenarios, considering both document
retrieval and passage ranking simultaneously. We use the
4.2. Baselines evaluation metric of the nDCG [30], Recall, and MRR scores
to evaluate the effectiveness. All experiments are conducted
We compare PFiD against the following three types of rank- on a single NVIDIA A100 GPU (40GB). In this work, we
ing baselines. The first is Single-Passage Cross-encoder do not consider other training approaches including data
(SPC) baselines, including MonoT5 [22], and RankT5 [21]. augmentation, knowledge distillation, or negative sampling
Due to the constraint of input tokens, we only take the strategies as delving into their effects falls outside the scope
first-π tokens in the document retrieval task. An alterna- of our objectives.
tive approach is to score each passage independently, and
then take the passage with the highest score as the rep-
resentative for ranking the document, or directly perform 5. Results and Analysis
retrieval over the segmented passages. However, we will
omit these approaches as the former lacks efficiency, and Retrieval and Ranking. Table 2 presents our evaluation
the latter is not scalable for real-world RAG scenarios. Then, results on document retrieval and passage ranking tasks.
the model is trained list-wisely with randomly sampled The key observations are as follows: (i) MPC significantly
negatives from the entire passage sets; The second is Multi- outperforms SPC in document retrieval task by aggregat-
Passage Cross-encoder (MPC) baselines, including FiD [9] ing multiple π passages, alleviating the problem of limited
and RFiD [24]. For comparison in our experimental setting, context size in SPC. In particular, one can see that PFiD out-
both FiD and RFiD models are trained with the target token performs RFiD by a large margin on both document ranking
of true ππ false, enabling both document retrieval and and passage ranking task. This indicates that by leveraging
passage ranking. All SPC and MPC baselines used in this passage-wise context to guide the decoder, we can better
experiment are initialized with T5-base model; The third is identify relative passage relevance. Note that compared with
the most frequently employed lexical ranker BM25 [16]. We the existing SPC baselines, our method achieves ranking effi-
use the Elasticsearch engine with default parameters π1 = ciency by explicitly removing the need for each granularity
1.2, and π = 0.75. ranking. PFiD directly consumes the entire document, and
scores the relevance of the entire passages and document
simultaneously. (ii) RFiD, implicitly guiding the decoder
with rationale embedding shows improvement over FiD by
a large margin, however, it is still even worse than BM25. It
1
suggests that implicitly guiding indeed benefits the modelβs
MS MARCO also provide segmented document collection, but the
segmented corpus do not align with passages in passage ranking tasks.
ranking ability to some extent. However, when ranking
Table 2
The evaluation results of different baselines. As for the document retrieval, we rank the top 100 documents retrieved by BM25,
while the passage ranking task ranks the passage within the retrieved document. π denotes the number of documents to
rank, whereas π denotes the number of passages in the document. The best performances are in β . Latency indicates the total
inference time from document retrieval to passage ranking, which is measured by averaging the time taken for each query
with a single thread and a single batch on the GPU.
Document Retrieval Passage Ranking
Model Category Complexity Latency (s)
top-π MRR@10 Recall@5 Recall@10 MRR@10 nDCG@5 nDCG@10
BM25 - |C| 0.3951 0.3683 0.4736 0.7366 0.7718 0.7856 - 0.32 (x1.00)
MonoT5 SPC 100 0.6204 0.5141 0.5794 0.8571 0.8774 0.8803 π(π + π π ) 5.65 (x17.65)
RankT5 SPC 100 0.6352 0.4992 0.5605 0.8778β 0.8916β 0.8952β π(π + π π ) 5.64 (x17.62)
FiD MPC 100 0.6322 0.5139 0.5821 0.3464 0.3725 0.4260 π(π ) 1.17 (x3.65)
RFiD MPC 100 0.7177 0.5743 0.6407 0.5617 0.6036 0.6359 π(π ) 1.21 (x3.78)
PFiD (Ours) MPC 100 0.7231β 0.5937β 0.6516β 0.8530 0.8726 0.8780 π(π ) 1.23 (x3.84)
various passages from multi-documents, traditional MPC
is completely indistinguishable, suggesting cross-attention 0.25 cross attention 0.25 PFiD
score from the decoder is not suited for the passage ranking 0.20 0.20
task. (iii) SPC achieves superior performance over MPC 0.15 0.15
in passage ranking task, as it is trained with rich negative
0.10 0.10
samples from other documents, while MPC is only trained
with in-document negatives. Additionally, even with in- 0.05 0.05
document negatives, when trained with inter-passage atten- 0.00 0.00
0 20 40 0 20 40
tion, PFiD can achieve ranking effectiveness that rivals that Figure 2: Distribution of the rank of positive passages.
of SPC, suggesting that incorporating an additional module
to identify relevant passages is more effective than relying
solely on the cross-attention mechanism of the decoder.
Results on real-world RAG scenario. Next, we investi-
References
gate the effectiveness of PFiD in real-world RAG scenarios. [1] OpenAI, Gpt-4 technical report, 2024.
We first retrieve # documents from the candidates, and arXiv:2303.08774.
rerank # passages within the retrieved documents. Figure 1 [2] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
represents the result of our evaluation. Notably, from Table 2 hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
we observed that MPC outperforms SPC in document re- gava, S. Bhosale, D. Bikel, L. B. et al., Llama 2:
trieval tasks, however, the performance drastically drops in Open foundation and fine-tuned chat models, 2023.
this setting, as cross-attention scores from the decoder are arXiv:2307.09288.
indistinguishable across passages from multi-documents. [3] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin,
Additionally, despite RankT5 reaching the best effective- A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. C. et al.,
ness on the passage ranking task, it did not exhibit any im- Palm 2 technical report, 2023. arXiv:2305.10403.
provements over our method in real-world RAG scenarios, [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-
suggesting the importance of the multi-granularity rank- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
ing. Instead, PFiD consistently outperforms all baselines, try, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger,
by leveraging the complementary nature of SPC and MPC. T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu,
PFiD is capable of more efficiently retrieving documents C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
and ranking passages, and capturing the relative semantic S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
correlation between different passages, leading to superior A. Radford, I. Sutskever, D. Amodei, Language models
performance. are few-shot learners, 2020. arXiv:2005.14165.
Cross-attention vs PFiD. As discussed above, PFiD has [5] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi,
the advantage of identifying relevant passages compared to H. Hajishirzi, When not to trust language models:
previous models like RFiD since it explicitly models relative Investigating effectiveness of parametric and non-
passage relevance. We investigate the effects of the cross- parametric memories, in: A. Rogers, J. Boyd-Graber,
attention scores of the decoder and our passage ranking N. Okazaki (Eds.), Proceedings of the 61st Annual
scores for the passage ranking task. Figure 2 illustrates the Meeting of the Association for Computational Linguis-
distribution of the rank of positive passages. As depicted tics (Volume 1: Long Papers), Association for Compu-
in Figure 2, the PFiD is more strongly correlated with pas- tational Linguistics, Toronto, Canada, 2023, pp. 9802β
sage relevances than cross-attention scores, suggesting the 9822. URL: https://aclanthology.org/2023.acl-long.546.
PFiD focuses more on positive passages by explicitly learn- doi:10.18653/v1/2023.acl-long.546.
ing relative passage relevance. Our experimental results [6] J. Maynez, S. Narayan, B. Bohnet, R. McDonald,
show that the enhanced ability to identify relevant passages On faithfulness and factuality in abstractive sum-
contributes to overall performance improvement. marization, in: D. Jurafsky, J. Chai, N. Schluter,
J. Tetreault (Eds.), Proceedings of the 58th An-
nual Meeting of the Association for Computational
Linguistics, Association for Computational Linguis-
tics, Online, 2020, pp. 1906β1919. URL: https://
aclanthology.org/2020.acl-main.173. doi:10.18653/ transformer, 2023. arXiv:1910.10683.
v1/2020.acl-main.173. [20] R. Nogueira, K. Cho, Passage re-ranking with bert,
[7] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, 2020. arXiv:1901.04085.
Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A survey [21] H. Zhuang, Z. Qin, R. Jagerman, K. Hui, J. Ma,
on hallucination in large language models: Princi- J. Lu, J. Ni, X. Wang, M. Bendersky, Rankt5: Fine-
ples, taxonomy, challenges, and open questions, 2023. tuning t5 for text ranking with ranking losses, 2022.
arXiv:2311.05232. arXiv:2210.10634.
[8] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, [22] R. Nogueira, Z. Jiang, J. Lin, Document ranking
N. Goyal, H. KΓΌttler, M. Lewis, W. tau Yih, T. Rock- with a pretrained sequence-to-sequence model, 2020.
tΓ€schel, S. Riedel, D. Kiela, Retrieval-augmented arXiv:2003.06713.
generation for knowledge-intensive nlp tasks, 2021. [23] X. Ma, L. Wang, N. Yang, F. Wei, J. Lin, Fine-
arXiv:2005.11401. tuning llama for multi-stage text retrieval, 2023.
[9] G. Izacard, E. Grave, Leveraging passage retrieval arXiv:2310.08319.
with generative models for open domain question [24] C. Wang, H. Yu, Y. Zhang, Rfid: Towards rational
answering, in: P. Merlo, J. Tiedemann, R. Tsar- fusion-in-decoder for open-domain question answer-
faty (Eds.), Proceedings of the 16th Conference of ing, 2023. arXiv:2305.17041.
the European Chapter of the Association for Com- [25] X. Zhang, N. Thakur, O. Ogundepo, E. Kamalloo,
putational Linguistics: Main Volume, Association for D. Alfonso-Hermelo, X. Li, Q. Liu, M. Rezagholizadeh,
Computational Linguistics, Online, 2021, pp. 874β J. Lin, MIRACL: A Multilingual Retrieval Dataset Cov-
880. URL: https://aclanthology.org/2021.eacl-main.74. ering 18 Diverse Languages, Transactions of the Asso-
doi:10.18653/v1/2021.eacl-main.74. ciation for Computational Linguistics 11 (2023) 1114β
[10] G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, 1131.
T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, E. Grave, [26] G. Izacard, E. Grave, Distilling knowledge from
Atlas: Few-shot learning with retrieval augmented reader to retriever for question answering, 2022.
language models, Journal of Machine Learning Re- arXiv:2012.04584.
search 24 (2023) 1β43. URL: http://jmlr.org/papers/v24/ [27] J. Yang, Z. Liu, C. Li, G. Sun, X. Xie, Longtriever:
23-0037.html. a pre-trained long text encoder for dense document
[11] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, retrieval, in: H. Bouamor, J. Pino, K. Bali (Eds.), Pro-
J. Sun, M. Wang, H. Wang, Retrieval-augmented gen- ceedings of the 2023 Conference on Empirical Methods
eration for large language models: A survey, 2024. in Natural Language Processing, Association for Com-
arXiv:2312.10997. putational Linguistics, Singapore, 2023, pp. 3655β3665.
[12] H. He, H. Zhang, D. Roth, Rethinking with re- URL: https://aclanthology.org/2023.emnlp-main.223.
trieval: Faithful large language model inference, 2022. doi:10.18653/v1/2023.emnlp-main.223.
arXiv:2301.00303. [28] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu,
[13] N. Thakur, L. Bonifacio, X. Zhang, O. Ogundepo, E. Ka- R. Majumder, A. McNamara, B. Mitra, T. Nguyen,
malloo, D. Alfonso-Hermelo, X. Li, Q. Liu, B. Chen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, T. Wang,
M. Rezagholizadeh, J. Lin, Nomiracl: Knowing when Ms marco: A human generated machine reading com-
you donβt know for robust multilingual retrieval- prehension dataset, 2018. arXiv:1611.09268.
augmented generation, 2024. arXiv:2312.11361. [29] D. P. Kingma, J. Ba, Adam: A method for stochastic
[14] F. Shi, X. Chen, K. Misra, N. Scales, D. Dohan, E. H. Chi, optimization, 2017. arXiv:1412.6980.
N. SchΓ€rli, D. Zhou, Large language models can be [30] K. JΓ€rvelin, J. KekΓ€lΓ€inen, Cumulated gain-based eval-
easily distracted by irrelevant context, in: A. Krause, uation of ir techniques, ACM Trans. Inf. Syst. 20 (2002)
E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett 422β446. URL: https://doi.org/10.1145/582415.582418.
(Eds.), Proceedings of the 40th International Confer- doi:10.1145/582415.582418.
ence on Machine Learning, volume 202 of Proceedings
of Machine Learning Research, PMLR, 2023, pp. 31210β
31227. URL: https://proceedings.mlr.press/v202/shi23a.
html.
[15] A. Asai, Z. Wu, Y. Wang, A. Sil, H. Hajishirzi, Self-rag:
Learning to retrieve, generate, and critique through
self-reflection, 2023. arXiv:2310.11511.
[16] S. Robertson, H. Zaragoza, The probabilistic rele-
vance framework: Bm25 and beyond, Foundations
and Trends in Information Retrieval 3 (2009) 333β389.
doi:10.1561/1500000019.
[17] V. Karpukhin, B. OΔuz, S. Min, P. Lewis, L. Wu,
S. Edunov, D. Chen, W. tau Yih, Dense passage re-
trieval for open-domain question answering, 2020.
arXiv:2004.04906.
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
Pre-training of deep bidirectional transformers for lan-
guage understanding, 2019. arXiv:1810.04805.
[19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
limits of transfer learning with a unified text-to-text