Enhancing Fusion-in-Decoder for Multi-Granularity Ranking

Enhancing Fusion-in-Decoder for Multi-Granularity Ranking HaejuPark LG AI Research

Republic of Korea

KyungjaeLee LG AI Research

Republic of Korea

SunghyunPark LG AI Research

Republic of Korea

MoontaeLee LG AI Research

Republic of Korea

Enhancing Fusion-in-Decoder for Multi-Granularity Ranking AEC23638FD7028479DE81C6D21979ECF GROBID - A machine learning software for extracting information from scholarly documents Information Systems Retrieval Augmented Generation Large Language Model Information Retrieval's Role in RAG Systems (IR-RAG) -2024

Large Language Models (LLMs) have demonstrated exceptional performance across various natural language tasks, leveraging extensive knowledge from massive datasets. However, their reliance solely on parametric knowledge often leads to the generation of inaccurate or outdated content, particularly in domain-specific tasks. Retrieval Augmented Generation (RAG) has emerged as a promising approach to address this limitation by incorporating external knowledge without necessitating re-training. While RAG enhances the accuracy of LLM-generated content, effectively retrieving external knowledge remains a challenge due to potential noise and computational costs. To address this, traditional information retrieval systems adopt two-stage approaches, utilizing efficient retrievers followed by reranking mechanisms. Recently, transformer-based architectures, including BERT and T5 models, have shown promise as effective rerankers. However, such models have limited context size and only perform single-granularity ranking at a time, hindering their effectiveness and efficiency. In this paper, we first explore the existing rerankers such as RankT5 and RFiD, highlighting challenges in multi-granularity ranking. Subsequently, we introduce PFiD (Passage Fusion-in-Decoder), a simple yet efficient approach aimed at effectively ranking both document and passage simultaneously. Through empirical evaluation, we demonstrate the efficacy of PFiD in improving effectiveness and efficiency, offering a promising direction for further research in this domain.

Introduction

Despite their remarkable capabilities and growth, Large Language Models (LLMs) [1,2,3,4] still tend to generate factually incorrect or outdated content as their knowledge solely relies on their parametric knowledge, especially in domainspecific or knowledge-intensive tasks [5,6,7]. Retrieval Augmented Generation (RAG) approaches [8,9,10,11] have gained significant attention, which improve the quality of LLM-generated output by grounding on external knowledge to supplement the LLMs' parametric knowledge, without having to re-train the LLMs. RAG leverages a powerful information retrieval model, which is designed to search large datasets or knowledge bases. The retrieved information is then incorporated into LLMs, enabling it to generate more accurate and contextually relevant content. By incorporating external knowledge, RAG can effectively reduce the problem of generating factually incorrect or outdated content in LLMs [12,13].

However, current RAG frameworks have major challenges when it comes to the effectiveness and efficiency of information retrieval systems: First, LLMs tend to generate inaccurate responses on distracting (or noisy) contexts, thus the performance of retrieval models has a significant impact on the quality of RAG's responses [14,15,11]. Second, the retrieval component of RAG requires searching through large-scale knowledge bases or the web, which can be computationally expensive and slow [11]. Due to the above challenges, existing retrieval systems adopt two-stage approaches, an efficient first-stage retriever such as BM25 [16] and DPR [17] retrieves a set of documents from a larger dataset, and then a second-stage reranker is used to rerank retrieved documents for precise ranking. Recently, with the advent of transformer-based models such as BERT [18] and T5 [19], more architectures including bi-encoder [17], crossencoder [20], encoder-decoder [21,22], and decoder-only models [23], have gradually shown their effectiveness as a reranker. However, these models have limited context size and only perform single-granularity ranking during inference, which hinders their effectiveness and efficiency in real-world RAG scenarios.

To this end, in this paper, we focus on the multigranularity ranking task, which ranks both document and passage simultaneously. Specifically, we first investigate the single-passage cross-encoder models such as MonoT5 [22] and RankT5 [21]. It achieves superior performance across various ranking tasks, but due to the constraint of input tokens, its efficiency is limited in real-world RAG scenarios. Next, we present the use of multi-passage cross-encoder, such as FiD [9] and RFiD [24]. These models alleviate the input tokens limit by leveraging multi-passage, but they directly use the cross-attention score of the decoder as a passage relevance, which is implicitly learned, and encounter difficulty with distinguishing relative differences between passages. Thereafter, we propose a simple and effective PFiD (Passage Fusion-in-Decoder) for multi-granularity ranking. PFiD extends the FiD model by generating a document-level relevance token, enabling both document retrieval and passage ranking. Furthermore, PFiD adopts the inter-passage attention mechanism to learn relative passage relevance explicitly, using the special tokens at the beginning of the input text to represent the entire context.

Experiments on MIRACL passage ranking dataset [25] demonstrate that PFiD improves effectiveness and efficiency compared to existing approaches, especially in RAG scenarios.

Preliminaries

Task definition

Given a user query 𝑞 and a document (or passage) corpus 𝐶 = {𝐷1, 𝐷2, ..., 𝐷𝑛}, the goal of document retrieval is to find the 𝑘 documents that are most relevant to the query 𝑞. In our multi-granularity ranking setting, which consists of document retrieval and passage ranking tasks, the document retrieval task is to perform reranking on BM25 retrieved top-𝑘 documents. While traditional passage ranking tasks typically involve ranking entire passages, in this paper, the passage ranking task focuses solely on ranking passages within the retrieved document itself, which aligns more closely with real-world RAG scenarios and is thus more feasible.

Ranking models

Pre-trained Language Models (PLMs) are currently the most effective ranking models, which can be categorized into: biencoders and cross-encoders. Bi-encoders encode a query and a passage separately to obtain semantic representations [17], emerging as powerful first-stage retrievers by pre-computing the passage representations offline. Instead, cross-encoders take the concatenation of the query and a passage, and perform query-passage interactions [20], which have been conceived as second-stage rerankers, designed to explicitly refine the results provided by the firststage retrieval. In this paper, for brevity, we also refer other PLMs, such as encoder-only [17,20], decoder-only [23], and encoder-decoder [21,22] models that perform querypassage interactions simultaneously, as cross-encoders.

There are several PLM-based cross-encoders, including sequence-to-sequence language models such as MonoT5 [22] and RankT5 [21] for ranking task, as well as multi-passage reader models like FiD [9] and RFiD [24] for RAG tasks, which have been demonstrated superior effectiveness. [22] is the first work to define a ranking task as a text generation task by leveraging T5 [19] encoderdecoder model. A query-document pair is concatenated into an input sequence Query : 𝑞 Document : 𝐷𝑛 Relevant:, and utilizes true and false as target tokens to represent their relevance. Then, the model is fine-tuned for text generation task. After training, the ranking scores are derived from the logits of true token, based on the softmax applied only on the logits of the true and false tokens.

MonoT5. MonoT5

RankT5. Following MonoT5 [22], the input sequence is similar except that RankT5 do not include the Relevant: postfix. Then, the model use the <extra_id_10> as target token to learn unnormalized ranking score. The model is trained with list-wise ranking loss directly, instead of using text generation loss as in MonoT5 [22]. However, these models cannot be directly used for long document retrieval due to the maximum input length constraint as in most PLMs, which hinders their effectiveness in the document retrieval task.

FiD. The FiD model further extends T5 [19] encoder-decoder model, taking multiple 𝑘 passages as input, encoding separately, and then feeds the concatenated 𝑘 encoder hidden states into a T5 decoder to generate the answer. Relevance scores for passages are computed using cross-attention scores, which entail averaging the attention score across all tokens within the passage and all layers and heads within the decoder [26].

RFiD. While FiD [9] treats all passages equally within its encoders, solely depending on the cross-attention mechanism to establish correlations between the decoder and encoders, which may identify the incorrect answer by referring to spurious passages. Instead, RFiD [24] improves FiD by identifying potential answer-containing passages (or rationale) among the candidates and guiding the decoder with the identified rationales. Afterward, cross-attention scores are directly regarded as passage relevance scores the same as in [9]. However, even with the rationale, the cross-attention mechanism still lacks for distinguishing relative differences between passages, as it is implicitly guided by a rationale classifier solely trained on point-wise binary classification loss.

Method

In this section, we briefly discuss a simple but effective Passage Fusion-in-Decoder (PFiD) for multi-granularity ranking. PFiD adopts the FiD [9] architecture as a base model, further extends FiD by utilizing true and false token as a target token to model document-level relevance, enabling multi-granularity ranking simultaneously. Additionally, PFiD integrates inter-passage attention to learn relative passage relevances explicitly, which is similar to the listwise training objective of RankT5 [21].

Fusion-in-Decoder for Document Retrieval. Formally, Given a question 𝑞 and a set of 𝑘 passages within the document 𝐷𝑛 = {𝑃 𝑛 1 , 𝑃 𝑛 2 , ..., 𝑃 𝑛 𝑘 }, the FiD encoder outputs the 𝑘-th passage embeddings H k ∈ R 𝐿×𝑑 , where 𝐿 denotes the maximum token length, and 𝑑 denotes the dimension of hidden states, which are then concatenated as the input of the fusion decoder [H1, H2, ..., H k ].

H k = FiD-Encoder(𝑞 + 𝑃 𝑛 𝑘 )(1)

The FiD decoder utilizes [H1, H2, ..., H k ] to generate the target token 𝑇 = true 𝑜𝑟 false. Therefore, the loss function can be defined as follows:

ℒ𝐹 𝑖𝐷 = − 𝑇 ∑︁ 𝑖=1 log 𝑝(𝑦𝑖|𝑦1, 𝑦2, ...., 𝑦𝑖−1, [H1, H2, ..., H k ]) (2)

Inter-passage Attention. Previous work [24] tackled the issue of spurious passages by employing a binary classifier on the first token's encoder hidden states H k,1 , to determine whether the passage is a rationale passage to the query. Then, guide the decoder by appending the additional embeddings toward the end of the encoder's hidden states [H1, H2, ..., H k , H k+1 ], where H k+1 ∈ R 2×𝑑 is trainable rationale embedding. However, as Table 2 shows, it drastically underperforms in passage ranking tasks by a large margin, as it does not explicitly model relative passage relevance.

Instead, to mitigate this, we utilize inter-passage attention to model interactions between passages explicitly. PFiD builds a set of input sequences by appending the first token hidden states of each pair as B = [H1,1, H2,1, ..., H k,1 ], where H𝑖,𝑗 denotes the 𝑗-th token embeddings of 𝑖-th passage. In a standard cross-encoder, the first token of the encoder aggregate query-passage information to compute a relevance score. We further use this token to depict the relative semantics via self-attention mechanism. Inspried by [27], we consider single-layer transformer model to depict relative passage relevance as follows:

︀ B = softmax (︂ QK ⊺ √ 𝑑 )︂ V, where Q = BW Q , K = BW K , V = BW V (3) in which matrices W Q , W K , W V ∈ R 𝑑×𝑑

are learnable parameters. The information from different passages is fused and exchanged via the self-attention mechanism. The training loss used for inter-passage attention can be defined as follows:

𝑝 𝑘 = softmax( ̃︀ B 𝑘 W𝐵) ∈ R 2 , ℒ𝑝𝑎𝑠𝑠𝑎𝑔𝑒 = −(𝑦 log(𝑝 𝑘 ) + (1 − 𝑦) log(1 − 𝑝 𝑘 ))(4)

where 𝑦 is the passage relevance label, and the overall training objective of PFiD is:

ℒ 𝑎𝑙𝑙 = ℒ𝐹 𝑖𝐷 + 𝜆ℒ𝑝𝑎𝑠𝑠𝑎𝑔𝑒,(5)

where 𝜆 is a hyperparameter to balance two losses.

Experimental setup 4.1. Datasets

We use MIRACL [25] passage ranking dataset for our experiments. The MIRACL [25] dataset is a large-scale, opendomain, human-generated multi-document ranking dataset which is similar to MS MARCO [28], but MIRACL owns its advantage by providing segmented document collection, enabling both document retrieval and passage ranking. 1 For the document retrieval task, we construct the document retrieval dataset by regarding a document with at least one positive passage, as a positive document. Table 1 shows the statistics of the datasets.

Baselines

We compare PFiD against the following three types of ranking baselines. The first is Single-Passage Cross-encoder (SPC) baselines, including MonoT5 [22], and RankT5 [21].

Due to the constraint of input tokens, we only take the first-𝑘 tokens in the document retrieval task. An alternative approach is to score each passage independently, and then take the passage with the highest score as the representative for ranking the document, or directly perform retrieval over the segmented passages. However, we will omit these approaches as the former lacks efficiency, and the latter is not scalable for real-world RAG scenarios. Then, the model is trained list-wisely with randomly sampled negatives from the entire passage sets; The second is Multi-Passage Cross-encoder (MPC) baselines, including FiD [9] and RFiD [24]. For comparison in our experimental setting, both FiD and RFiD models are trained with the target token of true 𝑜𝑟 false, enabling both document retrieval and passage ranking. All SPC and MPC baselines used in this experiment are initialized with T5-base model; The third is the most frequently employed lexical ranker BM25 [16]. We use the Elasticsearch engine with default parameters 𝑘1 = 1.2, and 𝑏 = 0.75.

1 MS MARCO also provide segmented document collection, but the segmented corpus do not align with passages in passage ranking tasks.

Experimental Details

We adopt T5-base [19] as our base model, using Adam [29] with a learning rate of 10 −4 and a dropout rate of 0.1. For both training and inference, we use the top-100 passages and truncate them to 200 of the maximum token length. The hyperparameter 𝜆 is set to 0.5. For the document retrieval task, we perform ranking on BM25 top-100 retrieved documents, whereas passage ranking ranks the passages within the given positive document. We also conduct experiments on real-world RAG scenarios, considering both document retrieval and passage ranking simultaneously. We use the evaluation metric of the nDCG [30], Recall, and MRR scores to evaluate the effectiveness. All experiments are conducted on a single NVIDIA A100 GPU (40GB). In this work, we do not consider other training approaches including data augmentation, knowledge distillation, or negative sampling strategies as delving into their effects falls outside the scope of our objectives.

Results and Analysis

Retrieval and Ranking. Table 2 presents our evaluation results on document retrieval and passage ranking tasks.

The key observations are as follows: (i) MPC significantly outperforms SPC in document retrieval task by aggregating multiple 𝑘 passages, alleviating the problem of limited context size in SPC. In particular, one can see that PFiD outperforms RFiD by a large margin on both document ranking and passage ranking task. This indicates that by leveraging passage-wise context to guide the decoder, we can better identify relative passage relevance. Note that compared with the existing SPC baselines, our method achieves ranking efficiency by explicitly removing the need for each granularity ranking. PFiD directly consumes the entire document, and scores the relevance of the entire passages and document simultaneously. (ii) RFiD, implicitly guiding the decoder with rationale embedding shows improvement over FiD by a large margin, however, it is still even worse than BM25. It suggests that implicitly guiding indeed benefits the model's ranking ability to some extent. However, when ranking

Table 2

The evaluation results of different baselines. As for the document retrieval, we rank the top 100 documents retrieved by BM25, while the passage ranking task ranks the passage within the retrieved document. 𝑁 denotes the number of documents to rank, whereas 𝑃 denotes the number of passages in the document. The best performances are in †. Latency indicates the total inference time from document retrieval to passage ranking, which is measured by averaging the time taken for each query with a single thread and a single batch on the GPU.

Model Category Document Retrieval Passage Ranking Complexity Latency (s) top-𝑘 MRR@10 Recall@5 Recall@10 MRR@10 nDCG@5 nDCG@10 We first retrieve # documents from the candidates, and rerank # passages within the retrieved documents. Figure 1 represents the result of our evaluation. Notably, from Table 2 we observed that MPC outperforms SPC in document retrieval tasks, however, the performance drastically drops in this setting, as cross-attention scores from the decoder are indistinguishable across passages from multi-documents. Additionally, despite RankT5 reaching the best effectiveness on the passage ranking task, it did not exhibit any improvements over our method in real-world RAG scenarios, suggesting the importance of the multi-granularity ranking. Instead, PFiD consistently outperforms all baselines, by leveraging the complementary nature of SPC and MPC. PFiD is capable of more efficiently retrieving documents and ranking passages, and capturing the relative semantic correlation between different passages, leading to superior performance.

Cross-attention vs PFiD. As discussed above, PFiD has the advantage of identifying relevant passages compared to previous models like RFiD since it explicitly models relative passage relevance. We investigate the effects of the crossattention scores of the decoder and our passage ranking scores for the passage ranking task. Figure 2 illustrates the distribution of the rank of positive passages. As depicted in Figure 2, the PFiD is more strongly correlated with passage relevances than cross-attention scores, suggesting the PFiD focuses more on positive passages by explicitly learning relative passage relevance. Our experimental results show that the enhanced ability to identify relevant passages contributes to overall performance improvement.

Figure 1 :1Figure 1: Passage ranking results on the real-world RAG scenarios. We first retrieve # of documents and rerank # passages within the retrieved documents.

Figure 2 :2Figure 2: Distribution of the rank of positive passages.

Table 11Statistics of Datasets.Task# train # dev # avg judgement# corpusDocument Retrieval22,5486,4042.225,758,285Passage Ranking29,4168,3502.7532,893,221

<author> <persName><surname>Openai</surname></persName> </author> <idno type="arXiv">arXiv:2303.08774</idno> <imprint> <date type="published" when="2024">2024</date> </imprint> </monogr> <note type="report_type">Gpt-4 technical report</note> </biblStruct> <biblStruct xml:id="b1"> <monogr> <author> <persName><forename type="first">H</forename><surname>Touvron</surname></persName> </author> <author> <persName><forename type="first">L</forename><surname>Martin</surname></persName> </author> <author> <persName><forename type="first">K</forename><surname>Stone</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Albert</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Almahairi</surname></persName> </author> <author> <persName><forename type="first">Y</forename><surname>Babaei</surname></persName> </author> <author> <persName><forename type="first">N</forename><surname>Bashlykov</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Batra</surname></persName> </author> <author> <persName><forename type="first">P</forename><surname>Bhargava</surname></persName> </author> <author> <persName><forename type="first">S</forename><surname>Bhosale</surname></persName> </author> <author> <persName><forename type="first">D</forename><surname>Bikel</surname></persName> </author> <author> <persName><forename type="first">L</forename><forename type="middle">B</forename></persName> </author> <idno type="arXiv">arXiv:2307.09288</idno> <title level="m">Llama 2: Open foundation and fine-tuned chat models 2023 RAnil AMDai OFirat MJohnson DLepikhin APassos SShakeri ETaropa PBailey ZC arXiv:2305.10403 Palm 2 technical report 2023 Language models are few-shot learners TBBrown BMann NRyder MSubbiah JKaplan PDhariwal ANeelakantan PShyam GSastry AAskell SAgarwal AHerbert-Voss GKrueger THenighan RChild ARamesh DMZiegler JWu CWinter CHesse MChen ESigler MLitwin SGray BChess JClark CBerner SMccandlish ARadford ISutskever DAmodei arXiv:2005.14165 2020 When not to trust language models: Investigating effectiveness of parametric and nonparametric memories AMallen AAsai VZhong RDas DKhashabi HHajishirzi 10.18653/v1/2023.acl-long.546 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ARogers JBoyd-Graber NOkazaki the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Toronto, Canada

Association for Computational Linguistics 2023 On faithfulness and factuality in abstractive summarization JMaynez SNarayan BBohnet RMcdonald 10.18653/v1/2020.acl-main.173 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 LHuang WYu WMa WZhong ZFeng HWang QChen WPeng XFeng BQin TLiu arXiv:2311.05232 A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions 2023 PLewis EPerez APiktus FPetroni VKarpukhin NGoyal HKüttler MLewis WYih TRocktäschel SRiedel DKiela arXiv:2005.11401 Retrieval-augmented generation for knowledge-intensive nlp tasks 2021 Leveraging passage retrieval with generative models for open domain question answering GIzacard EGrave 10.18653/v1/2021.eacl-main.74 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics PMerlo JTiedemann RTsarfaty the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics 2021 Atlas: Few-shot learning with retrieval augmented language models GIzacard PLewis MLomeli LHosseini FPetroni TSchick JDwivedi-Yu AJoulin SRiedel EGrave Journal of Machine Learning Research 24 2023 YGao YXiong XGao KJia JPan YBi YDai JSun MWang HWang arXiv:2312.10997 Retrieval-augmented generation for large language models: A survey 2024 Rethinking with retrieval: Faithful large language model inference HHe HZhang DRoth arXiv:2301.00303 2022 NThakur LBonifacio XZhang OOgundepo EKamalloo DAlfonso-Hermelo XLi QLiu BChen MRezagholizadeh JLin arXiv:2312.11361 Nomiracl: Knowing when you don't know for robust multilingual retrievalaugmented generation 2024 Large language models can be easily distracted by irrelevant context FShi XChen KMisra NScales DDohan EHChi NSchärli DZhou Proceedings of the 40th International Conference on Machine Learning AKrause EBrunskill KCho BEngelhardt SSabato JScarlett the 40th International Conference on Machine Learning

PMLR

2023 202 Proceedings of Machine Learning Research AAsai ZWu YWang ASil HHajishirzi arXiv:2310.11511 Self-rag: Learning to retrieve, generate, and critique through self-reflection 2023 The probabilistic relevance framework: Bm25 and beyond SRobertson HZaragoza 10.1561/1500000019 Foundations and Trends in Information Retrieval 3 2009 VKarpukhin BOğuz SMin PLewis LWu SEdunov DChen WYih arXiv:2004.04906 Dense passage retrieval for open-domain question answering 2020 JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2019 CRaffel NShazeer ARoberts KLee SNarang MMatena YZhou WLi PJLiu arXiv:1910.10683 Exploring the limits of transfer learning with a unified text-to-text transformer 2023 RNogueira KCho arXiv:1901.04085 Passage re-ranking with bert 2020 HZhuang ZQin RJagerman KHui JMa JLu JNi XWang MBendersky arXiv:2210.10634 Rankt5: Finetuning t5 for text ranking with ranking losses 2022 RNogueira ZJiang JLin arXiv:2003.06713 Document ranking with a pretrained sequence-to-sequence model 2020 XMa LWang NYang FWei JLin arXiv:2310.08319 Finetuning llama for multi-stage text retrieval 2023 CWang HYu YZhang arXiv:2305.17041 Rfid: Towards rational fusion-in-decoder for open-domain question answering 2023 MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages XZhang NThakur OOgundepo EKamalloo DAlfonso-Hermelo XLi QLiu MRezagholizadeh JLin Transactions of the Association for Computational Linguistics 11 2023 GIzacard EGrave arXiv:2012.04584 Distilling knowledge from reader to retriever for question answering 2022 Longtriever: a pre-trained long text encoder for dense document retrieval JYang ZLiu CLi GSun XXie 10.18653/v1/2023.emnlp-main.223 Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing HBouamor JPino KBali the 2023 Conference on Empirical Methods in Natural Language Processing

Singapore

Association for Computational Linguistics 2023 PBajaj DCampos NCraswell LDeng JGao XLiu RMajumder AMcnamara BMitra TNguyen MRosenberg XSong AStoica STiwary TWang arXiv:1611.09268 Ms marco: A human generated machine reading comprehension dataset 2018 DPKingma JBa arXiv:1412.6980 Adam: A method for stochastic optimization 2017 Cumulated gain-based evaluation of ir techniques KJärvelin JKekäläinen 10.1145/582415.582418 doi:10.1145/582415.582418 ACM Trans. Inf. Syst 20 2002