1. Introduction

Adaptive Re-Ranking as an Information-Seeking Agent

Sean MacAvaney

sean.macavaney@glasgow.ac.uk 0

Nicola Tonellotto

nicola.tonellotto@unipi.it 1

Craig Macdonald

craig.macdonald@glasgow.ac.uk 0 0 University of Glasgow , United Kingdom 1 University of Pisa , Italy

2022

Re-ranking systems are typically limited by the recall of the initial retrieval function. A recent work proposed adaptive re-ranking, which modifies the re-ranking loop to progressively prioritise documents likely to receive high scores based on the highest scoring ones thus far. The original work framed this process as an incarnation of the well-established clustering hypothesis. In this work, we argue that the approach can also be framed as an information-seeking agent. From this perspective, we explore several variations of the graph-based adaptive re-ranking algorithm and find that there is substantial room for improvement by modifying the agent. However, the agents that we explore are more sensitive to the new parameters they introduce than the simple-yet-efective approach proposed in the original adaptive re-ranking work.

Neural re-ranking Clustering hypothesis Retrieval agents

1. Introduction

the incremental nature of re-ranking documents, typically done in batches to make the best use of specialised hardware like GPUs and TPUs. After each batch, information is gained about which documents get the highest relevance score, which can be leveraged to make a more informed decision about which documents to rank next (a process the authors call Adaptive Re-Ranking). By making use of the clustering hypothesis [ 9, 10, 11 ] and pre-computed similarity scores between documents (via a corpus graph), documents that are similar to the highest-ranked documents can be identified and prioritised, thereby allowing documents missed in the first stage to be retrieved. This particular formulation of Adaptive Re-Ranking is referred to as Graph-based Adaptive Re-ranking (G A R ). The variant of G A R proposed by MacAvaney et al. [ 8 ] used a simple strategy for balancing (a) the scoring of batches of existing candidate documents (identified by the first stage retriever) and (b) discovering new candidates (by scoring documents similar to the top ranked results so far). Their approach simply alternates between (a) and (b), batch by batch. We refer to this strategy as A l t e r n a t e .

We argue that the approach for re-ranking can be thought of as an information-seeking agent and explained in terms of user browsing strategies. For instance, an agent using the traditional re-ranking approach simply scans down a list of search results looking for items that are relevant to its information need. Meanwhile, an agent using the A l t e r n a t e Adaptive approach alternates between exploring documents from the search results and similar documents to the most relevant ones it found so far. We observe that this may not be the process by which a reasonable human user would traverse search results; for instance, a user may only look at similar documents to those that are suficiently relevant. Therefore, in this work we explore a variety of agent-based strategies for adaptive re-ranking, inspired by behaviours that a human may take when seeking information.

In summary, we explore new agent-based strategies for the recently-proposed Graph-based Adaptive Re-Ranking approach. We establish an upper-bound performance for smarter agents and explore several other strategies. In doing so, we demonstrate the potential breadth of possible strategies that can be applied to GAR.

2. Background and Related Work

Cascading architectures are commonly deployed in information retrieval. In principle, these are driven by an eficiency-efectiveness tradeof – the most efective models cannot be applied on all documents in the index in a timely manner [ 12 ]. Hence, cheap ranking stages (with high recall) are typically combined with more expensive ranking stages (to obtain higher precision). Ranking cascades are common in web search using learning-to-rank, where an initial retrieval stage uses a cheap ranking function, commonly BM25, to identify a candidate set of documents. This is followed by a feature computation stage, and then application of the learned model [ 13 ]. Limiting the number of documents retrieved in the first stage – which we call the budget in this paper – has the efect of reducing recall, but enhances overall eficiency [ 14 ]. Later, Tonellotto et al. [ 15 ] proposed a selective approach that could adjust the budget on a per-query basis for eficient & efective retrieval.

More recently, multi-stage ranking has been commonly used for applying neural re-ranking models. For instance, cross-encoder neural models [ 16 ] must be applied jointly on the query and the document to compute a relevance score. Since the cross-encoder scoring function can be computationally expensive, re-ranking is often limited to a predefined maximum number documents that the system is willing to re-rank for each query (i.e., a re-ranking budget, such as 1000). Hence, the application of a first (so-called sparse) retriever to obtain a candidate set is necessary for such models. However, this limits the recall of the candidate set, in that relevant documents not matching the query terms (and which may obtain a high score from the re-ranking cross-encoder) will not be retrieved.

To address this gap, several recent neural ranking models of have been used proposed: (i) dense retrieval models based on bi-encoders and (ii) learned sparse models.

In bi-encoder-based dense retrieval models [ 4, 5, 6, 7 ], the documents are pre-encoded into embedded representations, and nearest neighbour lookup is performed based on an embedded representation of the query [ 17 ]. However, online nearest neighbour search is more expensive than use of an inverted index-based sparse retrieval, hence inverted indexes are adopted for ifrst-stage retrieval. Of note, ColBERT [ 4 ] uses a multi-stage dense retrieval architecture – an approximate nearest neighbour search, followed by an exact neural scoring.

Learned sparse models encompass the use of the PLMs upon the document representation to make a refined representation that can be indexed by existing sparse inverted indexing tools. For instance, doc2query [18, 19] enriches the sparse document representation with possible queries that the document could be retrieved for. DeepCT [20], uniCOIL [21, 22], DeepImpact [23, 24] and SPLADE [25, 26] all use a PLM to generate a new sparse representation of the document.

This work focuses on a recently-proposed complementary approach: Graph-Based Adaptive Re-Ranking (GAR) [ 8 ]. Rather than limiting the candidate pool of documents for re-ranking to those from the first stage, GAR augments the re-ranking loop to identify documents that are similar to the highest-scoring ones identified so far. In the following sections, we cover this process in more detail (Section 3) and identify that this process can be seen as an informationseeking agent (Section 4). We use this formulation to propose alternative adaptive re-ranking agents and compare their efectiveness (Section 6).

3. Graph-based Adaptive Re-Ranking

GAR is formulated as a re-ranking process. In a re-ranking process, the documents from an initial candidate set, 0, are re-ranked by a scoring function, Score(), to obtain a final ranking of 1 documents. The scoring function can be expensive to apply, for instance a cross-encoder neural model such as monoT5. This means that eficiency constraints may limit the number of applications of Score() that can be undertaken, which is called the re-ranking budget. In practice, batches of documents are scored at the same time, making efective use of GPU or TPU hardware.

A typical is non-adaptive re-ranking is shown in Algorithm 1, where an initial ranking (prioritised by rank) is scored in batches until the re-ranking budget is fully consumed.

In contrast, GAR proposed an adaptive approach to re-ranking, where highly scored documents are used as seeds for nearest neighbour retrieval, rather than continuing down the ranking. To perform this eficiently, a graph of document-document similarity ( ) is created at index time, which records the nearest documents for each given document. Since this is precomputed do 0 ← 0 ⧵ while | 1| < Algorithm 1 Non-Adaptive Re-Ranking

Input: Initial ranking 0, batch size , budget

←

Score(top documents from , subject to )

Score(top documents from 0, subject to ) and small, it can be used eficiently at ranking time to identify more documents to score.

In particular, after each batch of documents is scored, the nearest documents to those in the batch are added to a “frontier” set of documents . The frontier is prioritised by the scores of the re-ranked documents, so the neighbours of the highest-scored documents are explored ifrst. The process then alternates between scoring the top documents from the initial ranking 0 and scoring the top documents in the frontier . Algorithm 2 shows this process.

We call this A l t e r n a t e , due to the manner in which it alternates between the initial ranking and the frontier . In the following, we link GAR with intelligent agents, and shows us how this allows further agent-based implementations.

4. Re-Ranking Agents

We argue that GAR re-ranking approaches can be interpreted as information-seeking agents, which peruse retrieved documents, and decide how to act based on the relevance of the documents found (such actions could be selecting more documents from the candidate set, or selecting similar documents to those already scored highly). In doing so, they act like a user who was deciding how to progress through a ranked list.

Algorithm 2 A l t e r n a t e Re-Ranking [8] Output: Re-Ranked pool 1 Input: Initial ranking 0, batch size , budget , corpus graph

We treat existing re-ranking strategies as information-seeking agents, and in the following we introduce new alternative agents. In all cases, we have the following: an environment which consists of an initial ranking of candidate documents, denoted as 0, a graph of documentdocument similarities , and the final retrieved documents list 1; and a set of available actions, namely scoring documents from 0 and placing in 1, and scoring documents from and placing into 1. Some agents may also maintain a new state, such as a prioritised frontier of documents collected from the graph. We argue that such a fully-capable agent can be classified as a model-based reflex agent within the classes of [27]. Indeed, a scoring model (e.g., monoT5) estimates relevance, which is only partially observable. It can choose actions based on the state, i.e., the current frontier.

In the following, we list four possible agents based on the GAR formulation. 4.1. T w o P h a s e A simple alternative to A l t e r n a t e is to operate in two sequential phases. In the first phase, all documents up to rank < are scored. In the second phase, the remaining re-ranking budget − is utilised by prioritising the neighbours of the highest-scoring documents. The behaviour models a user who first investigates numerous pages of search results to find the most relevant content, and then goes back to check for similar documents to the best ones identified. Algorithm 3 shows this process. We consider two variations of this T w o P h a s e strategy: one where the frontier continues to be updated based on the scores of the documents in the second phase (T w o P h a s e - R e f i n e ), and one where the frontier only considers the first phase documents (T w o P h a s e - F i x e d ). 4.2. T h r e s h o l d The T h r e s h o l d agent only explores the corpus graph for suficiently relevant documents. In this setting, the initial ranking pool is updated to include the neighbours of scored documents

Algorithm 3 T w o P h a s e Re-Ranking

Input: Initial ranking 0, batch size , budget , corpus graph , first phase size Output: Re-Ranked pool 1 1 ← ∅ do ← Score(top from 0, subject to ) 1 ← 1 ∪ 0 ← 0 ⧵ while | 1| < ← Neighbours( 1, ) do ← Score(top from , subject to ) 1 ← 1 ∪ ← ⧵ ← ∪ ( Neighbours(, ) ⧵ 1) while | 1| < ▷ If “Refine” variant, otherwise skip this step Algorithm 4 T h r e s h o l d Re-Ranking Input: Initial ranking 0, batch size , budget , corpus graph

Score(top from 0, subject to ) 0 ← 0∪ (Neighbours(documents from that satisfy , G) ⧵ 1) do while | 1| < do 0 ← ∞ ← ∞ that exceed a relevance score threshold . By adding these documents to the pool with infinite priority, they are always scored before other documents in the pool. This agent is akin to a searcher who, as they traverse a ranked list, investigate similar documents to ones that they deem suficiently relevant as they are encountered. Importantly, this agent considers the absolute relevance score of documents, rather than the relative ordering of retrieved documents so far, which is used by techniques like A l t e r n a t e and T w o P h a s e . Algorithm 4 shows this process. 4.3. G r e e d y We next explore a strategy that attempts to identify which pool of documents – the initial ranking or the frontier – is has recently given the most relevant document. We approach this as a G r e e d y strategy, by always choosing the pool with the largest maximum score from the most recent batch. This process is akin to a searcher who switches between searching through a ranked list and the documents close to the best ones they found so far based on the most relevant document they identified recently from each pool. Algorithm

5 shows this process. Algorithm 5 G r e e d y Re-Ranking Output: Re-Ranked pool 1 Input: Initial ranking 0, batch size , budget , corpus graph

←MaxScore(B) ←

Score(top documents from , subject to )

if 4.4. Oracle Finally, to explore whether there is further room for improvement in adaptive re-ranking, we apply an oracle strategy. This strategy makes use of the relevance assessments (qrels) to choose which batch of documents to score. At each step, both the initial ranking and the frontier are scored. Then, the algorithm proceeds by adding only the batch that improves the ranking efectiveness (e.g., through a measure like nDCG) the most at that step. Naturally, this algorithm is not useful in practice, given that it relies on ground-truth relevant knowledge.1 However, it helps us establish whether there is further room for improvement in adaptive re-ranking agents. Algorithm 6 shows this process.

5. Experimental Setup

We conduct experiments using various adaptive re-ranking agents to answer the following research questions: RQ1 Is there room for further improvement in adaptive re-ranking strategies? RQ2 Can smarter adaptive re-ranking strategies outperform the A l t e r n a t e approach? RQ3 How sensitive are alternative adaptive re-ranking strategies to their parameters?

We conduct experiments using the TREC Deep Learning 2019 and 2020 passage ranking datasets [28, 29]. We use 2019 as our “development” data, leaving 2020 as a held-out test set. For both datasets, we report nDCG and Recall@1000 (with a minimum relevance threshold of 2) performance. We fix our study to a MonoT5 (base) scoring function [ 30], a re-ranking budget = 1000 , and a batch size = 16 . We explore four initial ranking functions: BM25, TCTColBERT [31, 32], Doc2Query [19], and SPLADE++ (C o C o n d e n s e r - E n s e m b l e D i s t i l version) [26]. 1Of course, a perfect agent would ignore the scoring function entirely and just return documents by their human assessments to get a perfect ranking. We limit the usage of the assessments for this agent to the decision of which batch to score to help us identify whether better decisions about which batches to score can further improve adaptive re-ranking efectiveness.

Pipeline BM25»MonoT5 TCT»MonoT5 D2Q»MonoT5 SPLADE»MonoT5

Agent Non-Adaptive Oracle Alternate TwoPhase-Fixed TwoPhase-Refine Threshold Greedy Non-Adaptive Oracle Alternate TwoPhase-Fixed TwoPhase-Refine Threshold Greedy Non-Adaptive Oracle Alternate TwoPhase-Fixed TwoPhase-Refine Threshold Greedy Non-Adaptive Oracle Alternate TwoPhase-Fixed TwoPhase-Refine Threshold Greedy GAR25

GAR

TREC DL 2020 (test) GAR25

GAR nDCG performing (non-oracle) agent in section is listed in bold. Significant diferences compared to the Non-Adaptive, Oracle, and A d a p t i v e systems are marked with , respectively (paired t-test, < 0.05 ). distribution, from 0.022 to 0.20 (by 0.02).3 Following the original G A R paper, we use two corpus graphs, based on BM25 similarity and TCT-ColBERT.

As we will show in Section 6, T w o P h a s e and T h r e s h o l d are rather sensitive to their parameters, so the results we present use the parameters that yield the highest nDCG performance on TREC values ranging from 100 to 900 (by 100). For DL 2019. For T w o P h a s e , we test first-stage cutof T h r e s h o l d , we first identify the distribution of scores using the MS MARCO dev (small) dataset, re-ranking 100 BM25 results. We then pick threshold values based on percentiles from this

6. Results and Analysis

2Here, 0.02 represents the top 2% of scores. 3This range was chosen after an initial study of orders of magnitude from 0.001 to 0.3 identified it as the most promising.

We first explore whether there exists room for improvement by exploring alternative agents (RQ1). In all cases, the Oracle agent, which has knowledge of the relevance assessments when choosing which batches of documents to score, has a higher nDCG performance than all other agents. The diference in performance can often be considerable; for instance, the best-performing non-oracle agent over a the TCT pipeline using a BM25 graph has an nDCG of 0.733, while the oracle achieves an nDCG of 0.793. Diferences between non-oracle agents and the Oracle are not always significant, however. We note that while nDCG is often improved using either oracle technique, recall often sufers compared to alternative agents. Perhaps this is unsurprising, given that the oracles aim to directly optimise nDCG (recall is only considered insofar as it would improve the nDCG). Significant improvements compared to the Oracle are found more frequently when measuring nDCG using TCT corpus graph (GAR ) than the BM25 graph (GAR25 ).

In summary, to answer RQ1, by exploring an oracle agent, we find that there is considerable room for improvement in adaptive re-ranking. In particular, exploration of the semantic graph (GAR ) by non-oracle agents is likely sub-optimal and has room for considerable improvement.

Next, we look to answer RQ2 and RQ3. We find that each of the proposed alternative agents to be the most efective in at least one measurement. In one case, an alternative agent significantly outperforms the existing A l t e r n a t e agent (T h r e s h o l d when re-ranking BM25 results using GAR ). Further, there are 11 cases where alternative strategies outperform the non-adaptive, while A l t e r n a t e does not. However, there are sometimes significant reductions in efectiveness when using the alternative agents (18 cases, all of which on the held-out DL20 data). There is also no clear pattern in which strategy is generally most efective. These results answer RQ2: while some strategies can outperform A l t e r n a t e , we do not find a clearly superior strategy among the agents we test.

Given that significant reductions are present when using the held-out DL20 data, we dive deeper into the performance of T w o P h a s e and T h r e s h o l d by plotting their performance for the parameters they introduce (T w o P h a s e adds the first stage cutof and T h r e s h o l d adds the relevance threshold ). Figure 1 present the nDCG performance. We see that in all cases, performance varies considerably as the parameters are adjusted. Further, the optimal parameters for DL19 are often diferent than the optimal parameters for DL20, suggesting that the alternative agents are not very robust to the selection parameter selection. This is in particular contrast with the robustness of the A l t e r n a t e approach to parameters, shown by MacAvaney et al. [ 8 ]. These results answer RQ3: T w o P h a s e and T h r e s h o l d introduce parameters that are more sensitive than those present in the A l t e r n a t e strategy.

7. Conclusions & Outlook

In this work we investigated the recently-proposed Adaptive Re-Ranking process from the perspective of information-seeking agents. We find that there is considerable room to improve adaptive re-ranking by changing the agent strategy. We further proposed several alternative strategies, and found that although they are promising, they are more sensitive to their parameters than the simple A l t e r n a t e approach. This work sets the stage for future work that investigates additional agent strategies, such as those inspired by multi-armed bandit models. 100 200 300 400 500 600 700 800 900

Cutoff k 100 200 300 400 500 600 700 800 900

Cutoff k

T w o P h a s e - F i x e d DL19 GARBM25

DL19 GARTCT DL20 GARBM25

DL20 GARTCT 100 200 300 400 500 600 700 800 900

Cutoff k 100 200 300 400 500 600 700 800 900

Cutoff k

T h r e s h o l d DL19 GARBM25

DL19 GARTCT

T w o P h a s e - R e f i n e DL19 GARBM25

DL19 GARTCT DL20 GARBM25

DL20 GARTCT

Acknowledgements

Sean and Craig acknowledge EPSRC grant EP/R018634/1: Closed-Loop Data Science for Complex, Computationally- & Data-Intensive Analytics. [18] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, Preprint: arXiv:1904.08375 (2019). [19] R. Nogueira, J. Lin, From doc2query to docTTTTTquery, Online preprint (2019). [20] Z. Dai, J. Callan, Context-aware sentence/passage term importance estimation for first stage retrieval, Preprint: arXiv:1910.10687 (2019). [21] L. Gao, Z. Dai, J. Callan, COIL: Revisit Exact Lexical Match in Information Retrieval with

Contextualized Inverted List, in: Proc. NAACL-HLT, 2021, pp. 3030–3042. [22] J. Lin, X. Ma, A few brief notes on DeepImpact, COIL, and a conceptual framework for information retrieval techniques, Preprint: arXiv:2106.14807 (2021). [23] A. Mallia, O. Khattab, T. Suel, N. Tonellotto, Learning passage impacts for inverted indexes, in: Proc. SIGIR, 2021. [24] A. Mallia, J. Mackenzie, T. Suel, N. Tonellotto, Faster Learned Sparse Retrieval with Guided

Traversal, in: Proc. SIGIR, 2022, p. 5. [25] T. Formal, B. Piwowarski, S. Clinchant, SPLADE: Sparse Lexical and Expansion Model for

First Stage Ranking, in: Proc. SIGIR, 2021, p. 2288–2292. [26] T. Formal, C. Lassance, B. Piwowarski, S. Clinchant, Splade v2: Sparse lexical and expansion model for information retrieval, 2021. [27] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 3 ed., Prentice Hall, 2010. [28] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. Voorhees, Overview of the trec 2019 deep learning track, in: TREC 2019, 2019. [29] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, Overview of the trec 2020 deep learning track, in: TREC, 2020. [30] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-tosequence model, in: Proc. EMNLP, 2020. [31] S.-C. Lin, J.-H. Yang, J. J. Lin, Distilling dense representations for ranking using tightlycoupled teachers, ArXiv abs/2010.11386 (2020). [32] S.-C. Lin, J.-H. Yang, J. Lin, In-batch negatives for knowledge distillation with tightlycoupled teachers for dense retrieval, in: RepL4NLP-2021 Workshop, 2021, pp. 163–173.

[1]

Lin ,

Nogueira ,

Yates , Pretrained Transformers for Text Ranking: BERT and Beyond , Synthesis Lectures on Human Language Technologies , Morgan & Claypool Publishers, 2021 .

[2]

Xiao ,

Ji ,

Cui ,

Tang ,

Ou ,

Xiao ,

Tan ,

Ju , Weakly supervised co-training of query rewriting andsemantic matching for e-commerce , in: Proc. WSDM , 2019 , p. 402 - 410 .

[3]

Macdonald ,

Tonellotto , I. Ounis , Eficient & efective selective query rewriting with eficiency predictions , in: Proc. SIGIR , 2017 , pp. 495 - 504 .

[4]

Khattab , M. Zaharia, ColBERT: Eficient and Efective Passage Search via Contextualized Late Interaction over BERT , in: Proc. SIGIR , 2020 .

[5]

Karpukhin ,

Oğuz ,

Min ,

Lewis ,

Wu ,

Edunov ,

Chen , W.-t. Yih, Dense passage retrieval for open-domain question answering , 2020 . URL: https://arxiv.org/abs/ 2004 .04906. doi:1 0 . 4 8 5 5 0

/ A R X I

V . 2 0 0 4 . 0 4 9 0 6 .

[6]

Xiong ,

Li ,

K.-F.

Tang , J. Liu,

Bennett ,

Ahmed ,

Overwijk , Approximate nearest neighbor negative contrastive learning for dense text retrieval , in: Proceedings of ICLR , 2021 .

[7]

Zhan ,

Mao ,

Liu ,

Guo ,

Zhang , S. Ma, Optimizing Dense Retrieval Model Training with Hard Negatives , in: Proc. SIGIR , 2021 , pp. 1503 - 1512 .

[8]

MacAvaney ,

Tonellotto ,

Macdonald , Adaptive re-ranking with a corpus graph , in: 31st ACM International Conference on Information and Knowledge Management , 2022 . URL: https://arxiv.org/abs/2208.08942. doi:1 0 . 1 1 4 5 / 3 5 1 1 8 0 8 . 3 5 5 7 2 3 1 .

[9]

Jardine , C. J. van Rijsbergen , The use of hierarchic clustering in information retrieval , Information storage and retrieval 7 ( 1971 ).

[10]

Kurland , The cluster hypothesis in information retrieval , in: Proc. SIGIR , 2013 .

[11]

E. M.

Voorhees , The cluster hypothesis revisited , in: Proc. SIGIR , 1985 .

[12]

Tonellotto ,

Macdonald , I. Ounis , Eficient query processing for scalable web search , Foundations and Trends in Information Retrieval 12 ( 2018 ) 319 - 492 .

[13] T .-Y. Liu, Learning to rank for information retrieval , Found. Trends Inf. Retr . 3 ( 2009 ) 225 - 331 . URL: https://doi.org/10.1561/1500000016. doi:1 0 . 1 5 6 1 / 1 5 0 0 0 0 0 0 1 6 .

[14]

Macdonald ,

R. L. T.

Santos , I. Ounis , The whens and hows of learning to rank for web search , Information Retrieval 16 ( 2012 ).

[15]

Tonellotto ,

Macdonald , I. Ounis , Eficient and efective retrieval using selective pruning , in: Proc. WSDM , 2013 .

[16]

Nogueira ,

Cho , Passage Re-ranking with

BERT

, arXiv preprint 1901 . 04085 ( 2019 ).

[17]

Johnson , M. Douze,

Jegou , Billion-scale similarity search with gpus , IEEE Trans. Big Data 7 ( 2021 ).