Adaptive Re-Ranking as an Information-Seeking Agent Sean MacAvaney1 , Nicola Tonellotto2 and Craig Macdonald1 1 University of Glasgow, United Kingdom 2 University of Pisa, Italy Abstract Re-ranking systems are typically limited by the recall of the initial retrieval function. A recent work proposed adaptive re-ranking, which modifies the re-ranking loop to progressively prioritise documents likely to receive high scores based on the highest scoring ones thus far. The original work framed this process as an incarnation of the well-established clustering hypothesis. In this work, we argue that the approach can also be framed as an information-seeking agent. From this perspective, we explore several variations of the graph-based adaptive re-ranking algorithm and find that there is substantial room for improvement by modifying the agent. However, the agents that we explore are more sensitive to the new parameters they introduce than the simple-yet-effective approach proposed in the original adaptive re-ranking work. Keywords Neural re-ranking, Clustering hypothesis, Retrieval agents 1. Introduction Neural techniques leveraging use pre-trained language models (PLMs) have been shown to be effective at various information retrieval (IR) ranking tasks, such as question answering and ad-hoc document ranking [1]. The most effective deployment of PLMs in neural IR is through a cascading architecture. The cascade is composed of a two-stage pipeline. The role of the first stage is to retrieve a candidate pool of documents from the collection using an inexpensive scoring functions such as BM25, while the second stage re-ranks the documents in the first stage sample using an expensive PLM. Typically, the first stage focuses on optimising the recall of the pool, i.e., the number of relevant documents in the pool, and the second stage optimises precision-oriented metrics such as NDCG@10. A major limitation of this approach lies in the performance of the first stage. If the first stage fails to retrieve a relevant document, the second stage has no chance to re-rank it. Consequently, many techniques have been designed to resolve this first stage “recall” problem, such as query rewriting [2, 3] and dense retrieval [4, 5, 6, 7]. Recently, MacAvaney et al. [8] proposed a complementary approach for overcoming the first stage recall problem by enabling re-rankers to identify and score documents missing from the initial ranker. This is accomplished by using First Workshop on Proactive and Agent-Supported Information Retrieval, 2022 Envelope-Open sean.macavaney@glasgow.ac.uk (S. MacAvaney); nicola.tonellotto@unipi.it (N. Tonellotto); craig.macdonald@glasgow.ac.uk (C. Macdonald) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the incremental nature of re-ranking documents, typically done in batches to make the best use of specialised hardware like GPUs and TPUs. After each batch, information is gained about which documents get the highest relevance score, which can be leveraged to make a more informed decision about which documents to rank next (a process the authors call Adaptive Re-Ranking). By making use of the clustering hypothesis [9, 10, 11] and pre-computed similarity scores between documents (via a corpus graph), documents that are similar to the highest-ranked documents can be identified and prioritised, thereby allowing documents missed in the first stage to be retrieved. This particular formulation of Adaptive Re-Ranking is referred to as Graph-based Adaptive Re-ranking (G A R ). The variant of G A R proposed by MacAvaney et al. [8] used a simple strategy for balancing (a) the scoring of batches of existing candidate documents (identified by the first stage retriever) and (b) discovering new candidates (by scoring documents similar to the top ranked results so far). Their approach simply alternates between (a) and (b), batch by batch. We refer to this strategy as A l t e r n a t e . We argue that the approach for re-ranking can be thought of as an information-seeking agent and explained in terms of user browsing strategies. For instance, an agent using the traditional re-ranking approach simply scans down a list of search results looking for items that are relevant to its information need. Meanwhile, an agent using the A l t e r n a t e Adaptive approach alternates between exploring documents from the search results and similar documents to the most relevant ones it found so far. We observe that this may not be the process by which a reasonable human user would traverse search results; for instance, a user may only look at similar documents to those that are sufficiently relevant. Therefore, in this work we explore a variety of agent-based strategies for adaptive re-ranking, inspired by behaviours that a human may take when seeking information. In summary, we explore new agent-based strategies for the recently-proposed Graph-based Adaptive Re-Ranking approach. We establish an upper-bound performance for smarter agents and explore several other strategies. In doing so, we demonstrate the potential breadth of possible strategies that can be applied to GAR. 2. Background and Related Work Cascading architectures are commonly deployed in information retrieval. In principle, these are driven by an efficiency-effectiveness tradeoff – the most effective models cannot be applied on all documents in the index in a timely manner [12]. Hence, cheap ranking stages (with high recall) are typically combined with more expensive ranking stages (to obtain higher precision). Ranking cascades are common in web search using learning-to-rank, where an initial retrieval stage uses a cheap ranking function, commonly BM25, to identify a candidate set of documents. This is followed by a feature computation stage, and then application of the learned model [13]. Limiting the number of documents retrieved in the first stage – which we call the budget in this paper – has the effect of reducing recall, but enhances overall efficiency [14]. Later, Tonellotto et al. [15] proposed a selective approach that could adjust the budget on a per-query basis for efficient & effective retrieval. More recently, multi-stage ranking has been commonly used for applying neural re-ranking models. For instance, cross-encoder neural models [16] must be applied jointly on the query and the document to compute a relevance score. Since the cross-encoder scoring function can be computationally expensive, re-ranking is often limited to a predefined maximum number documents that the system is willing to re-rank for each query (i.e., a re-ranking budget, such as 1000). Hence, the application of a first (so-called sparse) retriever to obtain a candidate set is necessary for such models. However, this limits the recall of the candidate set, in that relevant documents not matching the query terms (and which may obtain a high score from the re-ranking cross-encoder) will not be retrieved. To address this gap, several recent neural ranking models of have been used proposed: (i) dense retrieval models based on bi-encoders and (ii) learned sparse models. In bi-encoder-based dense retrieval models [4, 5, 6, 7], the documents are pre-encoded into embedded representations, and nearest neighbour lookup is performed based on an embedded representation of the query [17]. However, online nearest neighbour search is more expensive than use of an inverted index-based sparse retrieval, hence inverted indexes are adopted for first-stage retrieval. Of note, ColBERT [4] uses a multi-stage dense retrieval architecture – an approximate nearest neighbour search, followed by an exact neural scoring. Learned sparse models encompass the use of the PLMs upon the document representation to make a refined representation that can be indexed by existing sparse inverted indexing tools. For instance, doc2query [18, 19] enriches the sparse document representation with possible queries that the document could be retrieved for. DeepCT [20], uniCOIL [21, 22], DeepImpact [23, 24] and SPLADE [25, 26] all use a PLM to generate a new sparse representation of the document. This work focuses on a recently-proposed complementary approach: Graph-Based Adaptive Re-Ranking (GAR) [8]. Rather than limiting the candidate pool of documents for re-ranking to those from the first stage, GAR augments the re-ranking loop to identify documents that are similar to the highest-scoring ones identified so far. In the following sections, we cover this process in more detail (Section 3) and identify that this process can be seen as an information- seeking agent (Section 4). We use this formulation to propose alternative adaptive re-ranking agents and compare their effectiveness (Section 6). 3. Graph-based Adaptive Re-Ranking GAR is formulated as a re-ranking process. In a re-ranking process, the documents from an initial candidate set, 𝑅0 , are re-ranked by a scoring function, Score(), to obtain a final ranking of 𝑅1 documents. The scoring function can be expensive to apply, for instance a cross-encoder neural model such as monoT5. This means that efficiency constraints may limit the number of applications of Score() that can be undertaken, which is called the re-ranking budget. In practice, batches of documents are scored at the same time, making effective use of GPU or TPU hardware. A typical is non-adaptive re-ranking is shown in Algorithm 1, where an initial ranking (prioritised by rank) is scored in batches until the re-ranking budget is fully consumed. In contrast, GAR proposed an adaptive approach to re-ranking, where highly scored docu- ments are used as seeds for nearest neighbour retrieval, rather than continuing down the ranking. To perform this efficiently, a graph of document-document similarity (𝐺) is created at index time, which records the nearest documents for each given document. Since this is precomputed Algorithm 1 Non-Adaptive Re-Ranking Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐 Output: Re-Ranked pool 𝑅1 𝑅1 ← ∅ do 𝐵 ← Score(top 𝑏 documents from 𝑅0 , subject to 𝑐) 𝑅1 ← 𝑅 1 ∪ 𝐵 𝑅0 ← 𝑅 0 ⧵ 𝐵 while |𝑅1 | < 𝑐 and small, it can be used efficiently at ranking time to identify more documents to score. In particular, after each batch 𝐵 of documents is scored, the nearest documents to those in the batch are added to a “frontier” set of documents 𝐹. The frontier is prioritised by the scores of the re-ranked documents, so the neighbours of the highest-scored documents are explored first. The process then alternates between scoring the top documents from the initial ranking 𝑅0 and scoring the top documents in the frontier 𝐹. Algorithm 2 shows this process. We call this A l t e r n a t e , due to the manner in which it alternates between the initial ranking and the frontier 𝐹. In the following, we link GAR with intelligent agents, and shows us how this allows further agent-based implementations. 4. Re-Ranking Agents We argue that GAR re-ranking approaches can be interpreted as information-seeking agents, which peruse retrieved documents, and decide how to act based on the relevance of the doc- uments found (such actions could be selecting more documents from the candidate set, or selecting similar documents to those already scored highly). In doing so, they act like a user who was deciding how to progress through a ranked list. Algorithm 2 A l t e r n a t e Re-Ranking [8] Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺 Output: Re-Ranked pool 𝑅1 𝑅1 ← ∅ 𝑃 ← 𝑅0 𝐹 ←∅ do 𝐵 ← Score(top 𝑏 documents from 𝑃, subject to 𝑐) 𝑅1 ← 𝑅 1 ∪ 𝐵 𝑅0 ← 𝑅 0 ⧵ 𝐵 𝐹 ←𝐹 ⧵𝐵 𝐹 ← 𝐹 ∪ (Neighbours(𝐵, 𝐺) ⧵ 𝑅1 ) 𝑅 if 𝑃 = 𝐹 𝑃 ←{ 0 𝐹 if 𝑃 = 𝑅0 while |𝑅1 | < 𝑐 We treat existing re-ranking strategies as information-seeking agents, and in the following we introduce new alternative agents. In all cases, we have the following: an environment which consists of an initial ranking of candidate documents, denoted as 𝑅0 , a graph of document- document similarities 𝐺, and the final retrieved documents list 𝑅1 ; and a set of available actions, namely scoring documents from 𝑅0 and placing in 𝑅1 , and scoring documents from 𝐺 and placing into 𝑅1 . Some agents may also maintain a new state, such as a prioritised frontier 𝐹 of documents collected from the graph. We argue that such a fully-capable agent can be classified as a model-based reflex agent within the classes of [27]. Indeed, a scoring model (e.g., monoT5) estimates relevance, which is only partially observable. It can choose actions based on the state, i.e., the current frontier. In the following, we list four possible agents based on the GAR formulation. 4.1. T w o P h a s e A simple alternative to A l t e r n a t e is to operate in two sequential phases. In the first phase, all documents up to rank 𝑘 < 𝑐 are scored. In the second phase, the remaining re-ranking budget 𝑐 − 𝑘 is utilised by prioritising the neighbours of the highest-scoring documents. The behaviour models a user who first investigates numerous pages of search results to find the most relevant content, and then goes back to check for similar documents to the best ones identified. Algorithm 3 shows this process. We consider two variations of this T w o P h a s e strategy: one where the frontier continues to be updated based on the scores of the documents in the second phase (T w o P h a s e - R e f i n e ), and one where the frontier only considers the first phase documents (T w o P h a s e - F i x e d ). 4.2. T h r e s h o l d The T h r e s h o l d agent only explores the corpus graph for sufficiently relevant documents. In this setting, the initial ranking pool is updated to include the neighbours of scored documents Algorithm 3 T w o P h a s e Re-Ranking Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺, first phase size 𝑘 Output: Re-Ranked pool 𝑅1 𝑅1 ← ∅ do 𝐵 ← Score(top 𝑏 from 𝑅0 , subject to 𝑘) 𝑅1 ← 𝑅 1 ∪ 𝐵 𝑅0 ← 𝑅 0 ⧵ 𝐵 while |𝑅1 | < 𝑘 𝐹 ← Neighbours(𝑅1 , 𝐺) do 𝐵 ← Score(top 𝑏 from 𝐹, subject to 𝑐) 𝑅1 ← 𝑅 1 ∪ 𝐵 𝐹 ←𝐹 ⧵𝐵 𝐹 ← 𝐹 ∪ (Neighbours(𝐵, 𝐺) ⧵ 𝑅1 ) ▷ If “Refine” variant, otherwise skip this step while |𝑅1 | < 𝑐 Algorithm 4 T h r e s h o l d Re-Ranking Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺 Output: Re-Ranked pool 𝑅1 𝑅1 ← ∅ do 𝐵 ← Score(top 𝑏 from 𝑅0 , subject to 𝑐) 𝑅1 ← 𝑅 1 ∪ 𝐵 𝑅0 ← 𝑅 0 ⧵ 𝐵 𝑅0 ← 𝑅0 ∪ (Neighbours(documents from 𝐵 that satisfy 𝑟, G) ⧵𝑅1 ) while |𝑅1 | < 𝑐 that exceed a relevance score threshold 𝑟. By adding these documents to the pool with infinite priority, they are always scored before other documents in the pool. This agent is akin to a searcher who, as they traverse a ranked list, investigate similar documents to ones that they deem sufficiently relevant as they are encountered. Importantly, this agent considers the absolute relevance score of documents, rather than the relative ordering of retrieved documents so far, which is used by techniques like A l t e r n a t e and T w o P h a s e . Algorithm 4 shows this process. 4.3. G r e e d y We next explore a strategy that attempts to identify which pool of documents – the initial ranking or the frontier – is has recently given the most relevant document. We approach this as a G r e e d y strategy, by always choosing the pool with the largest maximum score from the most recent batch. This process is akin to a searcher who switches between searching through a ranked list and the documents close to the best ones they found so far based on the most relevant document they identified recently from each pool. Algorithm 5 shows this process. Algorithm 5 G r e e d y Re-Ranking Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺 Output: Re-Ranked pool 𝑅1 𝑅1 ← ∅ 𝑃 ← 𝑅0 𝐹 ←∅ 𝑏𝑒𝑠𝑡𝑅0 ← ∞ 𝑏𝑒𝑠𝑡𝐹 ← ∞ do 𝐵 ← Score(top 𝑏 documents from 𝑃, subject to 𝑐) 𝑏𝑒𝑠𝑡𝑃 ←MaxScore(B) 𝑅1 ← 𝑅 1 ∪ 𝐵 𝑅0 ← 𝑅 0 ⧵ 𝐵 𝐹 ←𝐹 ⧵𝐵 𝐹 ← 𝐹 ∪ (Neighbours(𝐵, 𝐺) ⧵ 𝑅1 ) 𝑅 if 𝑏𝑒𝑠𝑡𝑅0 ≥ 𝑏𝑒𝑠𝑡𝐹 𝑃 ←{ 0 𝐹 if 𝑏𝑒𝑠𝑡𝑅0 < 𝑏𝑒𝑠𝑡𝐹 while |𝑅1 | < 𝑐 Algorithm 6 Adaptive Oracle Re-Ranking Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺, relevance assessments 𝑞𝑟𝑒𝑙𝑠 Output: Re-Ranked pool 𝑅1 𝑅1 ← ∅ 𝐹 ←∅ do 𝑃 ← 𝑅0 or 𝐹, whichever will improve the ranking the most using 𝑞𝑟𝑒𝑙𝑠 𝐵 ← Score(top 𝑏 from 𝑃, subject to 𝑐) 𝑅1 ← 𝑅 1 ∪ 𝐵 𝑅0 ← 𝑅 0 ⧵ 𝐵 𝐹 ←𝐹 ⧵𝐵 𝐹 ← 𝐹 ∪ (Neighbours(𝐵, 𝐺) ⧵ 𝑅1 ) while |𝑅1 | < 𝑐 4.4. Oracle Finally, to explore whether there is further room for improvement in adaptive re-ranking, we apply an oracle strategy. This strategy makes use of the relevance assessments (qrels) to choose which batch of documents to score. At each step, both the initial ranking and the frontier are scored. Then, the algorithm proceeds by adding only the batch that improves the ranking effectiveness (e.g., through a measure like nDCG) the most at that step. Naturally, this algorithm is not useful in practice, given that it relies on ground-truth relevant knowledge.1 However, it helps us establish whether there is further room for improvement in adaptive re-ranking agents. Algorithm 6 shows this process. 5. Experimental Setup We conduct experiments using various adaptive re-ranking agents to answer the following research questions: RQ1 Is there room for further improvement in adaptive re-ranking strategies? RQ2 Can smarter adaptive re-ranking strategies outperform the A l t e r n a t e approach? RQ3 How sensitive are alternative adaptive re-ranking strategies to their parameters? We conduct experiments using the TREC Deep Learning 2019 and 2020 passage ranking datasets [28, 29]. We use 2019 as our “development” data, leaving 2020 as a held-out test set. For both datasets, we report nDCG and Recall@1000 (with a minimum relevance threshold of 2) performance. We fix our study to a MonoT5 (base) scoring function [30], a re-ranking budget 𝑐 = 1000, and a batch size 𝑏 = 16. We explore four initial ranking functions: BM25, TCT- ColBERT [31, 32], Doc2Query [19], and SPLADE++ (C o C o n d e n s e r - E n s e m b l e D i s t i l version) [26]. 1 Of course, a perfect agent would ignore the scoring function entirely and just return documents by their human assessments to get a perfect ranking. We limit the usage of the assessments for this agent to the decision of which batch to score to help us identify whether better decisions about which batches to score can further improve adaptive re-ranking effectiveness. TREC DL 2019 (dev) TREC DL 2020 (test) G A R 𝑏𝑚25 G A R 𝑡𝑐𝑡 G A R 𝑏𝑚25 G A R 𝑡𝑐𝑡 Pipeline Agent nDCG R@1k nDCG R@1k nDCG R@1k nDCG R@1k BM25»MonoT5 Non-Adaptive 0.699 0.755 0.699 0.755 0.711 0.805 0.711 0.805 Oracle 0.747 0.804 0.786 0.853 0.748 0.791 0.768 0.828 𝑁 𝑁𝑂 𝑁 𝑁 𝑁𝑂 𝑁 𝑁 Alternate 0.726 0.827 0.743 0.839 0.743 0.874 0.749 0.892 𝑁 𝑁 𝑁𝑂 𝑁 𝑁 𝑁𝐴 𝑁 𝑁𝐴 TwoPhase-Fixed 0.729 0.815 0.740 0.836 0.732 0.838 0.742 0.858 𝑁 𝑁 𝑁𝑂 𝑁 𝑁 𝑁𝑂 𝑁𝐴 𝑁𝐴 TwoPhase-Refine 0.741 0.826 0.743 0.841 0.743 0.871 0.744 0.879 𝑁 𝑁 𝑁𝑂𝐴 𝑁 𝑁 𝑁𝑂 𝑁 𝑁𝐴 Threshold 0.742 0.829 0.751 0.849 0.744 0.874 0.744 0.874 𝑁 𝑁𝑂 𝑁 𝑁 𝑁𝑂 𝑁 𝑁 Greedy 0.723 0.823 0.737 0.839 0.743 0.868 0.744 0.882 TCT»MonoT5 Non-Adaptive 0.704 0.830 0.704 0.830 0.693 0.848 0.693 0.848 Oracle 0.793 0.891 0.766 0.846 0.762 0.874 0.754 0.861 𝑁𝑂 𝑁 𝑁𝑂 𝑁 𝑁𝑂 𝑁 𝑁𝑂 𝑁 Alternate 0.733 0.883 0.724 0.866 0.719 0.881 0.710 0.871 𝑁𝑂 𝑁 𝑁𝑂 𝑁 𝑁𝑂 𝑁 𝑁𝑂 𝑁 TwoPhase-Fixed 0.733 0.874 0.719 0.857 0.717 0.877 0.710 0.868 𝑁𝑂 𝑁 𝑁𝑂 𝑁𝑂 𝑁 𝑁𝑂𝐴 𝐴 TwoPhase-Refine 0.733 0.882 0.722 0.859 0.719 0.883 0.707 0.866 𝑁𝑂 𝑁 𝑁𝑂 𝑁 𝑁𝑂𝐴 𝑁𝑂𝐴 Threshold 0.731 0.886 0.720 0.866 0.711 0.871 0.705 0.862 𝑁𝑂 𝑁 𝑁𝑂 𝑁 𝑁𝑂𝐴 𝑁 𝑁𝑂 𝑁 Greedy 0.731 0.881 0.725 0.871 0.713 0.873 0.708 0.868 D2Q»MonoT5 Non-Adaptive 0.747 0.830 0.747 0.830 0.731 0.839 0.731 0.839 Oracle 0.797 0.867 0.798 0.867 0.791 0.884 0.793 0.889 𝑁 𝑁𝑂 𝑁 𝑁𝑂 𝑁 𝑂 𝑁 Alternate 0.757 0.880 0.766 0.879 0.748 0.880 0.748 0.895 𝑁 𝑁 𝑁𝑂 𝑁 𝑁𝑂 𝑁𝐴 𝑂 𝑁𝐴 TwoPhase-Fixed 0.765 0.866 0.765 0.870 0.748 0.867 0.745 0.870 𝑁 𝑁 𝑁 𝑁 𝑁𝑂 𝑁 𝑂 𝑁 TwoPhase-Refine 0.769 0.875 0.767 0.878 0.748 0.877 0.747 0.892 𝑁 𝑁 𝑁𝑂 𝑁 𝑂 𝑁𝐴 𝑂 𝑁 Threshold 0.766 0.876 0.767 0.877 0.746 0.874 0.745 0.881 𝑁 𝑂 𝑁 𝑂 𝑁 𝑂 𝑁 Greedy 0.754 0.874 0.757 0.873 0.744 0.878 0.748 0.894 SPLADE»MonoT5 Non-Adaptive 0.737 0.872 0.737 0.872 0.731 0.899 0.731 0.899 Oracle 0.807 0.898 0.783 0.859 0.777 0.886 0.781 0.899 𝑂 𝑂 𝑂 𝑂 Alternate 0.745 0.893 0.737 0.875 0.737 0.909 0.734 0.908 𝑂 𝐴 𝑂 𝑁𝐴 TwoPhase-Fixed 0.763 0.863 0.764 0.869 0.748 0.868 0.742 0.867 𝑂 𝑂 𝑂 𝐴 TwoPhase-Refine 0.769 0.875 0.764 0.870 0.748 0.877 0.736 0.869 𝑂 𝑂 𝑂 𝑁𝐴 Threshold 0.766 0.871 0.759 0.857 0.746 0.874 0.744 0.865 𝑁𝑂 𝑁 𝑂 𝑂 𝑂 Greedy 0.747 0.895 0.740 0.882 0.734 0.903 0.734 0.906 Table 1 Re-ranking performance on TREC Deep Learning 2019 and 2020 using various agents. The best- performing (non-oracle) agent in section is listed in bold. Significant differences compared to the Non-Adaptive, Oracle, and A d a p t i v e systems are marked with 𝑁𝑂𝐴 , respectively (paired t-test, 𝑝 < 0.05). Following the original G A R paper, we use two corpus graphs, based on BM25 similarity and TCT-ColBERT. As we will show in Section 6, T w o P h a s e and T h r e s h o l d are rather sensitive to their parameters, so the results we present use the parameters that yield the highest nDCG performance on TREC DL 2019. For T w o P h a s e , we test first-stage cutoff 𝑘 values ranging from 100 to 900 (by 100). For T h r e s h o l d , we first identify the distribution of scores using the MS MARCO dev (small) dataset, re-ranking 100 BM25 results. We then pick threshold 𝑟 values based on percentiles from this distribution, from 0.022 to 0.20 (by 0.02).3 6. Results and Analysis Table 1 presents the main results of our study. 2 Here, 0.02 represents the top 2% of scores. 3 This range was chosen after an initial study of orders of magnitude from 0.001 to 0.3 identified it as the most promising. We first explore whether there exists room for improvement by exploring alternative agents (RQ1). In all cases, the Oracle agent, which has knowledge of the relevance assessments when choosing which batches of documents to score, has a higher nDCG performance than all other agents. The difference in performance can often be considerable; for instance, the best-performing non-oracle agent over a the TCT pipeline using a BM25 graph has an nDCG of 0.733, while the oracle achieves an nDCG of 0.793. Differences between non-oracle agents and the Oracle are not always significant, however. We note that while nDCG is often improved using either oracle technique, recall often suffers compared to alternative agents. Perhaps this is unsurprising, given that the oracles aim to directly optimise nDCG (recall is only considered insofar as it would improve the nDCG). Significant improvements compared to the Oracle are found more frequently when measuring nDCG using TCT corpus graph (GAR𝑡𝑐𝑡 ) than the BM25 graph (GAR𝑏𝑚25 ). In summary, to answer RQ1, by exploring an oracle agent, we find that there is considerable room for improvement in adaptive re-ranking. In particular, exploration of the semantic graph (GAR𝑡𝑐𝑡 ) by non-oracle agents is likely sub-optimal and has room for considerable improvement. Next, we look to answer RQ2 and RQ3. We find that each of the proposed alternative agents to be the most effective in at least one measurement. In one case, an alternative agent significantly outperforms the existing A l t e r n a t e agent (T h r e s h o l d when re-ranking BM25 results using GAR𝑡𝑐𝑡 ). Further, there are 11 cases where alternative strategies outperform the non-adaptive, while A l t e r n a t e does not. However, there are sometimes significant reductions in effectiveness when using the alternative agents (18 cases, all of which on the held-out DL20 data). There is also no clear pattern in which strategy is generally most effective. These results answer RQ2: while some strategies can outperform A l t e r n a t e , we do not find a clearly superior strategy among the agents we test. Given that significant reductions are present when using the held-out DL20 data, we dive deeper into the performance of T w o P h a s e and T h r e s h o l d by plotting their performance for the parameters they introduce (T w o P h a s e adds the first stage cutoff 𝑘 and T h r e s h o l d adds the relevance threshold 𝑟). Figure 1 present the nDCG performance. We see that in all cases, performance varies considerably as the parameters are adjusted. Further, the optimal parameters for DL19 are often different than the optimal parameters for DL20, suggesting that the alternative agents are not very robust to the selection parameter selection. This is in particular contrast with the robustness of the A l t e r n a t e approach to parameters, shown by MacAvaney et al. [8]. These results answer RQ3: T w o P h a s e and T h r e s h o l d introduce parameters that are more sensitive than those present in the A l t e r n a t e strategy. 7. Conclusions & Outlook In this work we investigated the recently-proposed Adaptive Re-Ranking process from the perspective of information-seeking agents. We find that there is considerable room to improve adaptive re-ranking by changing the agent strategy. We further proposed several alternative strategies, and found that although they are promising, they are more sensitive to their pa- rameters than the simple A l t e r n a t e approach. This work sets the stage for future work that investigates additional agent strategies, such as those inspired by multi-armed bandit models. TwoPhase-Refine DL19 GARBM25 DL19 GARTCT 0.76 D2Q D2Q 0.74 SPLADE SPLADE nDCG 0.72 TCT BM25 BM25 TCT 0.70 DL20 GARBM25 DL20 GARTCT 0.74 D2Q D2Q SPLADE SPLADE BM25 BM25 nDCG 0.72 TCT 0.70 TCT 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Cutoff k Cutoff k TwoPhase-Fixed DL19 GARBM25 DL19 GARTCT 0.775 0.750 D2Q D2Q SPLADE SPLADE 0.725 TCT BM25 BM25 TCT nDCG 0.700 0.675 0.650 DL20 GAR BM25 TCT DL20 GAR 0.76 0.74 D2Q SPLADE D2Q BM25 SPLADE BM25 0.72 nDCG 0.70 TCT TCT 0.68 0.66 100 200 300 400 500 600 700 800 900 100 200 300 400 500 600 700 800 900 Cutoff k Cutoff k Threshold DL19 GARBM25 DL19 GARTCT 0.76 D2Q D2Q 0.74 SPLADE SPLADE nDCG 0.72 TCT BM25 BM25 TCT 0.70 DL20 GARBM25 DL20 GARTCT 0.74 D2Q D2Q SPLADE SPLADE BM25 BM25 nDCG 0.72 TCT 0.70 TCT 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 Threshold r Threshold r Figure 1: Performance of various proposed agents for various initial rankers and parameter values on TREC DL 2019 and 2020. The upper limit of each shaded area represents the A l t e r n a t e strategy, and the lower limit represents the non-adaptive re-ranking performance. Acknowledgements Sean and Craig acknowledge EPSRC grant EP/R018634/1: Closed-Loop Data Science for Complex, Computationally- & Data-Intensive Analytics. References [1] J. Lin, R. Nogueira, A. Yates, Pretrained Transformers for Text Ranking: BERT and Beyond, Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers, 2021. [2] R. Xiao, J. Ji, B. Cui, H. Tang, W. Ou, Y. Xiao, J. Tan, X. Ju, Weakly supervised co-training of query rewriting andsemantic matching for e-commerce, in: Proc. WSDM, 2019, p. 402–410. [3] C. Macdonald, N. Tonellotto, I. Ounis, Efficient & effective selective query rewriting with efficiency predictions, in: Proc. SIGIR, 2017, pp. 495–504. [4] O. Khattab, M. Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, in: Proc. SIGIR, 2020. [5] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval for open-domain question answering, 2020. URL: https://arxiv.org/abs/2004.04906. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 4 . 0 4 9 0 6 . [6] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, A. Overwijk, Approximate nearest neighbor negative contrastive learning for dense text retrieval, in: Proceedings of ICLR, 2021. [7] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Optimizing Dense Retrieval Model Training with Hard Negatives, in: Proc. SIGIR, 2021, pp. 1503–1512. [8] S. MacAvaney, N. Tonellotto, C. Macdonald, Adaptive re-ranking with a corpus graph, in: 31st ACM International Conference on Information and Knowledge Management, 2022. URL: https://arxiv.org/abs/2208.08942. doi:1 0 . 1 1 4 5 / 3 5 1 1 8 0 8 . 3 5 5 7 2 3 1 . [9] N. Jardine, C. J. van Rijsbergen, The use of hierarchic clustering in information retrieval, Information storage and retrieval 7 (1971). [10] O. Kurland, The cluster hypothesis in information retrieval, in: Proc. SIGIR, 2013. [11] E. M. Voorhees, The cluster hypothesis revisited, in: Proc. SIGIR, 1985. [12] N. Tonellotto, C. Macdonald, I. Ounis, Efficient query processing for scalable web search, Foundations and Trends in Information Retrieval 12 (2018) 319–492. [13] T.-Y. Liu, Learning to rank for information retrieval, Found. Trends Inf. Retr. 3 (2009) 225–331. URL: https://doi.org/10.1561/1500000016. doi:1 0 . 1 5 6 1 / 1 5 0 0 0 0 0 0 1 6 . [14] C. Macdonald, R. L. T. Santos, I. Ounis, The whens and hows of learning to rank for web search, Information Retrieval 16 (2012). [15] N. Tonellotto, C. Macdonald, I. Ounis, Efficient and effective retrieval using selective pruning, in: Proc. WSDM, 2013. [16] R. Nogueira, K. Cho, Passage Re-ranking with BERT, arXiv preprint 1901.04085 (2019). [17] J. Johnson, M. Douze, H. Jegou, Billion-scale similarity search with gpus, IEEE Trans. Big Data 7 (2021). [18] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, Preprint: arXiv:1904.08375 (2019). [19] R. Nogueira, J. Lin, From doc2query to docTTTTTquery, Online preprint (2019). [20] Z. Dai, J. Callan, Context-aware sentence/passage term importance estimation for first stage retrieval, Preprint: arXiv:1910.10687 (2019). [21] L. Gao, Z. Dai, J. Callan, COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List, in: Proc. NAACL-HLT, 2021, pp. 3030–3042. [22] J. Lin, X. Ma, A few brief notes on DeepImpact, COIL, and a conceptual framework for information retrieval techniques, Preprint: arXiv:2106.14807 (2021). [23] A. Mallia, O. Khattab, T. Suel, N. Tonellotto, Learning passage impacts for inverted indexes, in: Proc. SIGIR, 2021. [24] A. Mallia, J. Mackenzie, T. Suel, N. Tonellotto, Faster Learned Sparse Retrieval with Guided Traversal, in: Proc. SIGIR, 2022, p. 5. [25] T. Formal, B. Piwowarski, S. Clinchant, SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, in: Proc. SIGIR, 2021, p. 2288–2292. [26] T. Formal, C. Lassance, B. Piwowarski, S. Clinchant, Splade v2: Sparse lexical and expansion model for information retrieval, 2021. [27] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 3 ed., Prentice Hall, 2010. [28] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. Voorhees, Overview of the trec 2019 deep learning track, in: TREC 2019, 2019. [29] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, Overview of the trec 2020 deep learning track, in: TREC, 2020. [30] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to- sequence model, in: Proc. EMNLP, 2020. [31] S.-C. Lin, J.-H. Yang, J. J. Lin, Distilling dense representations for ranking using tightly- coupled teachers, ArXiv abs/2010.11386 (2020). [32] S.-C. Lin, J.-H. Yang, J. Lin, In-batch negatives for knowledge distillation with tightly- coupled teachers for dense retrieval, in: RepL4NLP-2021 Workshop, 2021, pp. 163–173.