=Paper= {{Paper |id=Vol-3318/paper9 |storemode=property |title=Adaptive Re-Ranking as an Information-Seeking Agent |pdfUrl=https://ceur-ws.org/Vol-3318/paper9.pdf |volume=Vol-3318 |authors=Sean MacAvaney,Nicola Tonellotto,Craig Macdonald |dblpUrl=https://dblp.org/rec/conf/cikm/MacAvaneyTM22a }} ==Adaptive Re-Ranking as an Information-Seeking Agent== https://ceur-ws.org/Vol-3318/paper9.pdf
Adaptive Re-Ranking as an
Information-Seeking Agent
Sean MacAvaney1 , Nicola Tonellotto2 and Craig Macdonald1
1
    University of Glasgow, United Kingdom
2
    University of Pisa, Italy


                                         Abstract
                                         Re-ranking systems are typically limited by the recall of the initial retrieval function. A recent work
                                         proposed adaptive re-ranking, which modifies the re-ranking loop to progressively prioritise documents
                                         likely to receive high scores based on the highest scoring ones thus far. The original work framed this
                                         process as an incarnation of the well-established clustering hypothesis. In this work, we argue that the
                                         approach can also be framed as an information-seeking agent. From this perspective, we explore several
                                         variations of the graph-based adaptive re-ranking algorithm and find that there is substantial room for
                                         improvement by modifying the agent. However, the agents that we explore are more sensitive to the
                                         new parameters they introduce than the simple-yet-effective approach proposed in the original adaptive
                                         re-ranking work.

                                         Keywords
                                         Neural re-ranking, Clustering hypothesis, Retrieval agents




1. Introduction
Neural techniques leveraging use pre-trained language models (PLMs) have been shown to be
effective at various information retrieval (IR) ranking tasks, such as question answering and
ad-hoc document ranking [1]. The most effective deployment of PLMs in neural IR is through a
cascading architecture. The cascade is composed of a two-stage pipeline. The role of the first
stage is to retrieve a candidate pool of documents from the collection using an inexpensive
scoring functions such as BM25, while the second stage re-ranks the documents in the first
stage sample using an expensive PLM. Typically, the first stage focuses on optimising the recall
of the pool, i.e., the number of relevant documents in the pool, and the second stage optimises
precision-oriented metrics such as NDCG@10.
   A major limitation of this approach lies in the performance of the first stage. If the first stage
fails to retrieve a relevant document, the second stage has no chance to re-rank it. Consequently,
many techniques have been designed to resolve this first stage “recall” problem, such as query
rewriting [2, 3] and dense retrieval [4, 5, 6, 7]. Recently, MacAvaney et al. [8] proposed a
complementary approach for overcoming the first stage recall problem by enabling re-rankers
to identify and score documents missing from the initial ranker. This is accomplished by using

First Workshop on Proactive and Agent-Supported Information Retrieval, 2022
Envelope-Open sean.macavaney@glasgow.ac.uk (S. MacAvaney); nicola.tonellotto@unipi.it (N. Tonellotto);
craig.macdonald@glasgow.ac.uk (C. Macdonald)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
the incremental nature of re-ranking documents, typically done in batches to make the best use
of specialised hardware like GPUs and TPUs. After each batch, information is gained about
which documents get the highest relevance score, which can be leveraged to make a more
informed decision about which documents to rank next (a process the authors call Adaptive
Re-Ranking). By making use of the clustering hypothesis [9, 10, 11] and pre-computed similarity
scores between documents (via a corpus graph), documents that are similar to the highest-ranked
documents can be identified and prioritised, thereby allowing documents missed in the first
stage to be retrieved. This particular formulation of Adaptive Re-Ranking is referred to as
Graph-based Adaptive Re-ranking (G A R ). The variant of G A R proposed by MacAvaney et al. [8]
used a simple strategy for balancing (a) the scoring of batches of existing candidate documents
(identified by the first stage retriever) and (b) discovering new candidates (by scoring documents
similar to the top ranked results so far). Their approach simply alternates between (a) and (b),
batch by batch. We refer to this strategy as A l t e r n a t e .
   We argue that the approach for re-ranking can be thought of as an information-seeking agent
and explained in terms of user browsing strategies. For instance, an agent using the traditional
re-ranking approach simply scans down a list of search results looking for items that are
relevant to its information need. Meanwhile, an agent using the A l t e r n a t e Adaptive approach
alternates between exploring documents from the search results and similar documents to the
most relevant ones it found so far. We observe that this may not be the process by which a
reasonable human user would traverse search results; for instance, a user may only look at
similar documents to those that are sufficiently relevant. Therefore, in this work we explore a
variety of agent-based strategies for adaptive re-ranking, inspired by behaviours that a human
may take when seeking information.
   In summary, we explore new agent-based strategies for the recently-proposed Graph-based
Adaptive Re-Ranking approach. We establish an upper-bound performance for smarter agents
and explore several other strategies. In doing so, we demonstrate the potential breadth of
possible strategies that can be applied to GAR.


2. Background and Related Work
Cascading architectures are commonly deployed in information retrieval. In principle, these
are driven by an efficiency-effectiveness tradeoff – the most effective models cannot be applied
on all documents in the index in a timely manner [12]. Hence, cheap ranking stages (with high
recall) are typically combined with more expensive ranking stages (to obtain higher precision).
Ranking cascades are common in web search using learning-to-rank, where an initial retrieval
stage uses a cheap ranking function, commonly BM25, to identify a candidate set of documents.
This is followed by a feature computation stage, and then application of the learned model [13].
Limiting the number of documents retrieved in the first stage – which we call the budget in this
paper – has the effect of reducing recall, but enhances overall efficiency [14]. Later, Tonellotto
et al. [15] proposed a selective approach that could adjust the budget on a per-query basis for
efficient & effective retrieval.
   More recently, multi-stage ranking has been commonly used for applying neural re-ranking
models. For instance, cross-encoder neural models [16] must be applied jointly on the query
and the document to compute a relevance score. Since the cross-encoder scoring function can
be computationally expensive, re-ranking is often limited to a predefined maximum number
documents that the system is willing to re-rank for each query (i.e., a re-ranking budget, such
as 1000). Hence, the application of a first (so-called sparse) retriever to obtain a candidate
set is necessary for such models. However, this limits the recall of the candidate set, in that
relevant documents not matching the query terms (and which may obtain a high score from the
re-ranking cross-encoder) will not be retrieved.
   To address this gap, several recent neural ranking models of have been used proposed: (i)
dense retrieval models based on bi-encoders and (ii) learned sparse models.
   In bi-encoder-based dense retrieval models [4, 5, 6, 7], the documents are pre-encoded into
embedded representations, and nearest neighbour lookup is performed based on an embedded
representation of the query [17]. However, online nearest neighbour search is more expensive
than use of an inverted index-based sparse retrieval, hence inverted indexes are adopted for
first-stage retrieval. Of note, ColBERT [4] uses a multi-stage dense retrieval architecture – an
approximate nearest neighbour search, followed by an exact neural scoring.
   Learned sparse models encompass the use of the PLMs upon the document representation to
make a refined representation that can be indexed by existing sparse inverted indexing tools. For
instance, doc2query [18, 19] enriches the sparse document representation with possible queries
that the document could be retrieved for. DeepCT [20], uniCOIL [21, 22], DeepImpact [23, 24]
and SPLADE [25, 26] all use a PLM to generate a new sparse representation of the document.
   This work focuses on a recently-proposed complementary approach: Graph-Based Adaptive
Re-Ranking (GAR) [8]. Rather than limiting the candidate pool of documents for re-ranking to
those from the first stage, GAR augments the re-ranking loop to identify documents that are
similar to the highest-scoring ones identified so far. In the following sections, we cover this
process in more detail (Section 3) and identify that this process can be seen as an information-
seeking agent (Section 4). We use this formulation to propose alternative adaptive re-ranking
agents and compare their effectiveness (Section 6).


3. Graph-based Adaptive Re-Ranking
GAR is formulated as a re-ranking process. In a re-ranking process, the documents from an
initial candidate set, 𝑅0 , are re-ranked by a scoring function, Score(), to obtain a final ranking
of 𝑅1 documents. The scoring function can be expensive to apply, for instance a cross-encoder
neural model such as monoT5. This means that efficiency constraints may limit the number
of applications of Score() that can be undertaken, which is called the re-ranking budget. In
practice, batches of documents are scored at the same time, making effective use of GPU or
TPU hardware.
   A typical is non-adaptive re-ranking is shown in Algorithm 1, where an initial ranking
(prioritised by rank) is scored in batches until the re-ranking budget is fully consumed.
   In contrast, GAR proposed an adaptive approach to re-ranking, where highly scored docu-
ments are used as seeds for nearest neighbour retrieval, rather than continuing down the ranking.
To perform this efficiently, a graph of document-document similarity (𝐺) is created at index
time, which records the nearest documents for each given document. Since this is precomputed
Algorithm 1 Non-Adaptive Re-Ranking
Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐
Output: Re-Ranked pool 𝑅1
  𝑅1 ← ∅
  do
     𝐵 ← Score(top 𝑏 documents from 𝑅0 , subject to 𝑐)
     𝑅1 ← 𝑅 1 ∪ 𝐵
     𝑅0 ← 𝑅 0 ⧵ 𝐵
  while |𝑅1 | < 𝑐


and small, it can be used efficiently at ranking time to identify more documents to score.
   In particular, after each batch 𝐵 of documents is scored, the nearest documents to those in
the batch are added to a “frontier” set of documents 𝐹. The frontier is prioritised by the scores
of the re-ranked documents, so the neighbours of the highest-scored documents are explored
first. The process then alternates between scoring the top documents from the initial ranking
𝑅0 and scoring the top documents in the frontier 𝐹. Algorithm 2 shows this process.
   We call this A l t e r n a t e , due to the manner in which it alternates between the initial ranking
and the frontier 𝐹. In the following, we link GAR with intelligent agents, and shows us how
this allows further agent-based implementations.


4. Re-Ranking Agents
We argue that GAR re-ranking approaches can be interpreted as information-seeking agents,
which peruse retrieved documents, and decide how to act based on the relevance of the doc-
uments found (such actions could be selecting more documents from the candidate set, or
selecting similar documents to those already scored highly). In doing so, they act like a user
who was deciding how to progress through a ranked list.


Algorithm 2 A l t e r n a t e Re-Ranking [8]
Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺
Output: Re-Ranked pool 𝑅1
  𝑅1 ← ∅
  𝑃 ← 𝑅0
  𝐹 ←∅
  do
     𝐵 ← Score(top 𝑏 documents from 𝑃, subject to 𝑐)
     𝑅1 ← 𝑅 1 ∪ 𝐵
     𝑅0 ← 𝑅 0 ⧵ 𝐵
     𝐹 ←𝐹 ⧵𝐵
     𝐹 ← 𝐹 ∪ (Neighbours(𝐵, 𝐺) ⧵ 𝑅1 )
            𝑅 if 𝑃 = 𝐹
     𝑃 ←{ 0
            𝐹     if 𝑃 = 𝑅0
  while |𝑅1 | < 𝑐
   We treat existing re-ranking strategies as information-seeking agents, and in the following
we introduce new alternative agents. In all cases, we have the following: an environment which
consists of an initial ranking of candidate documents, denoted as 𝑅0 , a graph of document-
document similarities 𝐺, and the final retrieved documents list 𝑅1 ; and a set of available actions,
namely scoring documents from 𝑅0 and placing in 𝑅1 , and scoring documents from 𝐺 and
placing into 𝑅1 . Some agents may also maintain a new state, such as a prioritised frontier 𝐹 of
documents collected from the graph. We argue that such a fully-capable agent can be classified
as a model-based reflex agent within the classes of [27]. Indeed, a scoring model (e.g., monoT5)
estimates relevance, which is only partially observable. It can choose actions based on the state,
i.e., the current frontier.
   In the following, we list four possible agents based on the GAR formulation.

4.1. T w o P h a s e
A simple alternative to A l t e r n a t e is to operate in two sequential phases. In the first phase,
all documents up to rank 𝑘 < 𝑐 are scored. In the second phase, the remaining re-ranking
budget 𝑐 − 𝑘 is utilised by prioritising the neighbours of the highest-scoring documents. The
behaviour models a user who first investigates numerous pages of search results to find the most
relevant content, and then goes back to check for similar documents to the best ones identified.
Algorithm 3 shows this process. We consider two variations of this T w o P h a s e strategy: one
where the frontier continues to be updated based on the scores of the documents in the second
phase (T w o P h a s e - R e f i n e ), and one where the frontier only considers the first phase documents
(T w o P h a s e - F i x e d ).

4.2. T h r e s h o l d
The T h r e s h o l d agent only explores the corpus graph for sufficiently relevant documents. In
this setting, the initial ranking pool is updated to include the neighbours of scored documents


Algorithm 3 T w o P h a s e Re-Ranking
Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺, first phase size 𝑘
Output: Re-Ranked pool 𝑅1
  𝑅1 ← ∅
  do
     𝐵 ← Score(top 𝑏 from 𝑅0 , subject to 𝑘)
     𝑅1 ← 𝑅 1 ∪ 𝐵
     𝑅0 ← 𝑅 0 ⧵ 𝐵
  while |𝑅1 | < 𝑘
  𝐹 ← Neighbours(𝑅1 , 𝐺)
  do
     𝐵 ← Score(top 𝑏 from 𝐹, subject to 𝑐)
     𝑅1 ← 𝑅 1 ∪ 𝐵
     𝐹 ←𝐹 ⧵𝐵
     𝐹 ← 𝐹 ∪ (Neighbours(𝐵, 𝐺) ⧵ 𝑅1 )               ▷ If “Refine” variant, otherwise skip this step
  while |𝑅1 | < 𝑐
Algorithm 4 T h r e s h o l d Re-Ranking
Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺
Output: Re-Ranked pool 𝑅1
  𝑅1 ← ∅
  do
     𝐵 ← Score(top 𝑏 from 𝑅0 , subject to 𝑐)
     𝑅1 ← 𝑅 1 ∪ 𝐵
     𝑅0 ← 𝑅 0 ⧵ 𝐵
     𝑅0 ← 𝑅0 ∪ (Neighbours(documents from 𝐵 that satisfy 𝑟, G) ⧵𝑅1 )
  while |𝑅1 | < 𝑐


that exceed a relevance score threshold 𝑟. By adding these documents to the pool with infinite
priority, they are always scored before other documents in the pool. This agent is akin to a
searcher who, as they traverse a ranked list, investigate similar documents to ones that they
deem sufficiently relevant as they are encountered. Importantly, this agent considers the absolute
relevance score of documents, rather than the relative ordering of retrieved documents so far,
which is used by techniques like A l t e r n a t e and T w o P h a s e . Algorithm 4 shows this process.

4.3. G r e e d y
We next explore a strategy that attempts to identify which pool of documents – the initial
ranking or the frontier – is has recently given the most relevant document. We approach this
as a G r e e d y strategy, by always choosing the pool with the largest maximum score from the
most recent batch. This process is akin to a searcher who switches between searching through
a ranked list and the documents close to the best ones they found so far based on the most
relevant document they identified recently from each pool. Algorithm 5 shows this process.


Algorithm 5 G r e e d y Re-Ranking
Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺
Output: Re-Ranked pool 𝑅1
  𝑅1 ← ∅
  𝑃 ← 𝑅0
  𝐹 ←∅
  𝑏𝑒𝑠𝑡𝑅0 ← ∞
  𝑏𝑒𝑠𝑡𝐹 ← ∞
  do
      𝐵 ← Score(top 𝑏 documents from 𝑃, subject to 𝑐)
      𝑏𝑒𝑠𝑡𝑃 ←MaxScore(B)
      𝑅1 ← 𝑅 1 ∪ 𝐵
      𝑅0 ← 𝑅 0 ⧵ 𝐵
      𝐹 ←𝐹 ⧵𝐵
      𝐹 ← 𝐹 ∪ (Neighbours(𝐵, 𝐺) ⧵ 𝑅1 )
            𝑅 if 𝑏𝑒𝑠𝑡𝑅0 ≥ 𝑏𝑒𝑠𝑡𝐹
      𝑃 ←{ 0
            𝐹     if 𝑏𝑒𝑠𝑡𝑅0 < 𝑏𝑒𝑠𝑡𝐹
  while |𝑅1 | < 𝑐
Algorithm 6 Adaptive Oracle Re-Ranking
Input: Initial ranking 𝑅0 , batch size 𝑏, budget 𝑐, corpus graph 𝐺, relevance assessments 𝑞𝑟𝑒𝑙𝑠
Output: Re-Ranked pool 𝑅1
  𝑅1 ← ∅
  𝐹 ←∅
  do
     𝑃 ← 𝑅0 or 𝐹, whichever will improve the ranking the most using 𝑞𝑟𝑒𝑙𝑠
     𝐵 ← Score(top 𝑏 from 𝑃, subject to 𝑐)
     𝑅1 ← 𝑅 1 ∪ 𝐵
     𝑅0 ← 𝑅 0 ⧵ 𝐵
     𝐹 ←𝐹 ⧵𝐵
     𝐹 ← 𝐹 ∪ (Neighbours(𝐵, 𝐺) ⧵ 𝑅1 )
  while |𝑅1 | < 𝑐


4.4. Oracle
Finally, to explore whether there is further room for improvement in adaptive re-ranking, we
apply an oracle strategy. This strategy makes use of the relevance assessments (qrels) to choose
which batch of documents to score. At each step, both the initial ranking and the frontier
are scored. Then, the algorithm proceeds by adding only the batch that improves the ranking
effectiveness (e.g., through a measure like nDCG) the most at that step. Naturally, this algorithm
is not useful in practice, given that it relies on ground-truth relevant knowledge.1 However, it
helps us establish whether there is further room for improvement in adaptive re-ranking agents.
Algorithm 6 shows this process.


5. Experimental Setup
We conduct experiments using various adaptive re-ranking agents to answer the following
research questions:

RQ1 Is there room for further improvement in adaptive re-ranking strategies?
RQ2 Can smarter adaptive re-ranking strategies outperform the A l t e r n a t e approach?
RQ3 How sensitive are alternative adaptive re-ranking strategies to their parameters?

   We conduct experiments using the TREC Deep Learning 2019 and 2020 passage ranking
datasets [28, 29]. We use 2019 as our “development” data, leaving 2020 as a held-out test set.
For both datasets, we report nDCG and Recall@1000 (with a minimum relevance threshold
of 2) performance. We fix our study to a MonoT5 (base) scoring function [30], a re-ranking
budget 𝑐 = 1000, and a batch size 𝑏 = 16. We explore four initial ranking functions: BM25, TCT-
ColBERT [31, 32], Doc2Query [19], and SPLADE++ (C o C o n d e n s e r - E n s e m b l e D i s t i l version) [26].

1
    Of course, a perfect agent would ignore the scoring function entirely and just return documents by their human
    assessments to get a perfect ranking. We limit the usage of the assessments for this agent to the decision of which
    batch to score to help us identify whether better decisions about which batches to score can further improve
    adaptive re-ranking effectiveness.
                                                     TREC DL 2019 (dev)                                            TREC DL 2020 (test)
                                               G A R 𝑏𝑚25                    G A R 𝑡𝑐𝑡                       G A R 𝑏𝑚25                        G A R 𝑡𝑐𝑡
    Pipeline          Agent              nDCG           R@1k           nDCG              R@1k          nDCG               R@1k          nDCG               R@1k
    BM25»MonoT5       Non-Adaptive           0.699          0.755          0.699             0.755         0.711              0.805         0.711              0.805
                      Oracle                 0.747          0.804          0.786             0.853         0.748              0.791         0.768              0.828
                                                       𝑁             𝑁𝑂                  𝑁             𝑁            𝑁𝑂                 𝑁                   𝑁
                      Alternate            0.726         0.827           0.743             0.839        0.743          0.874             0.749           0.892
                                         𝑁             𝑁             𝑁𝑂                  𝑁             𝑁             𝑁𝐴                 𝑁              𝑁𝐴
                      TwoPhase-Fixed       0.729         0.815           0.740             0.836        0.732          0.838              0.742           0.858
                                         𝑁             𝑁              𝑁𝑂                 𝑁            𝑁             𝑁𝑂                 𝑁𝐴              𝑁𝐴
                      TwoPhase-Refine      0.741         0.826           0.743             0.841        0.743          0.871              0.744           0.879
                                         𝑁             𝑁            𝑁𝑂𝐴                  𝑁            𝑁             𝑁𝑂                  𝑁              𝑁𝐴
                      Threshold            0.742         0.829          0.751              0.849        0.744          0.874              0.744           0.874
                                                       𝑁              𝑁𝑂                 𝑁            𝑁             𝑁𝑂                  𝑁               𝑁
                      Greedy               0.723         0.823           0.737             0.839        0.743          0.868              0.744           0.882
    TCT»MonoT5        Non-Adaptive           0.704          0.830          0.704             0.830         0.693              0.848         0.693              0.848
                      Oracle                 0.793          0.891          0.766             0.846         0.762              0.874         0.754              0.861
                                        𝑁𝑂             𝑁             𝑁𝑂                  𝑁           𝑁𝑂                   𝑁           𝑁𝑂                   𝑁
                      Alternate            0.733         0.883          0.724              0.866         0.719          0.881            0.710              0.871
                                        𝑁𝑂             𝑁             𝑁𝑂                  𝑁            𝑁𝑂                  𝑁           𝑁𝑂                   𝑁
                      TwoPhase-Fixed       0.733         0.874          0.719              0.857         0.717          0.877            0.710               0.868
                                        𝑁𝑂             𝑁             𝑁𝑂                               𝑁𝑂              𝑁               𝑁𝑂𝐴                  𝐴
                      TwoPhase-Refine      0.733         0.882          0.722              0.859         0.719          0.883             0.707              0.866
                                        𝑁𝑂             𝑁             𝑁𝑂                  𝑁           𝑁𝑂𝐴                              𝑁𝑂𝐴
                      Threshold            0.731         0.886          0.720              0.866         0.711          0.871             0.705              0.862
                                        𝑁𝑂             𝑁             𝑁𝑂                  𝑁           𝑁𝑂𝐴              𝑁                𝑁𝑂                  𝑁
                      Greedy               0.731         0.881          0.725              0.871         0.713          0.873             0.708              0.868
    D2Q»MonoT5        Non-Adaptive           0.747          0.830          0.747             0.830         0.731              0.839         0.731              0.839
                      Oracle                 0.797          0.867          0.798             0.867         0.791              0.884         0.793              0.889
                                                       𝑁             𝑁𝑂                  𝑁           𝑁𝑂               𝑁                 𝑂                  𝑁
                      Alternate            0.757        0.880            0.766            0.879         0.748          0.880              0.748             0.895
                                         𝑁             𝑁             𝑁𝑂                  𝑁           𝑁𝑂              𝑁𝐴                 𝑂              𝑁𝐴
                      TwoPhase-Fixed       0.765         0.866           0.765             0.870        0.748           0.867              0.745             0.870
                                         𝑁             𝑁              𝑁                  𝑁           𝑁𝑂               𝑁                  𝑂                 𝑁
                      TwoPhase-Refine      0.769         0.875          0.767              0.878        0.748           0.877              0.747             0.892
                                         𝑁             𝑁             𝑁𝑂                  𝑁             𝑂             𝑁𝐴                  𝑂                 𝑁
                      Threshold            0.766         0.876          0.767              0.877         0.746          0.874              0.745             0.881
                                                       𝑁               𝑂                 𝑁             𝑂              𝑁                 𝑂                  𝑁
                      Greedy               0.754         0.874           0.757             0.873         0.744          0.878             0.748              0.894
    SPLADE»MonoT5     Non-Adaptive           0.737          0.872          0.737             0.872         0.731              0.899         0.731              0.899
                      Oracle                 0.807          0.898          0.783             0.859         0.777              0.886         0.781              0.899
                                         𝑂                             𝑂                               𝑂                                𝑂
                      Alternate             0.745        0.893           0.737               0.875        0.737           0.909            0.734         0.908
                                         𝑂                                                                                𝐴             𝑂              𝑁𝐴
                      TwoPhase-Fixed        0.763        0.863          0.764                0.869       0.748            0.868            0.742          0.867
                                         𝑂                                                             𝑂                                 𝑂              𝐴
                      TwoPhase-Refine      0.769         0.875          0.764                0.870       0.748            0.877            0.736          0.869
                                          𝑂                                                             𝑂                               𝑂              𝑁𝐴
                      Threshold             0.766        0.871           0.759               0.857        0.746           0.874           0.744           0.865
                                        𝑁𝑂             𝑁               𝑂                                𝑂                                𝑂
                      Greedy                0.747        0.895           0.740               0.882        0.734           0.903            0.734          0.906

Table 1
Re-ranking performance on TREC Deep Learning 2019 and 2020 using various agents. The best-
performing (non-oracle) agent in section is listed in bold. Significant differences compared to the
Non-Adaptive, Oracle, and A d a p t i v e systems are marked with 𝑁𝑂𝐴 , respectively (paired t-test, 𝑝 < 0.05).


Following the original G A R paper, we use two corpus graphs, based on BM25 similarity and
TCT-ColBERT.
    As we will show in Section 6, T w o P h a s e and T h r e s h o l d are rather sensitive to their parameters,
so the results we present use the parameters that yield the highest nDCG performance on TREC
DL 2019. For T w o P h a s e , we test first-stage cutoff 𝑘 values ranging from 100 to 900 (by 100). For
T h r e s h o l d , we first identify the distribution of scores using the MS MARCO dev (small) dataset,
re-ranking 100 BM25 results. We then pick threshold 𝑟 values based on percentiles from this
distribution, from 0.022 to 0.20 (by 0.02).3


6. Results and Analysis
Table 1 presents the main results of our study.
2
    Here, 0.02 represents the top 2% of scores.
3
    This range was chosen after an initial study of orders of magnitude from 0.001 to 0.3 identified it as the most
    promising.
   We first explore whether there exists room for improvement by exploring alternative agents
(RQ1). In all cases, the Oracle agent, which has knowledge of the relevance assessments
when choosing which batches of documents to score, has a higher nDCG performance than
all other agents. The difference in performance can often be considerable; for instance, the
best-performing non-oracle agent over a the TCT pipeline using a BM25 graph has an nDCG of
0.733, while the oracle achieves an nDCG of 0.793. Differences between non-oracle agents and
the Oracle are not always significant, however. We note that while nDCG is often improved
using either oracle technique, recall often suffers compared to alternative agents. Perhaps this
is unsurprising, given that the oracles aim to directly optimise nDCG (recall is only considered
insofar as it would improve the nDCG). Significant improvements compared to the Oracle are
found more frequently when measuring nDCG using TCT corpus graph (GAR𝑡𝑐𝑡 ) than the BM25
graph (GAR𝑏𝑚25 ).
   In summary, to answer RQ1, by exploring an oracle agent, we find that there is considerable
room for improvement in adaptive re-ranking. In particular, exploration of the semantic graph
(GAR𝑡𝑐𝑡 ) by non-oracle agents is likely sub-optimal and has room for considerable improvement.
   Next, we look to answer RQ2 and RQ3. We find that each of the proposed alternative agents to
be the most effective in at least one measurement. In one case, an alternative agent significantly
outperforms the existing A l t e r n a t e agent (T h r e s h o l d when re-ranking BM25 results using
GAR𝑡𝑐𝑡 ). Further, there are 11 cases where alternative strategies outperform the non-adaptive,
while A l t e r n a t e does not. However, there are sometimes significant reductions in effectiveness
when using the alternative agents (18 cases, all of which on the held-out DL20 data). There is
also no clear pattern in which strategy is generally most effective. These results answer RQ2:
while some strategies can outperform A l t e r n a t e , we do not find a clearly superior strategy
among the agents we test.
   Given that significant reductions are present when using the held-out DL20 data, we dive
deeper into the performance of T w o P h a s e and T h r e s h o l d by plotting their performance for the
parameters they introduce (T w o P h a s e adds the first stage cutoff 𝑘 and T h r e s h o l d adds the relevance
threshold 𝑟). Figure 1 present the nDCG performance. We see that in all cases, performance
varies considerably as the parameters are adjusted. Further, the optimal parameters for DL19
are often different than the optimal parameters for DL20, suggesting that the alternative agents
are not very robust to the selection parameter selection. This is in particular contrast with the
robustness of the A l t e r n a t e approach to parameters, shown by MacAvaney et al. [8]. These
results answer RQ3: T w o P h a s e and T h r e s h o l d introduce parameters that are more sensitive than
those present in the A l t e r n a t e strategy.


7. Conclusions & Outlook
In this work we investigated the recently-proposed Adaptive Re-Ranking process from the
perspective of information-seeking agents. We find that there is considerable room to improve
adaptive re-ranking by changing the agent strategy. We further proposed several alternative
strategies, and found that although they are promising, they are more sensitive to their pa-
rameters than the simple A l t e r n a t e approach. This work sets the stage for future work that
investigates additional agent strategies, such as those inspired by multi-armed bandit models.
                                                        TwoPhase-Refine
                               DL19 GARBM25                                                   DL19 GARTCT
          0.76                                                                                                                  D2Q
                                                                  D2Q
          0.74                                                   SPLADE                                                        SPLADE
   nDCG



          0.72                                               TCT                                                          BM25
                                                            BM25                                                           TCT
          0.70

                               DL20 GARBM25                                                   DL20 GARTCT
          0.74                                                    D2Q                                                        D2Q
                                                                 SPLADE                                                     SPLADE
                                                                BM25                                                       BM25
   nDCG




          0.72
                                                            TCT
          0.70                                                                                                            TCT

             100 200 300 400 500 600 700 800 900                           100 200 300 400 500 600 700 800 900
                             Cutoff k                                                      Cutoff k

                                                         TwoPhase-Fixed
                                DL19 GARBM25                                                  DL19 GARTCT
          0.775
          0.750                                                    D2Q                                                       D2Q
                                                                  SPLADE                                                    SPLADE
          0.725                                                  TCT                                                      BM25
                                                                BM25                                                       TCT
   nDCG




          0.700
          0.675
          0.650
                                DL20 GAR
                                       BM25                                        TCT        DL20 GAR
           0.76
           0.74                                     D2Q
                                                   SPLADE                                      D2Q
                                                  BM25                                        SPLADE
                                                                                             BM25
           0.72
    nDCG




           0.70                                  TCT                                        TCT
           0.68
           0.66
               100 200 300 400 500 600 700 800 900        100 200 300 400 500 600 700 800 900
                               Cutoff k                                   Cutoff k

                                                            Threshold
                              DL19 GARBM25                                                    DL19 GARTCT
         0.76                                                                                                                  D2Q
                                                                 D2Q
         0.74                                                   SPLADE                                                         SPLADE
  nDCG




         0.72                                               TCT                                                           BM25
                                                           BM25                                                            TCT
         0.70

                              DL20 GARBM25                                                    DL20 GARTCT
         0.74                                                 D2Q                                                            D2Q
                                                             SPLADE                                                         SPLADE
                                                            BM25                                                           BM25
  nDCG




         0.72
                                                           TCT
         0.70                                                                                                             TCT

            0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20              0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
                                 Threshold r                                                    Threshold r

Figure 1: Performance of various proposed agents for various initial rankers and parameter values on
TREC DL 2019 and 2020. The upper limit of each shaded area represents the A l t e r n a t e strategy, and the
lower limit represents the non-adaptive re-ranking performance.
Acknowledgements
Sean and Craig acknowledge EPSRC grant EP/R018634/1: Closed-Loop Data Science for Complex,
Computationally- & Data-Intensive Analytics.


References
 [1] J. Lin, R. Nogueira, A. Yates, Pretrained Transformers for Text Ranking: BERT and Beyond,
     Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers,
     2021.
 [2] R. Xiao, J. Ji, B. Cui, H. Tang, W. Ou, Y. Xiao, J. Tan, X. Ju, Weakly supervised co-training of
     query rewriting andsemantic matching for e-commerce, in: Proc. WSDM, 2019, p. 402–410.
 [3] C. Macdonald, N. Tonellotto, I. Ounis, Efficient & effective selective query rewriting with
     efficiency predictions, in: Proc. SIGIR, 2017, pp. 495–504.
 [4] O. Khattab, M. Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized
     Late Interaction over BERT, in: Proc. SIGIR, 2020.
 [5] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage
     retrieval for open-domain question answering, 2020. URL: https://arxiv.org/abs/2004.04906.
     doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 4 . 0 4 9 0 6 .
 [6] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, A. Overwijk, Approximate
     nearest neighbor negative contrastive learning for dense text retrieval, in: Proceedings of
     ICLR, 2021.
 [7] J. Zhan, J. Mao, Y. Liu, J. Guo, M. Zhang, S. Ma, Optimizing Dense Retrieval Model Training
     with Hard Negatives, in: Proc. SIGIR, 2021, pp. 1503–1512.
 [8] S. MacAvaney, N. Tonellotto, C. Macdonald, Adaptive re-ranking with a corpus graph, in:
     31st ACM International Conference on Information and Knowledge Management, 2022.
     URL: https://arxiv.org/abs/2208.08942. doi:1 0 . 1 1 4 5 / 3 5 1 1 8 0 8 . 3 5 5 7 2 3 1 .
 [9] N. Jardine, C. J. van Rijsbergen, The use of hierarchic clustering in information retrieval,
     Information storage and retrieval 7 (1971).
[10] O. Kurland, The cluster hypothesis in information retrieval, in: Proc. SIGIR, 2013.
[11] E. M. Voorhees, The cluster hypothesis revisited, in: Proc. SIGIR, 1985.
[12] N. Tonellotto, C. Macdonald, I. Ounis, Efficient query processing for scalable web search,
     Foundations and Trends in Information Retrieval 12 (2018) 319–492.
[13] T.-Y. Liu, Learning to rank for information retrieval, Found. Trends Inf. Retr. 3 (2009)
     225–331. URL: https://doi.org/10.1561/1500000016. doi:1 0 . 1 5 6 1 / 1 5 0 0 0 0 0 0 1 6 .
[14] C. Macdonald, R. L. T. Santos, I. Ounis, The whens and hows of learning to rank for web
     search, Information Retrieval 16 (2012).
[15] N. Tonellotto, C. Macdonald, I. Ounis, Efficient and effective retrieval using selective
     pruning, in: Proc. WSDM, 2013.
[16] R. Nogueira, K. Cho, Passage Re-ranking with BERT, arXiv preprint 1901.04085 (2019).
[17] J. Johnson, M. Douze, H. Jegou, Billion-scale similarity search with gpus, IEEE Trans. Big
     Data 7 (2021).
[18] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, Preprint:
     arXiv:1904.08375 (2019).
[19] R. Nogueira, J. Lin, From doc2query to docTTTTTquery, Online preprint (2019).
[20] Z. Dai, J. Callan, Context-aware sentence/passage term importance estimation for first
     stage retrieval, Preprint: arXiv:1910.10687 (2019).
[21] L. Gao, Z. Dai, J. Callan, COIL: Revisit Exact Lexical Match in Information Retrieval with
     Contextualized Inverted List, in: Proc. NAACL-HLT, 2021, pp. 3030–3042.
[22] J. Lin, X. Ma, A few brief notes on DeepImpact, COIL, and a conceptual framework for
     information retrieval techniques, Preprint: arXiv:2106.14807 (2021).
[23] A. Mallia, O. Khattab, T. Suel, N. Tonellotto, Learning passage impacts for inverted indexes,
     in: Proc. SIGIR, 2021.
[24] A. Mallia, J. Mackenzie, T. Suel, N. Tonellotto, Faster Learned Sparse Retrieval with Guided
     Traversal, in: Proc. SIGIR, 2022, p. 5.
[25] T. Formal, B. Piwowarski, S. Clinchant, SPLADE: Sparse Lexical and Expansion Model for
     First Stage Ranking, in: Proc. SIGIR, 2021, p. 2288–2292.
[26] T. Formal, C. Lassance, B. Piwowarski, S. Clinchant, Splade v2: Sparse lexical and expansion
     model for information retrieval, 2021.
[27] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 3 ed., Prentice Hall, 2010.
[28] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, E. Voorhees, Overview of the trec 2019 deep
     learning track, in: TREC 2019, 2019.
[29] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, Overview of the trec 2020 deep learning track,
     in: TREC, 2020.
[30] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Document ranking with a pretrained sequence-to-
     sequence model, in: Proc. EMNLP, 2020.
[31] S.-C. Lin, J.-H. Yang, J. J. Lin, Distilling dense representations for ranking using tightly-
     coupled teachers, ArXiv abs/2010.11386 (2020).
[32] S.-C. Lin, J.-H. Yang, J. Lin, In-batch negatives for knowledge distillation with tightly-
     coupled teachers for dense retrieval, in: RepL4NLP-2021 Workshop, 2021, pp. 163–173.