<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Separation of Logical and Physical Ranking Models for Text Retrieval Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jimmy Lin</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xueguang Ma</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joel Mackenzie</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Mallia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>New York University</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The University of Melbourne</institution>
          ,
          <addr-line>Victoria</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Waterloo</institution>
          ,
          <addr-line>Ontario</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Text retrieval using bags of words is typically formulated as inner products between vector representations of queries and documents, realized in query evaluation algorithms that traverse postings in an inverted index. Viewed in database terms, this captures a tight coupling between the “logical” aspects of ranking (i.e., term weighting) and the “physical” aspects of ranking (query evaluation). We argue that explicitly decoupling these two aspects ofers a framework for thinking about the relationship between sparse retrieval techniques and the rapidly growing literature on dense retrieval techniques.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Text retrieval using “bag-of-words” exact match tech- decoupled from the logical specification of the ranking
niques can be distilled into a scoring function between model. We can trace prototypes that attempt to integrate
a query  and a document , (, ) = ∑︀∈∩  (), information retrieval and database systems back to the
where  is some function of term statistics such as tf, idf, 1990s; see, for example, a special issue of the Bulletin
doclength, etc. This formulation covers nearly all major of the Technical Committee on Data Engineering from
families of retrieval models (probabilistic, vector space, March 1996, which includes a discussion by Fuhr [2] on
language modeling, divergence from randomness, etc.) the lack of “data independence” in information retrieval
and is equivalent to the inner product of two weighted systems. Despite some follow-up work in 2014 by
Mühvectors of dimension | |, where  is the vocabulary of leisen et al. [3], the idea of text ranking directly with a
the collection. Eficiently generating a top-  ranking relational database never caught on.
of documents from an arbitrarily large collection  is However, with growing recent interest in dense
reperformed using an inverted index that is traversed by trieval techniques, there are reasons to rethink this tight
a query evaluation algorithm—both components have logical/physical coupling. As an initial exploration, we
been optimized over many decades of research. empirically show that diferent physical realizations of</p>
      <p>To borrow an analogy from database systems, such a the same logical ranking model manifest diferent
tradedesign tightly couples the “logical” aspects of ranking— ofs in terms of quality, time, and space. While this
obserthe definition of (, )—with the “physical” aspects of vation is certainly not novel—after all, researchers have
ranking—how (, ) is computed for all  ∈  to ef- been exploring the eficiency of query evaluation
algoifciently generate a top-  ranking. In the historical de- rithms and index compression techniques for decades—
velopment of information retrieval, this tight coupling we argue that our extension of this discussion to dense
made sense because alternatives did not appear to be retrieval techniques provides a fresh perspective.
suficiently compelling. That is, inverted indexes were Let us begin by highlighting the connections between
the most sensible way to implement a ranking model, dense and sparse (i.e., bag-of-words exact match)
reespecially at scale. trieval. Adopting the terminology of Lin et al. [4], dense</p>
      <p>Nevertheless, we are not the first to explore the pos- retrieval can be captured by the following scoring
funcsibility of logical/physical decoupling in the context of tion: (, ) = ( (),  ()), where  · are encoders
information retrieval. Well over a decade ago, Héman that map queries and documents into representation
vecet al. [1] demonstrated that a relational database can tors, typically using transformers. These learned dense
be adapted to perform text ranking directly. Specifi- representations are then compared using , which can
cally, inverted lists can be stored in tables and a ranking range in complexity from a simple inner product to a
model can be expressed as an SQL query, thus leaving (lightweight) neural network.
the database engine to handle physical query execution, Focusing on the case where  is defined as an inner
product (which encompasses many dense retrieval
techDESIRES 2021 – 2nd International Conference on Design of niques [5, 6, 7, 8, 9, 10]), the top- ranking problem is
Experimental Search &amp; Information REtrieval Systems, September usually cast as nearest neighbor search. In many cases, a
1"5–j1im8,m20y2l1in,@Paudwuaa,teItralolyo.ca (J. Lin) brute-force scan over the representation vectors sufices
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ©CCo2Em02mU1oRCnospLWyicreigonhsrtekfAosrtthrthiboiusptpioanpPe4rr.0obIynctietesrenaaduttiihononragsl.s(CUC(seCBpYEer4Um.0i)tR.te-d WundSe.roCrregat)ive fdoerxleasteinncFya-cienbseonoski’tsivFeabisastclihbrqaureyry[1in1]g,, beu.gt.,inwoitthhe“flartc”aisne-s,
an approximate nearest neighbor (ANN) technique such Quality Time Space
as HNSW [12] is necessary for low latency top- ranking, Method MRR@10 Latency Index Size
e.g., using nmslib. Here, we note that the logical/physical (ms) (MB)
dichotomy applies: the same logical ranking model (i.e., Anserini (Lucene)
the definitions of  · and ) can be realized physically in ((11ab)) dBoacg2oqfuweroyr–dTs5 00..128777 4602..18 1606316
diferent ways, e.g., brute-force scans for batch querying (1c) DeepImpact (quantized) 0.325 244.1 1417
or HNSW for low-latency retrieval.1 PISA</p>
      <p>Further developing this connection, we note that the (2a) Bag of words 0.187 8.3 739
formulation of dense and sparse retrieval is mathemati- (2b) doc2query–T5 0.276 11.9 1150
cally equivalent. Specifically, (, ) in both cases is de- (2c) DeepImpact (quantized) 0.326 19.4 1564
ifned as the inner product between document and query nmslib HNSW
vectors—the only diference lies in the characteristics of (3a) DeepImpact 0.299 21.9 6686
those vectors, e.g., their dimensionality, how they are (3b) DeepImpact (quantized) 0.298 22.5 6686
computed, etc. This means that, in principle, we could Table 1
“mix and match” logical and physical ranking models— Experimental results on the development queries of the MS
and indeed, this has already been done before. For ex- MARCO passage ranking test collection.
ample, Teofili and Lin [13] evaluated a number of (not
very eficient) techniques for performing top-  ranking
on dense vectors using Lucene; Tu et al. [14] explored What do we make of these results? In truth, the
comusing HNSW for BM25 ranking. parison between Lucene and PISA is not particularly</p>
      <p>Here, we provide a case study that further investi- surprising, as researchers have performed experiments
gates this idea for sparse retrieval. We began with Deep- along these lines for many decades—comparing
alternaImpact [15] as the logical ranking model, on the MS tive implementations of the same class of solutions, in this
MARCO passage corpus. As points of comparison, we case, document-at-a-time query evaluation on inverted
also considered bag-of-words BM25, on both the origi- indexes. However, HNSW represents a fundamentally
nal passages from the corpus and the results of applying diferent physical realization of the logical ranking model
document expansion with doc2query–T5 [16]. based on hierarchical navigable small-world graphs, an</p>
      <p>We experimented with diferent physical ranking mod- approach that is very diferent from inverted indexes.
els: Anserini [17], PISA [18], and the HNSW [12] im- While it is true that HNSW currently does not provide
plementation in nmslib. Note that HNSW can perform a compelling solution—it is dominated by PISA in terms
search on real-valued DeepImpact representation vectors of both efectiveness and eficiency, HNSW is a relative
directly, but Anserini and PISA require quantization first newcomer. In contrast, PISA benefits from techniques
(in both cases, into 8 bits). For HNSW, we use  = 40, that have been optimized and refined over decades. It is
bfconstruct = 2500 and efsearch = 1000, similar to Tu et al. entirely possible that as HNSW receives more attention,
[14]. PISA runs using MaxScore processing after docu- the performance gap will close.
ment reordering [19]. Experiments were conducted in Furthermore, dense and sparse representations are not
memory on a Linux machine with two 3.50 GHz Intel discrete categories, but rather lie on a continuum.
CurXeon Gold 6144 CPUs and 512 GiB of RAM. Results on rently, the size (in terms of the number of dimension) of
the development queries are shown in Table 1, where sparse representations equals the vocabulary size of the
we report output quality (MRR@10), query latency (ms), corpus, and dense representations typically have
hunand index size (MB). In all cases, we set retrieval depth dreds of dimensions. What if we “densify” sparse
repto  = 1000. resentations and “sparisfy” dense representations—for</p>
      <p>We see that the same logical ranking model manifests example, to yield vectors that are ten thousand
dimena diverse range of quality/time/space tradeofs in difer- sions? How then will the tradeofs manifest? We believe
ent physical implementations. Comparing Anserini and that separating the logical and physical aspects of ranking
PISA, they achieve the same level of efectiveness (small will enable future innovations to progress independently
diferences due to tie-breaking efects), but PISA is quite a and provide a helpful framework for exploring quality,
bit faster (although its indexes are slightly larger). HNSW time, and space tradeofs in future studies of dense and
quality is worse because of the approximate nature of sparse retrieval, as well as hybrids and points in between.
nearest neighbor search, but its queries are much faster
than Lucene and almost as fast as PISA (but require much
larger indexes). Acknowledgements</p>
      <p>1Although note that database engines are in general
responsible for the faithful execution of a logical computation, which is not
the case here with HNSW due to its approximations.</p>
      <p>We’d like to thank Arjen de Vries for helpful comments
on an earlier draft of this piece.
[12] Y. A. Malkov, D. A. Yashunin, Eficient and robust
approximate nearest neighbor search using
hierar[1] S. Héman, M. Zukowski, A. P. de Vries, P. A. Boncz, chical navigable small world graphs, Transactions
Eficient and flexible information retrieval using on Pattern Analysis and Machine Intelligence 42
MonetDB/X100, in: Proceedings of the Third Bi- (2020) 824–836.
ennial Conference on Innovative Data Systems Re- [13] T. Teofili, J. Lin, Lucene for approximate
search (CIDR 2007), Asilomar, California, 2007, pp. nearest-neighbors search on arbitrary dense
vec96–101. tors, arXiv:1910.10208 (2019).
[2] N. Fuhr, Models for integrated information retrieval [14] Z. Tu, W. Yang, Z. Fu, Y. Xie, L. Tan, K. Xiong, M. Li,
and database systems, Bulletin of the Technical J. Lin, Approximate nearest neighbor search and
Committee on Data Engineering 19 (1996) 3–13. lightweight dense vector reranking in multi-stage
[3] H. Mühleisen, T. Samar, J. Lin, A. de Vries, Old retrieval architectures, in: Proceedings of the 2020
dogs are great at new tricks: column stores for IR ACM SIGIR on International Conference on Theory
prototyping, in: Proceedings of the 37th Annual of Information Retrieval, 2020, pp. 97–100.
International ACM SIGIR Conference on Research [15] A. Mallia, O. Khattab, T. Suel, N. Tonellotto,
Learnand Development in Information Retrieval (SIGIR ing passage impacts for inverted indexes, in:
Pro2014), Gold Coast, Australia, 2014, pp. 863–866. ceedings of the 44th International ACM SIGIR
Con[4] J. Lin, R. Nogueira, A. Yates, Pretrained trans- ference on Research and Development in
Informaformers for text ranking: BERT and beyond, tion Retrieval, 2021, pp. 1723–1727.
arXiv:2010.06467 (2020). [16] R. Nogueira, J. Lin, From doc2query to
docTTTTT[5] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, query, 2019.</p>
      <p>S. Edunov, D. Chen, W.-t. Yih, Dense passage re- [17] P. Yang, H. Fang, J. Lin, Anserini: Reproducible
trieval for open-domain question answering, in: ranking baselines using Lucene, Journal of Data
Proceedings of the 2020 Conference on Empirical and Information Quality 10 (2018) Article 16.
Methods in Natural Language Processing (EMNLP), [18] A. Mallia, M. Siedlaczek, J. Mackenzie, T. Suel, PISA:
2020, pp. 6769–6781. performant indexes and search for academia, in:
[6] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. N. Ben- Proceedings of the Open-Source IR Replicability
nett, J. Ahmed, A. Overwijk, Approximate nearest Challenge (OSIRRC 2019): CEUR Workshop
Proneighbor negative contrastive learning for dense ceedings Vol-2409, Paris, France, 2019, pp. 50–56.
text retrieval, in: Proceedings of the 9th Inter- [19] J. Mackenzie, M. Petri, A. Mofat, Faster index
national Conference on Learning Representations reordering with bipartite graph partitioning, in:
(ICLR 2021), 2021. Proceedings of the 44th International ACM SIGIR
[7] S. Hofstätter, S. Althammer, M. Schröder, Conference on Research and Development in
InforM. Sertkan, A. Hanbury, Improving eficient neural mation Retrieval, 2021, pp. 1910–1914.
ranking models with cross-architecture knowledge
distillation, arXiv:2010.02666 (2020).
[8] L. Gao, Z. Dai, T. Chen, Z. Fan, B. V. Durme,</p>
      <p>J. Callan, Complementing lexical retrieval with
semantic residual embedding, in: Proceedings of
the 43rd European Conference on Information
Retrieval (ECIR 2021), Part I, 2021, pp. 146–160.
[9] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A.
Hanbury, Eficiently teaching an efective dense
retriever with balanced topic aware sampling, in:
Proceedings of the 44th Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval (SIGIR 2021), 2021, pp. 113–
122.
[10] S.-C. Lin, J.-H. Yang, J. Lin, In-batch negatives for
knowledge distillation with tightly-coupled
teachers for dense retrieval, in: Proceedings of the
6th Workshop on Representation Learning for NLP
(RepL4NLP-2021), 2021, pp. 163–173.
[11] J. Johnson, M. Douze, H. Jégou, Billion-scale
similarity search with GPUs, arXiv:1702.08734 (2017).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>