On the Separation of Logical and Physical Ranking Models for Text Retrieval Applications Jimmy Lin1 , Xueguang Ma1 , Joel Mackenzie2 and Antonio Mallia3 1 University of Waterloo, Ontario, Canada 2 The University of Melbourne, Victoria, Australia 3 New York University, New York, USA Abstract Text retrieval using bags of words is typically formulated as inner products between vector representations of queries and documents, realized in query evaluation algorithms that traverse postings in an inverted index. Viewed in database terms, this captures a tight coupling between the “logical” aspects of ranking (i.e., term weighting) and the “physical” aspects of ranking (query evaluation). We argue that explicitly decoupling these two aspects offers a framework for thinking about the relationship between sparse retrieval techniques and the rapidly growing literature on dense retrieval techniques. Text retrieval using “bag-of-words” exact match tech- decoupled from the logical specification of the ranking niques can be distilled into a scoring function ∑︀ between model. We can trace prototypes that attempt to integrate a query 𝑞 and a document 𝑑, 𝑆(𝑞, 𝑑) = 𝑡∈𝑞∩𝑑 𝑓 (𝑡), information retrieval and database systems back to the where 𝑓 is some function of term statistics such as tf, idf, 1990s; see, for example, a special issue of the Bulletin doclength, etc. This formulation covers nearly all major of the Technical Committee on Data Engineering from families of retrieval models (probabilistic, vector space, March 1996, which includes a discussion by Fuhr [2] on language modeling, divergence from randomness, etc.) the lack of “data independence” in information retrieval and is equivalent to the inner product of two weighted systems. Despite some follow-up work in 2014 by Müh- vectors of dimension |𝑉 |, where 𝑉 is the vocabulary of leisen et al. [3], the idea of text ranking directly with a the collection. Efficiently generating a top-𝑘 ranking relational database never caught on. of documents from an arbitrarily large collection 𝒞 is However, with growing recent interest in dense re- performed using an inverted index that is traversed by trieval techniques, there are reasons to rethink this tight a query evaluation algorithm—both components have logical/physical coupling. As an initial exploration, we been optimized over many decades of research. empirically show that different physical realizations of To borrow an analogy from database systems, such a the same logical ranking model manifest different trade- design tightly couples the “logical” aspects of ranking— offs in terms of quality, time, and space. While this obser- the definition of 𝑆(𝑞, 𝑑)—with the “physical” aspects of vation is certainly not novel—after all, researchers have ranking—how 𝑆(𝑞, 𝑑) is computed for all 𝑑 ∈ 𝒞 to ef- been exploring the efficiency of query evaluation algo- ficiently generate a top-𝑘 ranking. In the historical de- rithms and index compression techniques for decades— velopment of information retrieval, this tight coupling we argue that our extension of this discussion to dense made sense because alternatives did not appear to be retrieval techniques provides a fresh perspective. sufficiently compelling. That is, inverted indexes were Let us begin by highlighting the connections between the most sensible way to implement a ranking model, dense and sparse (i.e., bag-of-words exact match) re- especially at scale. trieval. Adopting the terminology of Lin et al. [4], dense Nevertheless, we are not the first to explore the pos- retrieval can be captured by the following scoring func- sibility of logical/physical decoupling in the context of tion: 𝑆(𝑞, 𝑑) = 𝜑(𝜂𝑞 (𝑞), 𝜂𝑑 (𝑑𝑖 )), where 𝜂· are encoders information retrieval. Well over a decade ago, Héman that map queries and documents into representation vec- et al. [1] demonstrated that a relational database can tors, typically using transformers. These learned dense be adapted to perform text ranking directly. Specifi- representations are then compared using 𝜑, which can cally, inverted lists can be stored in tables and a ranking range in complexity from a simple inner product to a model can be expressed as an SQL query, thus leaving (lightweight) neural network. the database engine to handle physical query execution, Focusing on the case where 𝜑 is defined as an inner product (which encompasses many dense retrieval tech- DESIRES 2021 – 2nd International Conference on Design of niques [5, 6, 7, 8, 9, 10]), the top-𝑘 ranking problem is Experimental Search & Information REtrieval Systems, September usually cast as nearest neighbor search. In many cases, a 15–18, 2021, Padua, Italy " jimmylin@uwaterloo.ca (J. Lin) brute-force scan over the representation vectors suffices © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). for latency-insensitive batch querying, e.g., with “flat” in- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) dexes in Facebook’s Faiss library [11], but in other cases, an approximate nearest neighbor (ANN) technique such Quality Time Space as HNSW [12] is necessary for low latency top-𝑘 ranking, Method MRR@10 Latency Index Size e.g., using nmslib. Here, we note that the logical/physical (ms) (MB) dichotomy applies: the same logical ranking model (i.e., Anserini (Lucene) (1a) Bag of words 0.187 40.1 661 the definitions of 𝜂· and 𝜑) can be realized physically in (1b) doc2query–T5 0.277 62.8 1036 different ways, e.g., brute-force scans for batch querying (1c) DeepImpact (quantized) 0.325 244.1 1417 or HNSW for low-latency retrieval.1 PISA Further developing this connection, we note that the (2a) Bag of words 0.187 8.3 739 formulation of dense and sparse retrieval is mathemati- (2b) doc2query–T5 0.276 11.9 1150 cally equivalent. Specifically, 𝑆(𝑞, 𝑑) in both cases is de- (2c) DeepImpact (quantized) 0.326 19.4 1564 fined as the inner product between document and query nmslib HNSW (3a) DeepImpact 0.299 21.9 6686 vectors—the only difference lies in the characteristics of (3b) DeepImpact (quantized) 0.298 22.5 6686 those vectors, e.g., their dimensionality, how they are computed, etc. This means that, in principle, we could Table 1 “mix and match” logical and physical ranking models— Experimental results on the development queries of the MS and indeed, this has already been done before. For ex- MARCO passage ranking test collection. ample, Teofili and Lin [13] evaluated a number of (not very efficient) techniques for performing top-𝑘 ranking on dense vectors using Lucene; Tu et al. [14] explored What do we make of these results? In truth, the com- using HNSW for BM25 ranking. parison between Lucene and PISA is not particularly Here, we provide a case study that further investi- surprising, as researchers have performed experiments gates this idea for sparse retrieval. We began with Deep- along these lines for many decades—comparing alterna- Impact [15] as the logical ranking model, on the MS tive implementations of the same class of solutions, in this MARCO passage corpus. As points of comparison, we case, document-at-a-time query evaluation on inverted also considered bag-of-words BM25, on both the origi- indexes. However, HNSW represents a fundamentally nal passages from the corpus and the results of applying different physical realization of the logical ranking model document expansion with doc2query–T5 [16]. based on hierarchical navigable small-world graphs, an We experimented with different physical ranking mod- approach that is very different from inverted indexes. els: Anserini [17], PISA [18], and the HNSW [12] im- While it is true that HNSW currently does not provide plementation in nmslib. Note that HNSW can perform a compelling solution—it is dominated by PISA in terms search on real-valued DeepImpact representation vectors of both effectiveness and efficiency, HNSW is a relative directly, but Anserini and PISA require quantization first newcomer. In contrast, PISA benefits from techniques (in both cases, into 8 bits). For HNSW, we use 𝑀 = 40, that have been optimized and refined over decades. It is bfconstruct = 2500 and efsearch = 1000, similar to Tu et al. entirely possible that as HNSW receives more attention, [14]. PISA runs using MaxScore processing after docu- the performance gap will close. ment reordering [19]. Experiments were conducted in Furthermore, dense and sparse representations are not memory on a Linux machine with two 3.50 GHz Intel discrete categories, but rather lie on a continuum. Cur- Xeon Gold 6144 CPUs and 512 GiB of RAM. Results on rently, the size (in terms of the number of dimension) of the development queries are shown in Table 1, where sparse representations equals the vocabulary size of the we report output quality (MRR@10), query latency (ms), corpus, and dense representations typically have hun- and index size (MB). In all cases, we set retrieval depth dreds of dimensions. What if we “densify” sparse rep- to 𝑘 = 1000. resentations and “sparisfy” dense representations—for We see that the same logical ranking model manifests example, to yield vectors that are ten thousand dimen- a diverse range of quality/time/space tradeoffs in differ- sions? How then will the tradeoffs manifest? We believe ent physical implementations. Comparing Anserini and that separating the logical and physical aspects of ranking PISA, they achieve the same level of effectiveness (small will enable future innovations to progress independently differences due to tie-breaking effects), but PISA is quite a and provide a helpful framework for exploring quality, bit faster (although its indexes are slightly larger). HNSW time, and space tradeoffs in future studies of dense and quality is worse because of the approximate nature of sparse retrieval, as well as hybrids and points in between. nearest neighbor search, but its queries are much faster than Lucene and almost as fast as PISA (but require much larger indexes). Acknowledgements 1 Although note that database engines are in general responsi- We’d like to thank Arjen de Vries for helpful comments ble for the faithful execution of a logical computation, which is not on an earlier draft of this piece. the case here with HNSW due to its approximations. References [12] Y. A. Malkov, D. A. Yashunin, Efficient and robust approximate nearest neighbor search using hierar- [1] S. Héman, M. Zukowski, A. P. de Vries, P. A. Boncz, chical navigable small world graphs, Transactions Efficient and flexible information retrieval using on Pattern Analysis and Machine Intelligence 42 MonetDB/X100, in: Proceedings of the Third Bi- (2020) 824–836. ennial Conference on Innovative Data Systems Re- [13] T. Teofili, J. Lin, Lucene for approximate search (CIDR 2007), Asilomar, California, 2007, pp. nearest-neighbors search on arbitrary dense vec- 96–101. tors, arXiv:1910.10208 (2019). [2] N. Fuhr, Models for integrated information retrieval [14] Z. Tu, W. Yang, Z. Fu, Y. Xie, L. Tan, K. Xiong, M. Li, and database systems, Bulletin of the Technical J. Lin, Approximate nearest neighbor search and Committee on Data Engineering 19 (1996) 3–13. lightweight dense vector reranking in multi-stage [3] H. Mühleisen, T. Samar, J. Lin, A. de Vries, Old retrieval architectures, in: Proceedings of the 2020 dogs are great at new tricks: column stores for IR ACM SIGIR on International Conference on Theory prototyping, in: Proceedings of the 37th Annual of Information Retrieval, 2020, pp. 97–100. International ACM SIGIR Conference on Research [15] A. Mallia, O. Khattab, T. Suel, N. Tonellotto, Learn- and Development in Information Retrieval (SIGIR ing passage impacts for inverted indexes, in: Pro- 2014), Gold Coast, Australia, 2014, pp. 863–866. ceedings of the 44th International ACM SIGIR Con- [4] J. Lin, R. Nogueira, A. Yates, Pretrained trans- ference on Research and Development in Informa- formers for text ranking: BERT and beyond, tion Retrieval, 2021, pp. 1723–1727. arXiv:2010.06467 (2020). [16] R. Nogueira, J. Lin, From doc2query to docTTTTT- [5] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, query, 2019. S. Edunov, D. Chen, W.-t. Yih, Dense passage re- [17] P. Yang, H. Fang, J. Lin, Anserini: Reproducible trieval for open-domain question answering, in: ranking baselines using Lucene, Journal of Data Proceedings of the 2020 Conference on Empirical and Information Quality 10 (2018) Article 16. Methods in Natural Language Processing (EMNLP), [18] A. Mallia, M. Siedlaczek, J. Mackenzie, T. Suel, PISA: 2020, pp. 6769–6781. performant indexes and search for academia, in: [6] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. N. Ben- Proceedings of the Open-Source IR Replicability nett, J. Ahmed, A. Overwijk, Approximate nearest Challenge (OSIRRC 2019): CEUR Workshop Pro- neighbor negative contrastive learning for dense ceedings Vol-2409, Paris, France, 2019, pp. 50–56. text retrieval, in: Proceedings of the 9th Inter- [19] J. Mackenzie, M. Petri, A. Moffat, Faster index national Conference on Learning Representations reordering with bipartite graph partitioning, in: (ICLR 2021), 2021. Proceedings of the 44th International ACM SIGIR [7] S. Hofstätter, S. Althammer, M. Schröder, Conference on Research and Development in Infor- M. Sertkan, A. Hanbury, Improving efficient neural mation Retrieval, 2021, pp. 1910–1914. ranking models with cross-architecture knowledge distillation, arXiv:2010.02666 (2020). [8] L. Gao, Z. Dai, T. Chen, Z. Fan, B. V. Durme, J. Callan, Complementing lexical retrieval with semantic residual embedding, in: Proceedings of the 43rd European Conference on Information Re- trieval (ECIR 2021), Part I, 2021, pp. 146–160. [9] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Han- bury, Efficiently teaching an effective dense re- triever with balanced topic aware sampling, in: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), 2021, pp. 113– 122. [10] S.-C. Lin, J.-H. Yang, J. Lin, In-batch negatives for knowledge distillation with tightly-coupled teach- ers for dense retrieval, in: Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), 2021, pp. 163–173. [11] J. Johnson, M. Douze, H. Jégou, Billion-scale simi- larity search with GPUs, arXiv:1702.08734 (2017).