-

On the Separation of Logical and Physical Ranking Models for Text Retrieval Applications

Jimmy Lin

Xueguang Ma

Joel Mackenzie

Antonio Mallia

0 0 New York University , New York , USA 1 The University of Melbourne , Victoria , Australia 2 University of Waterloo , Ontario , Canada

Text retrieval using bags of words is typically formulated as inner products between vector representations of queries and documents, realized in query evaluation algorithms that traverse postings in an inverted index. Viewed in database terms, this captures a tight coupling between the “logical” aspects of ranking (i.e., term weighting) and the “physical” aspects of ranking (query evaluation). We argue that explicitly decoupling these two aspects ofers a framework for thinking about the relationship between sparse retrieval techniques and the rapidly growing literature on dense retrieval techniques.

Text retrieval using “bag-of-words” exact match tech- decoupled from the logical specification of the ranking niques can be distilled into a scoring function between model. We can trace prototypes that attempt to integrate a query and a document , (, ) = ∑︀∈∩ (), information retrieval and database systems back to the where is some function of term statistics such as tf, idf, 1990s; see, for example, a special issue of the Bulletin doclength, etc. This formulation covers nearly all major of the Technical Committee on Data Engineering from families of retrieval models (probabilistic, vector space, March 1996, which includes a discussion by Fuhr [2] on language modeling, divergence from randomness, etc.) the lack of “data independence” in information retrieval and is equivalent to the inner product of two weighted systems. Despite some follow-up work in 2014 by Mühvectors of dimension | |, where is the vocabulary of leisen et al. [3], the idea of text ranking directly with a the collection. Eficiently generating a top- ranking relational database never caught on. of documents from an arbitrarily large collection is However, with growing recent interest in dense reperformed using an inverted index that is traversed by trieval techniques, there are reasons to rethink this tight a query evaluation algorithm—both components have logical/physical coupling. As an initial exploration, we been optimized over many decades of research. empirically show that diferent physical realizations of

To borrow an analogy from database systems, such a the same logical ranking model manifest diferent tradedesign tightly couples the “logical” aspects of ranking— ofs in terms of quality, time, and space. While this obserthe definition of (, )—with the “physical” aspects of vation is certainly not novel—after all, researchers have ranking—how (, ) is computed for all ∈ to ef- been exploring the eficiency of query evaluation algoifciently generate a top- ranking. In the historical de- rithms and index compression techniques for decades— velopment of information retrieval, this tight coupling we argue that our extension of this discussion to dense made sense because alternatives did not appear to be retrieval techniques provides a fresh perspective. suficiently compelling. That is, inverted indexes were Let us begin by highlighting the connections between the most sensible way to implement a ranking model, dense and sparse (i.e., bag-of-words exact match) reespecially at scale. trieval. Adopting the terminology of Lin et al. [4], dense

Nevertheless, we are not the first to explore the pos- retrieval can be captured by the following scoring funcsibility of logical/physical decoupling in the context of tion: (, ) = ( (), ()), where · are encoders information retrieval. Well over a decade ago, Héman that map queries and documents into representation vecet al. [1] demonstrated that a relational database can tors, typically using transformers. These learned dense be adapted to perform text ranking directly. Specifi- representations are then compared using , which can cally, inverted lists can be stored in tables and a ranking range in complexity from a simple inner product to a model can be expressed as an SQL query, thus leaving (lightweight) neural network. the database engine to handle physical query execution, Focusing on the case where is defined as an inner product (which encompasses many dense retrieval techDESIRES 2021 – 2nd International Conference on Design of niques [5, 6, 7, 8, 9, 10]), the top- ranking problem is Experimental Search & Information REtrieval Systems, September usually cast as nearest neighbor search. In many cases, a 1"5–j1im8,m20y2l1in,@Paudwuaa,teItralolyo.ca (J. Lin) brute-force scan over the representation vectors sufices CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ©CCo2Em02mU1oRCnospLWyicreigonhsrtekfAosrtthrthiboiusptpioanpPe4rr.0obIynctietesrenaaduttiihononragsl.s(CUC(seCBpYEer4Um.0i)tR.te-d WundSe.roCrregat)ive fdoerxleasteinncFya-cienbseonoski’tsivFeabisastclihbrqaureyry[1in1]g,, beu.gt.,inwoitthhe“flartc”aisne-s, an approximate nearest neighbor (ANN) technique such Quality Time Space as HNSW [12] is necessary for low latency top- ranking, Method MRR@10 Latency Index Size e.g., using nmslib. Here, we note that the logical/physical (ms) (MB) dichotomy applies: the same logical ranking model (i.e., Anserini (Lucene) the definitions of · and ) can be realized physically in ((11ab)) dBoacg2oqfuweroyr–dTs5 00..128777 4602..18 1606316 diferent ways, e.g., brute-force scans for batch querying (1c) DeepImpact (quantized) 0.325 244.1 1417 or HNSW for low-latency retrieval.1 PISA

Further developing this connection, we note that the (2a) Bag of words 0.187 8.3 739 formulation of dense and sparse retrieval is mathemati- (2b) doc2query–T5 0.276 11.9 1150 cally equivalent. Specifically, (, ) in both cases is de- (2c) DeepImpact (quantized) 0.326 19.4 1564 ifned as the inner product between document and query nmslib HNSW vectors—the only diference lies in the characteristics of (3a) DeepImpact 0.299 21.9 6686 those vectors, e.g., their dimensionality, how they are (3b) DeepImpact (quantized) 0.298 22.5 6686 computed, etc. This means that, in principle, we could Table 1 “mix and match” logical and physical ranking models— Experimental results on the development queries of the MS and indeed, this has already been done before. For ex- MARCO passage ranking test collection. ample, Teofili and Lin [13] evaluated a number of (not very eficient) techniques for performing top- ranking on dense vectors using Lucene; Tu et al. [14] explored What do we make of these results? In truth, the comusing HNSW for BM25 ranking. parison between Lucene and PISA is not particularly

Here, we provide a case study that further investi- surprising, as researchers have performed experiments gates this idea for sparse retrieval. We began with Deep- along these lines for many decades—comparing alternaImpact [15] as the logical ranking model, on the MS tive implementations of the same class of solutions, in this MARCO passage corpus. As points of comparison, we case, document-at-a-time query evaluation on inverted also considered bag-of-words BM25, on both the origi- indexes. However, HNSW represents a fundamentally nal passages from the corpus and the results of applying diferent physical realization of the logical ranking model document expansion with doc2query–T5 [16]. based on hierarchical navigable small-world graphs, an

We experimented with diferent physical ranking mod- approach that is very diferent from inverted indexes. els: Anserini [17], PISA [18], and the HNSW [12] im- While it is true that HNSW currently does not provide plementation in nmslib. Note that HNSW can perform a compelling solution—it is dominated by PISA in terms search on real-valued DeepImpact representation vectors of both efectiveness and eficiency, HNSW is a relative directly, but Anserini and PISA require quantization first newcomer. In contrast, PISA benefits from techniques (in both cases, into 8 bits). For HNSW, we use = 40, that have been optimized and refined over decades. It is bfconstruct = 2500 and efsearch = 1000, similar to Tu et al. entirely possible that as HNSW receives more attention, [14]. PISA runs using MaxScore processing after docu- the performance gap will close. ment reordering [19]. Experiments were conducted in Furthermore, dense and sparse representations are not memory on a Linux machine with two 3.50 GHz Intel discrete categories, but rather lie on a continuum. CurXeon Gold 6144 CPUs and 512 GiB of RAM. Results on rently, the size (in terms of the number of dimension) of the development queries are shown in Table 1, where sparse representations equals the vocabulary size of the we report output quality (MRR@10), query latency (ms), corpus, and dense representations typically have hunand index size (MB). In all cases, we set retrieval depth dreds of dimensions. What if we “densify” sparse repto = 1000. resentations and “sparisfy” dense representations—for

We see that the same logical ranking model manifests example, to yield vectors that are ten thousand dimena diverse range of quality/time/space tradeofs in difer- sions? How then will the tradeofs manifest? We believe ent physical implementations. Comparing Anserini and that separating the logical and physical aspects of ranking PISA, they achieve the same level of efectiveness (small will enable future innovations to progress independently diferences due to tie-breaking efects), but PISA is quite a and provide a helpful framework for exploring quality, bit faster (although its indexes are slightly larger). HNSW time, and space tradeofs in future studies of dense and quality is worse because of the approximate nature of sparse retrieval, as well as hybrids and points in between. nearest neighbor search, but its queries are much faster than Lucene and almost as fast as PISA (but require much larger indexes). Acknowledgements

1Although note that database engines are in general responsible for the faithful execution of a logical computation, which is not the case here with HNSW due to its approximations.

We’d like to thank Arjen de Vries for helpful comments on an earlier draft of this piece. [12] Y. A. Malkov, D. A. Yashunin, Eficient and robust approximate nearest neighbor search using hierar[1] S. Héman, M. Zukowski, A. P. de Vries, P. A. Boncz, chical navigable small world graphs, Transactions Eficient and flexible information retrieval using on Pattern Analysis and Machine Intelligence 42 MonetDB/X100, in: Proceedings of the Third Bi- (2020) 824–836. ennial Conference on Innovative Data Systems Re- [13] T. Teofili, J. Lin, Lucene for approximate search (CIDR 2007), Asilomar, California, 2007, pp. nearest-neighbors search on arbitrary dense vec96–101. tors, arXiv:1910.10208 (2019). [2] N. Fuhr, Models for integrated information retrieval [14] Z. Tu, W. Yang, Z. Fu, Y. Xie, L. Tan, K. Xiong, M. Li, and database systems, Bulletin of the Technical J. Lin, Approximate nearest neighbor search and Committee on Data Engineering 19 (1996) 3–13. lightweight dense vector reranking in multi-stage [3] H. Mühleisen, T. Samar, J. Lin, A. de Vries, Old retrieval architectures, in: Proceedings of the 2020 dogs are great at new tricks: column stores for IR ACM SIGIR on International Conference on Theory prototyping, in: Proceedings of the 37th Annual of Information Retrieval, 2020, pp. 97–100. International ACM SIGIR Conference on Research [15] A. Mallia, O. Khattab, T. Suel, N. Tonellotto, Learnand Development in Information Retrieval (SIGIR ing passage impacts for inverted indexes, in: Pro2014), Gold Coast, Australia, 2014, pp. 863–866. ceedings of the 44th International ACM SIGIR Con[4] J. Lin, R. Nogueira, A. Yates, Pretrained trans- ference on Research and Development in Informaformers for text ranking: BERT and beyond, tion Retrieval, 2021, pp. 1723–1727. arXiv:2010.06467 (2020). [16] R. Nogueira, J. Lin, From doc2query to docTTTTT[5] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, query, 2019.

S. Edunov, D. Chen, W.-t. Yih, Dense passage re- [17] P. Yang, H. Fang, J. Lin, Anserini: Reproducible trieval for open-domain question answering, in: ranking baselines using Lucene, Journal of Data Proceedings of the 2020 Conference on Empirical and Information Quality 10 (2018) Article 16. Methods in Natural Language Processing (EMNLP), [18] A. Mallia, M. Siedlaczek, J. Mackenzie, T. Suel, PISA: 2020, pp. 6769–6781. performant indexes and search for academia, in: [6] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. N. Ben- Proceedings of the Open-Source IR Replicability nett, J. Ahmed, A. Overwijk, Approximate nearest Challenge (OSIRRC 2019): CEUR Workshop Proneighbor negative contrastive learning for dense ceedings Vol-2409, Paris, France, 2019, pp. 50–56. text retrieval, in: Proceedings of the 9th Inter- [19] J. Mackenzie, M. Petri, A. Mofat, Faster index national Conference on Learning Representations reordering with bipartite graph partitioning, in: (ICLR 2021), 2021. Proceedings of the 44th International ACM SIGIR [7] S. Hofstätter, S. Althammer, M. Schröder, Conference on Research and Development in InforM. Sertkan, A. Hanbury, Improving eficient neural mation Retrieval, 2021, pp. 1910–1914. ranking models with cross-architecture knowledge distillation, arXiv:2010.02666 (2020). [8] L. Gao, Z. Dai, T. Chen, Z. Fan, B. V. Durme,

J. Callan, Complementing lexical retrieval with semantic residual embedding, in: Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021), Part I, 2021, pp. 146–160. [9] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Eficiently teaching an efective dense retriever with balanced topic aware sampling, in: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), 2021, pp. 113– 122. [10] S.-C. Lin, J.-H. Yang, J. Lin, In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval, in: Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), 2021, pp. 163–173. [11] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, arXiv:1702.08734 (2017).