On the Separation of Logical and Physical Ranking Models
for Text Retrieval Applications
Jimmy Lin1 , Xueguang Ma1 , Joel Mackenzie2 and Antonio Mallia3
1
  University of Waterloo, Ontario, Canada
2
  The University of Melbourne, Victoria, Australia
3
  New York University, New York, USA


                                             Abstract
                                             Text retrieval using bags of words is typically formulated as inner products between vector representations of queries and
                                             documents, realized in query evaluation algorithms that traverse postings in an inverted index. Viewed in database terms,
                                             this captures a tight coupling between the “logical” aspects of ranking (i.e., term weighting) and the “physical” aspects of
                                             ranking (query evaluation). We argue that explicitly decoupling these two aspects offers a framework for thinking about the
                                             relationship between sparse retrieval techniques and the rapidly growing literature on dense retrieval techniques.


Text retrieval using “bag-of-words” exact match tech-                                                                 decoupled from the logical specification of the ranking
niques can be distilled into a scoring function
                                              ∑︀ between                                                              model. We can trace prototypes that attempt to integrate
a query 𝑞 and a document 𝑑, 𝑆(𝑞, 𝑑) =           𝑡∈𝑞∩𝑑 𝑓 (𝑡),                                                          information retrieval and database systems back to the
where 𝑓 is some function of term statistics such as tf, idf,                                                          1990s; see, for example, a special issue of the Bulletin
doclength, etc. This formulation covers nearly all major                                                              of the Technical Committee on Data Engineering from
families of retrieval models (probabilistic, vector space,                                                            March 1996, which includes a discussion by Fuhr [2] on
language modeling, divergence from randomness, etc.)                                                                  the lack of “data independence” in information retrieval
and is equivalent to the inner product of two weighted                                                                systems. Despite some follow-up work in 2014 by Müh-
vectors of dimension |𝑉 |, where 𝑉 is the vocabulary of                                                               leisen et al. [3], the idea of text ranking directly with a
the collection. Efficiently generating a top-𝑘 ranking                                                                relational database never caught on.
of documents from an arbitrarily large collection 𝒞 is                                                                   However, with growing recent interest in dense re-
performed using an inverted index that is traversed by                                                                trieval techniques, there are reasons to rethink this tight
a query evaluation algorithm—both components have                                                                     logical/physical coupling. As an initial exploration, we
been optimized over many decades of research.                                                                         empirically show that different physical realizations of
   To borrow an analogy from database systems, such a                                                                 the same logical ranking model manifest different trade-
design tightly couples the “logical” aspects of ranking—                                                              offs in terms of quality, time, and space. While this obser-
the definition of 𝑆(𝑞, 𝑑)—with the “physical” aspects of                                                              vation is certainly not novel—after all, researchers have
ranking—how 𝑆(𝑞, 𝑑) is computed for all 𝑑 ∈ 𝒞 to ef-                                                                  been exploring the efficiency of query evaluation algo-
ficiently generate a top-𝑘 ranking. In the historical de-                                                             rithms and index compression techniques for decades—
velopment of information retrieval, this tight coupling                                                               we argue that our extension of this discussion to dense
made sense because alternatives did not appear to be                                                                  retrieval techniques provides a fresh perspective.
sufficiently compelling. That is, inverted indexes were                                                                  Let us begin by highlighting the connections between
the most sensible way to implement a ranking model,                                                                   dense and sparse (i.e., bag-of-words exact match) re-
especially at scale.                                                                                                  trieval. Adopting the terminology of Lin et al. [4], dense
   Nevertheless, we are not the first to explore the pos-                                                             retrieval can be captured by the following scoring func-
sibility of logical/physical decoupling in the context of                                                             tion: 𝑆(𝑞, 𝑑) = 𝜑(𝜂𝑞 (𝑞), 𝜂𝑑 (𝑑𝑖 )), where 𝜂· are encoders
information retrieval. Well over a decade ago, Héman                                                                  that map queries and documents into representation vec-
et al. [1] demonstrated that a relational database can                                                                tors, typically using transformers. These learned dense
be adapted to perform text ranking directly. Specifi-                                                                 representations are then compared using 𝜑, which can
cally, inverted lists can be stored in tables and a ranking                                                           range in complexity from a simple inner product to a
model can be expressed as an SQL query, thus leaving                                                                  (lightweight) neural network.
the database engine to handle physical query execution,                                                                  Focusing on the case where 𝜑 is defined as an inner
                                                                                                                      product (which encompasses many dense retrieval tech-
DESIRES 2021 – 2nd International Conference on Design of                                                              niques [5, 6, 7, 8, 9, 10]), the top-𝑘 ranking problem is
Experimental Search & Information REtrieval Systems, September
                                                                                                                      usually cast as nearest neighbor search. In many cases, a
15–18, 2021, Padua, Italy
" jimmylin@uwaterloo.ca (J. Lin)                                                                                      brute-force scan over the representation vectors suffices
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      for latency-insensitive batch querying, e.g., with “flat” in-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        dexes in Facebook’s Faiss library [11], but in other cases,
an approximate nearest neighbor (ANN) technique such                                                    Quality    Time     Space
as HNSW [12] is necessary for low latency top-𝑘 ranking,                  Method                        MRR@10 Latency Index Size
e.g., using nmslib. Here, we note that the logical/physical                                                       (ms)      (MB)
dichotomy applies: the same logical ranking model (i.e.,                  Anserini (Lucene)
                                                                          (1a) Bag of words                0.187     40.1      661
the definitions of 𝜂· and 𝜑) can be realized physically in                (1b) doc2query–T5                0.277     62.8     1036
different ways, e.g., brute-force scans for batch querying                (1c) DeepImpact (quantized)      0.325    244.1     1417
or HNSW for low-latency retrieval.1                                       PISA
   Further developing this connection, we note that the                   (2a) Bag of words                0.187      8.3      739
formulation of dense and sparse retrieval is mathemati-                   (2b) doc2query–T5                0.276     11.9     1150
cally equivalent. Specifically, 𝑆(𝑞, 𝑑) in both cases is de-              (2c) DeepImpact (quantized)      0.326     19.4     1564
fined as the inner product between document and query                     nmslib HNSW
                                                                          (3a) DeepImpact                  0.299     21.9     6686
vectors—the only difference lies in the characteristics of
                                                                          (3b) DeepImpact (quantized)      0.298     22.5     6686
those vectors, e.g., their dimensionality, how they are
computed, etc. This means that, in principle, we could                  Table 1
“mix and match” logical and physical ranking models—                    Experimental results on the development queries of the MS
and indeed, this has already been done before. For ex-                  MARCO passage ranking test collection.
ample, Teofili and Lin [13] evaluated a number of (not
very efficient) techniques for performing top-𝑘 ranking
on dense vectors using Lucene; Tu et al. [14] explored                     What do we make of these results? In truth, the com-
using HNSW for BM25 ranking.                                            parison between Lucene and PISA is not particularly
   Here, we provide a case study that further investi-                  surprising, as researchers have performed experiments
gates this idea for sparse retrieval. We began with Deep-               along these lines for many decades—comparing alterna-
Impact [15] as the logical ranking model, on the MS                     tive implementations of the same class of solutions, in this
MARCO passage corpus. As points of comparison, we                       case, document-at-a-time query evaluation on inverted
also considered bag-of-words BM25, on both the origi-                   indexes. However, HNSW represents a fundamentally
nal passages from the corpus and the results of applying                different physical realization of the logical ranking model
document expansion with doc2query–T5 [16].                              based on hierarchical navigable small-world graphs, an
   We experimented with different physical ranking mod-                 approach that is very different from inverted indexes.
els: Anserini [17], PISA [18], and the HNSW [12] im-                    While it is true that HNSW currently does not provide
plementation in nmslib. Note that HNSW can perform                      a compelling solution—it is dominated by PISA in terms
search on real-valued DeepImpact representation vectors                 of both effectiveness and efficiency, HNSW is a relative
directly, but Anserini and PISA require quantization first              newcomer. In contrast, PISA benefits from techniques
(in both cases, into 8 bits). For HNSW, we use 𝑀 = 40,                  that have been optimized and refined over decades. It is
bfconstruct = 2500 and efsearch = 1000, similar to Tu et al.            entirely possible that as HNSW receives more attention,
[14]. PISA runs using MaxScore processing after docu-                   the performance gap will close.
ment reordering [19]. Experiments were conducted in                        Furthermore, dense and sparse representations are not
memory on a Linux machine with two 3.50 GHz Intel                       discrete categories, but rather lie on a continuum. Cur-
Xeon Gold 6144 CPUs and 512 GiB of RAM. Results on                      rently, the size (in terms of the number of dimension) of
the development queries are shown in Table 1, where                     sparse representations equals the vocabulary size of the
we report output quality (MRR@10), query latency (ms),                  corpus, and dense representations typically have hun-
and index size (MB). In all cases, we set retrieval depth               dreds of dimensions. What if we “densify” sparse rep-
to 𝑘 = 1000.                                                            resentations and “sparisfy” dense representations—for
   We see that the same logical ranking model manifests                 example, to yield vectors that are ten thousand dimen-
a diverse range of quality/time/space tradeoffs in differ-              sions? How then will the tradeoffs manifest? We believe
ent physical implementations. Comparing Anserini and                    that separating the logical and physical aspects of ranking
PISA, they achieve the same level of effectiveness (small               will enable future innovations to progress independently
differences due to tie-breaking effects), but PISA is quite a           and provide a helpful framework for exploring quality,
bit faster (although its indexes are slightly larger). HNSW             time, and space tradeoffs in future studies of dense and
quality is worse because of the approximate nature of                   sparse retrieval, as well as hybrids and points in between.
nearest neighbor search, but its queries are much faster
than Lucene and almost as fast as PISA (but require much
larger indexes).                                                        Acknowledgements
     1
       Although note that database engines are in general responsi-     We’d like to thank Arjen de Vries for helpful comments
ble for the faithful execution of a logical computation, which is not   on an earlier draft of this piece.
the case here with HNSW due to its approximations.
References                                                       [12] Y. A. Malkov, D. A. Yashunin, Efficient and robust
                                                                      approximate nearest neighbor search using hierar-
 [1] S. Héman, M. Zukowski, A. P. de Vries, P. A. Boncz,              chical navigable small world graphs, Transactions
     Efficient and flexible information retrieval using               on Pattern Analysis and Machine Intelligence 42
     MonetDB/X100, in: Proceedings of the Third Bi-                   (2020) 824–836.
     ennial Conference on Innovative Data Systems Re-            [13] T. Teofili, J. Lin,       Lucene for approximate
     search (CIDR 2007), Asilomar, California, 2007, pp.              nearest-neighbors search on arbitrary dense vec-
     96–101.                                                          tors, arXiv:1910.10208 (2019).
 [2] N. Fuhr, Models for integrated information retrieval        [14] Z. Tu, W. Yang, Z. Fu, Y. Xie, L. Tan, K. Xiong, M. Li,
     and database systems, Bulletin of the Technical                  J. Lin, Approximate nearest neighbor search and
     Committee on Data Engineering 19 (1996) 3–13.                    lightweight dense vector reranking in multi-stage
 [3] H. Mühleisen, T. Samar, J. Lin, A. de Vries, Old                 retrieval architectures, in: Proceedings of the 2020
     dogs are great at new tricks: column stores for IR               ACM SIGIR on International Conference on Theory
     prototyping, in: Proceedings of the 37th Annual                  of Information Retrieval, 2020, pp. 97–100.
     International ACM SIGIR Conference on Research              [15] A. Mallia, O. Khattab, T. Suel, N. Tonellotto, Learn-
     and Development in Information Retrieval (SIGIR                  ing passage impacts for inverted indexes, in: Pro-
     2014), Gold Coast, Australia, 2014, pp. 863–866.                 ceedings of the 44th International ACM SIGIR Con-
 [4] J. Lin, R. Nogueira, A. Yates, Pretrained trans-                 ference on Research and Development in Informa-
     formers for text ranking: BERT and beyond,                       tion Retrieval, 2021, pp. 1723–1727.
     arXiv:2010.06467 (2020).                                    [16] R. Nogueira, J. Lin, From doc2query to docTTTTT-
 [5] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu,                  query, 2019.
     S. Edunov, D. Chen, W.-t. Yih, Dense passage re-            [17] P. Yang, H. Fang, J. Lin, Anserini: Reproducible
     trieval for open-domain question answering, in:                  ranking baselines using Lucene, Journal of Data
     Proceedings of the 2020 Conference on Empirical                  and Information Quality 10 (2018) Article 16.
     Methods in Natural Language Processing (EMNLP),             [18] A. Mallia, M. Siedlaczek, J. Mackenzie, T. Suel, PISA:
     2020, pp. 6769–6781.                                             performant indexes and search for academia, in:
 [6] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. N. Ben-        Proceedings of the Open-Source IR Replicability
     nett, J. Ahmed, A. Overwijk, Approximate nearest                 Challenge (OSIRRC 2019): CEUR Workshop Pro-
     neighbor negative contrastive learning for dense                 ceedings Vol-2409, Paris, France, 2019, pp. 50–56.
     text retrieval, in: Proceedings of the 9th Inter-           [19] J. Mackenzie, M. Petri, A. Moffat, Faster index
     national Conference on Learning Representations                  reordering with bipartite graph partitioning, in:
     (ICLR 2021), 2021.                                               Proceedings of the 44th International ACM SIGIR
 [7] S. Hofstätter, S. Althammer, M. Schröder,                        Conference on Research and Development in Infor-
     M. Sertkan, A. Hanbury, Improving efficient neural               mation Retrieval, 2021, pp. 1910–1914.
     ranking models with cross-architecture knowledge
     distillation, arXiv:2010.02666 (2020).
 [8] L. Gao, Z. Dai, T. Chen, Z. Fan, B. V. Durme,
     J. Callan, Complementing lexical retrieval with
     semantic residual embedding, in: Proceedings of
     the 43rd European Conference on Information Re-
     trieval (ECIR 2021), Part I, 2021, pp. 146–160.
 [9] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Han-
     bury, Efficiently teaching an effective dense re-
     triever with balanced topic aware sampling, in:
     Proceedings of the 44th Annual International ACM
     SIGIR Conference on Research and Development in
     Information Retrieval (SIGIR 2021), 2021, pp. 113–
     122.
[10] S.-C. Lin, J.-H. Yang, J. Lin, In-batch negatives for
     knowledge distillation with tightly-coupled teach-
     ers for dense retrieval, in: Proceedings of the
     6th Workshop on Representation Learning for NLP
     (RepL4NLP-2021), 2021, pp. 163–173.
[11] J. Johnson, M. Douze, H. Jégou, Billion-scale simi-
     larity search with GPUs, arXiv:1702.08734 (2017).