-

1613-0073

Seismic: Eficient and Efective Retrieval over Learned Sparse Representation

SebastianBruch

Franco MariaNardin

francomaria.nardini@isti.cn

Cosimo Rull

cosimo.rulli@isti.cnr

RossanoVenturin

Pinecone

ISTI-CNR

Italy

0 University of Pisa , Italy

Learned sparse representations form an attractive class of contextual embeddings for text retrieval thanks to their efectiveness and interpretability. Retrieval over sparse embeddings remains challenging due to the distributional diferences between learned embeddings and term frequency-based lexical models of relevance, such as BM25. Recognizing this challenge, recent research trades of exactness for eficiency, moving to approximate retrieval systems. In this wo1,rwke propose a novel organization of the inverted index that enables fast yet efective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. Experiments on theSplade andE-Splade embeddings on theMs Marco andNQ datasets show that our approach is up to21× time faster than the winning (graph-based) submissions to the BigANN Challenge.

Learned sparse representations maximum inner product search inverted index

CEUR ceur-ws.org

1. Introduction

Learned Sparse Retrieval (LSR2), 3[ , 4, 5, 6 ] repurposes Large Language Models to encode an input intosparse embeddings, a vector in an inner product space where each dimension corresponds with a term in the model’s vocabulary. LSR models are of pivotal interest as they i) compete withdense retrieval models that encode text into dense vectors in terms of efectiveness [ 7, 8, 9, 10, 11, 12, 13 ], ii) tend to generalize better to out-of-domain da1t4a,s6e]t, s [ iii) areinterpretable by design [ 6, 1 ]. The straightforward usage of standard inverted index for sparse embeddings is hindered by the statistical properties of the weights learned by LSR, which do not conform to the assumptions under which popular inverted index-based retrieval algorithms operat1e5[, 16, 17]. Hence, many recent solutions give up on exact search to boost the eficiency of the search algorithm15[, 18], taking a leaf out of the Approximate Nearest Neighbor (ANN) literatur1e9][. As a clear example, the 2023 BigANN Challe1ngaet NeurIPS dedicated a track to learned sparse embeddings. Inspired by BigANN, we present a novel ANN algorithm that we caSelilsmic (Spilled Clusetring ofInverted Lists witShummaries for Maximum Inner Prodcut Search) and that admits efective and eficient retrieval over learned sparse embeddings. Our solution (Sect2io) nuses in a new way two familiar data structures: 1This contribution is an extended abstract of Betraulc.h[ 1 ] the inverted and the forward index. We extend the inverted index by introducing a novel organization of inverted lists into geometrically-cohesive blocks. Each block is equipped with a “sketch,” serving as saummary of the vectors contained in it. The summaries allow us to skip over a large number of blocks during retrieval and save substantial compute. Our experimental evaluation (Sectio3n)shows thatSeismic outperforms the state-of-the-art competitors up to 21× on theSplade andE-Splade embeddings on theMs Marco andNQ datasets. 2. Methodology 2: 4: 5: 7: ← { | all ∈

{ , }=1 Algorithm 1: Indexing.

Input: : sparse vectors iℝn ;

: Maximum inverted list length; : Maximum number of blocks per inverted list; : Fraction of the overall importance preserved by each summary.

Result: Seismic index.

1: for ∈ {1, … , } do () ≠ 0, () ∈ }

3: Sort in decreasing order b y for

ℐ

← { ,1 , ,2 , … , , }

Cluster ℐ into partitions,

6: for 1 ≤ ≤ do 8: return ℐ , { , } ∀, , ← -mass subvector o(f , )

Algorithm 2: Query processing Input: : query; : number of results;

cut: query entries considered; heap_factor: correction factor for summary inner productℐ; ’s and , ’s: inverted lists and summaries .

Result: A Heap with the top

documents. 1: cut ← the topcut entries of 2: Heap← ∅ 3: for ∈ cut do 4: 5: 6: 7: 8: 9: 10: for ∈ ℐ do ← ⟨,

, ⟩ if < hHeeaapp_.fmacinto()r then

continue {Skip the block} for ∈ do = ⟨,

ForwardIndex[]⟩

UpdateHeap(Heap, p, d) 11: return Heap

The design ofSeismic relied both on an inverted index and a forward iSnediesmx.ic uses an organization of the inverted index that blends tosgtaettichearnddynamic pruning. The documents pinpointed by the inverted index are then evaluated using the forward index. The data structure and the indexing / query processing algorithm are described in detail below. Static Pruning. Seismic heavily relies on the concentration of importance property discussed by Bruchet al. [ 1 ]. The property shows that a small subset of the most important coordinates of the sparse embedding of a query and document vector can be used to efectively approximate their inner product. Concretsetalyti,c pruning means that for coordin a,twee build its inverted list by gathering al∈l

whose ≠ 0. We then sort the inverted list ’bsyvalue in decreasing order (breaking ties arbitrarily), so that the document -wthhocsoeordinate has the largest value appears at the beginning of the list. We then prune the inverted list by keeping at most the first entries for a fixed —our first hyper-parameter. We denote the resulting inverted list for coordina teby ℐ .

Blocking of Inverted Lists. Seismic also introduces a novel blocking strategy on inverted lists. It partitions each inverted list insmtoall blocks—our second hyper-parameter. The rationale behind a blocked organization of an inverted list is to group together documentsismtihlaart are so as to facilitatdeyanamic pruning strategy.

A clustering algorithm is used to partition the documents whose ids are present in an inverted list into clusters. Each cluster is then turned into one block, consisting of the id of documents whose vectors belong to the same cluster. Conceptually, each block is “atomic” in the following sense: if the dynamic pruning algorithm decides we must visit a ballolctkh,e documents in that block are fully evaluated. We note that any geometrical (supervised or unsupervised) clustering algorithm may be readily used. We use a shallow vari2a0n]tof[K-Means; see the original paper for more detail1s][.

Per-block summary Vectors. Seismic leverages the concept osfuammary vector to determine whether a block should be evaluated. A summa ry-diismensional vector built with the idea to upper-bound the full inner product attainable by documents in a block. In other words, the -th coordinate of the summary vector of a block would contain the m a xiammuomng the documents in that block. More precisely, our summary func∶ti2o n → ℝ takes a block from the universe of all bloc2ks, and produces a vector who s-eth coordinate is simply () = max∈ . This summary is conservative: its inner product with the query is no less than the inner product between the query and any of its doc u⟨, m(e)⟩n≥t: ⟨, ⟩ for all ∈ and an arbitrary que r.y

The number of non-zero entries in summary vectors grows quickly with the block size, increasing the memory footprint and the search timSeeisomfic. To this end, we prune() by keeping only it s -mass subvector. See the original work for the definition-moafss subvector1[]. That, , is our third and last indexing hyper-parameter. We further reduce the size of summaries by applying scalar quantization after min-max scaling, employing only a single byte for each value.

Indexing. We summarize the discussion above in Algorit1h.mWhen indexing a collection ⊂ ℝ , for every coordinat∈e{1, … , } , we form its inverted list, recording only the document identifiers (Line2). We then sort the list in decreasing order of values3)(,Lained apply static pruning by keeping, for each inverted list, tehleements with the largest value (L4i)n.Wee then apply clustering to the inverted list to derive atblmoocskts (Line5). Once documents are assigned to the blocks, we then build the block summary using the procedure described earlier (Lin7e).

Query Processing. Algorithm2 shows the query processing logic Sineismic. We select a subset of the query coordinatceust (Line 1), sorted by magnitude, and (b) define a novel dynamic pruning strategy (Lin5e–s7) that allows to skip blocks in the inverted lists of the coordinates incut. Seismic adopts a coordinate-at-a-time traversal3()Loinfethe inverted index. For each coordina t∈e cut, it evaluates the blocks using their summary. The documents within a block are evaluated further if the approximation with the summary is greater than a fraction of the minimum inner product in the MHeina-p, using the Forward Index. A document whose inner product is greater than the minimum score in thHeeMaipni-s inserted into the heap (UpdateHeap).

3. Experiments

Experimental Setup. We experiment on two publicly-available dataMsestsM:arco v1 Passage [21] and Natural QuestionNsQ() from Beir [22]. We evaluateSeismic on embeddings generated usinSgplade [ 5 ] andE-Splade. [ 6 ].

We compareSeismic with five state-of-the-art retrieval solutions. In this manuscript, we only report the comparison against the best competitors, namely the winning solutions of the “Sparse Track” at the 2023 BigANN Challenge at NeurGIPrSa,ssRMA andPyAnn. See the original work for the complete compariso1n].[We compare the methods using mean query latency ( sec.) and accuracy, i.e., the percentage of true nearest neighbors recalled in the returned set. We implementedSeismic in Rust2. We conduct experiments on a server equipped with one Intel i9-9900K CPU, clock ra3t.e60 GHz and64 GiB of RAM, with single-threaded execution. Results Table1 details retrieval performance in terms of average per-query latency at various accuracy cutS.eismic consistently outperforGmrsassRMA andPyAnn by a substantial margin, ranging from2.6× (Splade on Ms Marco) to21.6× (E-Splade on Ms Marco) depending on the level of accuracy. In fact, as accuracy increases, the latency gap Sbeitswmeicenand the two graph-based methods widens. This gap is much larger when query vectors are sparser, such as withE-Splade embeddings. That is because, when queries are highly sparse, inner products between queries and documents become smaller, reducing the eficacy of a greedy graph traversal. As one data poiPnytA, nn overE-Splade embeddings ofMs Marco visits roughly 40,000 documents to reac9h7% accuracy, whereaSseismic evaluates jus2t,198 documents.

4. Conclusions and Future Work

This paper presentSseismic, a novel approach for eficient and efective retrieval over sparse learned representations. Our solution outperforms the state-of-art graph-based solutions for eficient sparse retrieval up to a factor21o×f on theSplade andE-Splade embeddings on theMs Marco dataset. As future work, we intend to explore the application of compression techniques for inverted lis2t3s][to further reduce the size of inverted and forward indexes. 2Our code is publicly availablehattps://github.com/TusKANNy/seism.ic [14] S. Bruch, S. Gai, A. Ingber, An analysis of fusion functions for hybrid retrieval, ACM

Transactions on Information Systems 42 (2023). [15] S. Bruch, F. M. Nardini, A. Ingber, E. Liberty, An approximate algorithm for maximum inner product search over streaming sparse vectors, ACM Transactions on Information Systems 42 (2023). [16] J. Mackenzie, A. Trotman, J. Lin, Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation, 2a0r2X1i.v:2110.11540. [17] M. Crane, J. S. Culpepper, J. Lin, J. Mackenzie, A. Trotman, A comparison of document-at-atime and score-at-a-time query evaluation, in: Proceedings of the 10th ACM International Conference on Web Search and Data Mining, 2017, pp. 201–210. [18] S. Bruch, F. M. Nardini, A. Ingber, E. Liberty, Bridging dense and sparse maximum inner product search, 2023a.rXiv:2309.09013. [19] S. Bruch, Foundations of Vector Retrieval, Springer Nature Switzerland, 2024. [20] F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, E. Upfal, Finding near neighbors through cluster pruning, in: Proceedings of the Twenty-Sixth ACM SIGMODSIGACT-SIGART Symposium on Principles of Database Systems, 2007, pp. 103–112. [21] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, Ms marco: A human generated machine reading comprehension dataset (2016). [22] N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, I. Gurevych, BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models, in: 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. [23] G. E. Pibiri, R. Venturini, Techniques for inverted index compression, ACM Computing Surveys 53 (2021) 125:1–125:36.

[1]

Bruch ,

F. M.

Nardini ,

Rulli ,

Venturini , Eficient inverted indexes for approximate retrieval over learned sparse representations , in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '24, Association for Computing Machinery, New York, NY, USA, 2024 , p. 152 - 162 . 1d0o .i: 1145 /3626772.3657769.

[2]

MacAvaney ,

F. M.

Nardini ,

Perego ,

Tonellotto ,

Goharian ,

Frieder , Expansion via prediction of importance with contextualization , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , 2020 , pp. 1573 - 1576 .

[3]

Formal ,

Piwowarski ,

Clinchant , Splade: Sparse lexical and expansion model for ifrst stage ranking , in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2021 , pp. 2288 - 2292 .

[4]

Formal ,

Lassance ,

Piwowarski ,

Clinchant , Splade v2: Sparse lexical and expansion model for information retrieval, 2a02r1X .iv: 2109 . 10086 .

[5]

Formal ,

Lassance ,

Piwowarski ,

Clinchant , From distillation to hard negative sampling: Making sparse neural ir models more efective , in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2022 , pp. 2353 - 2359 .

[6]

Lassance ,

Clinchant , An eficiency study for splade models , in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2022 , pp. 2220 - 2226 .

[7]

Lin ,

R. F.

Nogueira ,

Yates , Pretrained Transformers for Text Ranking: BERT and Beyond , Synthesis Lectures on Human Language Technologies , Morgan & Claypool Publishers, 2021 .

[8]

Karpukhin ,

Oguz ,

Min ,

Lewis ,

Wu ,

Edunov ,

Chen , W.-t. Yih, Dense passage retrieval for open-domain question answering , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , 2020 , pp. 6769 - 6781 .

[9]

Xiong ,

Li ,

K.-F.

Tang , J. Liu,

Bennett ,

Ahmed ,

Overwijk , Approximate nearest neighbor negative contrastive learning for dense text retrieval , in: International Conference on Learning Representations , 2021 .

[10]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics , 2019 .

[11]

Santhanam ,

Khattab ,

Saad-Falcon ,

Potts , M. Zaharia, ColBERTv2: Efective and eficient retrieval via lightweight late interaction , in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2022 , pp. 3715 - 3734 .

[12]

Khattab ,

Zaharia , Colbert: Eficient and efective passage search via contextualized late interaction over bert , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , 2020 , pp. 39 - 48 .

[13]

F. M.

Nardini ,

Rulli ,

Venturini , Eficient multi-vector dense retrieval with bit vectors , in: Advances in Information Retrieval , 2024 , pp. 3 - 17 .