1. Introduction

Eficient and Efective Multi-Vector Dense Retrieval with EMVB

Franco Maria Nardini

Cosimo Rulli

Rossano Venturini

ISTI-CNR

Italy

0 University of Pisa , Italy

Dense retrieval techniques utilize large pre-trained language models to construct a high-dimensional representation of queries and passages. These representations assess the relevance of a passage concerning a query through eficient similarity measures. Multi-vector representations, while enhancing efectiveness, cause a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. The current state-of-the-art approach, namely PLAID, has introduced a centroid-based term representation to mitigate the memory impact of multi-vector systems. By employing a centroid interaction mechanism, PLAID filters out non-relevant documents, reducing the cost of subsequent ranking stages. This paper 1 introduces "Eficient Multi-Vector dense retrieval with Bit vectors" (EMVB), a novel framework for eficient query processing in multi-vector dense retrieval. Firstly, EMVB utilizes an optimized bit vector pre-filtering step for passages, enhancing eficiency. Secondly, the computation of centroid interaction occurs column-wise, leveraging SIMD instructions to reduce latency. Thirdly, EMVB incorporates Product Quantization (PQ) to decrease the memory footprint of storing vector representations while facilitating fast late interaction. Lastly, a per-document term filtering method is introduced, further improving the eficiency of the final step. Experiments conducted on MS MARCO and LoTTE demonstrate that EMVB achieves up to a 2.8× speed improvement while reducing the memory footprint by 1.8× , without compromising retrieval accuracy compared to PLAID.

eol>Dense Retrieval Multi-Vector Eficiency Bit Vectors

1. Introduction

The widely acknowledged capability of Large Language Models (LLMs) to model semantic and context has been extensively used in Information Retrieval. In Dense Retrieval, LLMs are used to encode documents and queries into -dimensional vectors. This enables the modeling of document-query relevance using simple metrics like Euclidean distance. In this line, a successful strategy involves using multi-vector representations for documents and queries, where a dimensional vector is produced for each token in the text. In this context, the similarity between the query and the passage is measured using the so-called late interaction mechanism. This mechanism works by computing the sum of the maximum similarities between each term of the query and each term of a candidate passage. Although multi-vector representations enhance efectiveness, they come at the cost of increased computational burden, including a larger memory footprint and longer retrieval time. 1This paper is an extended abstract of Nardini et al. [ 1 ].

SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author.

Various approaches have been proposed to enhance the eficiency and reduce memory demands in multi-vector systems. ColBERT [ 2 ] exploits an inverted index to store all the terms embeddings and retrieve the candidate passages, but it necessitates maintaining the full-precision representation of each document term in memory, which can be substantial (e.g., 140 GB for MSMARCO). ColBERTv2 [ 3 ] introduces a centroid-based compression technique where each embedding is stored by saving the id of the closest centroid and then compressing the residual (i.e., the element-wise diference) by using 1 or 2 bits per component. ColBERTv2 saves up to 10× space compared to ColBERT but sacrifices retrieval eficiency requiring up to 3 seconds to perform query processing on CPU. PLAID [ 4 ] builds on the embedding compressor of ColBERTv2 and leverages the centroid-based representation to discard non-relevant passages (centroid interaction), thus performing the late interaction exclusively on a carefully selected batch of passages. PLAID allows for massive speedup compared to ColBERTv2, but its average query latency can be up to 400 msec. on CPU with single-thread execution [ 4 ].

This paper introduces EMVB, a novel framework designed for eficient query processing in multi-vector dense retrieval. The key focus is on addressing the most time-consuming steps identified in PLAID, which include: i) extracting the top-nprobe closest centroids during candidate passage selection, ii) computing the centroid interaction mechanism, and iii) decompressing quantized residuals. To address the first two steps, we propose a highly eficient passage filtering approach based on optimized bit vectors. This approach significantly reduces the cost of topnprobe extraction by identifying a small set of crucial centroid scores. Additionally, it decreases the number of passages for which centroid interaction computation is necessary. We further enhance eficiency in the second step by introducing a highly eficient column-wise reduction leveraging SIMD instructions. For the third step, late interaction eficiency is improved by introducing Product Quantization (PQ) [ 5 ]. This method provides comparable or superior performance compared to PLAID’s bitwise compressor, while being up to 3× faster. Additionally, we introduce a dynamic passage-term-selection criterion for late interaction, reducing the cost of this step by up to 30%.

Experimental evaluations on MS MARCO [ 6 ] passage (in-domain) and LoTTE [ 3 ] (out-ofdomain) datasets demonstrate the efectiveness of EMVB compared to PLAID. On MS MARCO, EMVB achieves up to a 2.8× speed improvement while reducing the memory footprint by 1.8× without compromising retrieval accuracy. In the out-of-domain evaluation, EMVB delivers up to a 2.9× speedup compared to PLAID with minimal loss in retrieval quality.

2. PLAID

In a multi-vector dense retrieval scenario, an LLM encodes a passage into a collection of dense -dimensional vector where is the number of tokens in the passage. Encoding each token in each passage generates large collection, e.g., almost 600M of vectors for the 8.8M of passages in MSMARCO. In virtue of this, ColBERTv2 [ 3 ] and successively PLAID [ 4 ] exploit a centroid-based vector compression technique. First, the K-Means algorithm is used to identify a set of centroids = {}= 1. The residual between a vector and its closest centroid ¯ is computed so that = − ¯ is computed, and then. compressed into ˜ using a -bit encoder that represents each dimension of using bits, with ∈ {1, 2}. This way, the memory footprint of each vector is given by ⌈log2 ||⌉ bits for the centroid index and × bits for the compressed residual. At scoring time, decompressing the residual encoding is ineficient. For this reason, PLAID aims at decompressing as few candidate documents as possible by hinging on the so-called centroid interaction filtering step [ 4 ]. We now detail the PLAID retrieval system [ 4 ]. After the K-Means algorithm, each centroid is linked with a posting list containing the ids of the candidate passages. A passage belongs to a centroid candidate list if at lest one of its tokens have as its closest centroid. The query processing starts by computing the top-nprobe closest centroids for each query term , with = 1, . . . , , according to the dot product similarity measure. From the set of set of closest centroids, the candidates passages are retrieved, thanks to the previously built posting lists. In the centroid interaction step, the distance between the -th query term and a token embedding with = 1, . . . , is computed as where ¯ is the closest centroid to . We estimate the score of a passage with terms as · ≃ · ¯ = ˜ , .

¯, = ∑︁ max · ¯ =1 =1... , = ∑︁ max · .

=1 =1... In the decompression phase the full-precision representation of is reconstructed by combining the centroids and the residuals. Only the top-ndocs passages from the previous centroid interaction step move to this step. Finally, PLAID applied late interaction [ 2, 3 ] to computed the score of a re-constructed candidate passage against a query . The late interaction measure is defined by Equation 3. Passages are then ranked according to their similarity score and the top- passages are selected.

PLAID execution time. We present a detailed analysis of PLAID’s execution time, delineating it into distinct phases such as retrieval, filtering , decompression, and late interaction. The experimentation adheres to the settings outlined in Section 4. The resulting execution times are reported for various values of , representing the number of retrieved passages.

3. EMVB

Fast Closest Centroids Selection. The retrieval phase in PLAID is time consuming, as shown in Figure 1. This step concists of i) matrix multiplication between the query matrix and centroids for distance computation, ii) identify top-nprobe closest centroids, for each query term. Maybe surprisingly, the former step is the most time consuming (3× slower than matrix multiplication), even when performed with asymptotically linear algorithms such as quickselect. Our pre-filtering strategy, as explained in the subsequent paragraph, efectively accelerate the selection of the top-nprobe by minimizing the number of evaluated elements. In practice, we eficiently discard centroids with scores below a predefined threshold, and then exclusively apply quickselect to the remaining ones. As a result, in EMVB, the cost associated with extracting the top-nprobe becomes negligible, showcasing a speed improvement of two orders of magnitude when compared to extracting the top-nprobe from the complete set of centroids. ˜ Pre-filtering using bitvectors . Let us recall the definition of , , which represents the approximate score of the -th token of passage with respect to the -th term of the query , ˜ as defined in Equation 1. Estimating whether , has a large value is a proxy for estimating the importance of a passage w.r.t to the query. Given a passage , our pre-filtering consists in determining whether ˜ , , for = 1, . . . , , = 1, . . . , is large or not. Recall that ˜ , represents the approximate score of the -th token of passage with respect to the -th term of the query , as defined in Equation 1. Our pre-filtering approach works by checking if the centroid associated with (¯ ) belongs to the set of the closest centroids of . We define closeℎ as the set of centroids whose scores surpass a specified threshold ℎ in relation to a query term . For a certain passage , we also introduce the list of centroids ids , where is the centroid id of ¯ . The similarity of a passage with respect to a query can be rapidly estimated with our novel filtering function (, ) ∈ [0, ] with the following equation: (, ) = ∑︁ 1(∃ s.t. ∈ closeℎ).

=1 For a passage , this counts how many query terms have at least one similar passage term in , where “similar” describes the belonging of to closeℎ.

In Figure 2 (left), we present a performance comparison of our innovative pre-filter operating in conjunction with the centroid interaction mechanism (depicted by orange, blue, and green lines) against the performance of the centroid interaction mechanism applied to the entire set of candidate documents (indicated by the red dashed line) on the MS MARCO dataset. The plot illustrates that our pre-filtering approach eficiently eliminates non-relevant passages without adversely afecting the recall of the subsequent centroid interaction phase. For instance, we can significantly reduce the candidate passage set to just 1000 elements using a threshold of 0.4 without any compromise in R@100. In the subsequent sections, we detail the implementation of this pre-filter for optimal eficiency.

Building the bit vectors. Let = · , with ∈ [ − 1, 1 ]×| |, with is the number of query tokens, and || is the number of centroids. For each -th row of , we want to scan it and pick those s.t. , > ℎ. This conceptually trivial algorithm can be implemented by leveraging SIMD instructions featured by modern CPUs. In particular, the AVX512 instruction set allows (4) to compare 16 fp32 values at a time thanks to the _mm512_cmp_epi32_mask instruction and store the comparison result in a variable. Those indexes = { ∈ [0, 15] | = 1} (if any) can be eficiently extracted by means of the _mm512_mask_compressstore instruction.

The efectiveness of algorithms employing if-based structures is largely contingent on the branch misprediction ratio. Contemporary CPUs speculate about the if condition’s outcome by identifying patterns in the algorithm’s execution flow. If an incorrect branch prediction occurs, a control hazard arises, leading to a pipeline flush with a delay of 15 to 20 clock cycles, approximately 10. To address the ineficiency associated with branch misprediction, we introduce a branchless algorithm. For a detailed description of the algorithm and of its vectorized version, refer to the original paper [ 1 ].

Figure 2 (right) presents a comparison of our diferent approaches, namely "Naive IF," the "Vectorized IF," the "Branchless," and the "VecBranchless" described above. Branchless algorithms present a constant execution time, regardless of the value of the threshold, while if-based approaches ofer better performances as the value of ℎ increases. With ℎ ≥ 0.3, "Vectorized IF" is the most eficient approach, with a speedup up to 3 times compared to its naive counterpart. Fast set membership. We now move to the problem of computing Equation 4, assuming closeℎ to be known. Observe that this is a integer set membership problem, where we have to test if at least one member of belongs to closeℎ, with = 1, . . . , . Bit vectors (or bit array) are a widely adopted solution for implementing sets of integer values. A bit vector maps a set of integers up to into an array of bits, where the -th bit is set to one if and only if the integer is part of the set. Operations like addition and searching for any integer can be executed in constant time using bit manipulation operators. In terms of memory occupancy, a bit vectors requires /8 bytes. In our scenario, given that || = 218, a bit vector only needs 32 bytes for storage.

We further improve the eficiency of bit vectors by relying on the specific properties of our setting. As we need to search through all the bit vectors at a time, we rearrange the representation of the bit vectors by stacking them vertically (Figure 3). This allows to search a centroid index through all the closeℎ at a time. The bits corresponding to the same centroid for diferent query terms are consecutive and fit a 32-bit word. This way, we can simultaneously test the membership for all the queries in constant time with a single bitwise operation. In i <latexish1_b64="Sm7+HqTyJXkz 10 e ry ep (m iTm euq r )s 20 ep (m 10 iTm euq e ry detail, our algorithm starts by initializing a mask of = 32 bits at zeros (Step 1, Figure 3). Subsequently, for each term in the candidate documents, it performs a bitwise xor between the mask and the 32-bit word representing membership to all the query terms (Step 2, Figure 3). Consequently, Equation 4 can be derived by counting the number of 1s in at the end of the execution using the popcnt operation available in modern CPUs (Step 3, Figure 3). faster than the “Baseline” usage of bit vectors, and up to 30× faster than the centroid-interaction the closest centroid ¯ . Equation 3 becomes

Our pre-filtering approach allows us to eficiently filter out non-relevant passages and is employed upstream of PLAID’s centroid interaction (Equation 2). The eficiency of the centroid interaction itself can be improved by using our column-wise reduction approach. For reason of space, we do not report the description of the algorithm in this discussion paper, and we encourage the reader to refer to the original work [ 1 ]. We implement PLAID’s centroid interaction in C++ and we compare its filtering time against our SIMD-based solution. The results of the comparison are reported for diferent values of candidate documents in Figure 4 (down). Thanks to the proficient read-write pattern and the highly eficient column-wise max-reduction, our method can be up to 1.8× faster than the filtering proposed in PLAID. Late Interaction We propose to the -bit residual compressor [ 3, 4 ] Product Quantization (PQ) [ 5 ]. Introducing PQ has two main advantages. On the one hand, it allows to compute the dot product between an input query vector and the compressed residual without decompression. On the other hand, it allows to re-use the Consider a query and a candidate passage . We decompose the computation of the dot product between the query terms and , = ∑︁

max =1 =1... ∑︁

max =1 =1... ( · ¯ + · ) ≃ ( · ¯ + · ), (5) where and = − ¯ . Our experimental evaluation shows that PQ is both faster (up to 3.6× ) and more efective compared to the -bit compressor used in previous work. We propose to further improve the eficiency of the scoring phase by hinging on the properties of Equation 5. In many cases, we have that · ¯ > · ; hence the max operator on is lead by the score between the query term and the centroid, rather than the score between the query term and the residual. We argue that it is possible to compute the scores on the residuals only for a reduced set of document terms ¯ , where identifies the index of the query term. In particular, = {| · ¯ > ℎ}, where ℎ is a second threshold that determines whether the score with ¯ the centroid is suficiently large. With the introduction of this new per-term filter, Equation 5 now becomes computing the max operator on the set of passages in ¯ , i.e., , = ∑︁ max( · ¯ + · ).

¯ =1 ∈ (6) In practice, we compute the residual scores only for those document terms whose centroid score is large enough. If ¯ = ∅, we compute , as in Equation 5. We experimentally verify that this allows to save up to 30% in the late interaction with no performance degradation.

4. Experimental Evaluation

Experimental Settings. In this section, we compare our methodology with the state-of-the-art engine for multi-vector dense retrieval, namely PLAID [ 4 ]. Our experiments are conducted on the MS MARCO passages dataset [ 6 ] for in-domain evaluation and on LoTTE [ 3 ] for out-ofdomain evaluation. Embeddings for MS MARCO are generated using the ColBERTv2 model, resulting in a dataset composed of about 600 million -dimensional vectors, with = 128. The implementation of Product Quantization utilizes the FAISS [ 7 ] library and is optimized using the JMPQ [ 8 ] technique. The experiments are carried out on an Intel Xeon Gold 5318Y CPU clocked at 2.10 GHz, equipped with the AVX512 instruction set, and executed with single-threading. The code is compiled using GCC 11.3.0 with -O3 compilation options on a Linux 5.15.0-72 machine. Evaluation. Table 1 compares EMVB against PLAID on the MS MARCO dataset, in terms of memory requirements (num. of bytes per embedding), average query latency (in milliseconds), MRR@10, and Recall@100, and 1000. With = 16, EMVB almost halves the per-vector memory load compared to PLAID, achieving up to 2.8× faster processing with minimal impact on retrieval efectiveness. Doubling the number of sub-partitions per vector, i.e., = 32, EMVB surpasses PLAID’s performance in terms of MRR and Recall while maintaining the same memory footprint, achieving up to 2.5× speedup.

Table 2 presents a comparison between EMVB and PLAID in the out-of-domain evaluation on the LoTTE dataset. Similar to PLAID [ 4 ], Success@5 and Success@100 are employed as retrieval quality metrics. On this dataset, EMVB exhibits slightly lower performance in terms of retrieval quality. It’s worth noting that JMPQ [ 8 ] cannot be applied in the out-of-domain evaluation due to the absence of training queries. Instead, we utilize Optimized Product Quantization (OPQ) [ 9 ], which searches for an optimal rotation of the dataset vectors to mitigate the quality degradation associated with PQ. To address the retrieval quality loss, PQ is experimented with = 32, as an increased number of partitions ofers a better representation of the original vector. However, EMVB provides a substantial speedup of up to 2.9× compared to PLAID in this out-of-domain 10 100 1000

Method PLAID EMVB (m=16) EMVB (m=32) PLAID EMVB (m=16) EMVB (m=32) PLAID EMVB (m=16) EMVB (m=32)

10 100 1000

Method PLAID EMVB (m=32) PLAID EMVB (m=32) PLAID EMVB (m=32)

36 36 evaluation. This larger speedup, compared to MS MARCO, is attributed to the larger average document lengths in LoTTE. In this context, filtering non-relevant documents using our bit vector-based approach significantly impacts eficiency. It’s noteworthy that for the out-ofdomain evaluation, our pre-filtering method could be integrated into PLAID. This integration could maintain PLAID’s accuracy while benefiting from EMVB’s eficiency. Combinations of PLAID and EMVB are left for future exploration.

Success@100 Acknowledgments

This work was supported by the EU - NGEU, by the PNRR - M4C2 - Investimento 1.3, Partenariato Esteso PE00000013 - “FAIR - Future Artificial Intelligence Research” - Spoke 1 “Human-centered AI” funded by the European Commission under the NextGeneration EU program, by the PNRR ECS00000017 Tuscany Health Ecosystem Spoke 6 “Precision medicine & personalized healthcare”, by the European Commission under the NextGeneration EU programme, by the Horizon Europe RIA “Extreme Food Risk Analytics” (EFRA), grant agreement n. 101093026, by the “Algorithms, Data Structures and Combinatorics for Machine Learning” (MIUR-PRIN 2017), and by the “Algorithmic Problems and Machine Learning” (MIUR-PRIN 2022).

[1]

F. M.

Nardini ,

Rulli ,

Venturini , Eficient multi-vector dense retrieval with bit vectors , in: Proceedings of the 46th European Conference on Information Retrieval (ECIR 2024 ), 2024 .

[2]

Khattab ,

Zaharia , Colbert: Eficient and efective passage search via contextualized late interaction over bert , in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval , 2020 , pp. 39 - 48 .

[3]

Santhanam ,

Khattab ,

Saad-Falcon ,

Potts , M. Zaharia, Colbertv2: Efective and eficient retrieval via lightweight late interaction , in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2022 .

[4]

Santhanam ,

Khattab ,

Potts ,

Zaharia , Plaid: an eficient engine for late interaction retrieval , in: Proceedings of the 31st ACM International Conference on Information & Knowledge Management , 2022 .

[5]

Jegou ,

Douze ,

Schmid , Product quantization for nearest neighbor search , IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2010 ).

[6]

Nguyen ,

Rosenberg ,

Song ,

Gao ,

Tiwary ,

Majumder , L. Deng, Ms marco: A human-generated machine reading comprehension dataset (????).

[7]

Johnson , M. Douze,

Jegou , Billion-scale similarity search with gpus , IEEE Transactions on Big Data ( 2021 ).

[8]

Fang ,

Zhan ,

Liu ,

Mao ,

Zhang , S. Ma, Joint optimization of multi-vector representation with product quantization , in: Natural Language Processing and Chinese Computing , 2022 .

[9]

Ge ,

He ,

Ke ,

Sun , Optimized product quantization , IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2013 ).