<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Eficient and Efective Multi-Vector Dense Retrieval with EMVB</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Franco Maria Nardini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cosimo Rulli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rossano Venturini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ISTI-CNR</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Dense retrieval techniques utilize large pre-trained language models to construct a high-dimensional representation of queries and passages. These representations assess the relevance of a passage concerning a query through eficient similarity measures. Multi-vector representations, while enhancing efectiveness, cause a one-order-of-magnitude increase in memory footprint and query latency by encoding queries and documents on a per-token level. The current state-of-the-art approach, namely PLAID, has introduced a centroid-based term representation to mitigate the memory impact of multi-vector systems. By employing a centroid interaction mechanism, PLAID filters out non-relevant documents, reducing the cost of subsequent ranking stages. This paper 1 introduces "Eficient Multi-Vector dense retrieval with Bit vectors" (EMVB), a novel framework for eficient query processing in multi-vector dense retrieval. Firstly, EMVB utilizes an optimized bit vector pre-filtering step for passages, enhancing eficiency. Secondly, the computation of centroid interaction occurs column-wise, leveraging SIMD instructions to reduce latency. Thirdly, EMVB incorporates Product Quantization (PQ) to decrease the memory footprint of storing vector representations while facilitating fast late interaction. Lastly, a per-document term filtering method is introduced, further improving the eficiency of the final step. Experiments conducted on MS MARCO and LoTTE demonstrate that EMVB achieves up to a 2.8× speed improvement while reducing the memory footprint by 1.8× , without compromising retrieval accuracy compared to PLAID.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Dense Retrieval</kwd>
        <kwd>Multi-Vector</kwd>
        <kwd>Eficiency</kwd>
        <kwd>Bit Vectors</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The widely acknowledged capability of Large Language Models (LLMs) to model semantic and
context has been extensively used in Information Retrieval. In Dense Retrieval, LLMs are used
to encode documents and queries into -dimensional vectors. This enables the modeling of
document-query relevance using simple metrics like Euclidean distance. In this line, a successful
strategy involves using multi-vector representations for documents and queries, where a
dimensional vector is produced for each token in the text. In this context, the similarity between
the query and the passage is measured using the so-called late interaction mechanism. This
mechanism works by computing the sum of the maximum similarities between each term of the
query and each term of a candidate passage. Although multi-vector representations enhance
efectiveness, they come at the cost of increased computational burden, including a larger
memory footprint and longer retrieval time.
1This paper is an extended abstract of Nardini et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
* Corresponding author.</p>
      <p>© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        Various approaches have been proposed to enhance the eficiency and reduce memory
demands in multi-vector systems. ColBERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] exploits an inverted index to store all the
terms embeddings and retrieve the candidate passages, but it necessitates maintaining the
full-precision representation of each document term in memory, which can be substantial (e.g.,
140 GB for MSMARCO). ColBERTv2 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduces a centroid-based compression technique
where each embedding is stored by saving the id of the closest centroid and then compressing
the residual (i.e., the element-wise diference) by using 1 or 2 bits per component. ColBERTv2
saves up to 10× space compared to ColBERT but sacrifices retrieval eficiency requiring up to 3
seconds to perform query processing on CPU. PLAID [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] builds on the embedding compressor
of ColBERTv2 and leverages the centroid-based representation to discard non-relevant passages
(centroid interaction), thus performing the late interaction exclusively on a carefully selected
batch of passages. PLAID allows for massive speedup compared to ColBERTv2, but its average
query latency can be up to 400 msec. on CPU with single-thread execution [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        This paper introduces EMVB, a novel framework designed for eficient query processing in
multi-vector dense retrieval. The key focus is on addressing the most time-consuming steps
identified in PLAID, which include: i) extracting the top-nprobe closest centroids during
candidate passage selection, ii) computing the centroid interaction mechanism, and iii) decompressing
quantized residuals. To address the first two steps, we propose a highly eficient passage filtering
approach based on optimized bit vectors. This approach significantly reduces the cost of
topnprobe extraction by identifying a small set of crucial centroid scores. Additionally, it decreases
the number of passages for which centroid interaction computation is necessary. We further
enhance eficiency in the second step by introducing a highly eficient column-wise reduction
leveraging SIMD instructions. For the third step, late interaction eficiency is improved by
introducing Product Quantization (PQ) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This method provides comparable or superior
performance compared to PLAID’s bitwise compressor, while being up to 3× faster. Additionally,
we introduce a dynamic passage-term-selection criterion for late interaction, reducing the cost
of this step by up to 30%.
      </p>
      <p>
        Experimental evaluations on MS MARCO [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] passage (in-domain) and LoTTE [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
(out-ofdomain) datasets demonstrate the efectiveness of EMVB compared to PLAID. On MS MARCO,
EMVB achieves up to a 2.8× speed improvement while reducing the memory footprint by 1.8×
without compromising retrieval accuracy. In the out-of-domain evaluation, EMVB delivers up
to a 2.9× speedup compared to PLAID with minimal loss in retrieval quality.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. PLAID</title>
      <p>
        In a multi-vector dense retrieval scenario, an LLM encodes a passage  into a collection of 
dense -dimensional vector  where  is the number of tokens in the passage. Encoding each
token in each passage generates large collection, e.g., almost 600M of vectors for the 8.8M of
passages in MSMARCO. In virtue of this, ColBERTv2 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and successively PLAID [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] exploit a
centroid-based vector compression technique. First, the K-Means algorithm is used to identify
a set of  centroids  = {}= 1. The residual  between a vector  and its closest centroid
¯ is computed so that  =  − ¯ is computed, and then. compressed into ˜ using a -bit
encoder that represents each dimension of  using  bits, with  ∈ {1, 2}. This way, the memory
footprint of each vector is given by ⌈log2 ||⌉ bits for the centroid index and  ×  bits for the
compressed residual. At scoring time, decompressing the residual encoding is ineficient. For
this reason, PLAID aims at decompressing as few candidate documents as possible by hinging on
the so-called centroid interaction filtering step [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We now detail the PLAID retrieval system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
After the K-Means algorithm, each centroid is linked with a posting list containing the ids of the
candidate passages. A passage belongs to a centroid  candidate list if at lest one of its tokens
have  as its closest centroid. The query processing starts by computing the top-nprobe closest
centroids for each query term , with  = 1, . . . , , according to the dot product similarity
measure. From the set of set of closest centroids, the candidates passages are retrieved, thanks
to the previously built posting lists. In the centroid interaction step, the distance between the
-th query term  and a token embedding  with  = 1, . . . ,  is computed as
where ¯ is the closest centroid to  . We estimate the score of a passage  with  terms as
 ·  ≃  · ¯ = ˜ , .
      </p>
      <p>¯, = ∑︁ max  · ¯
=1 =1...

, = ∑︁ max  ·  .</p>
      <p>
        =1 =1...
In the decompression phase the full-precision representation of  is reconstructed by
combining the centroids and the residuals. Only the top-ndocs passages from the previous centroid
interaction step move to this step. Finally, PLAID applied late interaction [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] to computed
the score of a re-constructed candidate passage against a query . The late interaction measure
is defined by Equation 3. Passages are then ranked according to their similarity score and the
top- passages are selected.
      </p>
      <p>PLAID execution time. We present a detailed analysis of PLAID’s execution time,
delineating it into distinct phases such as retrieval, filtering , decompression, and late interaction. The
experimentation adheres to the settings outlined in Section 4. The resulting execution times are
reported for various values of , representing the number of retrieved passages.</p>
    </sec>
    <sec id="sec-3">
      <title>3. EMVB</title>
      <p>Fast Closest Centroids Selection. The retrieval phase in PLAID is time consuming, as
shown in Figure 1. This step concists of i) matrix multiplication between the query matrix and
centroids for distance computation, ii) identify top-nprobe closest centroids, for each query
term. Maybe surprisingly, the former step is the most time consuming (3× slower than matrix
multiplication), even when performed with asymptotically linear algorithms such as quickselect.
Our pre-filtering strategy, as explained in the subsequent paragraph, efectively accelerate the
selection of the top-nprobe by minimizing the number of evaluated elements. In practice, we
eficiently discard centroids with scores below a predefined threshold, and then exclusively apply
quickselect to the remaining ones. As a result, in EMVB, the cost associated with extracting the
top-nprobe becomes negligible, showcasing a speed improvement of two orders of magnitude
when compared to extracting the top-nprobe from the complete set of centroids.
˜
Pre-filtering using bitvectors . Let us recall the definition of  , , which represents the
approximate score of the -th token of passage  with respect to the -th term of the query ,
˜
as defined in Equation 1. Estimating whether  , has a large value is a proxy for estimating
the importance of a passage  w.r.t to the query. Given a passage  , our pre-filtering consists
in determining whether ˜ , , for  = 1, . . . , ,  = 1, . . . ,  is large or not. Recall that ˜ ,
represents the approximate score of the -th token of passage  with respect to the -th term
of the query , as defined in Equation 1. Our pre-filtering approach works by checking if the
centroid associated with  (¯ ) belongs to the set of the closest centroids of . We define
closeℎ as the set of centroids whose scores surpass a specified threshold ℎ in relation to a
query term . For a certain passage  , we also introduce the list of centroids ids  , where
 is the centroid id of ¯ . The similarity of a passage with respect to a query can be rapidly
estimated with our novel filtering function  (, ) ∈ [0, ] with the following equation:

 (, ) = ∑︁ 1(∃  s.t.  ∈ closeℎ).</p>
      <p>=1
For a passage  , this counts how many query terms have at least one similar passage term in
 , where “similar” describes the belonging of  to closeℎ.</p>
      <p>In Figure 2 (left), we present a performance comparison of our innovative pre-filter operating
in conjunction with the centroid interaction mechanism (depicted by orange, blue, and green
lines) against the performance of the centroid interaction mechanism applied to the entire set
of candidate documents (indicated by the red dashed line) on the MS MARCO dataset. The plot
illustrates that our pre-filtering approach eficiently eliminates non-relevant passages without
adversely afecting the recall of the subsequent centroid interaction phase. For instance, we can
significantly reduce the candidate passage set to just 1000 elements using a threshold of 0.4
without any compromise in R@100. In the subsequent sections, we detail the implementation
of this pre-filter for optimal eficiency.</p>
      <p>
        Building the bit vectors. Let  =  ·  , with  ∈ [
        <xref ref-type="bibr" rid="ref1">− 1, 1</xref>
        ]×| |, with  is the number of
query tokens, and || is the number of centroids. For each -th row of , we want to scan it and
pick those  s.t. , &gt; ℎ. This conceptually trivial algorithm can be implemented by
leveraging SIMD instructions featured by modern CPUs. In particular, the AVX512 instruction set allows
(4)
to compare 16 fp32 values at a time thanks to the _mm512_cmp_epi32_mask instruction and
store the comparison result in a  variable. Those indexes  = { ∈ [0, 15] |  = 1} (if
any) can be eficiently extracted by means of the _mm512_mask_compressstore instruction.
      </p>
      <p>
        The efectiveness of algorithms employing if-based structures is largely contingent on the
branch misprediction ratio. Contemporary CPUs speculate about the if condition’s outcome
by identifying patterns in the algorithm’s execution flow. If an incorrect branch prediction
occurs, a control hazard arises, leading to a pipeline flush with a delay of 15 to 20 clock
cycles, approximately 10. To address the ineficiency associated with branch misprediction,
we introduce a branchless algorithm. For a detailed description of the algorithm and of its
vectorized version, refer to the original paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Figure 2 (right) presents a comparison of our diferent approaches, namely "Naive IF," the
"Vectorized IF," the "Branchless," and the "VecBranchless" described above. Branchless algorithms
present a constant execution time, regardless of the value of the threshold, while if-based
approaches ofer better performances as the value of ℎ increases. With ℎ ≥ 0.3, "Vectorized
IF" is the most eficient approach, with a speedup up to 3 times compared to its naive counterpart.
Fast set membership. We now move to the problem of computing Equation 4, assuming
closeℎ to be known. Observe that this is a integer set membership problem, where we have
to test if at least one member of  belongs to closeℎ, with  = 1, . . . , . Bit vectors (or bit
array) are a widely adopted solution for implementing sets of integer values. A bit vector maps
a set of integers up to  into an array of  bits, where the -th bit is set to one if and only if
the integer  is part of the set. Operations like addition and searching for any integer  can be
executed in constant time using bit manipulation operators. In terms of memory occupancy, a
bit vectors requires /8 bytes. In our scenario, given that || = 218, a bit vector only needs
32 bytes for storage.</p>
      <p>We further improve the eficiency of bit vectors by relying on the specific properties of
our setting. As we need to search through all the  bit vectors at a time, we rearrange the
representation of the bit vectors by stacking them vertically (Figure 3). This allows to search a
centroid index through all the closeℎ at a time. The bits corresponding to the same centroid
for diferent query terms are consecutive and fit a 32-bit word. This way, we can simultaneously
test the membership for all the queries in constant time with a single bitwise operation. In
i &lt;latexish1_b64="Sm7+HqTyJXkz
10 e ry
ep (m
iTm euq
r )s
20 ep (m
10 iTm euq
e ry
detail, our algorithm starts by initializing a mask  of  = 32 bits at zeros (Step 1, Figure 3).
Subsequently, for each term in the candidate documents, it performs a bitwise xor between the
mask and the 32-bit word representing membership to all the query terms (Step 2, Figure 3).
Consequently, Equation 4 can be derived by counting the number of 1s in  at the end of the
execution using the popcnt operation available in modern CPUs (Step 3, Figure 3).
faster than the “Baseline” usage of bit vectors, and up to 30× faster than the centroid-interaction
the closest centroid ¯ . Equation 3 becomes</p>
      <p>
        Our pre-filtering approach allows us to eficiently filter out non-relevant passages and is
employed upstream of PLAID’s centroid interaction (Equation 2). The eficiency of the centroid
interaction itself can be improved by using our column-wise reduction approach. For reason
of space, we do not report the description of the algorithm in this discussion paper, and
we encourage the reader to refer to the original work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We implement PLAID’s centroid
interaction in C++ and we compare its filtering time against our SIMD-based solution. The
results of the comparison are reported for diferent values of candidate documents in Figure 4
(down). Thanks to the proficient read-write pattern and the highly eficient column-wise
max-reduction, our method can be up to 1.8× faster than the filtering proposed in PLAID.
Late Interaction We propose to the -bit residual compressor [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] Product Quantization
(PQ) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Introducing PQ has two main advantages. On the one hand, it allows to compute
the dot product between an input query vector  and the compressed residual  without
decompression. On the other hand, it allows to re-use the Consider a query  and a candidate
passage  . We decompose the computation of the dot product between the query terms  and
, = ∑︁
      </p>
      <p>max
=1 =1...

∑︁</p>
      <p>max
=1 =1...
( · ¯ +  ·  ) ≃
( · ¯ +  ·  ),
(5)
where and  =  − ¯ . Our experimental evaluation shows that PQ is both faster (up to
3.6× ) and more efective compared to the -bit compressor used in previous work. We propose
to further improve the eficiency of the scoring phase by hinging on the properties of Equation 5.
In many cases, we have that  · ¯ &gt;  ·  ; hence the max operator on  is lead by the
score between the query term and the centroid, rather than the score between the query term
and the residual. We argue that it is possible to compute the scores on the residuals only for a
reduced set of document terms ¯ , where  identifies the index of the query term. In particular,
  = {| · ¯ &gt; ℎ}, where ℎ is a second threshold that determines whether the score with
¯
the centroid is suficiently large. With the introduction of this new per-term filter, Equation 5
now becomes computing the max operator on the set of passages in ¯ , i.e.,

, = ∑︁ max( · ¯ +  ·  ).</p>
      <p>¯
=1 ∈
(6)
In practice, we compute the residual scores only for those document terms whose centroid score
is large enough. If ¯  = ∅, we compute , as in Equation 5. We experimentally verify that
this allows to save up to 30% in the late interaction with no performance degradation.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>
        Experimental Settings. In this section, we compare our methodology with the state-of-the-art
engine for multi-vector dense retrieval, namely PLAID [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Our experiments are conducted on
the MS MARCO passages dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for in-domain evaluation and on LoTTE [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for
out-ofdomain evaluation. Embeddings for MS MARCO are generated using the ColBERTv2 model,
resulting in a dataset composed of about 600 million -dimensional vectors, with  = 128. The
implementation of Product Quantization utilizes the FAISS [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] library and is optimized using the
JMPQ [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] technique. The experiments are carried out on an Intel Xeon Gold 5318Y CPU clocked
at 2.10 GHz, equipped with the AVX512 instruction set, and executed with single-threading. The
code is compiled using GCC 11.3.0 with -O3 compilation options on a Linux 5.15.0-72 machine.
Evaluation. Table 1 compares EMVB against PLAID on the MS MARCO dataset, in terms of
memory requirements (num. of bytes per embedding), average query latency (in milliseconds),
MRR@10, and Recall@100, and 1000. With  = 16, EMVB almost halves the per-vector
memory load compared to PLAID, achieving up to 2.8× faster processing with minimal impact
on retrieval efectiveness. Doubling the number of sub-partitions per vector, i.e.,  = 32,
EMVB surpasses PLAID’s performance in terms of MRR and Recall while maintaining the same
memory footprint, achieving up to 2.5× speedup.
      </p>
      <p>
        Table 2 presents a comparison between EMVB and PLAID in the out-of-domain evaluation on
the LoTTE dataset. Similar to PLAID [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Success@5 and Success@100 are employed as retrieval
quality metrics. On this dataset, EMVB exhibits slightly lower performance in terms of retrieval
quality. It’s worth noting that JMPQ [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] cannot be applied in the out-of-domain evaluation due
to the absence of training queries. Instead, we utilize Optimized Product Quantization (OPQ) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
which searches for an optimal rotation of the dataset vectors to mitigate the quality degradation
associated with PQ. To address the retrieval quality loss, PQ is experimented with  = 32, as
an increased number of partitions ofers a better representation of the original vector. However,
EMVB provides a substantial speedup of up to 2.9× compared to PLAID in this out-of-domain

10
100
1000
      </p>
      <sec id="sec-4-1">
        <title>Method</title>
      </sec>
      <sec id="sec-4-2">
        <title>PLAID EMVB (m=16) EMVB (m=32)</title>
      </sec>
      <sec id="sec-4-3">
        <title>PLAID EMVB (m=16) EMVB (m=32)</title>
      </sec>
      <sec id="sec-4-4">
        <title>PLAID EMVB (m=16) EMVB (m=32)</title>
        <p>10
100
1000</p>
      </sec>
      <sec id="sec-4-5">
        <title>Method</title>
      </sec>
      <sec id="sec-4-6">
        <title>PLAID EMVB (m=32)</title>
      </sec>
      <sec id="sec-4-7">
        <title>PLAID EMVB (m=32)</title>
      </sec>
      <sec id="sec-4-8">
        <title>PLAID EMVB (m=32)</title>
        <p>36
36
evaluation. This larger speedup, compared to MS MARCO, is attributed to the larger average
document lengths in LoTTE. In this context, filtering non-relevant documents using our bit
vector-based approach significantly impacts eficiency. It’s noteworthy that for the
out-ofdomain evaluation, our pre-filtering method could be integrated into PLAID. This integration
could maintain PLAID’s accuracy while benefiting from EMVB’s eficiency. Combinations of
PLAID and EMVB are left for future exploration.</p>
      </sec>
      <sec id="sec-4-9">
        <title>Success@100</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by the EU - NGEU, by the PNRR - M4C2 - Investimento 1.3, Partenariato
Esteso PE00000013 - “FAIR - Future Artificial Intelligence Research” - Spoke 1 “Human-centered
AI” funded by the European Commission under the NextGeneration EU program, by the PNRR
ECS00000017 Tuscany Health Ecosystem Spoke 6 “Precision medicine &amp; personalized healthcare”,
by the European Commission under the NextGeneration EU programme, by the Horizon Europe
RIA “Extreme Food Risk Analytics” (EFRA), grant agreement n. 101093026, by the “Algorithms,
Data Structures and Combinatorics for Machine Learning” (MIUR-PRIN 2017), and by the
“Algorithmic Problems and Machine Learning” (MIUR-PRIN 2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Nardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rulli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Venturini</surname>
          </string-name>
          ,
          <article-title>Eficient multi-vector dense retrieval with bit vectors</article-title>
          ,
          <source>in: Proceedings of the 46th European Conference on Information Retrieval (ECIR</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>Colbert: Eficient and efective passage search via contextualized late interaction over bert</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saad-Falcon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zaharia, Colbertv2: Efective and eficient retrieval via lightweight late interaction</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>Plaid: an eficient engine for late interaction retrieval</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <article-title>Product quantization for nearest neighbor search</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , L. Deng,
          <article-title>Ms marco: A human-generated machine reading comprehension dataset (????).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , M. Douze,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <article-title>Billion-scale similarity search with gpus</article-title>
          ,
          <source>IEEE Transactions on Big Data</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ma,
          <article-title>Joint optimization of multi-vector representation with product quantization</article-title>
          ,
          <source>in: Natural Language Processing and Chinese Computing</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Optimized product quantization</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>