On Single and Multiple Representations in Dense
Passage Retrieval

Craig Macdonald1 , Nicola Tonellotto2 and Iadh Ounis1
1
    University of Glasgow, UK
2
    University of Pisa, Italy


                                         Abstract
                                         The advent of contextualised language models has brought gains in search effectiveness, not just when
                                         applied for re-ranking the output of classical weighting models such as BM25, but also when used di-
                                         rectly for passage indexing and retrieval, a technique which is called dense retrieval. In the existing
                                         literature in neural ranking, two dense retrieval families have become apparent: single representation,
                                         where entire passages are represented by a single embedding (usually BERT’s [CLS] token, as exem-
                                         plified by the recent ANCE approach), or multiple representations, where each token in a passage is
                                         represented by its own embedding (as exemplified by the recent ColBERT approach). These two fam-
                                         ilies have not been directly compared. However, because of the likely importance of dense retrieval
                                         moving forward, a clear understanding of their advantages and disadvantages is paramount. To this
                                         end, this paper contributes a direct study on their comparative effectiveness, noting situations where
                                         each method under/over performs w.r.t. each other, and w.r.t. a BM25 baseline. We observe that, while
                                         ANCE is more efficient than ColBERT in terms of response time and memory usage, multiple represen-
                                         tations are statistically more effective than the single representations for MAP and MRR@10. We also
                                         show that multiple representations get better improvements than single representations for queries be-
                                         ing the hardest for BM25, as well as for definitional queries, and those with complex information needs.


1. Introduction
Pre-trained contextualised language models such as BERT have been shown to greatly improve
retrieval effectiveness over the previous state-of-the-art methods in many information retrieval
(IR) tasks [1]. These contextualised language models are able to learn semantic representations
called embeddings from the contexts of words and, therefore, better capture the relevance of a
document w.r.t. a query, with substantial improvements over the classical approach in the rank-
ing and re-ranking of documents [2]. Most BERT-based models are computationally expensive
for estimating query-document similarities in ranking, due to the complexity of the underlying
transformer neural network [3, 4, 5]. As such, BERT-based ranking models have been used
as second-stage rankers in retrieval cascades, in particular to re-rank candidate documents
generated by classical relevance models such as BM25 [6, 7, 8]. BERT-based models are also
limited in the length of text that they can process, and hence are often applied on passages
rather than full documents (which we focus on in this paper); entire document rankings can
be obtained by estimating relevance at a passage level, then aggregating [9].
IIR 2021: The 11th Italian Information Retrieval Workshop, September 13–15, 2021, Bari, Italy
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   Recently, several works have proposed investigating whether BERT-based systems are able
to identify the relevant passages among all passages in a collection, rather than just among a
query-dependent sample; these systems represent a new type of retrieval approaches called
dense retrieval. In dense retrieval, passages are represented by real-valued vectors, while the
query-document similarity is computed by deploying efficient nearest neighbour techniques
over specialised indexes, such as those provided by the FAISS toolkit [10]. Thus far, two different
families of dense retrieval approaches have emerged recently, based on single representation and
multiple representation. In particular, DPR [11] and ANCE [12] use a single representation, by
indexing only the embedding of BERT’s [CLS] token, and therefore this is assumed to represent
the meaning of an entire passage within that single embedding. At retrieval time, the [CLS]
embedding of the query is then used to retrieve passages by identifying nearest neighbours
using a FAISS index. In contrast, ColBERT [3], which uses multiple representation, indexes an
embedding for each token in each document. At retrieval time, a set of the nearest document
embeddings to each query embedding is retrieved, by identifying the approximate nearest
neighbours from a FAISS index. These passages must then be exactly scored, based on the
maximal similarity between the query and the passage embeddings, to obtain the final ranking.
   These are two markedly different families of dense retrieval approaches. Indeed, as ColBERT
records one embedding for every token, this makes for a large index of embeddings, which
may allow a richer semantic representation of the content. On the other hand, DPR and ANCE
rely on a single embedding sufficiently representing the content of each passage. However, at
the time of writing, no systematic study has compared these two families of dense retrieval
approaches. For this reason, this work contributes a first investigation into the effectiveness
of single and multiple representation embeddings for dense retrieval, as exemplified by ANCE
and ColBERT, respectively. We perform experiments in a controlled environment using the
same collection and query sets, and we report several effectiveness metrics, together with a
detailed comparison of the results obtained for the two representation families w.r.t. a common
baseline, namely BM25. To derive further insights, we also provide a per-query analysis of the
effectiveness of single and multiple representations. We observe that, while ANCE is more
efficient than ColBERT in terms of response time and memory usage, multiple representations
are statistically more effective than the single representations for MAP and MRR@10. We also
show that multiple representations obtain better improvements than single representations
for queries that are the hardest for BM25, as well as for definitional queries, and those with
complex information needs.


2. Problem Statement

Embeddings. Contextualized language models such as BERT have been trained on a large
corpus for language understanding, and then fine-tuned on smaller, more specific textual
collections targeting a particular IR task. Through this fine-tuning, BERT learns how to map
texts, either queries or documents, into a multi-dimensional space of one or more vectors, called
embeddings. Both queries and documents are tokenised into terms according to a predefined
vocabulary; BERT learns a function mapping tokens in a query into multiple query embeddings,
one per query term, and another potentially different function mapping tokens in a document
into a document embedding per term. BERT and its derived models also make use of special
tokens, such as the [CLS] (classification) token, the [SEP] (separator) token, and the [MASK]
(masked) token. In particular, [CLS] is always placed at the beginning of any text given as input
to BERT, both at training and inference time, and is used to let BERT learn a global representation
of the input text as a single embedding. In more detail, a text composed by 𝑚 terms given as input
to BERT will produce 𝑚 + 1 embeddings, one per input term plus one additional embedding for
[CLS]. In single representation models, such as ANCE, the embedding corresponding to [CLS] is
assumed to encode all possible information about the input text, including the possible semantic
context of the composing terms. In contrast, in multiple representation models, such as ColBERT,
each input term’s embedding encodes its specific semantic information within the context of
the entire input text.
Dense Retrieval. The embeddings produced by the BERT models have recently demonstrated
their promise in being a suitable basis for dense retrieval. In dense retrieval, documents and
queries are represented using embeddings. The embeddings from the documents in a collec-
tion can be pre-computed through the application of the BERT learned mapping and stored
into an index data structure for embeddings supporting nearest neighbour similarity search,
as exemplified by the FAISS toolkit [10]. Depending on the number and dimensions of the
embeddings stored into the index, advanced compression strategies, together with suitable
nearest neighbour search algorithms, can be employed. In order to reduce the time required
to identify the most similar document embeddings to a given input embedding, it is possible
to shift from exact nearest neighbour search to approximate nearest neighbour search. While
ANCE stores embeddings in an uncompressed format supporting exact search, ColBERT, given
the larger number of document embeddings it has to store, resorts to compressed and quan-
tised embeddings supporting approximate search. However, the approximate similarity scores
produced by approximate search are not used by the ColBERT implementation to compute the
final top documents to return for a given query [13]1 . Hence, ColBERT uses approximate search
over compressed embeddings to identify a candidate set of documents, which are then re-scored
using an index with direct lookup for retrieving the candidate documents’ embeddings, to obtain
the final ranking of documents returned to the user.
Research Questions. In this work, we aim to compare the single and the multiple embedding
representations leveraging the ANCE and ColBERT implementations. Indeed, there was not
an effectiveness comparison with ColBERT in the ANCE paper [12]. ANCE embodies a recent
single representation approach, where we have a single large embedding per query/document,
which can be processed with exact similarity search in a single stage. In contrast, in the multiple
representation approach (ColBERT), we have a smaller sized embedding for each term in the
queries/documents, but due to the large number of embeddings, they must be processed using
approximate similarity search. Thereafter the candidate set must be re-ranked to compute
the exact similarity scores. The need for ColBERT to re-score all documents in the candidate
set necessitates storing all document embeddings in memory. As noted by Lin et al. [2], this
presents a significant storage overhead. This underlines the importance of an in-depth analysis

    1
     Indeed, in [13] we show that these approximate scores can allow a high recall but low precision ranking to be
obtained, which can be used to apply rank cutoffs to the the candidate set.
of the pros and cons of both approaches, in particular:
    • RQ1. What is the effectiveness of single and multiple representations in dense retrieval,
      in terms of MAP, NDCG@10, MRR@10?
    • RQ2. What are the relative gains and losses of single and multiple representations w.r.t.
      a common baseline such as BM25?
    • RQ3. For which queries are single representations better than multiple representations,
      and vice-versa?
In Section 3, we perform comparative experiments to address these research questions.


3. Experiments
In the following, we report our experimental setup, followed by analyses for RQs 1-3.

3.1. Setup
Our experiments use the MSMARCO passage ranking dataset, a dataset of 8.8M passages
and build upon our PyTerrier IR experimentation platform [14, 15]. We adapt the ANCE
implementation2 and the ColBERT implementation3 provided by their respective authors, using
integrations with PyTerrier4 . We use the provided ANCE model for the MSMARCO passage
ranking dataset. We train ColBERT using the same MSMARCO passage ranking training triples
file for 44,500 batches. In particular, we follow [12] and [3] for the settings of ANCE and
ColBERT, as summarised in Table 1.
   Of note, while ColBERT fine-tunes the bert-base-uncased BERT model, ANCE fine-tunes
a RoBERTa model [16] (specifically roberta-base), which is reported to apply more refined
pre-training than BERT. To try to eliminate model choice as a confounding factory, we also
trained a version of ColBERT by fine-tuning roberta-base. We found that even after training
for 300k batches (6× longer than we trained ColBERT using bert-base-uncased), this latter
model could had relative performance 25% less than the BERT-based ColBERT model (around
NDCG@10 of 0.533). Hence we discarded the RoBERTa-based ColBERT model from further
consideration. On the other hand, all of the released ANCE models use RoBERTa; training ANCE
requires multiple GPUs, e.g., 16, and has not, to the best of our knowledge, been reproduced.
Hence, we argue that as the RoBERTa-based ANCE and BERT-based ColBERT are individually
shown to be effective by their respective authors, the comparison of these representative models
still allows for interesting observations.
   We index the corpus using the code provided by the authors. Table 1 reports the statistics
of the resulting indices. In particular, the ANCE document index is stored in FAISS using the
uncompressed IndexFlatIP format. The ColBERT document index is stored in FAISS using the
compressed and quantised IndexIVFPQ format, which is trained on a random 5% sample of the
document embeddings. Mean response times for both ANCE and ColBERT, and their memory
consumption, are also shown in Table 1.
   2
     https://github.com/microsoft/ANCE
   3
     https://github.com/stanford-futuredata/ColBERT/tree/v0.2
   4
     See https://github.com/terrierteam/pyterrier_ance and https://github.com/terrierteam/pyterrier_colbert.
Table 1
Salient statistics of the ANCE and ColBERT setups.
                                                     ANCE                    ColBERT
                    Representation                    single                  multiple
                    Base model                  roberta-base         bert-base-uncased
                    # emb. per query                    1                       32
                    # emb. per passage                  1                    up to 180
                    Emb. dimensions                    768                      128
                    FAISS index size                  26GB                     16GB
                    Embedding index size                –                     176GB
                    Mean Response Time               211ms                     635ms


   For evaluating effectiveness, we use the publicly available query sets with relevance assess-
ments: 5000 queries sampled from the MSMARCO Dev set – which contain on average 1.1
judgements per query – as well as the TREC 2019 query set, which contains 43 queries with
an average of 215.3 judgements per query. To measure effectiveness, we employ MRR@10 for
the MSMARCO Dev set5 , and the MRR@10, NDCG@10 and MAP for the TREC query set.
   To examine gains and losses, for each query and each effectiveness metric, we examine the
comparative reward (improvement) and risk (degradation) over a BM25 baseline (following [17]),
as well as the number of wins & losses (improved and degraded queries).

3.2. Overall Comparison
Table 2 reports the effectiveness metrics of BM25, ANCE and ColBERT computed on the TREC
2019 and the sample of the MSMARCO Dev query sets. As expected, both the ANCE and
ColBERT dense retrieval approaches are significantly better than BM25 for the NDCG@10 and
MRR@10 metrics on both query sets. Comparing the two dense retrieval approaches, for MAP,
ColBERT significantly outperforms ANCE; for NDCG@10, ColBERT enhances ANCE by 6%
(0.6537→0.6934), but not significantly so; for MRR@10, ANCE is slightly (but not significantly)
better than ColBERT on the TREC2019 query set while ColBERT is statistically better than
ANCE on MSMARCO Dev by +7%. Overall, for RQ1, we conclude that multiple representations,
employed by ColBERT, experimentally obtain better effectiveness than single representations
(as employed by ANCE), exhibiting significant boost in effectiveness for MAP (TREC 2019) and
MRR@10 (Dev). Among the most striking differences is that for MAP on TREC 2019, where
ColBERT markedly outperforms ANCE (and BM25); this observation suggests that the single
representation is not sufficiently good at attaining high recall.

3.3. Comparison using a Common Baseline
Next, we investigate the comparative effectiveness of ANCE and ColBERT from the perspective
of using BM25 as the reference point, going further than reporting average performances over
the entire query sets as reported in Table 2. To perform this analysis, we define the difficulty of
   5
       This is the metric recommended by the track organisers for this query set.
Table 2
Effectiveness metrics of BM25, ANCE and ColBERT on different query sets. Points marked with △ and
▲ denote a significant increase in effectiveness compared to BM25 and ANCE, respectively, according
to a paired t-test with Bonferroni correction (p-value < 0.05).
                                        TREC 2019                 MSMARCO Dev
                              MAP       NDCG@10       MRR@10         MRR@10
                BM25          0.2864      0.4795       0.6410          0.1836
                ANCE         0.3715△     0.6537△      0.8574△         0.3292△
                ColBERT     0.4309▲△     0.6934△      0.8527△        0.3519△▲

 a query according to an effectiveness metric on the BM25 baseline, following Mothe et al. [18].
 Due to the sparsity of the relevance judgements and the official evaluation metrics of the two
 query sets, we adopt a different query difficulty classification for TREC 2019 and MSMARCO
 Dev. For the TREC 2019 query set, a query is considered hard, resp. easy, for the BM25 baseline
 system if the NDCG@10 (the official TREC metric in [19]) value is in the first quartile, resp. in
 the fourth quartile, and medium otherwise. For the MSMARCO query set, the official metric
 MRR@10 per query is too sparse to allow percentile computations. Hence we consider a Dev
 query to be hard if its MRR@10 is lesser than or equal to 0.1, and easy otherwise.
    We partition the queries in each query set according to the corresponding difficulty classifi-
 cation, and compute for how many queries the effectiveness of ANCE and ColBERT is higher
(denoted by W(in)) or lower (denoted with L(oss)) than BM25. For each partition, we also
 compute the average reward and risk associated with the W and L queries, following [17].
    Table 3 reports the observed results. For the TREC 2019 queries, both ANCE and ColBERT
 exhibit approx. the same number of wins/losses for each query difficulty level. However, ANCE
 obtains higher rewards and higher risks on the class of easy queries than ColBERT (+0.1930
vs. +0.1827 and -0.1976 vs. -0.1380). On the medium difficulty class, the situation is reversed,
 and ColBERT obtains both higher rewards and higher risks than ANCE (+0.3053 vs. +02978 and
-0.1521 vs. 0.1366). On the hard difficulty class, ColBERT is markedly superior to ANCE in terms
 of reward (+0.4114 vs. + 0.3750), and risk, even if such risk is computed over a single query. For
 the MSMARCO Dev queries, ColBERT is able to improve the MRR@10 of both easy and hard
 queries better than ANCE, and the losses are smaller for ColBERT than for ANCE.
    To conclude on RQ2, we have presented experimental evidence that both single and multiple
 representations are approximately as effective on easy queries. In contrast, for hard queries,
 the adoption of multiple embeddings helps w.r.t the usage of a single embedding. We explain
 this by noting that a single representation is learned to compress all semantic information and
 dependencies of the different tokens composing a query in a single embedding. On the other
 hand, multiple representations – using one embedding per query token together with additional
 masked tokens – can encode more diverse semantic information in the different embeddings,
 allowing to retrieve more relevant documents for queries that are hard to answer.
Table 3
Comparative performances w.r.t. BM25; queries are classified based on BM25 performance (easy/
medium/hard); Wins and Losses as well as Reward and Risk are calculated w.r.t. BM25 performance.
                                                                                                                                      ANCE                                                                                                                                                 ColBERT
             Type            Num
                                                                                           W/L                                                         reward/risk                                                                                 W/L                                                                reward/risk
                                                                                                               TREC 2019 – NDCG@10
             Easy                 11                                                         5/6                                           +0.1930/-0.1976                                                                                          5/6                                                +0.1827/-0.1380
             Medium               21                                                         17/4                                          +0.2978/-0.1366                                                                                          18/3                                               +0.3053/-0.1521
             Hard                 11                                                         9/1                                           +0.3750/-0.1826                                                                                          10/1                                               +0.4114/-0.0415
                                                                                                    MSMARCO Dev – MRR@10
             Easy                1954                                828/712                                                               +0.4735/-0.4001                                                                      854/673                                                                +0.4778/-0.3887
             Hard                3076                                1372/24                                                                +0.4543/-0.1                                                                        1455/21                                                                 +0.4793/-0.1


                           0.4
                                                          types of dysarthria from cerebral palsy


                                                                                                                                                                                                                                                                                                       why did the us volunterilay enter ww1
                                                                                                                                                                                                                                what is an aml surveillance analyst
                           0.2


                                                                                                                                                                                                                                                                                                                                               cost of interior concrete flooring
                                                                                                    who formed the commonwealth
                 NDCG@10


                                                                                                                                                                                                                                                                      how long is life cycle of flea
                                                                                                                                                                                                   what is theraderm used for


                           0.0
                                                                                                    of independent states
                                  tracheids are part of


                                                                                                                                                        who is robert gray

                                                                                                                                                                             lps laws definition
                                                                                                                                  rsa definition key


                           0.2

                           0.4
                                   52 0
                                  11 3
                                   49 2
                                  10 5
                                   44 8
                                   85 6
                                  11 0
                                           6

                                  10 4
                                   10 0
                                         61
                                       21
                                         3
                                       81
                                         9
                                       79
                                         9
                                         1
                                       77

                                         1
                                       75
                                      74


                                      05


                                      33
                                      54


                                      40


                                      48
                                 24


                                    03


                                    37


                                    15


                                    63
                                   26
                            11


Figure 1: Difference in NDCG@10 for queries in the TREC 2019 query set; Differences smaller than
0.15 absolute are omitted. +ve differences are where ColBERT exceeds ANCE.


3.4. Per-query Comparison
To address RQ3, in Figure 1 we present a per-query histogram comparing the ΔNDCG@10
between ColBERT and ANCE on the TREC 2019 query set; positive deltas indicate a higher
NDCG@10 for ColBERT than ANCE. In total, ColBERT outperform ANCE for 24 queries, while
the opposite was true for 17 queries; Δs less than 0.15 are omitted for clarity. On analysing
Figure 1, it appears that many queries requesting a definition appear to perform well for ColBERT
(e.g. 1124210, 490595). Indeed, on closer inspection of the TREC 2019 query set, out of 43 queries,
we found 19 such definitional queries – of which 16 were more effective for ColBERT.
   To illustrate other differences between the approaches, in Table 4 we select two non-definitional
queries where one approach markedly outperformed the other (but not the queries with the
Table 4
Examples of passages retrieved by ANCE and ColBERT at top ranks. The Label column contains the
assessment of that document for that query in the qrel file, with – denoting unjudged.
 Document        System    Rank   Passage                                                                           Label
                                                  ColBERT > ANCE
                                    527433: types of dysarthria from cerebral palsy
  8617271       ColBERT      1    There are three major types of dysarthria in cerebral palsy: spastic, dyski-       3
                                  netic (athetosis) and ataxic. Speech impairments in spastic dysarthria involves
                                  four major abnormalities of voluntary movement: spasticity, weakness, limited
                                  range of motion and slowness of movement.
  8306451        ANCE        2    The types of cerebral palsy are: 1 spastic: the most common type of cerebral       0
                                  palsy; reflexes are exaggerated and muscle movement is stiff. 2 dyskinetic:
                                  dyskinetic cerebral palsy is divided into two categories.
                                                  ANCE > ColBERT
                                    1063750: why did the us volunterilay enter ww1
  1300452       ColBERT      2    The main event that led the US to entering ww2 was Japan bombing Pearl             –
                                  Harbor. The day after the bombing u.s. joined the war On December 7, 1941,
                                  the Japanese Navy lau â.¦ nched a surprise attack on the naval base at Pearl
                                  Harbor, Hawaii.lthough the growing peril of Britain worried many, including
                                  Roosevelt, it was not until the US was directly attacked at Pearl Harbor that
                                  public and political opinion turned in favor of war with the Axis
  7952971        ANCE        1    The U.S entered WW1 for several reasons. The U.S entered for two main rea-         2
                                  sons: one was that the Germans had declared unlimited German submarine
                                  warfare and the Zimmermann note.The German had totally disregarded the in-
                                  ternational laws protecting neutral nation’s ships by sinking neutral ships.his
                                  note was the last straw, causing Wilson to join the war. The Zimmermann note
                                  and unlimited German submarine warfare were two of the biggest cause for the
                                  U.S to join the Allies and go to war with Germany. During the war Germany...


most extreme deltas, which may be outliers). Firstly, for query 527433 (‘types of dysarthria
from cerebral palsy’), ColBERT identifies a passage that clearly answers the query; in contrast,
the non-relevant passage identified at rank 2 by ANCE appears to have focused solely on
the ‘cerebral palsy’ aspect, omitting the dysarthria aspect of the query. Indeed, the Precision
@10 of ANCE for this query was 10   3
                                      , compared to 106
                                                         for ColBERT. This suggests that ANCE’s
compression of a complex information need into one embedding has caused an information
loss, with the model focusing on only a single aspect of the query, resulting in low effectiveness.
   On the other hand, for query 1063750 (‘why did the us volunterilay enter ww1’), ANCE
identified a relevant passage, but ColBERT identified a passage (1300452) focusing entirely on
the wrong World War (‘ww2’ rather than ‘ww1’). At least some of the reason for the conflation
of meanings is that neither ‘ww1’ nor ‘ww2’ do not appear in BERT’s fixed vocabulary, e.g., the
latter is tokenised into word pieces as ‘w’, ‘##w’, ‘##2’. Hence distinguishing between ‘ww1’
and ‘ww2’ information needs require context to be distributed across the three embeddings.
To analyse this passage further, Figure 2 shows the ColBERT interaction between the query
and document embeddings for this passage and query6 . In the figure, the darker shading in the
matrix is indicative of higher similarity; the highest similarity that is selected for a given query
embedding by the max-sim operator is indicated by a × symbol; the histogram at the top of the

    6
        This figure can be reproduced using the explain_text() function within our PyTerrier_ColBERT library.
                          0.5
                          0.0


                              voluntarily

                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [MASK]
                              [CLS]


                              enter
                              [SEP]
                              ##w
                              ##1
                              why
                              the
                              [Q]
                              did
                              us
                              w
                       [CLS]
                          [D]
                               X
                                 X                                             X
                          the
                        main
                       event
                         that
                          led X X
                          the                                                    X
                           us
                           to      X
                                     X
                                       X
                                                         X X X X X X               X X X

                    enteringw
                                         X
                                           X       X X X
                                                                     X
                                                                       X X
                                                                           X X

                        ##w
                        ##2
                                             X
                                               X
                                                                        X

                         was
                       japan
                    bombing
                        pearl
                      harbor .                   X
                          the
                         day
                        after
                          the
                    bombing u.
                            s
Figure 2: ColBERT interaction between query and document embeddings for query 106375 and pas-
sage 1300452 (see Table 4). In the interaction matrix, darker shading is indicative of higher similarity;
the document embedding (row) with highest similarity for each query embedding (column) is indicated
with a × symbol. The histogram at the top portrays the contribution of each query embedding to the
final score of the passage, with shading also indicative of the magnitude of contribution.


figure indicates the contribution of each query embedding to the final passage score. Indeed, on
inspection of the max similarities for this passage shows that the highest contributions to the
passage’s score comes from the ‘##w’ token, with ‘##1’ query embedding being highly similar
to the ‘##2’ document embedding. This suggests that the embeddings for ‘##1’ and ‘##2’ are not
sufficiently contextualised when following ‘##w’, or that ColBERT’s max similarity computation
could be adapted to better address proximity. In contrast, ANCE retrieved passage 1300452
at rank 155, showing that the single representations for the passages sufficiently distinguish
between World War 1 vs. World War 2.
   In summary, in addressing RQ3, we observed that there exists some large differences between
ANCE and ColBERT for some queries. Our analysis found that ColBERT perfoms better than
ANCE for definitional type queries. Moreover, our analysis suggests that in ANCE, the use of a
single embedding representation risks misinterpreting complex queries with multiple aspects
as shown by results in the previous subsection; For ColBERT, the max similarity operator can
overly focus on highly similar embeddings at the risk of mis-interpreting a query.
4. Conclusions
Despite their recency, dense passage retrieval approaches have the effectiveness potential to
supplant the traditional inverted index data structure. Yet, different families of dense retrieval
are emerging, for which the comparative advantages and disadvantages are not yet clear. In
this work, we made a systematic study of single vs. multiple representation dense retrieval
approaches, namely ANCE and ColBERT. We found that while both significantly outperformed
BM25 baselines across various metrics, ColBERT significantly outperformed ANCE for MAP on
TREC2019 and MRR@10 on the MSMARCO Dev query set, was more effective for queries that
BM25 found hard, and was better at definitional queries as well as queries that had complex
information needs. On the other hand, ANCE has desirable qualities in terms of mean response
time and memory occupancy (see Table 1). We postulate that research should be directed
toward hybrid solutions, either reducing the size of the ColBERT embedding index, e.g., through
adaptations to static pruning, or through using multiple embeddings within ANCE for complex
queries/passages.


Acknowledgements
Nicola Tonellotto was partially supported by the Italian Ministry of Education and Research
(MIUR) in the framework of the CrossLab project (Departments of Excellence). Craig Macdon-
ald and Iadh Ounis acknowledge EPSRC grant EP/ R018634/1: Closed-Loop Data Science for
Complex, Computationally- & Data-Intensive Analytics.


References
 [1] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding, in: Proc. NAACL, 2019.
 [2] J. Lin, R. Nogueira, A. Yates, Pretrained transformers for text ranking: BERT and beyond,
     2020. arXiv:2010.06467.
 [3] O. Khattab, M. Zaharia, ColBERT: Efficient and Effective Passage Search via Contextualized
     Late Interaction over BERT, in: Proc. SIGIR, 2020, p. 39–48.
 [4] S. Hofstätter, A. Hanbury, Let’s measure run time! Extending the IR replicability infras-
     tructure to include performance aspects, in: OSIRRC@SIGIR, 2019.
 [5] H. Zamani, M. Dehghani, W. B. Croft, E. Learned-Miller, J. Kamps, From neural re-ranking
     to neural ranking: Learning a sparse representation for inverted indexing, in: Proc. CIKM,
     2018, pp. 497–506.
 [6] S. MacAvaney, A. Yates, A. Cohan, N. Goharian, CEDR: Contextualized embeddings for
     document ranking, in: Proc. SIGIR, 2019, pp. 1101–1104.
 [7] S. MacAvaney, F. M. Nardini, R. Perego, N. Tonellotto, N. Goharian, O. Frieder, Efficient
     document re-ranking for transformers by precomputing term representations, in: Proc.
     SIGIR, 2020, pp. 49–58.
 [8] S. MacAvaney, F. M. Nardini, R. Perego, N. Tonellotto, N. Goharian, O. Frieder, Expansion
     via prediction of importance with contextualization, in: Proc. SIGIR, 2020, p. 1573–1576.
 [9] Z. Dai, J. Callan, Deeper text understanding for IR with contextual neural language
     modeling, in: Proc. SIGIR, 2019, pp. 985–988.
[10] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, 2017.
     arXiv:1702.08734.
[11] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage
     retrieval for open-domain question answering, in: Proc. EMNLP, 2020, pp. 6769–6781.
[12] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, A. Overwijk, Approximate
     nearest neighbor negative contrastive learning for dense text retrieval, in: Proc. ICLR,
     2021.
[13] C. Macdonald, N. Tonellotto, On approximate nearest neighbour selection for multi-stage
     dense retrieval, in: Proc. CIKM, 2021.
[14] C. Macdonald, N. Tonellotto, Declarative experimentation in information retrieval using
     PyTerrier, in: Proc. ICTIR, 2020, pp. 161–168.
[15] C. Macdonald, N. Tonellotto, S. MacAvaney, I. Ounis, PyTerrier: Declarative experimenta-
     tion in Python from BM25 to dense retrieval, in: Proc. CIKM, 2021.
[16] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
     moyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019.
     arXiv:1907.11692.
[17] L. Wang, P. N. Bennett, K. Collins-Thompson, Robust ranking models via risk-sensitive
     optimization, in: Proc. SIGIR, 2012, p. 761–770.
[18] J. Mothe, L. Laporte, A.-G. Chifu, Predicting Query Difficulty in IR: Impact of Difficulty
     Definition, in: Proc. KSE, 2019, pp. 1–6.
[19] N. Craswell, B. Mitra, D. Campos, E. Yilmaz, Overview of the TREC 2019 Deep Learning
     Track, in: Proc. TREC 2019, 2020.