<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Information Retrieval Workshop, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On Single and Multiple Representations in Dense Passage Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Tonellotto</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iadh Ounis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Glasgow</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>3</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The advent of contextualised language models has brought gains in search efectiveness, not just when applied for re-ranking the output of classical weighting models such as BM25, but also when used directly for passage indexing and retrieval, a technique which is called dense retrieval. In the existing literature in neural ranking, two dense retrieval families have become apparent: single representation, where entire passages are represented by a single embedding (usually BERT's [CLS] token, as exemplified by the recent ANCE approach), or multiple representations, where each token in a passage is represented by its own embedding (as exemplified by the recent ColBERT approach). These two families have not been directly compared. However, because of the likely importance of dense retrieval moving forward, a clear understanding of their advantages and disadvantages is paramount. To this end, this paper contributes a direct study on their comparative efectiveness, noting situations where each method under/over performs w.r.t. each other, and w.r.t. a BM25 baseline. We observe that, while ANCE is more eficient than ColBERT in terms of response time and memory usage, multiple representations are statistically more efective than the single representations for MAP and MRR@10. We also show that multiple representations get better improvements than single representations for queries being the hardest for BM25, as well as for definitional queries, and those with complex information needs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Pre-trained contextualised language models such as BERT have been shown to greatly improve
retrieval efectiveness over the previous state-of-the-art methods in many information retrieval
(IR) tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These contextualised language models are able to learn semantic representations
called embeddings from the contexts of words and, therefore, better capture the relevance of a
document w.r.t. a query, with substantial improvements over the classical approach in the
ranking and re-ranking of documents [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Most BERT-based models are computationally expensive
for estimating query-document similarities in ranking, due to the complexity of the underlying
transformer neural network [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. As such, BERT-based ranking models have been used
as second-stage rankers in retrieval cascades, in particular to re-rank candidate documents
generated by classical relevance models such as BM25 [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. BERT-based models are also
limited in the length of text that they can process, and hence are often applied on passages
rather than full documents (which we focus on in this paper); entire document rankings can
be obtained by estimating relevance at a passage level, then aggregating [9].
      </p>
      <p>
        Recently, several works have proposed investigating whether BERT-based systems are able
to identify the relevant passages among all passages in a collection, rather than just among a
query-dependent sample; these systems represent a new type of retrieval approaches called
dense retrieval. In dense retrieval, passages are represented by real-valued vectors, while the
query-document similarity is computed by deploying eficient nearest neighbour techniques
over specialised indexes, such as those provided by the FAISS toolkit [10]. Thus far, two diferent
families of dense retrieval approaches have emerged recently, based on single representation and
multiple representation. In particular, DPR [11] and ANCE [12] use a single representation, by
indexing only the embedding of BERT’s [CLS] token, and therefore this is assumed to represent
the meaning of an entire passage within that single embedding. At retrieval time, the [CLS]
embedding of the query is then used to retrieve passages by identifying nearest neighbours
using a FAISS index. In contrast, ColBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which uses multiple representation, indexes an
embedding for each token in each document. At retrieval time, a set of the nearest document
embeddings to each query embedding is retrieved, by identifying the approximate nearest
neighbours from a FAISS index. These passages must then be exactly scored, based on the
maximal similarity between the query and the passage embeddings, to obtain the final ranking.
      </p>
      <p>These are two markedly diferent families of dense retrieval approaches. Indeed, as ColBERT
records one embedding for every token, this makes for a large index of embeddings, which
may allow a richer semantic representation of the content. On the other hand, DPR and ANCE
rely on a single embedding suficiently representing the content of each passage. However, at
the time of writing, no systematic study has compared these two families of dense retrieval
approaches. For this reason, this work contributes a first investigation into the efectiveness
of single and multiple representation embeddings for dense retrieval, as exemplified by ANCE
and ColBERT, respectively. We perform experiments in a controlled environment using the
same collection and query sets, and we report several efectiveness metrics, together with a
detailed comparison of the results obtained for the two representation families w.r.t. a common
baseline, namely BM25. To derive further insights, we also provide a per-query analysis of the
efectiveness of single and multiple representations. We observe that, while ANCE is more
eficient than ColBERT in terms of response time and memory usage, multiple representations
are statistically more efective than the single representations for MAP and MRR@10. We also
show that multiple representations obtain better improvements than single representations
for queries that are the hardest for BM25, as well as for definitional queries, and those with
complex information needs.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Statement</title>
      <p>Embeddings. Contextualized language models such as BERT have been trained on a large
corpus for language understanding, and then fine-tuned on smaller, more specific textual
collections targeting a particular IR task. Through this fine-tuning, BERT learns how to map
texts, either queries or documents, into a multi-dimensional space of one or more vectors, called
embeddings. Both queries and documents are tokenised into terms according to a predefined
vocabulary; BERT learns a function mapping tokens in a query into multiple query embeddings,
one per query term, and another potentially diferent function mapping tokens in a document
into a document embedding per term. BERT and its derived models also make use of special
tokens, such as the [CLS] (classification) token, the [SEP] (separator) token, and the [MASK]
(masked) token. In particular, [CLS] is always placed at the beginning of any text given as input
to BERT, both at training and inference time, and is used to let BERT learn a global representation
of the input text as a single embedding. In more detail, a text composed by  terms given as input
to BERT will produce  + 1 embeddings, one per input term plus one additional embedding for
[CLS]. In single representation models, such as ANCE, the embedding corresponding to [CLS] is
assumed to encode all possible information about the input text, including the possible semantic
context of the composing terms. In contrast, in multiple representation models, such as ColBERT,
each input term’s embedding encodes its specific semantic information within the context of
the entire input text.</p>
      <p>Dense Retrieval. The embeddings produced by the BERT models have recently demonstrated
their promise in being a suitable basis for dense retrieval. In dense retrieval, documents and
queries are represented using embeddings. The embeddings from the documents in a
collection can be pre-computed through the application of the BERT learned mapping and stored
into an index data structure for embeddings supporting nearest neighbour similarity search,
as exemplified by the FAISS toolkit [ 10]. Depending on the number and dimensions of the
embeddings stored into the index, advanced compression strategies, together with suitable
nearest neighbour search algorithms, can be employed. In order to reduce the time required
to identify the most similar document embeddings to a given input embedding, it is possible
to shift from exact nearest neighbour search to approximate nearest neighbour search. While
ANCE stores embeddings in an uncompressed format supporting exact search, ColBERT, given
the larger number of document embeddings it has to store, resorts to compressed and
quantised embeddings supporting approximate search. However, the approximate similarity scores
produced by approximate search are not used by the ColBERT implementation to compute the
ifnal top documents to return for a given query [ 13]1. Hence, ColBERT uses approximate search
over compressed embeddings to identify a candidate set of documents, which are then re-scored
using an index with direct lookup for retrieving the candidate documents’ embeddings, to obtain
the final ranking of documents returned to the user.</p>
      <p>
        Research Questions. In this work, we aim to compare the single and the multiple embedding
representations leveraging the ANCE and ColBERT implementations. Indeed, there was not
an efectiveness comparison with ColBERT in the ANCE paper [ 12]. ANCE embodies a recent
single representation approach, where we have a single large embedding per query/document,
which can be processed with exact similarity search in a single stage. In contrast, in the multiple
representation approach (ColBERT), we have a smaller sized embedding for each term in the
queries/documents, but due to the large number of embeddings, they must be processed using
approximate similarity search. Thereafter the candidate set must be re-ranked to compute
the exact similarity scores. The need for ColBERT to re-score all documents in the candidate
set necessitates storing all document embeddings in memory. As noted by Lin et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], this
presents a significant storage overhead. This underlines the importance of an in-depth analysis
1Indeed, in [13] we show that these approximate scores can allow a high recall but low precision ranking to be
obtained, which can be used to apply rank cutofs to the the candidate set.
of the pros and cons of both approaches, in particular:
• RQ1. What is the efectiveness of single and multiple representations in dense retrieval,
in terms of MAP, NDCG@10, MRR@10?
• RQ2. What are the relative gains and losses of single and multiple representations w.r.t.
      </p>
      <p>a common baseline such as BM25?
• RQ3. For which queries are single representations better than multiple representations,
and vice-versa?
In Section 3, we perform comparative experiments to address these research questions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        3.1. Setup
In the following, we report our experimental setup, followed by analyses for RQs 1-3.
Our experiments use the MSMARCO passage ranking dataset, a dataset of 8.8M passages
and build upon our PyTerrier IR experimentation platform [14, 15]. We adapt the ANCE
implementation2 and the ColBERT implementation3 provided by their respective authors, using
integrations with PyTerrier4. We use the provided ANCE model for the MSMARCO passage
ranking dataset. We train ColBERT using the same MSMARCO passage ranking training triples
ifle for 44,500 batches. In particular, we follow [ 12] and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for the settings of ANCE and
ColBERT, as summarised in Table 1.
      </p>
      <p>Of note, while ColBERT fine-tunes the bert-base-uncased BERT model, ANCE fine-tunes
a RoBERTa model [16] (specifically roberta-base), which is reported to apply more refined
pre-training than BERT. To try to eliminate model choice as a confounding factory, we also
trained a version of ColBERT by fine-tuning roberta-base. We found that even after training
for 300k batches (6× longer than we trained ColBERT using bert-base-uncased), this latter
model could had relative performance 25% less than the BERT-based ColBERT model (around
NDCG@10 of 0.533). Hence we discarded the RoBERTa-based ColBERT model from further
consideration. On the other hand, all of the released ANCE models use RoBERTa; training ANCE
requires multiple GPUs, e.g., 16, and has not, to the best of our knowledge, been reproduced.
Hence, we argue that as the RoBERTa-based ANCE and BERT-based ColBERT are individually
shown to be efective by their respective authors, the comparison of these representative models
still allows for interesting observations.</p>
      <p>We index the corpus using the code provided by the authors. Table 1 reports the statistics
of the resulting indices. In particular, the ANCE document index is stored in FAISS using the
uncompressed IndexFlatIP format. The ColBERT document index is stored in FAISS using the
compressed and quantised IndexIVFPQ format, which is trained on a random 5% sample of the
document embeddings. Mean response times for both ANCE and ColBERT, and their memory
consumption, are also shown in Table 1.</p>
      <p>2https://github.com/microsoft/ANCE
3https://github.com/stanford-futuredata/ColBERT/tree/v0.2
4See https://github.com/terrierteam/pyterrier_ance and https://github.com/terrierteam/pyterrier_colbert.</p>
      <p>For evaluating efectiveness, we use the publicly available query sets with relevance
assessments: 5000 queries sampled from the MSMARCO Dev set – which contain on average 1.1
judgements per query – as well as the TREC 2019 query set, which contains 43 queries with
an average of 215.3 judgements per query. To measure efectiveness, we employ MRR@10 for
the MSMARCO Dev set5, and the MRR@10, NDCG@10 and MAP for the TREC query set.</p>
      <p>To examine gains and losses, for each query and each efectiveness metric, we examine the
comparative reward (improvement) and risk (degradation) over a BM25 baseline (following [17]),
as well as the number of wins &amp; losses (improved and degraded queries).</p>
      <sec id="sec-3-1">
        <title>3.2. Overall Comparison</title>
        <p>Table 2 reports the efectiveness metrics of BM25, ANCE and ColBERT computed on the TREC
2019 and the sample of the MSMARCO Dev query sets. As expected, both the ANCE and
ColBERT dense retrieval approaches are significantly better than BM25 for the NDCG@10 and
MRR@10 metrics on both query sets. Comparing the two dense retrieval approaches, for MAP,
ColBERT significantly outperforms ANCE; for NDCG@10, ColBERT enhances ANCE by 6%
(0.6537→0.6934), but not significantly so; for MRR@10, ANCE is slightly (but not significantly)
better than ColBERT on the TREC2019 query set while ColBERT is statistically better than
ANCE on MSMARCO Dev by +7%. Overall, for RQ1, we conclude that multiple representations,
employed by ColBERT, experimentally obtain better efectiveness than single representations
(as employed by ANCE), exhibiting significant boost in efectiveness for MAP (TREC 2019) and
MRR@10 (Dev). Among the most striking diferences is that for MAP on TREC 2019, where
ColBERT markedly outperforms ANCE (and BM25); this observation suggests that the single
representation is not suficiently good at attaining high recall.
3.3. Comparison using a Common Baseline
Next, we investigate the comparative efectiveness of ANCE and ColBERT from the perspective
of using BM25 as the reference point, going further than reporting average performances over
the entire query sets as reported in Table 2. To perform this analysis, we define the dificulty of
5This is the metric recommended by the track organisers for this query set.
a query according to an efectiveness metric on the BM25 baseline, following Mothe et al. [18].
Due to the sparsity of the relevance judgements and the oficial evaluation metrics of the two
query sets, we adopt a diferent query dificulty classification for TREC 2019 and MSMARCO
Dev. For the TREC 2019 query set, a query is considered hard, resp. easy, for the BM25 baseline
system if the NDCG@10 (the oficial TREC metric in [ 19]) value is in the first quartile, resp. in
the fourth quartile, and medium otherwise. For the MSMARCO query set, the oficial metric
MRR@10 per query is too sparse to allow percentile computations. Hence we consider a Dev
query to be hard if its MRR@10 is lesser than or equal to 0.1, and easy otherwise.</p>
        <p>We partition the queries in each query set according to the corresponding dificulty
classification, and compute for how many queries the efectiveness of ANCE and ColBERT is higher
(denoted by W(in)) or lower (denoted with L(oss)) than BM25. For each partition, we also
compute the average reward and risk associated with the W and L queries, following [17].</p>
        <p>Table 3 reports the observed results. For the TREC 2019 queries, both ANCE and ColBERT
exhibit approx. the same number of wins/losses for each query dificulty level. However, ANCE
obtains higher rewards and higher risks on the class of easy queries than ColBERT (+0.1930
vs. +0.1827 and -0.1976 vs. -0.1380). On the medium dificulty class, the situation is reversed,
and ColBERT obtains both higher rewards and higher risks than ANCE (+0.3053 vs. +02978 and
-0.1521 vs. 0.1366). On the hard dificulty class, ColBERT is markedly superior to ANCE in terms
of reward (+0.4114 vs. + 0.3750), and risk, even if such risk is computed over a single query. For
the MSMARCO Dev queries, ColBERT is able to improve the MRR@10 of both easy and hard
queries better than ANCE, and the losses are smaller for ColBERT than for ANCE.</p>
        <p>To conclude on RQ2, we have presented experimental evidence that both single and multiple
representations are approximately as efective on easy queries. In contrast, for hard queries,
the adoption of multiple embeddings helps w.r.t the usage of a single embedding. We explain
this by noting that a single representation is learned to compress all semantic information and
dependencies of the diferent tokens composing a query in a single embedding. On the other
hand, multiple representations – using one embedding per query token together with additional
masked tokens – can encode more diverse semantic information in the diferent embeddings,
allowing to retrieve more relevant documents for queries that are hard to answer.</p>
        <p>ANCE</p>
        <p>reward/risk
+0.1930/-0.1976
+0.2978/-0.1366
+0.3750/-0.1826</p>
        <p>W/L</p>
        <p>ColBERT</p>
        <p>reward/risk
+0.1827/-0.1380
+0.3053/-0.1521
+0.4114/-0.0415
0.2
0.4
f
o
t
r
a
p
e
r
a
s
d
i
e
h
c
a
r
t
y
s
l
a
p
lifrrrcoaaeebm ltoaehnwmmtse
sya ftrcooeehdmttsaeeennddp
r
h
t
r
d
f
o
s
e
p
y
t
y
e
k
n
o
iiif
t
n
e
oh ifn sad
w o r
y
a
r
g
t
r
e
b
o
r
s
i
o
h
w
n
o
iiif
t
n
e
d
s
w
a
l
s
p
l
fr
o
d
e
s
u
m
r
e
d
a
r
e
h
t
s
i
t
a
h
w
t
s
y
l
a
n
a
e
c
n
a
ill
e
v
r
u
s
l
m
a
n
a
s
i
t
a
h
w
a
e
lff
o
e
l
c
y
c
e
liif
s
g
n
o
l
w
o
h
1
w
w
r
e
t
n
e
y
a
li
r
e
t
n
u
l
o
v
s
u
e
h
t
d
i
d
y
h
w
g
n
i
r
o
o
lf
e
t
e
r
c
n
o
c
r
o
i
r
e
t
n
if
o
t
s
o
c
1124210527433110381249059510377984433968554101115776
2640141063750104861</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.4. Per-query Comparison</title>
        <p>To address RQ3, in Figure 1 we present a per-query histogram comparing the ΔNDCG@10
between ColBERT and ANCE on the TREC 2019 query set; positive deltas indicate a higher
NDCG@10 for ColBERT than ANCE. In total, ColBERT outperform ANCE for 24 queries, while
the opposite was true for 17 queries; Δs less than 0.15 are omitted for clarity. On analysing
Figure 1, it appears that many queries requesting a definition appear to perform well for ColBERT
(e.g. 1124210, 490595). Indeed, on closer inspection of the TREC 2019 query set, out of 43 queries,
we found 19 such definitional queries – of which 16 were more efective for ColBERT.</p>
        <p>To illustrate other diferences between the approaches, in Table 4 we select two non-definitional
queries where one approach markedly outperformed the other (but not the queries with the
Examples of passages retrieved by ANCE and ColBERT at top ranks. The Label column contains the
assessment of that document for that query in the qrel file, with – denoting unjudged.
There are three major types of dysarthria in cerebral palsy: spastic,
dyskinetic (athetosis) and ataxic. Speech impairments in spastic dysarthria involves
four major abnormalities of voluntary movement: spasticity, weakness, limited
range of motion and slowness of movement.</p>
        <p>The types of cerebral palsy are: 1 spastic: the most common type of cerebral
palsy; reflexes are exaggerated and muscle movement is stif. 2 dyskinetic:
dyskinetic cerebral palsy is divided into two categories.</p>
        <p>ANCE &gt; ColBERT
1063750: why did the us volunterilay enter ww1
The main event that led the US to entering ww2 was Japan bombing Pearl
Harbor. The day after the bombing u.s. joined the war On December 7, 1941,
the Japanese Navy lau â.¦ nched a surprise attack on the naval base at Pearl
Harbor, Hawaii.lthough the growing peril of Britain worried many, including
Roosevelt, it was not until the US was directly attacked at Pearl Harbor that
public and political opinion turned in favor of war with the Axis
The U.S entered WW1 for several reasons. The U.S entered for two main
reasons: one was that the Germans had declared unlimited German submarine
warfare and the Zimmermann note.The German had totally disregarded the
international laws protecting neutral nation’s ships by sinking neutral ships.his
note was the last straw, causing Wilson to join the war. The Zimmermann note
and unlimited German submarine warfare were two of the biggest cause for the
U.S to join the Allies and go to war with Germany. During the war Germany...</p>
        <p>Label
3
0
–
2
most extreme deltas, which may be outliers). Firstly, for query 527433 (‘types of dysarthria
from cerebral palsy’), ColBERT identifies a passage that clearly answers the query; in contrast,
the non-relevant passage identified at rank 2 by ANCE appears to have focused solely on
the ‘cerebral palsy’ aspect, omitting the dysarthria aspect of the query. Indeed, the Precision
@10 of ANCE for this query was 130 , compared to 160 for ColBERT. This suggests that ANCE’s
compression of a complex information need into one embedding has caused an information
loss, with the model focusing on only a single aspect of the query, resulting in low efectiveness.</p>
        <p>On the other hand, for query 1063750 (‘why did the us volunterilay enter ww1’), ANCE
identified a relevant passage, but ColBERT identified a passage (1300452) focusing entirely on
the wrong World War (‘ww2’ rather than ‘ww1’). At least some of the reason for the conflation
of meanings is that neither ‘ww1’ nor ‘ww2’ do not appear in BERT’s fixed vocabulary, e.g., the
latter is tokenised into word pieces as ‘w’, ‘##w’, ‘##2’. Hence distinguishing between ‘ww1’
and ‘ww2’ information needs require context to be distributed across the three embeddings.
To analyse this passage further, Figure 2 shows the ColBERT interaction between the query
and document embeddings for this passage and query6. In the figure, the darker shading in the
matrix is indicative of higher similarity; the highest similarity that is selected for a given query
embedding by the max-sim operator is indicated by a × symbol; the histogram at the top of the
6This figure can be reproduced using the explain_text() function within our PyTerrier_ColBERT library.
y
l
i
] ltrauntre w#1#]PE]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA]SKA
[LSC][Qyhwiddtehsuvo enw##[S[M[M[M[M[M[M[M[M[M[M[M[M[M[M[M[M[M[M[M[M</p>
        <p>X</p>
        <p>X</p>
        <p>X</p>
        <p>X</p>
        <p>X</p>
        <p>X</p>
        <p>X</p>
        <p>X</p>
        <p>X X X
X</p>
        <p>X</p>
        <p>X X X X X X</p>
        <p>X</p>
        <p>X X
X X X</p>
        <p>X</p>
        <p>X</p>
        <p>X X X
ifgure indicates the contribution of each query embedding to the final passage score. Indeed, on
inspection of the max similarities for this passage shows that the highest contributions to the
passage’s score comes from the ‘##w’ token, with ‘##1’ query embedding being highly similar
to the ‘##2’ document embedding. This suggests that the embeddings for ‘##1’ and ‘##2’ are not
suficiently contextualised when following ‘##w’, or that ColBERT’s max similarity computation
could be adapted to better address proximity. In contrast, ANCE retrieved passage 1300452
at rank 155, showing that the single representations for the passages suficiently distinguish
between World War 1 vs. World War 2.</p>
        <p>In summary, in addressing RQ3, we observed that there exists some large diferences between
ANCE and ColBERT for some queries. Our analysis found that ColBERT perfoms better than
ANCE for definitional type queries. Moreover, our analysis suggests that in ANCE, the use of a
single embedding representation risks misinterpreting complex queries with multiple aspects
as shown by results in the previous subsection; For ColBERT, the max similarity operator can
overly focus on highly similar embeddings at the risk of mis-interpreting a query.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>Despite their recency, dense passage retrieval approaches have the efectiveness potential to
supplant the traditional inverted index data structure. Yet, diferent families of dense retrieval
are emerging, for which the comparative advantages and disadvantages are not yet clear. In
this work, we made a systematic study of single vs. multiple representation dense retrieval
approaches, namely ANCE and ColBERT. We found that while both significantly outperformed
BM25 baselines across various metrics, ColBERT significantly outperformed ANCE for MAP on
TREC2019 and MRR@10 on the MSMARCO Dev query set, was more efective for queries that
BM25 found hard, and was better at definitional queries as well as queries that had complex
information needs. On the other hand, ANCE has desirable qualities in terms of mean response
time and memory occupancy (see Table 1). We postulate that research should be directed
toward hybrid solutions, either reducing the size of the ColBERT embedding index, e.g., through
adaptations to static pruning, or through using multiple embeddings within ANCE for complex
queries/passages.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>Nicola Tonellotto was partially supported by the Italian Ministry of Education and Research
(MIUR) in the framework of the CrossLab project (Departments of Excellence). Craig
Macdonald and Iadh Ounis acknowledge EPSRC grant EP/ R018634/1: Closed-Loop Data Science for
Complex, Computationally- &amp; Data-Intensive Analytics.
[9] Z. Dai, J. Callan, Deeper text understanding for IR with contextual neural language
modeling, in: Proc. SIGIR, 2019, pp. 985–988.
[10] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, 2017.</p>
      <p>arXiv:1702.08734.
[11] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage
retrieval for open-domain question answering, in: Proc. EMNLP, 2020, pp. 6769–6781.
[12] L. Xiong, C. Xiong, Y. Li, K.-F. Tang, J. Liu, P. Bennett, J. Ahmed, A. Overwijk, Approximate
nearest neighbor negative contrastive learning for dense text retrieval, in: Proc. ICLR,
2021.
[13] C. Macdonald, N. Tonellotto, On approximate nearest neighbour selection for multi-stage
dense retrieval, in: Proc. CIKM, 2021.
[14] C. Macdonald, N. Tonellotto, Declarative experimentation in information retrieval using</p>
      <p>PyTerrier, in: Proc. ICTIR, 2020, pp. 161–168.
[15] C. Macdonald, N. Tonellotto, S. MacAvaney, I. Ounis, PyTerrier: Declarative
experimentation in Python from BM25 to dense retrieval, in: Proc. CIKM, 2021.
[16] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L.
Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019.
arXiv:1907.11692.
[17] L. Wang, P. N. Bennett, K. Collins-Thompson, Robust ranking models via risk-sensitive
optimization, in: Proc. SIGIR, 2012, p. 761–770.
[18] J. Mothe, L. Laporte, A.-G. Chifu, Predicting Query Dificulty in IR: Impact of Dificulty</p>
      <p>Definition, in: Proc. KSE, 2019, pp. 1–6.
[19] N. Craswell, B. Mitra, D. Campos, E. Yilmaz, Overview of the TREC 2019 Deep Learning
Track, in: Proc. TREC 2019, 2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,
          <source>in: Proc. NAACL</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <article-title>Pretrained transformers for text ranking: BERT and beyond</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2010</year>
          .06467.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          , M. Zaharia,
          <article-title>ColBERT: Eficient and Efective Passage Search via Contextualized Late Interaction over BERT</article-title>
          ,
          <source>in: Proc. SIGIR</source>
          ,
          <year>2020</year>
          , p.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Let's measure run time! Extending the IR replicability infrastructure to include performance aspects</article-title>
          ,
          <source>in: OSIRRC@SIGIR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Learned-Miller</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kamps</surname>
          </string-name>
          ,
          <article-title>From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing</article-title>
          ,
          <source>in: Proc. CIKM</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          , CEDR:
          <article-title>Contextualized embeddings for document ranking</article-title>
          ,
          <source>in: Proc. SIGIR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1101</fpage>
          -
          <lpage>1104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Nardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Frieder</surname>
          </string-name>
          ,
          <article-title>Eficient document re-ranking for transformers by precomputing term representations</article-title>
          ,
          <source>in: Proc. SIGIR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Nardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Frieder</surname>
          </string-name>
          ,
          <article-title>Expansion via prediction of importance with contextualization</article-title>
          ,
          <source>in: Proc. SIGIR</source>
          ,
          <year>2020</year>
          , p.
          <fpage>1573</fpage>
          -
          <lpage>1576</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>