<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>E cient top-k document retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science and Engineering, New York University</institution>
          ,
          <addr-line>NY</addr-line>
          ,
          <country country="US">US</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Over the past few decades, the IR community has been making a continuous e ort to improve the e ciency of search in large collections of documents. Query processing is still one of the main bottlenecks in large-scale search systems. The top-k document retrieval problem, which can be de ned as reporting the k most relevant documents from a collection for a given query, can be extremely expensive, as it involves scoring large amounts of documents. In this work, we investigate the top-k document retrieval problem from several angles with the aim of improving the e ciency of this task in large-scale search systems. Finally, we brie y describe our initial ndings and conclude by proposing future directions to follow.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In the past two decades, the amount of data being created has skyrocketed.
The key to unlock the full potential of these huge datasets is to make the most
of advances in algorithms and tools capable to handle it. Many parts of the
search engine architecture, including data acquisition, data analysis, and index
maintenance, are facing critical challenges. Nevertheless, query processing is still
the hardest to deal with, since workload grows with both data size and query
load. Although hardware is getting less expensive and more powerful every day,
the size of the data and the number of searches is growing at an even faster rate.
Much of the research and development in information retrieval is, indeed, aimed
at improving retrieval e ciency.</p>
      <p>The most common structure used for text retrieval is an inverted index. For
each term in a parsed collection, it stores a list of numerical IDs of documents
containing this term, typically along with additional data, such as term
frequencies or precomputed quantized impact scores. We call all values associated with
a (term, document)-pair a posting.</p>
      <p>The rst problem we encounter is e cient index representation. Given the
extremely large collections indexed by current search engines, even a single node
of a large search cluster typically contains many billions of integers. In particular,
compression of posting lists is of utmost importance, since they account for much
of the data size and access costs.</p>
      <p>Likewise, the choice of a retrieval algorithm is crucial to the e ciency of
query processing. While query processing in search engines is a complex process,
most systems appear to process a query by rst evaluating a fairly simple ranking
function over an inverted index. Due to the large sizes of most search collections,
retrieving all potential matches is infeasible and undesirable. Most advanced
query processing algorithms make use of dynamic pruning techniques, which
allow them to skip document evaluation while being able to retrieve the top-k
most relevant documents for a given query without any e ectiveness degradation
of its ranking.</p>
      <p>In this paper, we pose the following research questions:
RQ1 How does the choice of query processing algorithm depend on the
compression method used?
RQ2 How is performance impacted by the number of results returned by the
query processing algorithm?</p>
    </sec>
    <sec id="sec-2">
      <title>2 Related Work</title>
      <sec id="sec-2-1">
        <title>2.1 Index compression</title>
        <p>Compression is an important technique employed in indexing systems to reduce
the size of an inverted index. A large amount of research has as primary objective
the inverted index space occupancy minimization, while maintaining fast query
processing speed.</p>
        <p>
          Binary packing. Binary Packing [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] groups numbers into x-sized blocks. For
each block, its selector b is the smallest number of bits required to binary-encode
the largest element of the group. The selector is binary-encoded in one byte,
followed by the values belonging to the group, each encoded in b bits. For better
performance, we can store four selectors at a time in one 32-bit word, followed
by their respective groups. Lemire and Boytsov [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] proposed a Binary Packing
method that exploits SIMD instructions. This method, called SIMD-BP, packs
128 consecutive integers into as few 128-bit words as possible. Selectors are stored
in groups of 16, to fully utilize 128-bit SIMD operations.
        </p>
        <p>
          Variable Byte. The encodings in the Variable Byte family are known for their
high decoding speed. Arguably the simplest and best known is VByte [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which
uses 7 bits per byte to store the binary representation of a number and one
remaining bit to indicate whether the same binary code continues in the next byte.
These continuation bits, when put together, form unary-encoded byte-lengths
of encoded numbers. To improve decoding speed, Dean [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] proposed VarintGB,
which groups these lengths together and encodes them in binary instead: one
byte is used to store four 2-bit sizes of the next four integers, followed by their
binary representations.
        </p>
        <p>
          Recently, several SIMD-based implementations of variable-byte encodings
have been shown to be extremely e cient [
          <xref ref-type="bibr" rid="ref10 ref11 ref3">3, 10, 11</xref>
          ]. Stepanov et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] analyzed
a family of SIMD-based algorithms, including a SIMD version of Group Varint,
and found the fastest to be Varint-G8IU: Consecutive numbers are grouped in
8-byte blocks, preceded by a 1-byte descriptor containing unary-encoded lengths
(in bytes) of the integers in the block. If the next integer cannot t in a block,
the remaining bytes are unused.
        </p>
        <p>
          PForDelta. PForDelta [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] encodes a large number of integers (say, 64 or 128) at
a time by choosing a k such that most (say, 90%) of the integers can be encoded
in k bits. The remaining values, called exceptions, are encoded separately using
another method. More precisely, we select b and k such that most values are in
the range [b; b + 2k 1], and thus can be encoded in k bits by shifting them to
the range [0; 2k 1]. One variant, OptPFD[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], selects b and k to optimize for
space or decoding cost, with most implementations focusing on space.
Elias-Fano. Given a monotonically increasing integer sequence S of size n, such
that Sn 1 &lt; u, we can encode it in binary using dlog ue bits. Instead of writing
them directly, Elias-Fano coding [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] splits each number into two parts, a low
part consisting of l = dlog nu e right-most bits, and a high part consisting of
the remaining dlog ue l left-most bits. The low parts are explicitly written in
binary for all numbers, in a single stream of bits. The high parts are compressed
by writing, in negative-unary form (i.e., with the roles of 0 and 1 reversed),
the gaps between the high parts of consecutive numbers. Sequential decoding is
done by simultaneously retrieving low and high parts, and concatenating them.
Random access requires nding the locations of the i-th 0- or 1-bit within the
unary part of the data using an auxiliary succinct data structure. Furthermore,
a NextGEQ(x) operation, which returns the smallest element that is greater than
or equal to x, can be implemented e ciently. Observe that hx, the higher bits
of x, is used to nd the number of elements having higher bits smaller than hx,
denoted as p. Then, a linear scan of lx, the lower bits of x, can be employed
starting from posting p of the lower bits array of the encoded list.
        </p>
        <p>
          The above version of Elias-Fano coding cannot exploit clustered data
distributions for better compression. This is achieved by a modi cation called
Partitioned Elias-Fano (PEF) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] that splits the sequence into b blocks, and then uses
an optimal choice of the number of bits in the high and low parts for each block.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Query processing</title>
        <p>
          Traversing the index structures of all the query terms and computing the scores
of all the postings is the way to evaluate exhaustively a user query.
Unfortunately, the processing cost of each query increases linearly with the number of
documents, making it very expensive for large collections. To overcome this
problem, many researchers have proposed so-called early-termination techniques, for
nding the top-k ranked results without computing all posting scores.
MaxScore. In 1995 Turtle and Flood [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] rstly proposed an algorithm named
MaxScore. MaxScore is an optimization technique which relies on the maximum
impact scores of each term. It has been applied to both term-at-a-time (TAAT)
and document-at-a-time (DAAT) evaluation strategies, but the latter is de
nitely the most e ective optimization.
        </p>
        <p>Given a list of query terms q = ft1; t2; : : : ; tmg such that maxti maxti+1 , at
any point of the algorithm, query terms (and associated posting lists) are
partitioned into essential q+ = ft1; t2; : : : ; tpg and nonessential q = ftp+1; tp+2; : : : ; tmg.
This partition depends on the current threshold T for a document to enter the
top-k results, and is de ned by the smallest p such that Pt2q maxt &lt; T . Thus,
no document containing only nonessential terms can make it into the top-k
results. We can now perform disjunctive processing over only the essential terms,
with lookups into the nonessential lists.</p>
        <p>
          Variable Block-Max WAND. Similar to MaxScore, WAND [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] (Weighted or
Weak AND) also utilizes the maxt values. One shortcoming of these techniques is
that they use maximum impact scores over the entire lists. Thus, if maxt is much
larger than the other scores in the list, then the impact score upper bound will
usually be a signi cant overestimate. BlockMax WAND (BMW) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] addresses this
by introducing block-wise maximum impact scores. Variable BlockMax WAND
[
          <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
          ], or just VBMW, generalizes BMW by allowing variable lengths of blocks.
More precisely, it uses a block partitioning such that the sum of di erences
between maximum scores and individual scores is minimized. This results in
better upper bound estimation and more frequent document skipping.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>
        Testing details. All methods are implemented in C++17 and compiled with
GCC 7.3.0 using the highest optimization settings1[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The tests are performed
on a machine with an Intel Core i7-4770 quad-core 3.40GHz CPU, with 32GiB
RAM, running Linux 4.15.0.
      </p>
      <p>Data Sets. We performed experiments on two standard datasets: Gov2 and
ClueWeb09. For each document in the collection, the body text was extracted,
the words lowercased and stemmed using the Porter2 stemmer; no stopwords
were removed.</p>
      <p>Queries. To evaluate query processing speed, we use TREC 2005 and TREC
2006 Terabyte Track E ciency Task. We sample 500 queries from each set.</p>
      <p>Gov2</p>
      <p>Clueweb09
15
]
s
m
[
iem10
t
y
r
e
u
eq 5
g
a
r
e
v
A
30
20
10
]s 30
m
[
iem15
t
y
r
e 7

3
100
50
25
10</p>
      <sec id="sec-3-1">
        <title>3.1 Improving Partitioned Elias-Fano</title>
        <p>Figure 1 shows the average query time of MaxScore and VBMW with indexes
encoded using di erent encoding techniques. It is interesting to notice that
although PEF is time-e cient with VBMW, when it comes to the MaxScore query
processing algorithm PEF performs poorly. The reason can be identi ed in PEF's
ability to e ciently perform skipping within a posting list, while being
suboptimal in sequential decoding. In fact, the latter still represents a considerable
fraction of the time spent on processing a query using MaxScore. Future
research could focus on redesigning PEF to have a twofold decoding ability. While
we desire to maintain its fast skipping capabilities, PEF needs to match the
sequential decoding speed of the modern SIMD-enabled competitors. As a result, a
query algorithm could exploit this duality by switching to the most appropriate
access policy while traversing the index.</p>
        <p>Gov2</p>
        <sec id="sec-3-1-1">
          <title>Clueweb09</title>
          <p>101
102 103</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Number of retrieved documents</title>
          <p>104
101
102 103</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Number of retrieved documents</title>
          <p>104</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>VBMW</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>MaxScore</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>OptPFD</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>SIMD-BP128</title>
        </sec>
        <sec id="sec-3-1-8">
          <title>VarintG8IU PEF</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Improving MaxScore for large k</title>
        <p>Much previous work has focused on smaller values of k, such as 10 or 100.
However, when query processing algorithms are used for subsequent reranking
by a complex ranker, more results are needed. The results for a range of values
of k are shown in Figure 2 on a log-log scale for better readability. First, we
notice a signi cant time increase with larger k across all encoding techniques.
We nd this increase to be faster for VBMW. Also, note that at k = 10; 000
even the performance of PEF, which previously performed well only for VBMW,
is similar for both algorithms. While we intend to explore and gain a deeper
insight on the motivations behind VBMW results in order to study a solution
able to overcome these issues, we acknowledge that MaxScore might be better
suited for some cases of top-k document retrieval. We believe that there is a
potential for further speeding up MaxScore in scenarios where k is large. As
mentioned in Section 2.2, nonessential lists are only used for lookups in order to
re ne the document scores. It would be interesting to analyze the possibility of
redesigning an innovative solution to iterate and test document IDs membership
over inverted lists with the use of probabilistic data structures.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Conclusions</title>
      <p>Top-k document retrieval has been gaining increasing attention in recent years
due to the spread of complex ranking techniques that would result
impracticable on the entire collection of documents. This aspect brings in front of the
researchers many opportunities for improving the e ciency of this essential phase.
In this paper, we illustrate our initial results and propose di erent approaches
that may concur to bring top-k document retrieval several improvements.
Acknowledgements. This research was supported by NSF Grant IIS-1718680 and a
grant from Amazon.</p>
    </sec>
    <sec id="sec-5">
      <title>References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Vo</given-names>
            <surname>Ngoc</surname>
          </string-name>
          <article-title>Anh and Alistair Mo at</article-title>
          .
          <article-title>Index compression using 64-bit words</article-title>
          .
          <source>Softw. Pract</source>
          . Exper.,
          <volume>40</volume>
          :
          <fpage>131</fpage>
          {
          <fpage>147</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Andrei Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          , David Carmel,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Herscovici</surname>
          </string-name>
          , Aya So er, and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Zien</surname>
          </string-name>
          .
          <article-title>E cient query evaluation using a two-level retrieval process</article-title>
          .
          <source>In Proc. CIKM</source>
          , pages
          <volume>426</volume>
          {
          <fpage>434</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Je rey Dean. Challenges in building large-scale information retrieval systems: invited talk</article-title>
          .
          <source>In WSDM, pages 1{1</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Shuai</given-names>
            <surname>Ding</surname>
          </string-name>
          and
          <string-name>
            <given-names>Torsten</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <article-title>Faster top-k document retrieval using block-max indexes</article-title>
          .
          <source>In Proc. SIGIR</source>
          , pages
          <volume>993</volume>
          {
          <fpage>1002</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Lemire</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Boytsov</surname>
          </string-name>
          .
          <article-title>Decoding billions of integers per second through vectorization</article-title>
          .
          <source>Softw. Pract</source>
          . Exper.,
          <volume>45</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>29</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Antonio</given-names>
            <surname>Mallia</surname>
          </string-name>
          and
          <string-name>
            <given-names>Elia</given-names>
            <surname>Porciani</surname>
          </string-name>
          .
          <article-title>Faster blockmax WAND with longer skipping</article-title>
          .
          <source>In Proc. ECIR</source>
          , pages
          <volume>771</volume>
          {
          <fpage>778</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Antonio</given-names>
            <surname>Mallia</surname>
          </string-name>
          , Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and
          <string-name>
            <given-names>Rossano</given-names>
            <surname>Venturini</surname>
          </string-name>
          .
          <article-title>Faster BlockMax WAND with Variable-sized blocks</article-title>
          .
          <source>In Proc. SIGIR</source>
          , pages
          <volume>625</volume>
          {
          <fpage>634</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Antonio</given-names>
            <surname>Mallia</surname>
          </string-name>
          , Michal Siedlaczek, Joel Mackenzie, and
          <string-name>
            <given-names>Torsten</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <article-title>PISA: performant indexes and search for academia</article-title>
          .
          <source>In Proc. OSIRRC@SIGIR</source>
          , pages
          <volume>50</volume>
          {
          <fpage>56</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Ottaviano</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rossano</given-names>
            <surname>Venturini</surname>
          </string-name>
          .
          <article-title>Partitioned Elias-Fano indexes</article-title>
          .
          <source>In Proc. SIGIR</source>
          , pages
          <volume>273</volume>
          {
          <fpage>282</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Je</surname>
            <given-names>Plaisance</given-names>
          </string-name>
          , Nathan Kurz, and Daniel Lemire.
          <article-title>Vectorized VByte decoding</article-title>
          . CoRR, abs/1503.07387,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Alexander</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Stepanov</surname>
            , Anil R. Gangolli, Daniel E. Rose,
            <given-names>Ryan J.</given-names>
          </string-name>
          <string-name>
            <surname>Ernst</surname>
          </string-name>
          , and
          <string-name>
            <surname>Paramjit</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Oberoi</surname>
          </string-name>
          .
          <article-title>Simd-based decoding of posting lists</article-title>
          .
          <source>In CIKM</source>
          , pages
          <volume>317</volume>
          {
          <fpage>326</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Larry</surname>
            <given-names>H</given-names>
          </string-name>
          <string-name>
            <surname>Thiel and HS Heaps</surname>
          </string-name>
          .
          <article-title>Program design for retrospective searches on large data bases</article-title>
          .
          <source>Information Storage and Retrieval</source>
          ,
          <volume>8</volume>
          :1{
          <fpage>20</fpage>
          ,
          <year>1972</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Howard</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Turtle</surname>
            and
            <given-names>James</given-names>
          </string-name>
          <string-name>
            <surname>Flood</surname>
          </string-name>
          .
          <article-title>Query evaluation: Strategies and optimizations</article-title>
          . Inf. Process. Manage.,
          <volume>31</volume>
          (
          <issue>6</issue>
          ):
          <volume>831</volume>
          {
          <fpage>850</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Sebastiano</given-names>
            <surname>Vigna</surname>
          </string-name>
          .
          <article-title>Quasi-succinct indices</article-title>
          .
          <source>In Proc. WSDM</source>
          , pages
          <volume>83</volume>
          {
          <fpage>92</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Hao</surname>
            <given-names>Yan</given-names>
          </string-name>
          , Shuai Ding, and
          <string-name>
            <given-names>Torsten</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <article-title>Inverted index compression and query processing with optimized document ordering</article-title>
          .
          <source>In Proc. WWW</source>
          , pages
          <volume>401</volume>
          {
          <fpage>410</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Marcin</surname>
            <given-names>Zukowski</given-names>
          </string-name>
          , Sandor Heman, Niels Nes, and
          <string-name>
            <surname>Peter</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Boncz</surname>
          </string-name>
          .
          <article-title>Super-scalar RAM-CPU cache compression</article-title>
          .
          <source>In Proc. ICDE</source>
          , page
          <volume>59</volume>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>