<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Efficient Dynamic Pruning with Proximity Support</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicola Tonellotto</string-name>
          <email>nicola.tonellotto@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald, Iadh Ounis</string-name>
          <email>craigm@dcs.gla.ac.uk</email>
          <email>ounis@dcs.gla.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing Science, University of Glasgow</institution>
          ,
          <addr-line>Glasgow, G12 8QQ</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Science and Technologies Institute, National Research Council</institution>
          ,
          <addr-line>Via G. Moruzzi 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>31</fpage>
      <lpage>35</lpage>
      <abstract>
        <p>Modern retrieval approaches apply not just single-term weighting models when ranking documents - instead, proximity weighting models are in common use, which highly score the co-occurrence of pairs of query terms in close proximity to each other in documents. The adoption of these proximity weighting models can cause a computational overhead when documents are scored, negatively impacting the efficiency of the retrieval process. In this paper, we discuss the integration of proximity weighting models into efficient dynamic pruning strategies. In particular, we propose to modify document-at-a-time strategies to include proximity scoring without any modifications to pre-existing index structures. Our resulting two-stage dynamic pruning strategies only consider single query terms during first stage pruning, but can early terminate the proximity scoring of a document if it can be shown that it will never be retrieved. We empirically examine the efficiency benefits of our approach using a large Web test collection of 50 million documents and 10,000 queries from a real query log. Our results show that our proposed two-stage dynamic pruning strategies are considerably more efficient than the original strategies, particularly for queries of 3 or more terms. Copyright c 2010 for the individual papers by the papers' authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. LSDS-IR Workshop, July 2010. Geneva, Switzerland.</p>
      </abstract>
      <kwd-group>
        <kwd>Dynamic Pruning</kwd>
        <kwd>Efficient Proximity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In most information retrieval (IR) systems, the relevance
score for a document given a query follows the general
outline given by the best match strategy: a score is calculated
for each query term occurring in the document. These scores
are then aggregated by a summation to give the final
document relevance score. However, there are many queries
where the relevant documents contain the query terms in
close proximity. Hence, modern retrieval systems apply not
just single-term weighting models when ranking documents.
Instead, proximity weighting models are commonly applied,
which highly score the co-occurrence of pairs of query terms
in close proximity to each other in documents [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Dynamic pruning strategies reduce the scoring of
documents, such that efficient retrieval can be obtained,
without impacting on the retrieval effectiveness before rank K
- such strategies are safe-up-to-rank-K. However, when
additional proximity scores must be calculated for each
document, the computational overhead impacts the efficiency of
the retrieval process. While pruning techniques have been
studied to efficiently score documents without considering
term proximity [
        <xref ref-type="bibr" rid="ref20 ref23 ref4">4, 20</xref>
        ], there are very few proposals
considering efficient top K retrieval where proximity is
considered [
        <xref ref-type="bibr" rid="ref19 ref21 ref22">19, 21, 22</xref>
        ]. Moreover, these proposals require
modifications of the index structure to implement efficient scoring
strategies. Indeed, such modifications include sorting the
posting lists by frequency or impact [
        <xref ref-type="bibr" rid="ref10 ref2">2, 10</xref>
        ], or using
additional index structures containing the intersection of pairs
of posting lists [
        <xref ref-type="bibr" rid="ref19 ref21 ref22">19, 21, 22</xref>
        ]. However, these can lead to
negative effects on other aspects of the IR system, such as the
compression of index structures or the impossibility to use
other existing ranking strategies.
      </p>
      <p>
        This work contributes a study into the behaviour of
dynamic pruning strategies when combined with proximity
weighting models. In particular, we analyse two existing
document-at-a-time (DAAT) dynamic pruning strategies,
namely MaxScore [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and Wand [
        <xref ref-type="bibr" rid="ref23 ref4">4</xref>
        ], that can efficiently
score documents without decreasing the retrieval
effectiveness at rank K, nor requiring impact sorted indices.
Moreover, we propose a runtime modification of these
strategies to take into account proximity scores. We generate at
runtime the posting lists of the term pairs, and
transparently include the processing of these pair posting lists in the
MaxScore and Wand strategies. Next, we propose a
reorganisation of these strategies to increase their efficiency.
Using thorough experiments on a 50 million document
corpus and 10,000 queries from a real query log, we evaluate
the proposed modification to determine their efficiency.
      </p>
      <p>The remainder of this paper is structured as follows: In
Section 2, we describe the state-of-the-art approaches to
efficient ranking and the current existing solutions taking into
account proximity scoring. In Section 3, we describe in
detail the proposed framework to support proximity scores in
DAAT strategies, and in Section 4, we evaluate the
efficiency of the proposed modification. We provide concluding
remarks in Section 5.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>In the following, we outline the state-of-the-art strategies
of dynamic pruning, followed by a discussion on proximity
weighting models.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Dynamic Pruning</title>
      <p>
        The algorithms to match and score documents for a query
fall into two main categories [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]: in term-at-a-time (TAAT)
scoring, the query term posting lists are processed and scored
in sequence, so that documents containing query term ti gain
a partial score before scoring commences on term ti+1. In
contrast, in document-at-a-time (DAAT) scoring, the query
term postings lists are processed in parallel, such that all
postings of document dj are considered before scoring
commences on dj+1. Compared to TAAT, DAAT has a smaller
memory footprint than TAAT, due to the lack of
maintaining intermediate scores for many documents, and is
reportedly applied by large search engines [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. An alternative
strategy to DAAT and TAAT is called score-at-a-time [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
however this is suitable only for indices sorted or partially sorted
by document importance, which must be calculated before
the actual query processing. The algorithms from the family
of threshold algorithms [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] work similarly.
      </p>
      <p>
        Efficient dynamic pruning strategies do not rank every
document in the collection for each user query; they manage
to rank only the documents that will have a chance to enter
in the top-K results returned to the users. These
strategies are safe-up-to-rank-K [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], meaning that the ranking
of documents up to rank K will have full possible
effectiveness, but with increased efficiency. Dynamic pruning
strategies rely on maintaining, at query scoring time, a threshold
score that documents must overcome to be considered in the
top-K documents. To guarantee that the dynamic pruning
strategy will provide the correct top-K documents, an upper
bound for each term on its maximal contribution to the score
of any document in its posting list is used. In this paper, we
focus on two state-of-the-art safe-up-to-rank-K DAAT
dynamic pruning strategies, namely MaxScore and Wand.
      </p>
      <p>The MaxScore strategy maintains, at query scoring time,
a sorted list containing the current top-K documents scored
so far. The list is sorted in decreasing order of score. The
score of the last top-K document is a threshold score that
documents must overcome to be considered in the top-K
documents. A new document is given a partial score while
the posting lists with that document are processed. A
document scoring can terminate early when it is possible to
guarantee that the document will never obtain a score greater
than that of the current threshold. This happens when the
current document score plus the upper bounds of terms yet
to be scored is not greater than the threshold.</p>
      <p>
        The Wand strategy maintains the same top-K documents
list and the threshold score, but, for any new document,
it calculates an approximate score, summing up some
upper bounds for the terms associated with the document. If
this approximate score is greater than the current threshold,
then the document is fully scored. It is then inserted in the
top-K candidate document set if this score is greater than
the current threshold, and the current threshold is updated.
If the approximate score check fails, the next document is
processed. The selection of the next document to score is
optimised [
        <xref ref-type="bibr" rid="ref23 ref4">4</xref>
        ] – however, for our purposes, it is of note that
the set of postings lists are sorted by the document identifier
(docid) they currently represent. More details on the Wand
document selection strategy, which uses the skipping [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] of
postings in the posting lists to reduce disk IO and increase
efficiency, is presented in Appendix A.
      </p>
      <p>
        The MaxScore and Wand dynamic pruning strategies
can both enhance retrieval efficiency, whilst ensuring that
the top K documents are fully scored – i.e. that the
retrieval effectiveness at rank K is not at all negatively
impacted. Generally, speaking, Wand is more efficient [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
due to its ability to skip postings for unimportant query
terms. Note that both strategies examine at least one term
from each document, and hence cannot benefit efficiency for
single term queries.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Proximity</title>
      <p>
        There are many queries where the relevant documents
contain the query terms in close proximity. Hence, modern
retrieval systems apply not just single-term weighting
models when ranking documents. Instead, proximity weighting
models are commonly applied, which highly score the
cooccurrence of pairs of query terms in close proximity to each
other in documents [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Hence, some scoring proximity (or
term dependence) models have recently been proposed that
integrate single term and proximity scores for ranking
documents [
        <xref ref-type="bibr" rid="ref15 ref18 ref5">5, 15, 18</xref>
        ]. In this manner, the basic ranking model
of an IR system for a query Q can be expressed as:
scoreQ(d, Q) = ω S(d) + κ X score(tfd, ∗d, t) + φ prox(d, Q)
t∈Q
where S(d) is the combination of some query independent
features of document d (e.g. PageRank, URL length), and
score(tfd, ∗d, t) is the application of a weighting model to
score tfd occurrences of term t in document d. ∗d denotes
any other document statistics required by a particular
weighting model (e.g. document length). prox(d, Q) represents
some proximity document scoring function. The influence of
the various features is influenced using weights ω, κ and φ.
      </p>
      <p>
        However, none of the proximity weighting models
proposed have been designed for efficient document scoring.
The main approaches to integrate proximity weighting
models into pruning strategies require modifications to the
index structure to include information on the proximity scores
upper bounds. In [
        <xref ref-type="bibr" rid="ref19 ref21 ref22">19, 21, 22</xref>
        ], the authors detail several
approaches to leverage early termination when proximity
scores are included in the ranking model. While these
strategies alter the index structure (e.g. by adding term-pair
inverted indices), we aim to exploit the proximity scores
without modifying the index structure (other than keeping
position occurrence information in the standard inverted
index posting list). In particular, we use the sequential term
dependence model of Markov Random Fields (MRF) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
which has been shown to be effective at modelling the
proximity of query term occurrences in documents. In MRF, the
proximity score is calculated as follows:
      </p>
      <p>X “</p>
      <p>score`pf (ti, ti+1, d, k1), ld, p´
prox(d,Q) =
p=(ti,ti+1)∈Q</p>
      <p>
        + score`pf (ti, ti+1, d, k2), ld, p´”
where pf (ti, ti+1, d, k) represents the number of occurrences
of the pair of sequential query terms (ti, ti+1) occurring in
document d in windows of size k (abbreviated as pair
frequency pfd). Following [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], we set κ = 1, φ = 0.1, and
k1 = 2 and k2 = 8 to account for the proximity of two
terms as an exact phrase, and proximity at distance 8,
respectively. score(pfd, ld, p) is implemented using Dirichlet
language modelling [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], but where pair frequency takes the
role of term frequency. However, for the background
statistics of the language model, in contrast to term weighting,
when using proximity weighting, it is common to assume a
constant frequency for the pair in the collection [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]1.
1As implemented by the authors of MRF in the Ivory
retrieval system, see www.umiacs.umd.edu/~jimmylin/ivory
      </p>
    </sec>
    <sec id="sec-5">
      <title>FRAMEWORK</title>
      <p>The integration of proximity weighting models within
efficient dynamic pruning strategies requires the
materialisation of term pair posting lists and their integration into the
existing dynamic pruning decision mechanism. In the
following we discuss how we proposed to address both aspects.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Term pair posting lists</title>
      <p>Most dynamic pruning algorithms use posting list
iterators – object-oriented interfaces to a posting list, allowing a
posting to be read, or to be moved on to the next posting.
With a standard inverted index, one posting list’s iterator
represents the documents in which a single query term
occurs, ordered by docid.</p>
      <p>
        Proximity weighting models require knowledge of the
occurrence of pairs of query terms in a document. The
posting list of pairs of terms can be constructed either statically
(i.e., at indexing time, calculating the intersections of all
pairs of term posting lists) or dynamically (i.e., at retrieval
time, generating term pair postings on the fly). Previous
approaches [
        <xref ref-type="bibr" rid="ref19 ref21 ref22">19, 21, 22</xref>
        ] investigated different methodologies to
statically calculate these intersections. However, the static
approach has two drawbacks. Firstly, storing new posting
lists requires additional space on disk, and secondly, the
pairs of terms whose posting lists must be intersected must
be known in advance (e.g. by identifying popular phrases
in the corpus [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]), to avoid generating a large number of
new, potentially useless posting lists. While these
drawbacks may be lightened by caching solutions to store paired
posting lists [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], even in this case, there is always a relative
consumption of disk or memory resources.
      </p>
      <p>Instead, the pair posting lists can be built dynamically.
Given two single term iterators on postings lists, there is a
valid term pair posting each time they point to the same
docid. In order to transparently include these pair postings
in existing DAAT strategies, we must be sure that they are
ordered by docid. A pair posting list is illustrated in
Figure 1, based on the postings for terms t1 and t2. In our
proposed approach, to avoid additional I/O operations at
runtime, only the single term posting lists are responsible
for reading from disk and decompressing the single
postings, while the pair posting docid is updated each time a
new single posting is read with the minimum of the current
single term docids. The pair posting is valid only when the
docids of the underlying single term posting lists are equal
(i.e., in Figure 1, only two valid postings exist, namely
docid 1 and docid 8.). When a term posting list ends, all the
associated pair posting lists end as well. Overall, the pair
posting list is docid-sorted and cannot skip over potentially
useful term pair postings, however, a number of invalid pair
postings will occur (e.g. (8,2) and (9,14) in Figure 1).
disk
t1
t2
t1
t2
1
1
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Dynamic pruning with proximity</title>
      <p>
        The dynamic pair posting lists can be directly put into
work in existing DAAT strategies without modification. When
a term pair posting is selected for scoring, it is necessary to
calculate the exact value for the pair frequency at window
size k, by comparing the lists of positions stored in both term
postings. With dynamic pruning strategies (MaxScore and
Wand), this computation can be avoided if the posting is not
considered for scoring. Moreover, both the MaxScore and
Wand pruning strategies require upper bounds on the score
contributions of single terms. Hence, when using proximity,
we need also to provide upper bounds on the score
contributions of pairs as well. In [
        <xref ref-type="bibr" rid="ref23 ref4">4</xref>
        ], the authors proposed using
a dynamic estimation of the inverse document frequency of
pairs to determine the upper bound (the particular
proximity weighting model is not defined, but assumed to be similar
to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we proposed a new approximation for upper
bounds of the Markov Random Fields, requiring only the
knowledge of the maximum term frequency of the postings
in the two term posting lists.
      </p>
      <p>We now describe how proximity weighting is achieved
using the dynamic pruning strategies. In particular, the
MaxScore strategy must always know the minimum docid in the
currently processed posting lists set (which can be obtained
by maintaining a heap), while the Wand strategy must have
access to the posting lists sorted by docid (i.e., in the worst
case, every posting in each posting list must be removed and
inserted in a sorted set). However, when proximity is
considered, many extra pair postings must be considered (i.e.,
|Q| single term postings, plus an additional 2(|Q| − 1) pair
postings) – causing the efficiency of Wand to be hindered.
Moreover, both strategies must make additional checks to
ensure that only ‘valid’ pair postings are considered, which
can cause a performance bottleneck.</p>
      <p>To deal with these limitations, we propose a modification
that can be applied to both MaxScore and Wand pruning
strategies, whereby the processing of single terms is
separated from that of term pairs during each document
scoring. We refer to these two-stage strategies as MaxScoreP
and WandP. In particular, if a pair posting is updated after
each term posting update, we will generate two potentially
invalid pair postings. With the proposed modification, we
update the pair postings only after all single terms have
been moved to their respective next posting. This implies
that we can generate only one pair posting instead of two
each time both of the single term posting iterators advance.
Hence, MaxScoreP and WandP process the single term
posting lists according to their respective algorithms,
however the term pairs are subsequently processed in a second
stage using early termination, according to the MaxScore
strategy. The use of early termination of proximity scoring
is motivated by the fact that the pair frequency of a pair
posting is expensive to compute (in comparison to term
frequency, which is directly recorded in the posting) – hence
early termination can reduce the unnecessary pair frequency
and proximity score calculations.</p>
      <p>In summary, we propose to implement proximity scoring
using only normal index structures at retrieval time, and
in such a way to integrate directly with existing DAAT
dynamic pruning strategies. Moreover, we propose a
modification to these dynamic pruning strategies, where the single
terms are processed first according to the original dynamic
pruning strategy, while the terms pairs are processed
according to the MaxScore strategy. This can significantly
reduce the performance impact of additional data structures
required at runtime, and, as will be shown in Section 4,
leads to improved retrieval efficiency. Moreover, both of the
MaxScoreP and WandP modified strategies remain
safeup-to-rank-K.</p>
    </sec>
    <sec id="sec-8">
      <title>EVALUATION</title>
      <p>
        In the following experiments, we want to evaluate the
benefits of the proposed modification of the dynamic pruning
strategies. We are mainly interested in efficiency, because
all strategies are safe-up-to-rank-K – hence have no impact
on effectiveness. We tackle the following research questions:
1. How do MaxScore and Wand compare when
applying the Markov Random Fields proximity weighting
model?
2. Do the proposed modifications benefit the efficiency
when applying proximity?
3. What impact does the length of the query have?
Experiments are performed using a 230GB 50 million
English document subset of the TREC ClueWeb 09 (CW09B)
corpus [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Documents are ranked using the Dirichlet LM
weighting model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] (with parameter setting μ = 4000)
and the Markov Random Fields proximity weighting (see
Section 2.2). No query-independent features are used (i.e.,
ω = 0). CW09B is indexed using the Terrier IR
platform [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]2, applying Porter’s English stemmer and removing
standard stopwords. In the posting list, docids are encoded
using Elias Gamma-encoded deltas [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and term frequencies
using Elias Unary [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The positions of occurrences of the
term within the document are also recorded in each posting,
using Elias Gamma-encoded deltas. Each posting list also
includes skip points [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], one every 10,000 postings. The
resulting size of the inverted index is 72GB.
      </p>
      <p>
        For testing retrieval efficiency, we extract a stream of user
queries from a real search engine log. In particular, we
select the first 10,000 queries of the MSN 2006 query log [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
applying Porter’s English stemmer and removing standard
stopwords (empty queries are removed). The experiments
measure the average query response time for each dynamic
pruning strategy, broken down by the number of query terms
(1, 2, 3, 4 and more than 4). The number of documents
retrieved for each query is K = 1, 000. All experiments are
made using a dual quad-core Intel Xeon 2.6GHz, with 8GB
RAM and a 2TB SATA2 disk containing the index.
      </p>
      <p>In the following, we compare five strategies, namely: an
exhaustive “Full” DAAT strategy, which fully scores every
posting for all query terms and pairs; the original
MaxScore and Wand dynamic pruning strategies without any
modification; and the proposed two-stage MaxScoreP and
WandP dynamic pruning strategies which integrate the
modification proposed in Section 3.2. Every strategy uses
dynamic pair posting lists, as discussed in Section 3.1.</p>
      <p>Table 1 details the average query response time for both
the original and two-stage strategies per number of query
terms. From this table, we note that the average response
time are reduced by approximately 22%-30% by applying
the original MaxScore, but for the original Wand the
average response times are worse than for Full DAAT scoring.
This counters the normal efficiency of Wand, and supports
# queries
Full</p>
      <sec id="sec-8-1">
        <title>MaxScore Wand</title>
      </sec>
      <sec id="sec-8-2">
        <title>MaxScoreP WandP</title>
        <p>our assertion that Wand is not suitable for proximity
scoring in its normal form. Indeed, when using the pair posting
lists, there is a higher number of posting lists to maintain
in the docid sorted set of posting lists used by Wand, in
addition to the check that each pair posting is valid.</p>
        <p>
          In contrast, the two-stage pruning strategies perform
better in comparison to their original versions. MaxScoreP
exhibits improvements of average response times varying
from 31% for two terms, up to 48% for more than four
terms, while WandP benefits vary from 39% to 58%.
Moreover, we note that WandP exhibits better efficiency than
MaxScoreP, in common with MaxScore versus Wand
for non-proximity queries [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. For single term queries, no
proximity is applied, and, as expected, all dynamic pruning
strategies are equally efficient to Full DAAT scoring.
        </p>
        <p>Figure 2 summarises the percentage differences of the
dynamic pruning strategies with respect to the Full DAAT
scoring, for varying lengths of query. As already reported,
the two-stage MaxScoreP and WandP strategies
outperform their original equivalents. Moreover, their benefits
increase as the length of the queries (and hence pairs of terms)
increase, up to 40-55% improvements for queries with 3 or
more terms.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>
        In this work, we examined how to efficiently score
documents using dynamic pruning strategies when using the
Markov Random Field proximity weighting model [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In
particular, we discussed how pair posting lists could be used
to allow proximity scoring without changes to the underlying
index. Moreover, the most efficient way to score documents
using DAAT dynamic pruning strategies was discussed. We
proposed that dynamic pruning should be performed in a
two-stage process (as exemplified by the MaxScoreP and
WandP strategies), whereby only single query terms are
processed and pruned in the first stage. The pair posting
lists are only considered during a second stage, since they
are omitted from consideration in the first stage.
      </p>
      <p>We performed large-scale experiments comparing the
orginal and proposed two-stage dynamic pruning strategies,
using a corpus of 50 million documents, and 10,000 user queries
from a real query log. Our results demonstrated the
benefit of the two-stage versions of the dynamic pruning
strategies, particularly for queries of 3 or more terms, and are a
promising start for the future efficient examination of other
proximity weighting models.</p>
      <p>
        Dynamic pruning techniques have previously been shown
to be readily appliable in distributed retrieval settings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Similarly, we infer that our strategies can also be applied in
distributed retrieval without requiring adaptation.
6.
      </p>
    </sec>
    <sec id="sec-10">
      <title>APPENDIX</title>
      <p>A.</p>
      <p>WAND NEXT DOCUMENT SELECTION
1 t2
4 t3</p>
      <p>The selection of the next document performed in the Wand
strategy is explained with the help of Figure 3. The
posting lists are maintained in increasing order of docid. Then
a pivot term is computed, i.e. the first term for which the
accumulated sum of upper bounds of preceding terms and
itself exceeds the current threshold (e.g., term t3 with
accumulated score of 7). The corresponding docid identifies the
pivot document, i.e. the smallest docid having a chance to
overcome the current threshold. If the current docids of the
previous terms are equal to the pivot document docid, the
document is fully scored. Otherwise, one of the preceding
terms posting list is moved to the pivot document docid, and
the procedure is repeated. In the example, at the third step
a good candidate document is found (23) and it is fully
processed. This implementation can benefit from skipping every
posting list to a particular document, however the selection
of the right document requires some additional CPU time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          .
          <article-title>Sampling search-engine results</article-title>
          .
          <source>In Proc. of WWW</source>
          <year>2005</year>
          ,
          <volume>245</volume>
          -
          <fpage>256</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Anh</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          .
          <article-title>Pruned query evaluation using pre-computed impact scores</article-title>
          .
          <source>In Proc. of SIGIR</source>
          <year>2006</year>
          ,
          <volume>372</volume>
          -
          <fpage>379</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gionis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Junqueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Telloli</surname>
          </string-name>
          .
          <article-title>On the feasibility of multi-site web search engines</article-title>
          .
          <source>In Proc. of CIKM</source>
          <year>2009</year>
          ,
          <volume>425</volume>
          -
          <fpage>434</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Herscovici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soffer</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zien</surname>
          </string-name>
          .
          <article-title>Efficient query evaluation using a two-level retrieval process</article-title>
          .
          <source>In Proc. of CIKM</source>
          <year>2006</year>
          ,
          <volume>426</volume>
          -
          <fpage>434</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bu</surname>
          </string-name>
          ¨ttcher,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Lushman</surname>
          </string-name>
          .
          <article-title>Term proximity scoring for ad-hoc retrieval on very large text collections</article-title>
          .
          <source>In Proc. of SIGIR</source>
          <year>2006</year>
          ,
          <volume>621</volume>
          -
          <fpage>622</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Soboroff.</surname>
          </string-name>
          <article-title>Overview of the TREC 2009 Web track</article-title>
          .
          <source>In Proc. of TREC</source>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dupret</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Viegas</surname>
          </string-name>
          .
          <source>Proc. of the Web Search Click Data Workshop at WSDM</source>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Strohman</surname>
          </string-name>
          .
          <source>Search Engines: Information Retrieval in Practice Addison Wesley</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Elias</surname>
          </string-name>
          .
          <article-title>Universal codeword sets and representations of the integers</article-title>
          .
          <source>Information Theory</source>
          , IEEE Transactions on,
          <volume>21</volume>
          (
          <issue>2</issue>
          ):
          <fpage>194</fpage>
          -
          <lpage>203</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lotem</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Naor</surname>
          </string-name>
          .
          <article-title>Optimal aggregation algorithms for middleware</article-title>
          .
          <source>J. Comput. Syst. Sci</source>
          .
          <volume>66</volume>
          (
          <issue>4</issue>
          ):
          <fpage>614</fpage>
          -
          <lpage>656</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lafferty</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>A study of smoothing methods for language models applied to information retrieval</article-title>
          .
          <source>In Proc. of SIGIR</source>
          <year>2001</year>
          ,
          <volume>334</volume>
          -
          <fpage>342</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X.</given-names>
            <surname>Long</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <article-title>Three-level caching for efficient query processing in large web search engines</article-title>
          .
          <source>In Proc. of WWW</source>
          <year>2005</year>
          ,
          <volume>257</volume>
          -
          <fpage>266</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          . Global Statistics in Proximity Weighting Models.
          <source>In Proc. of Web N-gram Workshop at SIGIR</source>
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Ounis.</surname>
          </string-name>
          <article-title>Upper Bound Approximations for Dynamic Pruning</article-title>
          . Manuscript submitted for publication,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>A Markov random field model for term dependencies</article-title>
          .
          <source>In Proc. of SIGIR</source>
          <year>2005</year>
          ,
          <volume>472</volume>
          -
          <fpage>479</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Moffat</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>Self-indexing inverted files for fast text retrieval</article-title>
          .
          <source>Transactions on Information Systems</source>
          ,
          <volume>14</volume>
          (
          <issue>4</issue>
          ):
          <fpage>349</fpage>
          -
          <lpage>379</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          .
          <article-title>Terrier: a high performance and scalable information retrieval platform</article-title>
          .
          <source>In Proc. of OSIR Workshop at SIGIR</source>
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Ounis.</surname>
          </string-name>
          <article-title>Incorporating term dependency in the DFR framework</article-title>
          .
          <source>In Proc. of SIGIR</source>
          <year>2007</year>
          ,
          <volume>843</volume>
          -
          <fpage>844</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Schenkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Broschart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Theobald</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <article-title>Efficient text proximity search</article-title>
          .
          <source>In Proc. of SPIRE</source>
          <year>2007</year>
          ,
          <volume>287</volume>
          -
          <fpage>299</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Turtle</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Flood</surname>
          </string-name>
          .
          <article-title>Query evaluation: strategies and optimizations</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>31</volume>
          (
          <issue>6</issue>
          ):
          <fpage>831</fpage>
          -
          <lpage>850</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          .
          <article-title>Effective top-k computation in retrieving structured documents with term-proximity support</article-title>
          .
          <source>In Proc. of CIKM</source>
          <year>2007</year>
          ,
          <volume>771</volume>
          -
          <fpage>780</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yu</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          .
          <article-title>Can phrase indexing help to process non-phrase queries?</article-title>
          <source>In Proc. of CIKM</source>
          <year>2008</year>
          ,
          <volume>679</volume>
          -
          <fpage>688</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>4 current docid 6 current threshold</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>