<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Zobel and A. Moffat, “Inverted Files for Text Search
Engines,” in ACM Computing Surveys</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Linh Thai Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science Illinois Institute of Technology Chicago</institution>
          ,
          <addr-line>IL 60616</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2004</year>
      </pub-date>
      <volume>38</volume>
      <issue>2</issue>
      <abstract>
        <p>Static index pruning methods have been proposed to reduce size of the inverted index of information retrieval systems. The goal is to increase efficiency (in terms of query response time) while preserving effectiveness (in terms of ranking quality). Current state-of-the-art approaches include the term-centric pruning approach and the document-centric pruning approach. While the term-centric pruning considers each inverted list independently and removes less important postings from each inverted list, the document-centric approach considers each document independently and removes less important terms from each document. In other words, the term-centric approach does not consider the relative importance of a posting in comparison with others in the same document, and the document-centric approach does not consider the relative importance of a posting in comparison with others in the same inverted list. The consequence is less important postings are not pruned in some situations, and important postings are pruned in some other situations. We propose a posting-based pruning approach, which is a generalization of both the term-centric and document-centric approaches. This approach ranks all postings and keeps only a subset of top ranked ones. The rank of a posting depends on several factors, such as its rank in its inverted list, its rank in its document, its weighting score, the term weight and the document weight. The effectiveness of our approach is verified by experiments using TREC queries and TREC datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Static Index Pruning</kwd>
        <kwd>Document-centric Index Pruning</kwd>
        <kwd>Termcentric Index Pruning</kwd>
        <kwd>Posting-centric Index Pruning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Copyright © 2009 for the individual papers by the papers’ authors.
Copying permitted for private and academic purposes. Re-publication of
material from this volume requires permission by the copyright owners.
This volume is published by its editors.</p>
      <p>LSDS-IR Workshop. July 2009. Boston, USA.</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Text information retrieval systems are based on an inverted index
to efficiently process queries. The most important part of an
inverted index is its inverted file, a file that contains posting list
for each term in the text collection [1]. In general, a posting list of
a term contains its posting entries (or index pointers), each in the
form of &lt;docID, freq&gt;, where docID is the ID of a document that
contains the term, and freq is its frequency in the document. For a
multi-keyword query, all posting lists of query terms are retrieved
from the inverted file, and document scores are accumulated for
each document in the union of the posting lists, based on a
specified weighting scheme. A list of documents in descending
order of rank scores is presented to the user.</p>
      <p>For a large text corpus, the inverted file is too large to fit into
memory of the search server. Thus, query processing involves a
lot of disk access, which increases query response time. For a text
information retrieval system that has to process thousands of
queries per second, it is critical to improve query processing
performance.</p>
      <p>
        Beside the parallel query processing approach that uses a cluster
of servers to process queries, the index compression approach is
widely used. The lossless compression approach uses data
compression techniques to compress index data, thereby reducing
the volume of data transferred from disk. The compressed index
data is then decompressed in memory, and queries are processed
based on the original index information. Common data
compression technique used in information retrieval systems is
variable length data coding [2]. In contrast, lossy compression
approach opts for keeping only important information in the
index, discarding other less important information [4][
        <xref ref-type="bibr" rid="ref1">5</xref>
        ][
        <xref ref-type="bibr" rid="ref2">6</xref>
        ][
        <xref ref-type="bibr" rid="ref4">8</xref>
        ]
[
        <xref ref-type="bibr" rid="ref7">11</xref>
        ]. Thus ranking quality of queries processed based on a lossy
compressed index (i.e. a pruned index) might be affected.
In practice, a lossless compression technique can be applied on a
lossy pruned index to further reduce index size. In addition, both
types of compressed/pruned index can be used by an information
retrieval system: a lossy pruned index is used to answer a large
portion of user queries, and a lossless compressed index is used
only if result quality is significantly hurt [4].
      </p>
      <p>
        In this work, we concentrate on lossy index compression. Current
state-of-the-art approaches include term-centric pruning [
        <xref ref-type="bibr" rid="ref1">5</xref>
        ] and
document-centric pruning [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ]. While term-centric pruning method
considers each inverted list independently and removes less
important postings from each inverted list, document-centric
pruning considers each document independently and removes less
important terms from each document. In other words, the
termcentric method does not consider the relative importance of a
posting in comparison with others in the same document, and
document-centric method does not consider the relative
importance of a posting in comparison with others in the same
inverted list. The consequence is less important postings are not
pruned in some situations, and important postings are pruned in
some other situations.
      </p>
      <p>We propose a posting-based pruning approach, which is a
generalization of both the term-centric and document-centric
approaches. Our approach ranks all postings and keeps only a
subset of the top ranked ones, removing the others. We consider a
couple of factors when ranking a posting, such as its rank in its
posting list, its rank in its document, its weighting score, the
normalized weight of the term, and the normalized weight of the
document. Our experiments based on TREC queries and TREC
datasets [22] show that posting-based pruning method
outperforms both the term-centric and document-centric methods.</p>
    </sec>
    <sec id="sec-3">
      <title>2. RELATED WORK</title>
      <p>Lossless index compression techniques are well studied, for
example, see Witten et al. [2][3]. Those techniques are mainly
based on the fact that the frequencies of terms in documents,
which are stored in the inverted index, follow a skewed
distribution. In that case, variable length coding technique can be
used to encode index information, consuming only a few bits for
most of term frequency values. In general, this helps to reduce
index size by about one-tenth. However, for large scale
information retrieval systems, the compressed index is still too big
to fit into memory. In addition, using a compressed index reduces
time to access index data from disk, but does not reduce time to
process the posting lists. Thus, using lossless index compression
alone cannot significantly improve efficiency.</p>
      <p>Lossy index compression techniques opt for discarding postings
that are not informative. By removing a large number of postings
from the inverted index, lossy index compression techniques not
only significantly reduce index size, but also significantly reduce
length of posting lists. Therefore, lossy index compression
techniques can reduce both time to access index data from disk
and time to process posting lists. However, as some index
information is lost, lossy index compression techniques may lead
to a drop in query ranking quality.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref1">5</xref>
        ], Carmel et al. introduced a term-centric approach to static
index pruning. Entries in each posting list are sorted in
descending order of a weighting scores. Only entries whose
weighting scores are greater than a threshold value are kept in the
pruned index. The threshold value can be the same for all terms
(uniform pruning), or it can be different for each term (term-based
pruning).
      </p>
      <p>
        Buttcher and Clarke introduce a document-centric pruning
technique [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ]. Instead of posting list pruning, they propose
document pruning. For each document, they keep only a small
number of representative, highly-ranked terms in the pruned
index. Terms in each document are ranked based on their
contribution to the Kullback-Leibler divergence [
        <xref ref-type="bibr" rid="ref12">16</xref>
        ] between the
document and the text collection. The intuition behind this is that
those document-representative terms are powerful enough to
distinguish the document from others. They also show
experimentally that if the document is ranked high for a given
query, it is very likely that query terms are among its
representative terms. Thus indexing sets of representative terms is
a good method to preserve ranking quality while reducing index
size.
      </p>
      <p>
        Other index pruning techniques (some are for distributed,
peer-topeer information retrieval systems) belong to either the
termcentric approach or document-centric approach. Blanco et al. [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ],
and Shokouhi et al. [
        <xref ref-type="bibr" rid="ref6">10</xref>
        ], try to find terms whose posting lists can
be completely removed. De Moura et al. [
        <xref ref-type="bibr" rid="ref7">11</xref>
        ] propose to index a
set of representative sentences for each document. Lu and Callan
[
        <xref ref-type="bibr" rid="ref3">7</xref>
        ] propose a number of methods to identify a representative term
set for each document. Podna et al. [
        <xref ref-type="bibr" rid="ref8">12</xref>
        ] and Skobeltsyn et al. [
        <xref ref-type="bibr" rid="ref9">13</xref>
        ]
propose to index term combinations to reduce the negative effect
of posting list pruning to ranking quality. Blanco and Barreiro [
        <xref ref-type="bibr" rid="ref5">9</xref>
        ]
improve the precision of term-centric pruning by considering a
number of designs overlooked by the original work.
      </p>
      <p>
        Looking at other aspect of static index pruning, Skobeltsyn et al.
[
        <xref ref-type="bibr" rid="ref10">14</xref>
        ] point out that the use of results caching fundamentally affects
the performance of a pruned index, due to the change in query
pattern introduced by results caching. They then propose to
combine results caching and index pruning to reduce the query
workload of back-end servers.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. TERM-CENTRIC PRUNING VERSUS</title>
    </sec>
    <sec id="sec-5">
      <title>DOCUMENT-CENTRIC PRUNING</title>
    </sec>
    <sec id="sec-6">
      <title>3.1 Term-Centric Index Pruning</title>
      <p>
        Term-centric pruning fits very well with the inverted index
structure of information retrieval systems. As queries are
processed based on inverted lists, it is natural to truncate inverted
lists in order to reduce index size. Based on the inverted index
structure, the “idealized, term-based” pruning technique proposed
by Carmel et al. is well-formed and mathematically provable. This
clearly shows that a pruned index, even though not containing all
information, still can guarantee the ranking quality to some extent
[
        <xref ref-type="bibr" rid="ref1">5</xref>
        ].
      </p>
      <p>
        There are several properties that are specific to term-centric
pruning. It preserves the collection vocabulary. For every term,
there are always some entries in its inverted list in the pruned
index. (The works of Blanco et al. [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ] and Shokouhi et al. [
        <xref ref-type="bibr" rid="ref6">10</xref>
        ] are
exceptions, as their work reduces the vocabulary size.) In contrast,
term-centric pruning does not necessarily preserve the set of
documents. As posting entries are removed, it is possible that
some documents will be totally removed from the pruned index.
The fact that term-centric index pruning preserves the set of terms
demonstrates its support for the possibility of all terms appearing
in user queries. Due to this support, in order to guarantee the
quality of top-K results for any queries, term-centric pruning must
not prune any of the top-K entries in any posting list. Obviously,
pruning any of these makes the pruned index unable to guarantee
the top-K results of the query containing only that single term. In
addition, term-centric pruning assumes (implicitly) that every term
in the vocabulary is equally important. In contrast, for documents,
term-centric pruning assumes that some are more important than
others. This is inferred from the fact that term-centric pruning
might totally removed some documents from the pruned index.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Document-Centric Data Pruning</title>
      <p>Document-centric pruning does not make any assumption about
the index structure and how queries are processed. Precisely,
document-centric pruning should be considered as a “data”
pruning technique instead of an index pruning technique, as what
it actually does is to prune the documents, not an index structure.
In contrast to term-centric pruning, document-centric pruning
preserves the set of documents, not the set of terms. While any
document in the collection is represented by a subset of its terms
(i.e., its set of representative terms), there is no guarantee that
every term will be indexed. It is likely that there are terms that are
always ranked low in any document and are removed by
document-centric pruning.</p>
      <p>The first assumption implied by document-centric pruning is that
every document can be ranked first by some queries (one such
query might be the query that contains all its representative
terms). Due to this assumption, document-centric pruning opts for
including every document in the pruned index. The second
implied assumption is that terms are not equally important, and
some terms can be totally removed from the pruned index.</p>
    </sec>
    <sec id="sec-8">
      <title>4. POSTING-BASED INDEX PRUNING</title>
      <p>As pointed out above, term-centric pruning prunes index
elements, which are posting lists; while document-centric pruning
prunes data elements, which are documents. Both approaches
assume that all elements are equally important, and thus the
pruned index should keep some information about every element,
either they are posting lists or documents.</p>
      <p>We first find that the decision to keep some amount of
information for each posting list or document to be reasonable.
Without any information about the user queries, we must assume
any term can be used by users. Thus no term can be totally
removed. Similarly, without any information about what users
will search for, we also have to assume any document can be an
answer (to some queries). Therefore, no document can be totally
removed.</p>
      <p>However, given the requirement of significantly reducing index
size, it is not affordable to keep information for all posting lists
and all documents. We believe that the pruned index should
contain neither all terms, nor all documents, but only the most
important postings, given the desired pruning level.</p>
      <p>
        We suspect that non-informative terms are common in any large
text collection. Non-informative terms are those terms that do not
help to discriminate documents. One example of non-informative
terms is a term that appears in every document, such as the term
“abstract” in a collection of scientific papers. Those terms are
expected to have similar weighting scores to every document.
Therefore, eliminating those terms will not hurt ranking quality.
Unfortunately, term-centric pruning tends not to prune any entries
from the posting lists of those terms. The reason is, as entry scores
are almost similar, all scores are likely to be greater than the
threshold value computed by the term-based method proposed in
[
        <xref ref-type="bibr" rid="ref1">5</xref>
        ].
      </p>
      <p>
        We also suspect that there are many “rarely asked for” documents
in any large text collection. A document is called “rarely asked
for” if it does not appear in the top-ranked results of any real
world query. In practice, users normally look at only the top 20
results, so any document that does not appear in the top-20 results
of a large number of queries can be removed. Puppin et al. [
        <xref ref-type="bibr" rid="ref13">17</xref>
        ]
observed that, for a collection of 5,939,061 documents and a set
of 190,000 unique queries, around 52% of the documents were
not returned among the first 100 top-ranked results of any query.
We propose a posting-based index pruning method. We choose
neither posting lists, nor documents as our working elements.
Instead, we choose postings, i.e. tuples of the form &lt;term ID,
document ID, term frequency&gt;. Postings are contained in both
index elements (as posting entries in posting lists) and data
elements (as terms in documents). Also, choosing posting entries
as working elements, we open the possibility of removing any
document and any term’s posting list from the pruned index. With
this flexibility, our method is able to remove non-informative
terms as well as “rarely asked for” documents.
      </p>
    </sec>
    <sec id="sec-9">
      <title>4.1 Term Weighting</title>
      <p>
        When building a pruned index, terms should not be treated
equally. Non-informative terms appear in a large number of
documents, results in long posting lists. However,
noninformative terms do not help much in ranking documents. Thus,
the pruned index should significantly prune the posting lists of
non-informative terms and reserve places for other informative
terms. Our posting-based pruning method assigns a weight to each
term as its informativeness value. Blanco and Barreiro [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ] have
studied a number of term-weighting schemes for the purpose of
posting list pruning. Their finding is that residual inverse
document frequency (RIDF) is a good quantity to measure the
informativeness of terms, among other schemes such as the classic
inverse document frequency and the term discriminative value.
We adopt RIDF to calculate term informativeness values. As
specified in [
        <xref ref-type="bibr" rid="ref4">8</xref>
        ], the RIDF value of a term is
 - tf 
RIDF = - log df  + log1 - e N 
 N   
(1)
where df is the term’s document frequency and N is the number of
documents in the collection. As pointed out by Blanco, RIDF
values can be computed efficiently.
      </p>
    </sec>
    <sec id="sec-10">
      <title>4.2 Document Weighting</title>
      <p>
        Documents in a large text collection are not equally important and
therefore, in the pruned index, more terms should be kept for
important documents. As for term weighting, we also assign a
weight to each document in the collection to reflect how
important it is. For a Web collection, the PageRank [
        <xref ref-type="bibr" rid="ref14">18</xref>
        ] or HITS
[
        <xref ref-type="bibr" rid="ref15">19</xref>
        ] algorithm can be used to compute document important
values. However, PageRank and HITS are not applicable for
nonWeb documents, as there is no link structure among documents.
We adopt the approach of Buttcher and Clarke [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ] for this
purpose. For each document, we assign its Kullback-Leibler
distance to the collection as its important value. Thus, “out
standing” documents (i.e., documents which are very different
from the collection) will be assigned high important values, while
documents which are similar to others will be assigned low
important values. Our important value for a document d is defined
as
      </p>
      <p>KLD(d || C ) = ∑t∈d t|fd(t|) log t|fd(t|) × T|FC(t|) 
(2)
where tf(t) is the term frequency of term t, TF(t) is the collection
term frequency of term t, |d| is the length of document d (i.e., the
number of terms in d), and |C| is the length of the collection (i.e.,
the sum of document lengths).</p>
    </sec>
    <sec id="sec-11">
      <title>4.3 Posting Ranking Function</title>
      <p>Our static pruning method evaluates postings and assigns each a
usefulness value. We then build the pruned index based on these
values. Given a desired level of pruning, posting entries are
selected based on their usefulness values and added into the
pruned index until the pruned index size reaches its limit.
According to term-centric pruning, we should assign a high
usefulness value to a posting entry which appears at the top of its
posting list. According to document-centric pruning, we should
not assign a high usefulness value to a posting whose term does
not belong to the set of top-ranked terms of the document. In
addition, as discussed above, we should assign low values to
noninformative terms and “rarely asked for” documents, and vice
versa. Also, we obviously want to assign high usefulness value to
posting entries with high scores.</p>
      <p>To rank postings, for each posting &lt;t, d&gt;, we compute the
following quantities:
•
•
•
•
•</p>
      <p>S(t, d) = the score term t contributes to the rank score of
document d.</p>
      <p>RIDF(t) = the informativeness value of term t.</p>
      <p>KLD(d || C) = the important value of document d.</p>
      <p>Rank(t) = the rank of term t relatively with other terms in
document d.</p>
      <p>Rank(d) = the rank of document d relatively with other
documents in the posting list of term t.</p>
      <p>
        Among the quantities above, S(t, d) is computed using a
weighting scheme, such as the classic TFIDF weighting scheme,
or the state-of-the-art BM25 weighting scheme; RIDF(t) is
calculated as specified in (1); KLD(d || C) is calculated according
to (2); Rank(d) is the position of document d in the posting list of
term t, where posting entries are sorted in descending order of its
scores S(t,di); and Rank(t) is the position of term t in document d,
where terms are sorted in descending order of its “feedback” score
[
        <xref ref-type="bibr" rid="ref2">6</xref>
        ], defined below:
      </p>
      <p>
        ScoreDCP (t) =  | tdf |  log | tdf | × |TCF| 
In this work, we use the BM25 weighting scheme [
        <xref ref-type="bibr" rid="ref16">20</xref>
        ] (given
below) to calculate S(t, d) due to its widely use in other research
works.
      </p>
      <p>S BM 25 (t, d ) = log N - df + 0.5  ×
 df + 0.5 </p>
      <p>tf (k1 + 1)
tf + k1  (1 - b) + b a|vdgd| l 
where avgdl is the average length of documents, k1 is set to its
“standard” value of 1.2 and b is set to its “standard” value of
0.75.</p>
      <p>In combination, our posting entry ranking function takes as
parameters all the above quantities and returns a usefulness value.
Note that we apply normalization and transformation to parameter
values. First, RIDF(t) values are normalized so that they sum up
to one. Similar normalization is applied to KLD(d || C) values.
Normalization step is necessary, as the range of RIDF(t) and
KLD(d || C) are different. Second, we use a sigmoid function to
(3)
(4)
1
transform the term rank values and document rank values. This
transformation is necessary, too. Using the term rank values
makes it appears that the 10th term is ten time less important than
the top ranked term, which does not seem right. Therefore, we use
a non-linear function specified below (5) as a transform function.
The parameter x0 is used to shift the “transition” point, where the
sigmoid function switches from high value state to low value
state, and the parameter a is used to control the slope of the
transition period of the function.</p>
      <p>sigmoid (x) = 1 - 1 + e(-1x+x0) a
(5)
The sigmoid function above returns a value between zero and one.
Rank value close to zero (i.e., top ranked element) will be
transformed to a value close to one, while other rank values will
be transformed depend on two parameters x0 and a. In Figure 1,
we show the shape of the sigmoid function for several
combinations of parameters.</p>
      <p>1
16
31
61
76</p>
      <p>91
46</p>
      <p>Rank</p>
    </sec>
    <sec id="sec-12">
      <title>4.4 Posting-Based Pruning versus Document</title>
    </sec>
    <sec id="sec-13">
      <title>Centric Pruning and Term-Centric Pruning</title>
      <p>We show that our posting entry ranking function generalizes
document-centric index pruning method and term-centric pruning
method.</p>
      <p>
        If we set the value of α to zero, replace the document weighting
function KLD(d||C) with the unity function u(d) = 1, and set the
parameter a of the sigmoid function to 1 so that the sigmoid
function in (5) becomes a threshold function at x0, then for each
document d, our ranking function f() assigns a non-zero value to
the posting entry &lt;t, d&gt; if t is ranked in the top-x0 among all
unique terms in d, otherwise f() returns a zero value. In this case,
our posting-based pruning technique is equivalent to the
“constant” document-centric pruning technique proposed by
Buttcher and Clarke in [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ], which select a fixed number of top
ranked terms from each document. Obviously, the “relative”
document-centric pruning technique proposed in [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ] can be easily
obtained from our posting entry ranking function by adjusting x0
for each document according to its length.
      </p>
      <p>
        Similarly, if we set the value of α to one, replace the term
weighting function RIDF(t) with the unity function u(t) = 1, set
the parameter a of the sigmoid function to 1, and set the parameter
x0 to the rank of the i-th posting entry in the posting list of term t
such that S(t, di) &gt; τ (t) and S(t, di+1) ≤ τ (t), where τ (t) is the
cutoff threshold proposed by Carmel et al. in [
        <xref ref-type="bibr" rid="ref1">5</xref>
        ], then our posting
entry ranking function assigns a non-zero value to any posting
entry selected by term-centric pruning technique, and a zero value
to any non-selected posting entry.
      </p>
      <p>Our posting-based pruning technique is different from
termcentric and document-centric pruning techniques in several
aspects. By using term weighting function and document
weighting function other than the unity function, we allow the
pruned index to include more information for informative terms
and outstanding documents. By using a sigmoid function instead
of a threshold function to transform the term and document ranks,
we open the possibility that a posting entry which is ranked low in
a posting list could be selected, if it is ranked high in the
document, and vice versa.</p>
    </sec>
    <sec id="sec-14">
      <title>5. EXPERIMENTAL SETTINGS</title>
      <p>
        We use the WT10G collection as our data set. This collection
contains 1,692,096 Web documents crawled from the Web. We
use the Terrier platform [
        <xref ref-type="bibr" rid="ref17">21</xref>
        ] to index and rank queries, and we
develop our posting-based index pruning technique based on
Terrier.
      </p>
      <p>We use the BM25 weighting scheme for both calculating
termdocument scores and query ranking. Potter stemming algorithm
[23] and a standard list of stop-words are used for preprocessing
documents. After stemming and stop-word removal, the
vocabulary contains 3,161,488 unique terms. The un-pruned
index contains 280,632,807 posting entries.</p>
      <p>
        We use TREC [22] topics 451–500 as our query set. Precision is
measured for each pruned index using the set of relevance
judgment provided by TREC for topics 451–500. From TREC
topics 451–500, we build two sets of queries: one set of long
queries, wherein each query includes the title and the description
fields from the TREC topic, and one set of short queries, wherein
each query includes only the title field from the TREC topic.
We also implement term-centric index pruning and
documentcentric index pruning exactly as specified in their original works
[
        <xref ref-type="bibr" rid="ref1">5</xref>
        ][
        <xref ref-type="bibr" rid="ref2">6</xref>
        ]. The only difference is that in [
        <xref ref-type="bibr" rid="ref1">5</xref>
        ], Carmel et al. used the
SMART term weighting scheme for both index pruning and query
ranking. We instead use BM25 term weighting scheme for query
ranking, but still use SMART term weighting scheme for index
pruning.
      </p>
      <p>
        We conduct experiments to compare the effectiveness of our
proposed posting-based pruning technique with the term-based,
“score shifting” pruning technique proposed in [
        <xref ref-type="bibr" rid="ref1">5</xref>
        ], and the
“relative” document-centric pruning technique proposed in [
        <xref ref-type="bibr" rid="ref2">6</xref>
        ]. In
the next section, we report precision at 10, precision at 20, and
average precision at each pruning level for each technique.
      </p>
    </sec>
    <sec id="sec-15">
      <title>6. EXPERIMENTAL RESULTS</title>
      <p>In Figure 2, we show the effectiveness of pruned indices for the
set of short TREC queries; and in Figure 3, we show the
effectiveness of pruned indices for the set of long TREC queries.
Posting-centric pruned index is marked as “pXX_YY_ZZ”, where
XX is the value of parameter α (percentage), YY is the value of
parameter a, and ZZ is the value of parameter x0. Figure 2 and
Figure 3 show experiment results for a posting-centric pruned
index with α = 50%, a = 15, and x0 = 50. With the pruning level
less than 70%, posting-centric pruning has similar performance as
compare with document-centric pruning and term-centric pruning.
However, for pruning level of 70% or more, posting-centric
pruning is outperformed by document-centric pruning.
We turn our parameters for posting-centric index pruning
technique. Table 1, Table 2, and Table 3 show experiment results
for short queries of document-centric pruning technique and
posting-centric pruning techniques with various combinations of
parameter values. Experiments with long queries have similar
trend and therefore are omitted.</p>
      <p>In Table 1, Table 2, and Table 3, the values in bold are the highest
performance for a specific pruning level. The first observation is
that posting-centric pruning is better at low and moderate pruning
level, while document-centric pruning is better at higher pruning
level. The second observation is that, even though posting-centric
pruning is better than document-centric pruning at low and
moderate pruning level, the differences are small. In contrast, at
higher pruning level, the differences between the performance of
posting-centric pruning and document-centric pruning are larger.
In addition, none of the posting-centric pruning techniques
outperforms document-centric pruning technique at high pruning
level.</p>
      <p>In all previous experiments of posting-centric pruning technique,
parameters of the posting entry ranking function (6) are fixed for
all posting entries. We consider the possibility of adaptively
setting parameters for each posting entry, by adapting the slope of
the sigmoid function as follows. For a posting entry &lt;t, d&gt;, we
use two sigmoid functions, one for the rank of term t in the
document d, the other for the rank of document d in the posting
list of term t. We call the former document-oriented sigmoid
function, and the later term-oriented sigmoid function. For the
term-oriented sigmoid function, its x0 parameter is set according
to the pruning level (i.e., if the pruning level is 90%, x0 is equal to
10% of the length of the term posting list). Similarly, for the
document-oriented sigmoid function, if the pruning level is 90%,
its x0 parameter is set to 10% of the number of unique terms in the
document. For both sigmoid functions, the parameter a, which
control the slope of the function, is set so that the function returns
a value close to 1 for input value which is less than 0.9 × x0 and
returns a value close to 0 for input value which is greater than 1.1
× x0. We also explore the performance of several alternatives,
which may or may not use term-weighting and/or
documentweighting (refer to the ranking function (6) in Section 4.3).
In Table 4, we report the performance of different alternatives at
the approximately 90% pruning level. Document-centric pruning
performance is included for reference. For each column, the best
value is in bold. From the results reported in Table 4, we can see
that:
(i)
(ii)
(iii)</p>
      <p>Term-document scores should be used in the posting
entry ranking function (6), which is revealed by the
significant differences between the performance of AP1
and AP2.</p>
      <p>Term-weighting is useful, which is confirmed by the
differences between the performance of AP1 and AP3.
Document weighting seems to be harmful, which hurt the
performance of AP4, compare to the performance of AP3
and AP1.</p>
      <p>Among all alternatives, AP3 is the best, which is only slightly
outperformed by document-centric pruning for long queries
according to MAP. We therefore consider it as the best among our
alternatives. Below, we report its performance in comparison with
document-centric pruning and term-centric pruning at various
level of pruning in Figure 4 and Figure 5.</p>
      <p>By adapting the sigmoid functions to posting entries, the
performance of our posting-centric pruning technique is much
better, as good as the performance of document-centric pruning
technique (posting-centric pruning is better than document-centric
pruning at some pruning level, while the inverse is true at other
pruning levels).</p>
    </sec>
    <sec id="sec-16">
      <title>7. CONCLUSIONS AND FUTURE WORK</title>
      <p>We evaluate document-centric and term-centric static index
pruning based on the WT10G corpus and TREC query sets. Based
on our experimental results, term-centric index pruning is better
than document-centric index pruning at low and moderate pruning
level (i.e., less than 70% pruning, according to our results), while
document-centric index pruning is better at higher pruning level.
We propose posting-centric index pruning technique, which ranks
each posting entry (i.e., a term-document pair) based on a set of
features such as the rank of the term in the document and the rank
of the document in the inverted list of the term. We show that
posting-centric index pruning generalizes both document-centric
and term-centric pruning, and therefore, the solution space of
term-centric pruning covers the solution spaces of both
documentcentric and term-centric index pruning. This implies that, by
exploring this larger solution space, better solution can be found.
We explore the solution space of posting-centric pruning by
studying a family of posting entry ranking functions. We discover
that term weighting based on RIDF is useful, while document
weighting based on KL-divergence is harmful. We also notice that
parameters of the sigmoid function, which we use to transform the
rank of a term/document to its score, should be adapted to each
posting entry. Fixing these parameters makes posting-centric
pruning less effective than document-centric pruning.
Other term weighting and document weighting methods are
possible. We are evaluating a method of weighting terms and
documents based on user queries and the PageRank algorithm
applying on the graph of terms and documents. Our goal is to
discover important terms and documents by analyzing the
relationship among terms and documents given the context of user
queries. Once the important terms and the important documents
are discovered, their information is kept in the pruned index,
while information about others, less important terms and
documents, can be partially removed or totally discarded.</p>
    </sec>
    <sec id="sec-17">
      <title>8. REFERENCES</title>
      <p>[22] TREC (WT10G, TREC-9)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          et al.,
          <source>“Static Index Pruning for Information Retrieval Systems,” in Proc. ACM SIGIR</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Buttcher</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          , “
          <article-title>A Document-Centric Approach to Static Index Pruning in Text Retrieval Systems,”</article-title>
          <source>in Proc. ACM CIKM</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          , “
          <article-title>Pruning Long Documents for Distributed Information Retrieval,”</article-title>
          <source>in Proc. ACM CIKM</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Barreiro</surname>
          </string-name>
          , “Static Pruning of Terms in Inverted Files,”
          <source>in Proc. ECIR</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Blanco</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Barreiro</surname>
          </string-name>
          , “Boosting Static Pruning of Inverted Files,”
          <source>in Proc. ACM SIGIR</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shokouhi</surname>
          </string-name>
          et al.,
          <article-title>“Using Query Logs to Establish Vocabularies in Distributed Information Retrieval,”</article-title>
          <source>In Int'l Journal on Information Processing and Management</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [11]
          <string-name>
            <surname>E. S. de Moura</surname>
          </string-name>
          et al, “
          <article-title>Improving Web Search Efficiency via a Locality Based Static Pruning Method,”</article-title>
          <source>in Proc. ACM WWW</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Podna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rajman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klemn</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          , “
          <article-title>Scalable Peer-to-peer Web Retrieval with Highly Discriminative Keys,”</article-title>
          <source>in Proc. IEEE ICDE</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Skobeltsyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. P.</given-names>
            <surname>Zarko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rajman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          , “
          <article-title>Web Text Retrieval with a P2P Query Driven Index,”</article-title>
          <source>in Proc. ACM SIGIR</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Skobeltsyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Junqueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          , and R. BaezaYates, “
          <article-title>ResIn: a Combination of Results Caching and Index Pruning for High-performance Web Search Engines,”</article-title>
          <source>in Proc. ACM SIGIR</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dwarkadas</surname>
          </string-name>
          , “
          <article-title>Hybrid Global-local Indexing for Efficient Peer-to-peer Information Retrieval,”</article-title>
          <source>in Proc. NSDI</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kullback</surname>
          </string-name>
          , “The
          <string-name>
            <surname>Kullback-Leibler</surname>
            <given-names>Distance</given-names>
          </string-name>
          ,” The American Statistician,
          <volume>41</volume>
          :
          <fpage>340</fpage>
          -
          <lpage>341</lpage>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Puppin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Laforenza</surname>
          </string-name>
          , “
          <article-title>Query-Driven Document Partitioning and Collection Selection,”</article-title>
          <source>in Proc. INFOSCALE</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Page</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          , and T. Winograd, “
          <article-title>The PageRank Citation Ranking: Bringing Order to the Web,”</article-title>
          <source>Technical Report</source>
          , Stanford University.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleinberg</surname>
          </string-name>
          , “
          <article-title>Authoritative Sources in a Hyperlinked Environment,”</article-title>
          <source>in Journal of the ACM</source>
          ,
          <volume>46</volume>
          (
          <issue>5</issue>
          ),
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S. E. Robertson. S.</given-names>
            <surname>Walker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          , “Okapi at TREC-
          <volume>7</volume>
          ,” in
          <source>Proc. of the Seventh Text Retrieval Conference</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [21]
          <string-name>
            <surname>TERabyte</surname>
            <given-names>RetrIEveR</given-names>
          </string-name>
          , http://ir.dcs.gla.ac.uk/terrier/ [23]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Porter</surname>
          </string-name>
          , “
          <article-title>An Algorithm for Suffix Stripping</article-title>
          ,” in Program,
          <volume>14</volume>
          (
          <issue>3</issue>
          ),
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>