=Paper=
{{Paper
|id=Vol-480/paper-3
|storemode=property
|title=Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach
|pdfUrl=https://ceur-ws.org/Vol-480/paper9.pdf
|volume=Vol-480
|dblpUrl=https://dblp.org/rec/conf/sigir/Nguyen09
}}
==Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach==
Static Index Pruning for Information Retrieval Systems: A
Posting-Based Approach
Linh Thai Nguyen
Department of Computer Science
Illinois Institute of Technology
Chicago, IL 60616 USA
+1-312-567-5330
nguylin@iit.edu
ABSTRACT
Static index pruning methods have been proposed to reduce size
1. INTRODUCTION
Text information retrieval systems are based on an inverted index
of the inverted index of information retrieval systems. The goal is
to efficiently process queries. The most important part of an
to increase efficiency (in terms of query response time) while
inverted index is its inverted file, a file that contains posting list
preserving effectiveness (in terms of ranking quality). Current
for each term in the text collection [1]. In general, a posting list of
state-of-the-art approaches include the term-centric pruning
a term contains its posting entries (or index pointers), each in the
approach and the document-centric pruning approach. While the
form of , where docID is the ID of a document that
term-centric pruning considers each inverted list independently
contains the term, and freq is its frequency in the document. For a
and removes less important postings from each inverted list, the
multi-keyword query, all posting lists of query terms are retrieved
document-centric approach considers each document
from the inverted file, and document scores are accumulated for
independently and removes less important terms from each
each document in the union of the posting lists, based on a
document. In other words, the term-centric approach does not
specified weighting scheme. A list of documents in descending
consider the relative importance of a posting in comparison with
order of rank scores is presented to the user.
others in the same document, and the document-centric approach
does not consider the relative importance of a posting in For a large text corpus, the inverted file is too large to fit into
comparison with others in the same inverted list. The consequence memory of the search server. Thus, query processing involves a
is less important postings are not pruned in some situations, and lot of disk access, which increases query response time. For a text
important postings are pruned in some other situations. We information retrieval system that has to process thousands of
propose a posting-based pruning approach, which is a queries per second, it is critical to improve query processing
generalization of both the term-centric and document-centric performance.
approaches. This approach ranks all postings and keeps only a Beside the parallel query processing approach that uses a cluster
subset of top ranked ones. The rank of a posting depends on of servers to process queries, the index compression approach is
several factors, such as its rank in its inverted list, its rank in its widely used. The lossless compression approach uses data
document, its weighting score, the term weight and the document compression techniques to compress index data, thereby reducing
weight. The effectiveness of our approach is verified by the volume of data transferred from disk. The compressed index
experiments using TREC queries and TREC datasets. data is then decompressed in memory, and queries are processed
based on the original index information. Common data
Categories and Subject Descriptors compression technique used in information retrieval systems is
H.3.3 [Information Storage and Retrieval]: Information Search variable length data coding [2]. In contrast, lossy compression
and Retrieval – Index pruning, Search process. approach opts for keeping only important information in the
index, discarding other less important information [4][5][6][8]
[11]. Thus ranking quality of queries processed based on a lossy
General Terms compressed index (i.e. a pruned index) might be affected.
Algorithms, Performance, Experimentation.
In practice, a lossless compression technique can be applied on a
lossy pruned index to further reduce index size. In addition, both
Keywords types of compressed/pruned index can be used by an information
Static Index Pruning, Document-centric Index Pruning, Term- retrieval system: a lossy pruned index is used to answer a large
centric Index Pruning, Posting-centric Index Pruning. portion of user queries, and a lossless compressed index is used
only if result quality is significantly hurt [4].
Copyright © 2009 for the individual papers by the papers’ authors. In this work, we concentrate on lossy index compression. Current
Copying permitted for private and academic purposes. Re-publication of state-of-the-art approaches include term-centric pruning [5] and
material from this volume requires permission by the copyright owners. document-centric pruning [6]. While term-centric pruning method
This volume is published by its editors. considers each inverted list independently and removes less
LSDS-IR Workshop. July 2009. Boston, USA. important postings from each inverted list, document-centric
pruning considers each document independently and removes less experimentally that if the document is ranked high for a given
important terms from each document. In other words, the term- query, it is very likely that query terms are among its
centric method does not consider the relative importance of a representative terms. Thus indexing sets of representative terms is
posting in comparison with others in the same document, and a good method to preserve ranking quality while reducing index
document-centric method does not consider the relative size.
importance of a posting in comparison with others in the same
Other index pruning techniques (some are for distributed, peer-to-
inverted list. The consequence is less important postings are not
peer information retrieval systems) belong to either the term-
pruned in some situations, and important postings are pruned in
centric approach or document-centric approach. Blanco et al. [8],
some other situations.
and Shokouhi et al. [10], try to find terms whose posting lists can
We propose a posting-based pruning approach, which is a be completely removed. De Moura et al. [11] propose to index a
generalization of both the term-centric and document-centric set of representative sentences for each document. Lu and Callan
approaches. Our approach ranks all postings and keeps only a [7] propose a number of methods to identify a representative term
subset of the top ranked ones, removing the others. We consider a set for each document. Podna et al. [12] and Skobeltsyn et al. [13]
couple of factors when ranking a posting, such as its rank in its propose to index term combinations to reduce the negative effect
posting list, its rank in its document, its weighting score, the of posting list pruning to ranking quality. Blanco and Barreiro [9]
normalized weight of the term, and the normalized weight of the improve the precision of term-centric pruning by considering a
document. Our experiments based on TREC queries and TREC number of designs overlooked by the original work.
datasets [22] show that posting-based pruning method
outperforms both the term-centric and document-centric methods. Looking at other aspect of static index pruning, Skobeltsyn et al.
[14] point out that the use of results caching fundamentally affects
the performance of a pruned index, due to the change in query
2. RELATED WORK pattern introduced by results caching. They then propose to
Lossless index compression techniques are well studied, for
combine results caching and index pruning to reduce the query
example, see Witten et al. [2][3]. Those techniques are mainly
workload of back-end servers.
based on the fact that the frequencies of terms in documents,
which are stored in the inverted index, follow a skewed
distribution. In that case, variable length coding technique can be
3. TERM-CENTRIC PRUNING VERSUS
used to encode index information, consuming only a few bits for DOCUMENT-CENTRIC PRUNING
most of term frequency values. In general, this helps to reduce 3.1 Term-Centric Index Pruning
index size by about one-tenth. However, for large scale
Term-centric pruning fits very well with the inverted index
information retrieval systems, the compressed index is still too big
structure of information retrieval systems. As queries are
to fit into memory. In addition, using a compressed index reduces
processed based on inverted lists, it is natural to truncate inverted
time to access index data from disk, but does not reduce time to
lists in order to reduce index size. Based on the inverted index
process the posting lists. Thus, using lossless index compression
structure, the “idealized, term-based” pruning technique proposed
alone cannot significantly improve efficiency.
by Carmel et al. is well-formed and mathematically provable. This
Lossy index compression techniques opt for discarding postings clearly shows that a pruned index, even though not containing all
that are not informative. By removing a large number of postings information, still can guarantee the ranking quality to some extent
from the inverted index, lossy index compression techniques not [5].
only significantly reduce index size, but also significantly reduce There are several properties that are specific to term-centric
length of posting lists. Therefore, lossy index compression pruning. It preserves the collection vocabulary. For every term,
techniques can reduce both time to access index data from disk there are always some entries in its inverted list in the pruned
and time to process posting lists. However, as some index index. (The works of Blanco et al. [8] and Shokouhi et al. [10] are
information is lost, lossy index compression techniques may lead exceptions, as their work reduces the vocabulary size.) In contrast,
to a drop in query ranking quality. term-centric pruning does not necessarily preserve the set of
In [5], Carmel et al. introduced a term-centric approach to static documents. As posting entries are removed, it is possible that
index pruning. Entries in each posting list are sorted in some documents will be totally removed from the pruned index.
descending order of a weighting scores. Only entries whose The fact that term-centric index pruning preserves the set of terms
weighting scores are greater than a threshold value are kept in the demonstrates its support for the possibility of all terms appearing
pruned index. The threshold value can be the same for all terms in user queries. Due to this support, in order to guarantee the
(uniform pruning), or it can be different for each term (term-based quality of top-K results for any queries, term-centric pruning must
pruning). not prune any of the top-K entries in any posting list. Obviously,
pruning any of these makes the pruned index unable to guarantee
Buttcher and Clarke introduce a document-centric pruning
the top-K results of the query containing only that single term. In
technique [6]. Instead of posting list pruning, they propose
addition, term-centric pruning assumes (implicitly) that every term
document pruning. For each document, they keep only a small
in the vocabulary is equally important. In contrast, for documents,
number of representative, highly-ranked terms in the pruned
term-centric pruning assumes that some are more important than
index. Terms in each document are ranked based on their
others. This is inferred from the fact that term-centric pruning
contribution to the Kullback-Leibler divergence [16] between the
might totally removed some documents from the pruned index.
document and the text collection. The intuition behind this is that
those document-representative terms are powerful enough to
distinguish the document from others. They also show
3.2 Document-Centric Data Pruning of a large number of queries can be removed. Puppin et al. [17]
Document-centric pruning does not make any assumption about observed that, for a collection of 5,939,061 documents and a set
the index structure and how queries are processed. Precisely, of 190,000 unique queries, around 52% of the documents were
document-centric pruning should be considered as a “data” not returned among the first 100 top-ranked results of any query.
pruning technique instead of an index pruning technique, as what We propose a posting-based index pruning method. We choose
it actually does is to prune the documents, not an index structure. neither posting lists, nor documents as our working elements.
In contrast to term-centric pruning, document-centric pruning Instead, we choose postings, i.e. tuples of the form . Postings are contained in both
document in the collection is represented by a subset of its terms index elements (as posting entries in posting lists) and data
(i.e., its set of representative terms), there is no guarantee that elements (as terms in documents). Also, choosing posting entries
every term will be indexed. It is likely that there are terms that are as working elements, we open the possibility of removing any
always ranked low in any document and are removed by document and any term’s posting list from the pruned index. With
document-centric pruning. this flexibility, our method is able to remove non-informative
terms as well as “rarely asked for” documents.
The first assumption implied by document-centric pruning is that
every document can be ranked first by some queries (one such 4.1 Term Weighting
query might be the query that contains all its representative When building a pruned index, terms should not be treated
terms). Due to this assumption, document-centric pruning opts for equally. Non-informative terms appear in a large number of
including every document in the pruned index. The second documents, results in long posting lists. However, non-
implied assumption is that terms are not equally important, and informative terms do not help much in ranking documents. Thus,
some terms can be totally removed from the pruned index. the pruned index should significantly prune the posting lists of
non-informative terms and reserve places for other informative
4. POSTING-BASED INDEX PRUNING terms. Our posting-based pruning method assigns a weight to each
As pointed out above, term-centric pruning prunes index term as its informativeness value. Blanco and Barreiro [8] have
elements, which are posting lists; while document-centric pruning studied a number of term-weighting schemes for the purpose of
prunes data elements, which are documents. Both approaches posting list pruning. Their finding is that residual inverse
assume that all elements are equally important, and thus the document frequency (RIDF) is a good quantity to measure the
pruned index should keep some information about every element, informativeness of terms, among other schemes such as the classic
either they are posting lists or documents. inverse document frequency and the term discriminative value.
We first find that the decision to keep some amount of We adopt RIDF to calculate term informativeness values. As
information for each posting list or document to be reasonable. specified in [8], the RIDF value of a term is
Without any information about the user queries, we must assume −
tf
df
any term can be used by users. Thus no term can be totally RIDF = − log + log1 − e N (1)
removed. Similarly, without any information about what users N
will search for, we also have to assume any document can be an where df is the term’s document frequency and N is the number of
answer (to some queries). Therefore, no document can be totally documents in the collection. As pointed out by Blanco, RIDF
removed. values can be computed efficiently.
However, given the requirement of significantly reducing index
size, it is not affordable to keep information for all posting lists 4.2 Document Weighting
and all documents. We believe that the pruned index should Documents in a large text collection are not equally important and
contain neither all terms, nor all documents, but only the most therefore, in the pruned index, more terms should be kept for
important postings, given the desired pruning level. important documents. As for term weighting, we also assign a
weight to each document in the collection to reflect how
We suspect that non-informative terms are common in any large
important it is. For a Web collection, the PageRank [18] or HITS
text collection. Non-informative terms are those terms that do not
[19] algorithm can be used to compute document important
help to discriminate documents. One example of non-informative
values. However, PageRank and HITS are not applicable for non-
terms is a term that appears in every document, such as the term
Web documents, as there is no link structure among documents.
“abstract” in a collection of scientific papers. Those terms are
We adopt the approach of Buttcher and Clarke [6] for this
expected to have similar weighting scores to every document.
purpose. For each document, we assign its Kullback-Leibler
Therefore, eliminating those terms will not hurt ranking quality.
distance to the collection as its important value. Thus, “out
Unfortunately, term-centric pruning tends not to prune any entries
standing” documents (i.e., documents which are very different
from the posting lists of those terms. The reason is, as entry scores
from the collection) will be assigned high important values, while
are almost similar, all scores are likely to be greater than the
documents which are similar to others will be assigned low
threshold value computed by the term-based method proposed in
important values. Our important value for a document d is defined
[5].
as
We also suspect that there are many “rarely asked for” documents
tf (t ) tf (t ) | C |
in any large text collection. A document is called “rarely asked KLD (d || C ) = ∑ log × (2)
for” if it does not appear in the top-ranked results of any real t∈d |d | | d | TF (t )
world query. In practice, users normally look at only the top 20 where tf(t) is the term frequency of term t, TF(t) is the collection
results, so any document that does not appear in the top-20 results term frequency of term t, |d| is the length of document d (i.e., the
number of terms in d), and |C| is the length of the collection (i.e., transform the term rank values and document rank values. This
the sum of document lengths). transformation is necessary, too. Using the term rank values
makes it appears that the 10th term is ten time less important than
4.3 Posting Ranking Function the top ranked term, which does not seem right. Therefore, we use
Our static pruning method evaluates postings and assigns each a a non-linear function specified below (5) as a transform function.
usefulness value. We then build the pruned index based on these The parameter x0 is used to shift the “transition” point, where the
values. Given a desired level of pruning, posting entries are sigmoid function switches from high value state to low value
selected based on their usefulness values and added into the state, and the parameter a is used to control the slope of the
pruned index until the pruned index size reaches its limit. transition period of the function.
According to term-centric pruning, we should assign a high
sigmoid ( x ) = 1 −
1
(5)
usefulness value to a posting entry which appears at the top of its 1 + e (− x+ x0 ) a
posting list. According to document-centric pruning, we should
The sigmoid function above returns a value between zero and one.
not assign a high usefulness value to a posting whose term does
Rank value close to zero (i.e., top ranked element) will be
not belong to the set of top-ranked terms of the document. In
transformed to a value close to one, while other rank values will
addition, as discussed above, we should assign low values to non-
be transformed depend on two parameters x0 and a. In Figure 1,
informative terms and “rarely asked for” documents, and vice
we show the shape of the sigmoid function for several
versa. Also, we obviously want to assign high usefulness value to
combinations of parameters.
posting entries with high scores.
1
To rank postings, for each posting , we compute the a=1,x0=50
following quantities: 0.8 a=15,x0=50
a=5,x0=60
• S(t, d) = the score term t contributes to the rank score of 0.6 a=50,x0=60
Weight
document d.
0.4
• RIDF(t) = the informativeness value of term t.
0.2
• KLD(d || C) = the important value of document d.
0
• Rank(t) = the rank of term t relatively with other terms in 1 16 31 46 61 76 91
Rank
document d.
• Rank(d) = the rank of document d relatively with other Figure 1. Sigmoid functions.
documents in the posting list of term t.
Among the quantities above, S(t, d) is computed using a Our posting entry ranking function is given below:
weighting scheme, such as the classic TFIDF weighting scheme, f ( t, d ) = S BM 25 (t , d ) ⋅ {α ⋅ RIDF (t ) ⋅ sigmoid (Rank (d )) +
(6)
or the state-of-the-art BM25 weighting scheme; RIDF(t) is (1 − α ) ⋅ KLD (d || C ) ⋅ sigmoid (Rank (t ))}
calculated as specified in (1); KLD(d || C) is calculated according
to (2); Rank(d) is the position of document d in the posting list of where α is a parameter taking value between zero and one.
term t, where posting entries are sorted in descending order of its
scores S(t,di); and Rank(t) is the position of term t in document d, 4.4 Posting-Based Pruning versus Document-
where terms are sorted in descending order of its “feedback” score Centric Pruning and Term-Centric Pruning
[6], defined below: We show that our posting entry ranking function generalizes
tf tf | C | document-centric index pruning method and term-centric pruning
ScoreDCP (t ) = log × (3) method.
| d | | d | TF
If we set the value of α to zero, replace the document weighting
In this work, we use the BM25 weighting scheme [20] (given function KLD(d||C) with the unity function u(d) = 1, and set the
below) to calculate S(t, d) due to its widely use in other research parameter a of the sigmoid function to 1 so that the sigmoid
works. function in (5) becomes a threshold function at x0, then for each
N − df + 0.5 tf (k1 + 1) document d, our ranking function f() assigns a non-zero value to
S BM 25 (t , d ) = log × the posting entry if t is ranked in the top-x0 among all
df + 0 . 5 tf + k (1 − b ) + b | d |
(4)
1
avgdl
unique terms in d, otherwise f() returns a zero value. In this case,
our posting-based pruning technique is equivalent to the
where avgdl is the average length of documents, k1 is set to its “constant” document-centric pruning technique proposed by
“standard” value of 1.2 and b is set to its “standard” value of Buttcher and Clarke in [6], which select a fixed number of top
0.75. ranked terms from each document. Obviously, the “relative”
In combination, our posting entry ranking function takes as document-centric pruning technique proposed in [6] can be easily
parameters all the above quantities and returns a usefulness value. obtained from our posting entry ranking function by adjusting x0
Note that we apply normalization and transformation to parameter for each document according to its length.
values. First, RIDF(t) values are normalized so that they sum up Similarly, if we set the value of α to one, replace the term
to one. Similar normalization is applied to KLD(d || C) values. weighting function RIDF(t) with the unity function u(t) = 1, set
Normalization step is necessary, as the range of RIDF(t) and the parameter a of the sigmoid function to 1, and set the parameter
KLD(d || C) are different. Second, we use a sigmoid function to x0 to the rank of the i-th posting entry in the posting list of term t
Figure 2: Querying effectiveness for short TREC queries.
Figure 3: Querying effectiveness for long TREC queries.
such that S(t, di) > τ (t) and S(t, di+1) ≤ τ (t), where τ (t) is the cut- fields from the TREC topic, and one set of short queries, wherein
off threshold proposed by Carmel et al. in [5], then our posting each query includes only the title field from the TREC topic.
entry ranking function assigns a non-zero value to any posting We also implement term-centric index pruning and document-
entry selected by term-centric pruning technique, and a zero value centric index pruning exactly as specified in their original works
to any non-selected posting entry. [5][6]. The only difference is that in [5], Carmel et al. used the
Our posting-based pruning technique is different from term- SMART term weighting scheme for both index pruning and query
centric and document-centric pruning techniques in several ranking. We instead use BM25 term weighting scheme for query
aspects. By using term weighting function and document ranking, but still use SMART term weighting scheme for index
weighting function other than the unity function, we allow the pruning.
pruned index to include more information for informative terms We conduct experiments to compare the effectiveness of our
and outstanding documents. By using a sigmoid function instead proposed posting-based pruning technique with the term-based,
of a threshold function to transform the term and document ranks, “score shifting” pruning technique proposed in [5], and the
we open the possibility that a posting entry which is ranked low in “relative” document-centric pruning technique proposed in [6]. In
a posting list could be selected, if it is ranked high in the the next section, we report precision at 10, precision at 20, and
document, and vice versa. average precision at each pruning level for each technique.
5. EXPERIMENTAL SETTINGS 6. EXPERIMENTAL RESULTS
We use the WT10G collection as our data set. This collection In Figure 2, we show the effectiveness of pruned indices for the
contains 1,692,096 Web documents crawled from the Web. We set of short TREC queries; and in Figure 3, we show the
use the Terrier platform [21] to index and rank queries, and we effectiveness of pruned indices for the set of long TREC queries.
develop our posting-based index pruning technique based on Posting-centric pruned index is marked as “pXX_YY_ZZ”, where
Terrier. XX is the value of parameter α (percentage), YY is the value of
We use the BM25 weighting scheme for both calculating term- parameter a, and ZZ is the value of parameter x0. Figure 2 and
document scores and query ranking. Potter stemming algorithm Figure 3 show experiment results for a posting-centric pruned
[23] and a standard list of stop-words are used for preprocessing index with α = 50%, a = 15, and x0 = 50. With the pruning level
documents. After stemming and stop-word removal, the less than 70%, posting-centric pruning has similar performance as
vocabulary contains 3,161,488 unique terms. The un-pruned compare with document-centric pruning and term-centric pruning.
index contains 280,632,807 posting entries. However, for pruning level of 70% or more, posting-centric
pruning is outperformed by document-centric pruning.
We use TREC [22] topics 451–500 as our query set. Precision is
measured for each pruned index using the set of relevance We turn our parameters for posting-centric index pruning
judgment provided by TREC for topics 451–500. From TREC technique. Table 1, Table 2, and Table 3 show experiment results
topics 451–500, we build two sets of queries: one set of long for short queries of document-centric pruning technique and
queries, wherein each query includes the title and the description posting-centric pruning techniques with various combinations of
Table 1. Precision at 10 for short TREC queries of document-centric pruning technique and posting-centric pruning techniques of
various parameter combinations.
Pct #posting Document- p50_15_50 p20_15_50 p50_35_200 p50_200_50 p80_200_1000 p50_300_1000 p80_300_1000 p80_400_1000
entry-pruned centric
0 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25
10 0.2458 0.2542 0.2542 0.2542 0.2542 0.2542 0.2542 0.2563 0.2521
20 0.2396 0.2521 0.2521 0.2542 0.2375 0.2333 0.2333 0.2354 0.2354
30 0.2396 0.2458 0.2458 0.2458 0.2292 0.2271 0.2292 0.2313 0.2292
40 0.2458 0.2479 0.2479 0.2271 0.2250 0.2271 0.2271 0.2250 0.2333
50 0.2500 0.2563 0.2542 0.2375 0.2292 0.2250 0.2312 0.2312 0.2208
60 0.2458 0.2521 0.2500 0.2250 0.2208 0.2104 0.2208 0.2167 0.2146
70 0.2396 0.2354 0.2354 0.2271 0.2396 0.2229 0.2354 0.2187 0.2187
80 0.2500 0.1979 0.2167 0.2021 0.2104 0.2063 0.2333 0.1979 0.2021
90 0.2458 0.15 0.1625 0.1625 0.1854 0.1958 0.1875 0.2021 0.2000
Table 2. Precision at 20 for short TREC queries of document-centric pruning technique and posting-centric pruning techniques of
various parameter combinations.
Pct #posting Document- p50_15_50 p20_15_50 p50_35_200 p50_200_50 p80_200_1000 p50_300_1000 p80_300_1000 p80_400_1000
entry-pruned centric
0 0.2073 0.2073 0.2073 0.2073 0.2073 0.2073 0.2073 0.2073 0.2073
10 0.2052 0.2042 0.2042 0.2052 0.2052 0.2052 0.2052 0.2052 0.2052
20 0.201 0.2031 0.2031 0.2021 0.1990 0.1938 0.1948 0.1948 0.1958
30 0.1979 0.2052 0.2063 0.1990 0.1854 0.1844 0.1854 0.1854 0.1844
40 0.1979 0.2042 0.2042 0.1917 0.1750 0.1740 0.1740 0.1750 0.1781
50 0.1969 0.2083 0.2073 0.1875 0.1771 0.1740 0.1781 0.1792 0.1719
60 0.1958 0.2031 0.2010 0.1802 0.1813 0.1813 0.1844 0.1813 0.1833
70 0.2010 0.2031 0.1990 0.1719 0.1802 0.1740 0.1802 0.1750 0.1750
80 0.2094 0.1583 0.1729 0.1573 0.1552 0.1635 0.1729 0.1635 0.1646
90 0.1927 0.1177 0.1229 0.1208 0.1354 0.1469 0.1448 0.1531 0.1500
Table 3. MAP for short TREC queries of document-centric pruning technique and posting-centric pruning techniques of various
parameter combinations.
Pct #posting Document- p50_15_50 p20_15_50 p50_35_200 p50_200_50 p80_200_1000 p50_300_1000 p80_300_1000 p80_400_1000
entry-pruned centric
0 0.1892 0.1892 0.1892 0.1892 0.1892 0.1892 0.1892 0.1892 0.1892
10 0.1864 0.1868 0.1871 0.1869 0.1879 0.1875 0.1878 0.1873 0.1880
20 0.1838 0.1891 0.1892 0.1889 0.1835 0.1818 0.1824 0.1835 0.1826
30 0.1846 0.1914 0.1913 0.1901 0.1816 0.1806 0.1807 0.1821 0.1825
40 0.1847 0.1855 0.1854 0.1880 0.1805 0.1822 0.1820 0.1827 0.1845
50 0.1837 0.1958 0.1958 0.1815 0.1834 0.1809 0.1839 0.1840 0.1809
60 0.1835 0.1911 0.1915 0.1804 0.1755 0.1743 0.1785 0.1747 0.1747
70 0.1693 0.172 0.1739 0.1698 0.1702 0.1619 0.1657 0.1692 0.1627
80 0.1679 0.1557 0.1571 0.1476 0.1373 0.1494 0.1575 0.1486 0.1440
90 0.1533 0.0919 0.1136 0.1238 0.1338 0.1404 0.1314 0.1378 0.1421
parameter values. Experiments with long queries have similar higher pruning level, the differences between the performance of
trend and therefore are omitted. posting-centric pruning and document-centric pruning are larger.
In Table 1, Table 2, and Table 3, the values in bold are the highest In addition, none of the posting-centric pruning techniques
performance for a specific pruning level. The first observation is outperforms document-centric pruning technique at high pruning
that posting-centric pruning is better at low and moderate pruning level.
level, while document-centric pruning is better at higher pruning In all previous experiments of posting-centric pruning technique,
level. The second observation is that, even though posting-centric parameters of the posting entry ranking function (6) are fixed for
pruning is better than document-centric pruning at low and all posting entries. We consider the possibility of adaptively
moderate pruning level, the differences are small. In contrast, at setting parameters for each posting entry, by adapting the slope of
Table 4. Performance at 90% pruning level of different posting-centric pruning techniques.
Pruning method Short TREC queries Long TREC queries
P@10 P@20 MAP P@10 P@20 MAP
Document-centric 0.2458 0.1927 0.1533 0.3120 0.2470 0.1731
AP1: no term weighting, no document weighting, α = 0.5, using two different sigmoid
functions with parameters vary for each posting entry 0.2375 0.1688 0.1456 0.3020 0.2100 0.1561
AP2: similar to AP1, except that term-document scores are not used 0.1277 0.1106 0.0831 0.1660 0.1250 0.0903
AP3: similar to AP1, except that term weighting is used 0.2604 0.1969 0.1592 0.3300 0.2500 0.1714
AP4: similar to AP1, except that both term weighting and document weighting are used 0.2271 0.1573 0.1354 0.2740 0.1900 0.1479
Figure 4: Querying effectiveness for short TREC queries.
Figure 5: Querying effectiveness for long TREC queries.
the sigmoid function as follows. For a posting entry , we value is in bold. From the results reported in Table 4, we can see
use two sigmoid functions, one for the rank of term t in the that:
document d, the other for the rank of document d in the posting (i) Term-document scores should be used in the posting
list of term t. We call the former document-oriented sigmoid entry ranking function (6), which is revealed by the
function, and the later term-oriented sigmoid function. For the significant differences between the performance of AP1
term-oriented sigmoid function, its x0 parameter is set according and AP2.
to the pruning level (i.e., if the pruning level is 90%, x0 is equal to
10% of the length of the term posting list). Similarly, for the (ii) Term-weighting is useful, which is confirmed by the
document-oriented sigmoid function, if the pruning level is 90%, differences between the performance of AP1 and AP3.
its x0 parameter is set to 10% of the number of unique terms in the (iii) Document weighting seems to be harmful, which hurt the
document. For both sigmoid functions, the parameter a, which performance of AP4, compare to the performance of AP3
control the slope of the function, is set so that the function returns and AP1.
a value close to 1 for input value which is less than 0.9 × x0 and
Among all alternatives, AP3 is the best, which is only slightly
returns a value close to 0 for input value which is greater than 1.1
outperformed by document-centric pruning for long queries
× x0. We also explore the performance of several alternatives, according to MAP. We therefore consider it as the best among our
which may or may not use term-weighting and/or document- alternatives. Below, we report its performance in comparison with
weighting (refer to the ranking function (6) in Section 4.3). document-centric pruning and term-centric pruning at various
In Table 4, we report the performance of different alternatives at level of pruning in Figure 4 and Figure 5.
the approximately 90% pruning level. Document-centric pruning By adapting the sigmoid functions to posting entries, the
performance is included for reference. For each column, the best performance of our posting-centric pruning technique is much
better, as good as the performance of document-centric pruning
technique (posting-centric pruning is better than document-centric [5] D. Carmel et al., “Static Index Pruning for Information
pruning at some pruning level, while the inverse is true at other Retrieval Systems,” in Proc. ACM SIGIR, 2001.
pruning levels). [6] S. Buttcher and C. L. A. Clarke, “A Document-Centric
Approach to Static Index Pruning in Text Retrieval
7. CONCLUSIONS AND FUTURE WORK Systems,” in Proc. ACM CIKM, 2006.
We evaluate document-centric and term-centric static index
[7] J. Lu and J. Callan, “Pruning Long Documents for
pruning based on the WT10G corpus and TREC query sets. Based
Distributed Information Retrieval,” in Proc. ACM CIKM,
on our experimental results, term-centric index pruning is better
2002.
than document-centric index pruning at low and moderate pruning
level (i.e., less than 70% pruning, according to our results), while [8] R. Blanco and A. Barreiro, “Static Pruning of Terms in
document-centric index pruning is better at higher pruning level. Inverted Files,” in Proc. ECIR, 2007.
We propose posting-centric index pruning technique, which ranks [9] R. Blanco and A. Barreiro, “Boosting Static Pruning of
each posting entry (i.e., a term-document pair) based on a set of Inverted Files,” in Proc. ACM SIGIR, 2007.
features such as the rank of the term in the document and the rank [10] M. Shokouhi et al., “Using Query Logs to Establish
of the document in the inverted list of the term. We show that Vocabularies in Distributed Information Retrieval,” In Int'l
posting-centric index pruning generalizes both document-centric Journal on Information Processing and Management, 2007.
and term-centric pruning, and therefore, the solution space of [11] E. S. de Moura et al, “Improving Web Search Efficiency via
term-centric pruning covers the solution spaces of both document- a Locality Based Static Pruning Method,” in Proc. ACM
centric and term-centric index pruning. This implies that, by WWW, 2005.
exploring this larger solution space, better solution can be found.
[12] I. Podna, M. Rajman, T. Luu, F. Klemn, and K. Aberer,
We explore the solution space of posting-centric pruning by “Scalable Peer-to-peer Web Retrieval with Highly
studying a family of posting entry ranking functions. We discover Discriminative Keys,” in Proc. IEEE ICDE, 2007.
that term weighting based on RIDF is useful, while document
weighting based on KL-divergence is harmful. We also notice that [13] G. Skobeltsyn, T. Luu, I. P. Zarko, M. Rajman, and K.
parameters of the sigmoid function, which we use to transform the Aberer, “Web Text Retrieval with a P2P Query Driven
rank of a term/document to its score, should be adapted to each Index,” in Proc. ACM SIGIR, 2007.
posting entry. Fixing these parameters makes posting-centric [14] G. Skobeltsyn, F. Junqueira, V. Plachouras, and R. Baeza-
pruning less effective than document-centric pruning. Yates, “ResIn: a Combination of Results Caching and Index
Pruning for High-performance Web Search Engines,” in
Other term weighting and document weighting methods are Proc. ACM SIGIR, 2008.
possible. We are evaluating a method of weighting terms and
documents based on user queries and the PageRank algorithm [15] C. Tang, and S. Dwarkadas, “Hybrid Global-local Indexing
applying on the graph of terms and documents. Our goal is to for Efficient Peer-to-peer Information Retrieval,” in Proc.
discover important terms and documents by analyzing the NSDI, 2004.
relationship among terms and documents given the context of user [16] S. Kullback, “The Kullback-Leibler Distance,” The
queries. Once the important terms and the important documents American Statistician, 41:340-341, 1987.
are discovered, their information is kept in the pruned index,
[17] D. Puppin, F. Silvestri, and D. Laforenza, “Query-Driven
while information about others, less important terms and
Document Partitioning and Collection Selection,” in Proc.
documents, can be partially removed or totally discarded.
INFOSCALE, 2006.
[18] L. Page, S. Brin, R. Motwani, and T. Winograd, “The
8. REFERENCES PageRank Citation Ranking: Bringing Order to the Web,”
Technical Report, Stanford University.
[1] D. Grossman and O. Frieder, “Information Retrieval: [19] J. Kleinberg, “Authoritative Sources in a Hyperlinked
Algorithms and Heuristics,” Springer, 2nd ed, 2004. Environment,” in Journal of the ACM, 46(5), 1999.
[2] I. H. Witten, A. Moffat, and T. C. Bell, “Managing [20] S. E. Robertson. S. Walker, and M. Hancock-Beaulieu,
Gigabytes,” Morgan Kaufmann, 2nd ed, 1999. “Okapi at TREC-7,” in Proc. of the Seventh Text Retrieval
[3] J. Zobel and A. Moffat, “Inverted Files for Text Search Conference, 1998.
Engines,” in ACM Computing Surveys, 38(2), 2006. [21] TERabyte RetrIEveR, http://ir.dcs.gla.ac.uk/terrier/
[4] A. Ntoulas and J. Cho, “Pruning Policies for Two-Tiered [22] TREC (WT10G, TREC-9)
Inverted Index with Correctness Guarantee,” in Proc. ACM
SIGIR, 2007. [23] M. F. Porter, “An Algorithm for Suffix Stripping,” in
Program, 14(3), 1980.