Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach Linh Thai Nguyen Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 USA +1-312-567-5330 nguylin@iit.edu ABSTRACT Static index pruning methods have been proposed to reduce size 1. INTRODUCTION Text information retrieval systems are based on an inverted index of the inverted index of information retrieval systems. The goal is to efficiently process queries. The most important part of an to increase efficiency (in terms of query response time) while inverted index is its inverted file, a file that contains posting list preserving effectiveness (in terms of ranking quality). Current for each term in the text collection [1]. In general, a posting list of state-of-the-art approaches include the term-centric pruning a term contains its posting entries (or index pointers), each in the approach and the document-centric pruning approach. While the form of , where docID is the ID of a document that term-centric pruning considers each inverted list independently contains the term, and freq is its frequency in the document. For a and removes less important postings from each inverted list, the multi-keyword query, all posting lists of query terms are retrieved document-centric approach considers each document from the inverted file, and document scores are accumulated for independently and removes less important terms from each each document in the union of the posting lists, based on a document. In other words, the term-centric approach does not specified weighting scheme. A list of documents in descending consider the relative importance of a posting in comparison with order of rank scores is presented to the user. others in the same document, and the document-centric approach does not consider the relative importance of a posting in For a large text corpus, the inverted file is too large to fit into comparison with others in the same inverted list. The consequence memory of the search server. Thus, query processing involves a is less important postings are not pruned in some situations, and lot of disk access, which increases query response time. For a text important postings are pruned in some other situations. We information retrieval system that has to process thousands of propose a posting-based pruning approach, which is a queries per second, it is critical to improve query processing generalization of both the term-centric and document-centric performance. approaches. This approach ranks all postings and keeps only a Beside the parallel query processing approach that uses a cluster subset of top ranked ones. The rank of a posting depends on of servers to process queries, the index compression approach is several factors, such as its rank in its inverted list, its rank in its widely used. The lossless compression approach uses data document, its weighting score, the term weight and the document compression techniques to compress index data, thereby reducing weight. The effectiveness of our approach is verified by the volume of data transferred from disk. The compressed index experiments using TREC queries and TREC datasets. data is then decompressed in memory, and queries are processed based on the original index information. Common data Categories and Subject Descriptors compression technique used in information retrieval systems is H.3.3 [Information Storage and Retrieval]: Information Search variable length data coding [2]. In contrast, lossy compression and Retrieval – Index pruning, Search process. approach opts for keeping only important information in the index, discarding other less important information [4][5][6][8] [11]. Thus ranking quality of queries processed based on a lossy General Terms compressed index (i.e. a pruned index) might be affected. Algorithms, Performance, Experimentation. In practice, a lossless compression technique can be applied on a lossy pruned index to further reduce index size. In addition, both Keywords types of compressed/pruned index can be used by an information Static Index Pruning, Document-centric Index Pruning, Term- retrieval system: a lossy pruned index is used to answer a large centric Index Pruning, Posting-centric Index Pruning. portion of user queries, and a lossless compressed index is used only if result quality is significantly hurt [4]. Copyright © 2009 for the individual papers by the papers’ authors. In this work, we concentrate on lossy index compression. Current Copying permitted for private and academic purposes. Re-publication of state-of-the-art approaches include term-centric pruning [5] and material from this volume requires permission by the copyright owners. document-centric pruning [6]. While term-centric pruning method This volume is published by its editors. considers each inverted list independently and removes less LSDS-IR Workshop. July 2009. Boston, USA. important postings from each inverted list, document-centric pruning considers each document independently and removes less experimentally that if the document is ranked high for a given important terms from each document. In other words, the term- query, it is very likely that query terms are among its centric method does not consider the relative importance of a representative terms. Thus indexing sets of representative terms is posting in comparison with others in the same document, and a good method to preserve ranking quality while reducing index document-centric method does not consider the relative size. importance of a posting in comparison with others in the same Other index pruning techniques (some are for distributed, peer-to- inverted list. The consequence is less important postings are not peer information retrieval systems) belong to either the term- pruned in some situations, and important postings are pruned in centric approach or document-centric approach. Blanco et al. [8], some other situations. and Shokouhi et al. [10], try to find terms whose posting lists can We propose a posting-based pruning approach, which is a be completely removed. De Moura et al. [11] propose to index a generalization of both the term-centric and document-centric set of representative sentences for each document. Lu and Callan approaches. Our approach ranks all postings and keeps only a [7] propose a number of methods to identify a representative term subset of the top ranked ones, removing the others. We consider a set for each document. Podna et al. [12] and Skobeltsyn et al. [13] couple of factors when ranking a posting, such as its rank in its propose to index term combinations to reduce the negative effect posting list, its rank in its document, its weighting score, the of posting list pruning to ranking quality. Blanco and Barreiro [9] normalized weight of the term, and the normalized weight of the improve the precision of term-centric pruning by considering a document. Our experiments based on TREC queries and TREC number of designs overlooked by the original work. datasets [22] show that posting-based pruning method outperforms both the term-centric and document-centric methods. Looking at other aspect of static index pruning, Skobeltsyn et al. [14] point out that the use of results caching fundamentally affects the performance of a pruned index, due to the change in query 2. RELATED WORK pattern introduced by results caching. They then propose to Lossless index compression techniques are well studied, for combine results caching and index pruning to reduce the query example, see Witten et al. [2][3]. Those techniques are mainly workload of back-end servers. based on the fact that the frequencies of terms in documents, which are stored in the inverted index, follow a skewed distribution. In that case, variable length coding technique can be 3. TERM-CENTRIC PRUNING VERSUS used to encode index information, consuming only a few bits for DOCUMENT-CENTRIC PRUNING most of term frequency values. In general, this helps to reduce 3.1 Term-Centric Index Pruning index size by about one-tenth. However, for large scale Term-centric pruning fits very well with the inverted index information retrieval systems, the compressed index is still too big structure of information retrieval systems. As queries are to fit into memory. In addition, using a compressed index reduces processed based on inverted lists, it is natural to truncate inverted time to access index data from disk, but does not reduce time to lists in order to reduce index size. Based on the inverted index process the posting lists. Thus, using lossless index compression structure, the “idealized, term-based” pruning technique proposed alone cannot significantly improve efficiency. by Carmel et al. is well-formed and mathematically provable. This Lossy index compression techniques opt for discarding postings clearly shows that a pruned index, even though not containing all that are not informative. By removing a large number of postings information, still can guarantee the ranking quality to some extent from the inverted index, lossy index compression techniques not [5]. only significantly reduce index size, but also significantly reduce There are several properties that are specific to term-centric length of posting lists. Therefore, lossy index compression pruning. It preserves the collection vocabulary. For every term, techniques can reduce both time to access index data from disk there are always some entries in its inverted list in the pruned and time to process posting lists. However, as some index index. (The works of Blanco et al. [8] and Shokouhi et al. [10] are information is lost, lossy index compression techniques may lead exceptions, as their work reduces the vocabulary size.) In contrast, to a drop in query ranking quality. term-centric pruning does not necessarily preserve the set of In [5], Carmel et al. introduced a term-centric approach to static documents. As posting entries are removed, it is possible that index pruning. Entries in each posting list are sorted in some documents will be totally removed from the pruned index. descending order of a weighting scores. Only entries whose The fact that term-centric index pruning preserves the set of terms weighting scores are greater than a threshold value are kept in the demonstrates its support for the possibility of all terms appearing pruned index. The threshold value can be the same for all terms in user queries. Due to this support, in order to guarantee the (uniform pruning), or it can be different for each term (term-based quality of top-K results for any queries, term-centric pruning must pruning). not prune any of the top-K entries in any posting list. Obviously, pruning any of these makes the pruned index unable to guarantee Buttcher and Clarke introduce a document-centric pruning the top-K results of the query containing only that single term. In technique [6]. Instead of posting list pruning, they propose addition, term-centric pruning assumes (implicitly) that every term document pruning. For each document, they keep only a small in the vocabulary is equally important. In contrast, for documents, number of representative, highly-ranked terms in the pruned term-centric pruning assumes that some are more important than index. Terms in each document are ranked based on their others. This is inferred from the fact that term-centric pruning contribution to the Kullback-Leibler divergence [16] between the might totally removed some documents from the pruned index. document and the text collection. The intuition behind this is that those document-representative terms are powerful enough to distinguish the document from others. They also show 3.2 Document-Centric Data Pruning of a large number of queries can be removed. Puppin et al. [17] Document-centric pruning does not make any assumption about observed that, for a collection of 5,939,061 documents and a set the index structure and how queries are processed. Precisely, of 190,000 unique queries, around 52% of the documents were document-centric pruning should be considered as a “data” not returned among the first 100 top-ranked results of any query. pruning technique instead of an index pruning technique, as what We propose a posting-based index pruning method. We choose it actually does is to prune the documents, not an index structure. neither posting lists, nor documents as our working elements. In contrast to term-centric pruning, document-centric pruning Instead, we choose postings, i.e. tuples of the form . Postings are contained in both document in the collection is represented by a subset of its terms index elements (as posting entries in posting lists) and data (i.e., its set of representative terms), there is no guarantee that elements (as terms in documents). Also, choosing posting entries every term will be indexed. It is likely that there are terms that are as working elements, we open the possibility of removing any always ranked low in any document and are removed by document and any term’s posting list from the pruned index. With document-centric pruning. this flexibility, our method is able to remove non-informative terms as well as “rarely asked for” documents. The first assumption implied by document-centric pruning is that every document can be ranked first by some queries (one such 4.1 Term Weighting query might be the query that contains all its representative When building a pruned index, terms should not be treated terms). Due to this assumption, document-centric pruning opts for equally. Non-informative terms appear in a large number of including every document in the pruned index. The second documents, results in long posting lists. However, non- implied assumption is that terms are not equally important, and informative terms do not help much in ranking documents. Thus, some terms can be totally removed from the pruned index. the pruned index should significantly prune the posting lists of non-informative terms and reserve places for other informative 4. POSTING-BASED INDEX PRUNING terms. Our posting-based pruning method assigns a weight to each As pointed out above, term-centric pruning prunes index term as its informativeness value. Blanco and Barreiro [8] have elements, which are posting lists; while document-centric pruning studied a number of term-weighting schemes for the purpose of prunes data elements, which are documents. Both approaches posting list pruning. Their finding is that residual inverse assume that all elements are equally important, and thus the document frequency (RIDF) is a good quantity to measure the pruned index should keep some information about every element, informativeness of terms, among other schemes such as the classic either they are posting lists or documents. inverse document frequency and the term discriminative value. We first find that the decision to keep some amount of We adopt RIDF to calculate term informativeness values. As information for each posting list or document to be reasonable. specified in [8], the RIDF value of a term is Without any information about the user queries, we must assume  −  tf  df  any term can be used by users. Thus no term can be totally RIDF = − log  + log1 − e N  (1) removed. Similarly, without any information about what users N   will search for, we also have to assume any document can be an where df is the term’s document frequency and N is the number of answer (to some queries). Therefore, no document can be totally documents in the collection. As pointed out by Blanco, RIDF removed. values can be computed efficiently. However, given the requirement of significantly reducing index size, it is not affordable to keep information for all posting lists 4.2 Document Weighting and all documents. We believe that the pruned index should Documents in a large text collection are not equally important and contain neither all terms, nor all documents, but only the most therefore, in the pruned index, more terms should be kept for important postings, given the desired pruning level. important documents. As for term weighting, we also assign a weight to each document in the collection to reflect how We suspect that non-informative terms are common in any large important it is. For a Web collection, the PageRank [18] or HITS text collection. Non-informative terms are those terms that do not [19] algorithm can be used to compute document important help to discriminate documents. One example of non-informative values. However, PageRank and HITS are not applicable for non- terms is a term that appears in every document, such as the term Web documents, as there is no link structure among documents. “abstract” in a collection of scientific papers. Those terms are We adopt the approach of Buttcher and Clarke [6] for this expected to have similar weighting scores to every document. purpose. For each document, we assign its Kullback-Leibler Therefore, eliminating those terms will not hurt ranking quality. distance to the collection as its important value. Thus, “out Unfortunately, term-centric pruning tends not to prune any entries standing” documents (i.e., documents which are very different from the posting lists of those terms. The reason is, as entry scores from the collection) will be assigned high important values, while are almost similar, all scores are likely to be greater than the documents which are similar to others will be assigned low threshold value computed by the term-based method proposed in important values. Our important value for a document d is defined [5]. as We also suspect that there are many “rarely asked for” documents tf (t )  tf (t ) | C |  in any large text collection. A document is called “rarely asked KLD (d || C ) = ∑ log ×  (2) for” if it does not appear in the top-ranked results of any real t∈d |d |  | d | TF (t )  world query. In practice, users normally look at only the top 20 where tf(t) is the term frequency of term t, TF(t) is the collection results, so any document that does not appear in the top-20 results term frequency of term t, |d| is the length of document d (i.e., the number of terms in d), and |C| is the length of the collection (i.e., transform the term rank values and document rank values. This the sum of document lengths). transformation is necessary, too. Using the term rank values makes it appears that the 10th term is ten time less important than 4.3 Posting Ranking Function the top ranked term, which does not seem right. Therefore, we use Our static pruning method evaluates postings and assigns each a a non-linear function specified below (5) as a transform function. usefulness value. We then build the pruned index based on these The parameter x0 is used to shift the “transition” point, where the values. Given a desired level of pruning, posting entries are sigmoid function switches from high value state to low value selected based on their usefulness values and added into the state, and the parameter a is used to control the slope of the pruned index until the pruned index size reaches its limit. transition period of the function. According to term-centric pruning, we should assign a high sigmoid ( x ) = 1 − 1 (5) usefulness value to a posting entry which appears at the top of its 1 + e (− x+ x0 ) a posting list. According to document-centric pruning, we should The sigmoid function above returns a value between zero and one. not assign a high usefulness value to a posting whose term does Rank value close to zero (i.e., top ranked element) will be not belong to the set of top-ranked terms of the document. In transformed to a value close to one, while other rank values will addition, as discussed above, we should assign low values to non- be transformed depend on two parameters x0 and a. In Figure 1, informative terms and “rarely asked for” documents, and vice we show the shape of the sigmoid function for several versa. Also, we obviously want to assign high usefulness value to combinations of parameters. posting entries with high scores. 1 To rank postings, for each posting , we compute the a=1,x0=50 following quantities: 0.8 a=15,x0=50 a=5,x0=60 • S(t, d) = the score term t contributes to the rank score of 0.6 a=50,x0=60 Weight document d. 0.4 • RIDF(t) = the informativeness value of term t. 0.2 • KLD(d || C) = the important value of document d. 0 • Rank(t) = the rank of term t relatively with other terms in 1 16 31 46 61 76 91 Rank document d. • Rank(d) = the rank of document d relatively with other Figure 1. Sigmoid functions. documents in the posting list of term t. Among the quantities above, S(t, d) is computed using a Our posting entry ranking function is given below: weighting scheme, such as the classic TFIDF weighting scheme, f ( t, d ) = S BM 25 (t , d ) ⋅ {α ⋅ RIDF (t ) ⋅ sigmoid (Rank (d )) + (6) or the state-of-the-art BM25 weighting scheme; RIDF(t) is (1 − α ) ⋅ KLD (d || C ) ⋅ sigmoid (Rank (t ))} calculated as specified in (1); KLD(d || C) is calculated according to (2); Rank(d) is the position of document d in the posting list of where α is a parameter taking value between zero and one. term t, where posting entries are sorted in descending order of its scores S(t,di); and Rank(t) is the position of term t in document d, 4.4 Posting-Based Pruning versus Document- where terms are sorted in descending order of its “feedback” score Centric Pruning and Term-Centric Pruning [6], defined below: We show that our posting entry ranking function generalizes  tf   tf | C |  document-centric index pruning method and term-centric pruning ScoreDCP (t ) =   log ×  (3) method.  | d |   | d | TF  If we set the value of α to zero, replace the document weighting In this work, we use the BM25 weighting scheme [20] (given function KLD(d||C) with the unity function u(d) = 1, and set the below) to calculate S(t, d) due to its widely use in other research parameter a of the sigmoid function to 1 so that the sigmoid works. function in (5) becomes a threshold function at x0, then for each  N − df + 0.5  tf (k1 + 1) document d, our ranking function f() assigns a non-zero value to S BM 25 (t , d ) = log  × the posting entry if t is ranked in the top-x0 among all  df + 0 . 5  tf + k  (1 − b ) + b | d |   (4) 1 avgdl  unique terms in d, otherwise f() returns a zero value. In this case,  our posting-based pruning technique is equivalent to the where avgdl is the average length of documents, k1 is set to its “constant” document-centric pruning technique proposed by “standard” value of 1.2 and b is set to its “standard” value of Buttcher and Clarke in [6], which select a fixed number of top 0.75. ranked terms from each document. Obviously, the “relative” In combination, our posting entry ranking function takes as document-centric pruning technique proposed in [6] can be easily parameters all the above quantities and returns a usefulness value. obtained from our posting entry ranking function by adjusting x0 Note that we apply normalization and transformation to parameter for each document according to its length. values. First, RIDF(t) values are normalized so that they sum up Similarly, if we set the value of α to one, replace the term to one. Similar normalization is applied to KLD(d || C) values. weighting function RIDF(t) with the unity function u(t) = 1, set Normalization step is necessary, as the range of RIDF(t) and the parameter a of the sigmoid function to 1, and set the parameter KLD(d || C) are different. Second, we use a sigmoid function to x0 to the rank of the i-th posting entry in the posting list of term t Figure 2: Querying effectiveness for short TREC queries. Figure 3: Querying effectiveness for long TREC queries. such that S(t, di) > τ (t) and S(t, di+1) ≤ τ (t), where τ (t) is the cut- fields from the TREC topic, and one set of short queries, wherein off threshold proposed by Carmel et al. in [5], then our posting each query includes only the title field from the TREC topic. entry ranking function assigns a non-zero value to any posting We also implement term-centric index pruning and document- entry selected by term-centric pruning technique, and a zero value centric index pruning exactly as specified in their original works to any non-selected posting entry. [5][6]. The only difference is that in [5], Carmel et al. used the Our posting-based pruning technique is different from term- SMART term weighting scheme for both index pruning and query centric and document-centric pruning techniques in several ranking. We instead use BM25 term weighting scheme for query aspects. By using term weighting function and document ranking, but still use SMART term weighting scheme for index weighting function other than the unity function, we allow the pruning. pruned index to include more information for informative terms We conduct experiments to compare the effectiveness of our and outstanding documents. By using a sigmoid function instead proposed posting-based pruning technique with the term-based, of a threshold function to transform the term and document ranks, “score shifting” pruning technique proposed in [5], and the we open the possibility that a posting entry which is ranked low in “relative” document-centric pruning technique proposed in [6]. In a posting list could be selected, if it is ranked high in the the next section, we report precision at 10, precision at 20, and document, and vice versa. average precision at each pruning level for each technique. 5. EXPERIMENTAL SETTINGS 6. EXPERIMENTAL RESULTS We use the WT10G collection as our data set. This collection In Figure 2, we show the effectiveness of pruned indices for the contains 1,692,096 Web documents crawled from the Web. We set of short TREC queries; and in Figure 3, we show the use the Terrier platform [21] to index and rank queries, and we effectiveness of pruned indices for the set of long TREC queries. develop our posting-based index pruning technique based on Posting-centric pruned index is marked as “pXX_YY_ZZ”, where Terrier. XX is the value of parameter α (percentage), YY is the value of We use the BM25 weighting scheme for both calculating term- parameter a, and ZZ is the value of parameter x0. Figure 2 and document scores and query ranking. Potter stemming algorithm Figure 3 show experiment results for a posting-centric pruned [23] and a standard list of stop-words are used for preprocessing index with α = 50%, a = 15, and x0 = 50. With the pruning level documents. After stemming and stop-word removal, the less than 70%, posting-centric pruning has similar performance as vocabulary contains 3,161,488 unique terms. The un-pruned compare with document-centric pruning and term-centric pruning. index contains 280,632,807 posting entries. However, for pruning level of 70% or more, posting-centric pruning is outperformed by document-centric pruning. We use TREC [22] topics 451–500 as our query set. Precision is measured for each pruned index using the set of relevance We turn our parameters for posting-centric index pruning judgment provided by TREC for topics 451–500. From TREC technique. Table 1, Table 2, and Table 3 show experiment results topics 451–500, we build two sets of queries: one set of long for short queries of document-centric pruning technique and queries, wherein each query includes the title and the description posting-centric pruning techniques with various combinations of Table 1. Precision at 10 for short TREC queries of document-centric pruning technique and posting-centric pruning techniques of various parameter combinations. Pct #posting Document- p50_15_50 p20_15_50 p50_35_200 p50_200_50 p80_200_1000 p50_300_1000 p80_300_1000 p80_400_1000 entry-pruned centric 0 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 10 0.2458 0.2542 0.2542 0.2542 0.2542 0.2542 0.2542 0.2563 0.2521 20 0.2396 0.2521 0.2521 0.2542 0.2375 0.2333 0.2333 0.2354 0.2354 30 0.2396 0.2458 0.2458 0.2458 0.2292 0.2271 0.2292 0.2313 0.2292 40 0.2458 0.2479 0.2479 0.2271 0.2250 0.2271 0.2271 0.2250 0.2333 50 0.2500 0.2563 0.2542 0.2375 0.2292 0.2250 0.2312 0.2312 0.2208 60 0.2458 0.2521 0.2500 0.2250 0.2208 0.2104 0.2208 0.2167 0.2146 70 0.2396 0.2354 0.2354 0.2271 0.2396 0.2229 0.2354 0.2187 0.2187 80 0.2500 0.1979 0.2167 0.2021 0.2104 0.2063 0.2333 0.1979 0.2021 90 0.2458 0.15 0.1625 0.1625 0.1854 0.1958 0.1875 0.2021 0.2000 Table 2. Precision at 20 for short TREC queries of document-centric pruning technique and posting-centric pruning techniques of various parameter combinations. Pct #posting Document- p50_15_50 p20_15_50 p50_35_200 p50_200_50 p80_200_1000 p50_300_1000 p80_300_1000 p80_400_1000 entry-pruned centric 0 0.2073 0.2073 0.2073 0.2073 0.2073 0.2073 0.2073 0.2073 0.2073 10 0.2052 0.2042 0.2042 0.2052 0.2052 0.2052 0.2052 0.2052 0.2052 20 0.201 0.2031 0.2031 0.2021 0.1990 0.1938 0.1948 0.1948 0.1958 30 0.1979 0.2052 0.2063 0.1990 0.1854 0.1844 0.1854 0.1854 0.1844 40 0.1979 0.2042 0.2042 0.1917 0.1750 0.1740 0.1740 0.1750 0.1781 50 0.1969 0.2083 0.2073 0.1875 0.1771 0.1740 0.1781 0.1792 0.1719 60 0.1958 0.2031 0.2010 0.1802 0.1813 0.1813 0.1844 0.1813 0.1833 70 0.2010 0.2031 0.1990 0.1719 0.1802 0.1740 0.1802 0.1750 0.1750 80 0.2094 0.1583 0.1729 0.1573 0.1552 0.1635 0.1729 0.1635 0.1646 90 0.1927 0.1177 0.1229 0.1208 0.1354 0.1469 0.1448 0.1531 0.1500 Table 3. MAP for short TREC queries of document-centric pruning technique and posting-centric pruning techniques of various parameter combinations. Pct #posting Document- p50_15_50 p20_15_50 p50_35_200 p50_200_50 p80_200_1000 p50_300_1000 p80_300_1000 p80_400_1000 entry-pruned centric 0 0.1892 0.1892 0.1892 0.1892 0.1892 0.1892 0.1892 0.1892 0.1892 10 0.1864 0.1868 0.1871 0.1869 0.1879 0.1875 0.1878 0.1873 0.1880 20 0.1838 0.1891 0.1892 0.1889 0.1835 0.1818 0.1824 0.1835 0.1826 30 0.1846 0.1914 0.1913 0.1901 0.1816 0.1806 0.1807 0.1821 0.1825 40 0.1847 0.1855 0.1854 0.1880 0.1805 0.1822 0.1820 0.1827 0.1845 50 0.1837 0.1958 0.1958 0.1815 0.1834 0.1809 0.1839 0.1840 0.1809 60 0.1835 0.1911 0.1915 0.1804 0.1755 0.1743 0.1785 0.1747 0.1747 70 0.1693 0.172 0.1739 0.1698 0.1702 0.1619 0.1657 0.1692 0.1627 80 0.1679 0.1557 0.1571 0.1476 0.1373 0.1494 0.1575 0.1486 0.1440 90 0.1533 0.0919 0.1136 0.1238 0.1338 0.1404 0.1314 0.1378 0.1421 parameter values. Experiments with long queries have similar higher pruning level, the differences between the performance of trend and therefore are omitted. posting-centric pruning and document-centric pruning are larger. In Table 1, Table 2, and Table 3, the values in bold are the highest In addition, none of the posting-centric pruning techniques performance for a specific pruning level. The first observation is outperforms document-centric pruning technique at high pruning that posting-centric pruning is better at low and moderate pruning level. level, while document-centric pruning is better at higher pruning In all previous experiments of posting-centric pruning technique, level. The second observation is that, even though posting-centric parameters of the posting entry ranking function (6) are fixed for pruning is better than document-centric pruning at low and all posting entries. We consider the possibility of adaptively moderate pruning level, the differences are small. In contrast, at setting parameters for each posting entry, by adapting the slope of Table 4. Performance at 90% pruning level of different posting-centric pruning techniques. Pruning method Short TREC queries Long TREC queries P@10 P@20 MAP P@10 P@20 MAP Document-centric 0.2458 0.1927 0.1533 0.3120 0.2470 0.1731 AP1: no term weighting, no document weighting, α = 0.5, using two different sigmoid functions with parameters vary for each posting entry 0.2375 0.1688 0.1456 0.3020 0.2100 0.1561 AP2: similar to AP1, except that term-document scores are not used 0.1277 0.1106 0.0831 0.1660 0.1250 0.0903 AP3: similar to AP1, except that term weighting is used 0.2604 0.1969 0.1592 0.3300 0.2500 0.1714 AP4: similar to AP1, except that both term weighting and document weighting are used 0.2271 0.1573 0.1354 0.2740 0.1900 0.1479 Figure 4: Querying effectiveness for short TREC queries. Figure 5: Querying effectiveness for long TREC queries. the sigmoid function as follows. For a posting entry , we value is in bold. From the results reported in Table 4, we can see use two sigmoid functions, one for the rank of term t in the that: document d, the other for the rank of document d in the posting (i) Term-document scores should be used in the posting list of term t. We call the former document-oriented sigmoid entry ranking function (6), which is revealed by the function, and the later term-oriented sigmoid function. For the significant differences between the performance of AP1 term-oriented sigmoid function, its x0 parameter is set according and AP2. to the pruning level (i.e., if the pruning level is 90%, x0 is equal to 10% of the length of the term posting list). Similarly, for the (ii) Term-weighting is useful, which is confirmed by the document-oriented sigmoid function, if the pruning level is 90%, differences between the performance of AP1 and AP3. its x0 parameter is set to 10% of the number of unique terms in the (iii) Document weighting seems to be harmful, which hurt the document. For both sigmoid functions, the parameter a, which performance of AP4, compare to the performance of AP3 control the slope of the function, is set so that the function returns and AP1. a value close to 1 for input value which is less than 0.9 × x0 and Among all alternatives, AP3 is the best, which is only slightly returns a value close to 0 for input value which is greater than 1.1 outperformed by document-centric pruning for long queries × x0. We also explore the performance of several alternatives, according to MAP. We therefore consider it as the best among our which may or may not use term-weighting and/or document- alternatives. Below, we report its performance in comparison with weighting (refer to the ranking function (6) in Section 4.3). document-centric pruning and term-centric pruning at various In Table 4, we report the performance of different alternatives at level of pruning in Figure 4 and Figure 5. the approximately 90% pruning level. Document-centric pruning By adapting the sigmoid functions to posting entries, the performance is included for reference. For each column, the best performance of our posting-centric pruning technique is much better, as good as the performance of document-centric pruning technique (posting-centric pruning is better than document-centric [5] D. Carmel et al., “Static Index Pruning for Information pruning at some pruning level, while the inverse is true at other Retrieval Systems,” in Proc. ACM SIGIR, 2001. pruning levels). [6] S. Buttcher and C. L. A. Clarke, “A Document-Centric Approach to Static Index Pruning in Text Retrieval 7. CONCLUSIONS AND FUTURE WORK Systems,” in Proc. ACM CIKM, 2006. We evaluate document-centric and term-centric static index [7] J. Lu and J. Callan, “Pruning Long Documents for pruning based on the WT10G corpus and TREC query sets. Based Distributed Information Retrieval,” in Proc. ACM CIKM, on our experimental results, term-centric index pruning is better 2002. than document-centric index pruning at low and moderate pruning level (i.e., less than 70% pruning, according to our results), while [8] R. Blanco and A. Barreiro, “Static Pruning of Terms in document-centric index pruning is better at higher pruning level. Inverted Files,” in Proc. ECIR, 2007. We propose posting-centric index pruning technique, which ranks [9] R. Blanco and A. Barreiro, “Boosting Static Pruning of each posting entry (i.e., a term-document pair) based on a set of Inverted Files,” in Proc. ACM SIGIR, 2007. features such as the rank of the term in the document and the rank [10] M. Shokouhi et al., “Using Query Logs to Establish of the document in the inverted list of the term. We show that Vocabularies in Distributed Information Retrieval,” In Int'l posting-centric index pruning generalizes both document-centric Journal on Information Processing and Management, 2007. and term-centric pruning, and therefore, the solution space of [11] E. S. de Moura et al, “Improving Web Search Efficiency via term-centric pruning covers the solution spaces of both document- a Locality Based Static Pruning Method,” in Proc. ACM centric and term-centric index pruning. This implies that, by WWW, 2005. exploring this larger solution space, better solution can be found. [12] I. Podna, M. Rajman, T. Luu, F. Klemn, and K. Aberer, We explore the solution space of posting-centric pruning by “Scalable Peer-to-peer Web Retrieval with Highly studying a family of posting entry ranking functions. We discover Discriminative Keys,” in Proc. IEEE ICDE, 2007. that term weighting based on RIDF is useful, while document weighting based on KL-divergence is harmful. We also notice that [13] G. Skobeltsyn, T. Luu, I. P. Zarko, M. Rajman, and K. parameters of the sigmoid function, which we use to transform the Aberer, “Web Text Retrieval with a P2P Query Driven rank of a term/document to its score, should be adapted to each Index,” in Proc. ACM SIGIR, 2007. posting entry. Fixing these parameters makes posting-centric [14] G. Skobeltsyn, F. Junqueira, V. Plachouras, and R. Baeza- pruning less effective than document-centric pruning. Yates, “ResIn: a Combination of Results Caching and Index Pruning for High-performance Web Search Engines,” in Other term weighting and document weighting methods are Proc. ACM SIGIR, 2008. possible. We are evaluating a method of weighting terms and documents based on user queries and the PageRank algorithm [15] C. Tang, and S. Dwarkadas, “Hybrid Global-local Indexing applying on the graph of terms and documents. Our goal is to for Efficient Peer-to-peer Information Retrieval,” in Proc. discover important terms and documents by analyzing the NSDI, 2004. relationship among terms and documents given the context of user [16] S. Kullback, “The Kullback-Leibler Distance,” The queries. Once the important terms and the important documents American Statistician, 41:340-341, 1987. are discovered, their information is kept in the pruned index, [17] D. Puppin, F. Silvestri, and D. Laforenza, “Query-Driven while information about others, less important terms and Document Partitioning and Collection Selection,” in Proc. documents, can be partially removed or totally discarded. INFOSCALE, 2006. [18] L. Page, S. Brin, R. Motwani, and T. Winograd, “The 8. REFERENCES PageRank Citation Ranking: Bringing Order to the Web,” Technical Report, Stanford University. [1] D. Grossman and O. Frieder, “Information Retrieval: [19] J. Kleinberg, “Authoritative Sources in a Hyperlinked Algorithms and Heuristics,” Springer, 2nd ed, 2004. Environment,” in Journal of the ACM, 46(5), 1999. [2] I. H. Witten, A. Moffat, and T. C. Bell, “Managing [20] S. E. Robertson. S. Walker, and M. Hancock-Beaulieu, Gigabytes,” Morgan Kaufmann, 2nd ed, 1999. “Okapi at TREC-7,” in Proc. of the Seventh Text Retrieval [3] J. Zobel and A. Moffat, “Inverted Files for Text Search Conference, 1998. Engines,” in ACM Computing Surveys, 38(2), 2006. [21] TERabyte RetrIEveR, http://ir.dcs.gla.ac.uk/terrier/ [4] A. Ntoulas and J. Cho, “Pruning Policies for Two-Tiered [22] TREC (WT10G, TREC-9) Inverted Index with Correctness Guarantee,” in Proc. ACM SIGIR, 2007. [23] M. F. Porter, “An Algorithm for Suffix Stripping,” in Program, 14(3), 1980.