=Paper=
{{Paper
|id=None
|storemode=property
|title=Experimental Comparison of Set Intersection Algorithms for Inverted Indexing
|pdfUrl=https://ceur-ws.org/Vol-1003/58.pdf
|volume=Vol-1003
|dblpUrl=https://dblp.org/rec/conf/itat/Boza13
}}
==Experimental Comparison of Set Intersection Algorithms for Inverted Indexing==
ITAT 2013 Proceedings, CEUR Workshop Proceedings Vol. 1003, pp. 58–64 http://ceur-ws.org/Vol-1003, Series ISSN 1613-0073, c 2013 V. Boža Experimental Comparison of Set Intersection Algorithms for Inverted Indexing Vladimír Boža Department of Computer Science, Faculty of Mathematics, Physics and Informatics Comenius University Mlynská dolina, 842 48 Bratislava, Slovakia Abstract: The set intersection problem is one of the main also when the shorter set is much shorter than longer one. problems in document retrieval. Query consists of two We will describe this algorithm in the next section. keywords, and for each of keyword we have a sorted set All previous algorithms get the two sorted sets on input of document IDs containing it. The goal is to retrieve the without any additional preprocessing. However in inverted set of document IDs containing both keywords. We per- indexing, sets for all keywords are known in advance, and form an experimental comparison of Galloping search and perhaps some preprocessing of these sets could speed up a new algorithm by Cohen and Porat (LATIN2010), which query processing. In particular, the length of the output has a better theoretical time complexity. We show that the can be much smaller than the length of the shorter set, and new algorithm has often worse performance than the trivial it would be desirable to have an algorithm, which would one on real data. We also propose a variant of the Cohen not depend linearly on m. The first step in this direction is a and Porat algorithm with a similar complexity but better recent algorithm by Cohen and Porat [1], which uses linear empirical performance. Finally, we investigate influence memory to store√all sets and processes each intersection of document ordering on query time. query in time O( No + o), where o is the length of output and N is the sum of the sizes of all sets. We will denote this algorithm as the fast set intersection algorithm. 1 Introduction However, it is not clear, whether this theoretical im- In the set intersection problem, we are given a collection provement is really useful in practice. In this article, we of sets A1 , A2 , . . . , Ad . Our goal is to preprocess them and compare the query times of the fast set intersection algo- then answer queries of the following type: For given i, rithm and the galloping search on a dataset consisting of j, find the intersection of sets Ai and A j . This problem a sample of English Wikipedia articles with a set of two- appears in various areas. For example, set intersection is words queries from TREC Terabyte 2006 query stream. needed in conjuctive queries in relational databases, and We also present a variant of the fast set intersection al- Ng, Amir, Pevzner [7] use set intersection to match mass gorithm with a similar time and memory complexity but spectra against a protein database. However, perhaps the better empirical performance. most important application is in document retrieval, where Previous experimental comparisons of set intersection the goal is to maintain data structures over a set of doc- algorithms [5, 6] consider only different variants of gal- uments. The most typical data structure is the inverted loping search or other algorithms that do not use set pre- index, which stores for each word the set of documents processing. Yan et al. [2] observe that better query times containing that word. Such an index allows us to eas- can be achieved via better document ordering. In our ex- ily retrieve documents containing a particular query word. periments, we also compare query times of random doc- When we want to retrieve documents which contain two ument ordering and document ordering based on simple or more given words, we can do set intersection of corre- clustering scheme for all tested algorithms. sponding document sets from the inverted index. Classical algorithms for set intersection are merging and 2 Algorithms binary search. Merging identifies common elements by it- erating through both sorted lists, as in the final phase of In this section we descibe the three algorithms, which we the merge sort algorithm. If we denote the length of the will compare. We will denote the two input sets as A, B, smaller set as m and the larger set as n, then the time com- where |A| = n, |B| = m, m ≤ n. We will denote the total plexity of merging is O(m + n), which is good when the number of sets as s and the total size of sets as N. lengths of the two sets are almost same. If m is much smaller than n, it is better to search for each element of the 2.1 Galloping search smaller set in the larger set by binary search, in O(m lg n) total time. The Galloping search ([8]) is a simple modification of There is a better algorithm originally introduced by the binary search. We try to find each element of B in Bentley and Yao [8], called galloping search with time A; formally for each B[i] we will find index ki such that complexity O(m lg(n/m)). This algorithm has good time A[ki ] ≤ B[i] and A[ki + 1] > B[i]. We will use several im- complexity when the lengths of the sets are similar and provements. First, if j > i, then k j ≥ ki . This means that Experimental Comparison of Set Intersection Algorithms 59 when we are searching for ki+1 , we need to search only This tree will have at most O(lg N) levels. At each level, in range ki , ki + 1, . . . , n. Secondly before doing binary we need O(N) bits for intersection matrices. This means search we will find the smallest p ∈ {1, 2, 4, 8, . . .} such we need O(N lg N) bits, which is O(N) in term of words. that A[ki + p] ≥ B[i + 1]. This limits the range of the bi- During query answering we will start traversing the tree nary search when difference between ki and ki+1 is small. starting in the root node. In each node we will check When using this modifications the algorithm achieves time whether both sets are large. If not, we will answer query complexity O(m lg(n/m)). Pseudocode of this algorithm using the galloping search. If both sets are large, we will is given below: look into intersection matrix. If sets do not have intersec- tion in this node, we stop the traversal in this node. Oth- low:=1 erwise, we will propagate the query to the children of that for i := 1 to m: node. We also need to check whether the element kept in diff := 1 the node belongs to the intersection. while low + diff <= n and A[low + diff] < B[i]: It can be shown that the time complexity of a set inter- √ diff *= 2 section query is at most O(( No + o) lg n). It can be also high := min(n, low + diff) shown that it is never worse than time complexity of the k = binary_search(A, low, high) galloping search [1]. if A[k] == B[i]: Note that the original article used hash tables instead of output B[i] the galloping search. low = k 2.3 Our set intersection 2.2 The fast set intersection algorithm The set intersection matrix usually contains many ones and We now briefly describe the data structure used by fast set only few zeroes. We can use the space better by instead intersection algorithm [1]. storing intervals in with intersection of two sets is empty. We build tree where each node handles some subsets of √ We take N biggest sets and call them large sets. We will the original sets. The cost of a node is the sum of the sizes call other sets√as small sets. Note that the size of a small of all the subsets it handles. The root handles all original set is at most N. sets, so it costs N. Now we will do a preprocessing for intersections of the Definition 1. Let large sets. √ d be a node with costs x. A large set in d is one of the x biggest sets in node d. Definition 2. Let A, B be two large sets, where |B| ≤ |A|. Note the the original article defined large set in a slighly The empty interval is a sequence B[i], B[i + 1], . . . , B[ j] of √ elements of set B such that: different way. Their large set was a set with at√least x el- ements. It is clear, that every set with at least x elements • For each k such that i ≤ k ≤ j: B[k] ∈ / A. is a large set by our definition. A set intersection matrix is a matrix that stores for each • i = 1 or B[i − 1] ∈ A. pair of sets a bit indicating whether their intersection is non-empty. For x sets this matrix needs O(x2 ) bits of space • j = n or B[ j + 1] ∈ A. and therefore node with cost c needs O(c) bits of memory. The size of this interval is j − i + 1. For each node of tree we will construct a set intersection matrix for all the large sets in that node. Now we will find and store k largest empty intervals Now we need to describe how to build the tree. We will from all intersections of large sets (note that we can store use a top-down approach. We will start with root node and zero, one or more than one intervals for some pairs of sets). in each node we will divide the sets and propagate them Note that if k > N, √ then the smallest stored empty interval to its children. Only large sets are propagated down to the has size at most N. node children. We will call this sets a propagated group. We will answer set a intersection query as follows. If Let d be a node with cost x and G its propagated group. any of the sets is small, we will use the galloping search. √ The G costs at most x. Let E be the set of all elements in This gives us query time O(m lg(n/m)),√ but since m ≤ N, the sets of G. We will try to split E into two disjoint sets the query time can be written as O( N lg(n/m)). If both E1 , E2 . For a given set S ∈ G the child will handle S ∩ E1 sets are large, then we will again use the galloping search, and the right child will handle S ∩ E2 . We want the each but we will ignore empty intervals found for the given in- child to cost at most x/2. This is sometimes impossible tersection. to achieve. We will fix this by keeping one element of E We will show two things about time complexity of the in d. We add elements to E1 until adding another element query in our algorithm. would make the left child cost more than x/2. The next First, if k =√N, then the query time complexity is element will be kept in d. All other elements will go to E2 bounded by O(o N). The memory complexity in this case and the right child will cost at most x/2. is O(N). Secondly, if we put k = N lg N, the average query 60 V. Boža time complexity of this approach is not worse than the time 0.02 complexity of the fast set intersection. This is because the number of empty intervals is the same as the number of bits in the set intersection matrices of the fast set intersec- tion algorithm and our empty intervals allows us to skip 0.015 Fast set intersection search for more elements than the zeroes in the set inter- section matrices. Thus√ average query time complexity of our approach is O(( No+o) lg n). The memory complex- 0.01 ity is O(N lg N), which is little bit higher. The complexity of our algorithm does not look as promising as complexity of the fast set intersection algo- 0.005 rithm, but this algorithm allows time-memory tradeoff. We can store any number of empty intervals as we want. In our experiements we set this number to achieve same memory 0 consumption as fast set intersection algorithm. 0 0.005 0.01 0.015 0.02 Galloping search 3 Document ordering Figure 1: Comparison of query times for galloping search In the document retrieval we can choose arbitrary IDs for (x-axis) and fast set intersection (y-axis) with random doc- individual documents, and thus influence the ordering of ument ordering elements in the input sets. There are several proposed heuristics for document ordering; most of them try to order documents for achieving better index compression ([3], 4.1 Documents [4]). But good document ordering improves query time [2]. This happens because similar document are closer to- We took all articles from the English Wikipedia [9] and gether in the sets and during the galloping search we will divided each article into paragraphs. We have used para- make smaller jumps. In our work, we will use the k-scan graphs instead of documents, because otherwise we would algorithm [3]. To describe this algorithm, we first need to only have a few big documents and the difference between define similarity of documents. algorithms would be hard to measure. Then we sampled 6.5 millions of paragraphs and took them as documents. Definition 3. The Jaccard similarity of two sets A, B is The total size of index N (the sum of size of all sets) was given by: |A ∩ B| 313 millions word-document pairs. J(A, B) = |A ∪ B| The document is a set√of terms. For calculating distance 4.2 Queries we will only consider N terms occuring in the largest We have used query log from TREC Terabyte 2006 query number of documents (the large sets from the previous stream [10]. We only consider two word queries from the sections). The similarity of two documents is the Jaccard log. This gives us approximately 14000 queries. We ran similarity of their sets. each query 100 times and measured the average time in The k-scan algorithm tries to find an ordering of docu- seconds. ments by partioning them into k clusters. Let d be the num- ber of documents. This algorithm has k iterations. In each iteration, it first picks a cluster center and then chooses 5 Experimental results among the unassigned documents the d/k − 1 ones most similar to the cluster center. Also it picks the cluster cen- 5.1 Galloping search vs. fast set interserction vs. our ter for the next iteration, which is the d/k-th most simi- algorithm lar document. The cluster center for the first iteration is picked randomly. If we assume that document similarity We will first show comparision between all algorithms us- can be computed in time s, then time complexity of this ing random document order. Results are shown in Figures algorithm is O(kds). In our experiments we use k = 1000. 1, 2. As we can see the fast set intersection algorithm intro- 4 Experimental setup duces significant overhead in query processing time for large queries. Our hypothesis is that this is because this We implemented all algorithms in C++ and run our exper- algorithm does not use the caching in an optimal way. On iment on a computer with Intel i7 920 CPU, 12 GB RAM the other hand our set intersection algorithm introduces a and Ubuntu Linux. We compiled our code with g++ 4.7.3 small improvement in query processing time. The average using -O3 optimizations. improvement is around 14%. Experimental Comparison of Set Intersection Algorithms 61 0.02 160 140 0.015 120 Our set intersection Number of queries 100 0.01 80 60 0.005 40 20 0 0 0 0.005 0.01 0.015 0.02 0 20 40 60 80 100 Galloping search Fraction of skipped elements Figure 2: Comparison of query times for galloping search Figure 3: Histogram of fraction of skipped elements from (x-axis) and ours set intersection (y-axis) with random the smaller set in queries for fast intersection algorithm document ordering with random document ordering. Zero is ommited due to its big size (12780 queries). We also explored individual algorithm using more de- tailed statistics. The set intersection matrices of fast set 160 intersection algorithms contained 15% of zeroes. There 140 are 5400 (39%) queries where both sets are large in root node. Only in 709 of these queries we found zero in the set 120 Number of queries intersection matrices. We have also measured how many elements of the smaller set we can skip due to zeroes in set 100 intersection matrices. As we can see from the histogram 80 in Figure 3, there are some queries where the output size is zero and all elements are skipped. But overall there are 60 only few queries where we skipped more than half of the 40 smaller set. Most of the time we skip only few percent of the smaller set. 20 In our algorithm, there are 825 queries where we en- 0 counter an empty interval stored for the two sets. Again we 0 20 40 60 80 100 measured fraction of skipped elements due to empty inter- vals with respect to the size of the smaller set and plotted Fraction of skipped elements histogram of this fractions (see Figure 4). We see that this histogram looks quite better than the previous one. Figure 4: Histogram of fraction of skipped elements from the smaller set in queries for our intersection algorithm 5.2 Document ordering effects with random document ordering. Zero is ommited due to its big size (12400 queries). The overall effect of document ordering on the query time is shown in Figures 5, 6, 7. The average improvement of query time for galloping the same problems as in random document ordering – the search is 18%, for fast set intersection 27% and for our set number of queries with 80 − 90 percent fraction is zero. intersection 22%. On the other hand, we gained a lot of queries where we It is worth noting that the set intersections matrices con- eliminated around 10% of work. tain 25% zeroes when using document ordering based on In our algorithm, there are 1500 queries where we en- k-scans, compared to 15% with random document order. counter an empty interval. Histogram of the fraction of We had 2100 queries which encoutered zero in some set skipped elements is in Figure 9. Its shape is similar to his- intersection matrix in the fast set intersection algorithm. togram when using random document ordering. This approximatelly three times more than when we used Finally, in Figures 10, 11 we see a comparision of run- random document ordering. Histogram of the fraction of ning times of algorithms when using document ordering skipped elements is in Figure 8. In this histogram we see based on k-scans. We still see significant slowdown for 62 V. Boža 0.02 0.02 Clustered document ordering Clustered document ordering 0.015 0.015 0.01 0.01 0.005 0.005 0 0 0 0.005 0.01 0.015 0.02 0 0.005 0.01 0.015 0.02 Random document ordering Random document ordering Figure 5: Comparison of query times for galloping search Figure 7: Comparison of query times for our set intersec- using random document order(x-axis) and document order tion using random document order(x-axis) and document based on k-scan algorithm (y-axis) order based on k-scan algorithm (y-axis) 0.02 600 Clustered document ordering 500 0.015 Number of queries 400 0.01 300 200 0.005 100 0 0 0 0.005 0.01 0.015 0.02 0 20 40 60 80 100 Random document ordering Fraction of skipped elements Figure 6: Comparison of query times for fast set intersec- Figure 8: Histogram of fraction of skipped elements from tion using random document order(x-axis) and document the smaller set in queries for fast intersection algorithm order based on k-scan algorithm (y-axis) with k-scan document ordering. Zero is ommited due to its big size (12390 queries). fast set intersection. The speedup for our set intersection was 19% which is similar to speedup for random docu- minutes for preprocessing. The fast set intersection algo- ment ordering. Finally, in Figure 12 we see a compari- rithm required 7 GB of memory and 90 minutes of prepro- son of galloping search using random document ordering cessing. Our algorithm required 7 GB of memory and 2.5 and our algorithm using better document ordering. Com- hours of preprocessing. bination of these two factors leads to average improvement around 35%. 6 Conclusion 5.3 Preprocessing time and memory consumption We examined three different algorithms for set intersec- tion. The experimental result can be summarized as fol- We now briefly sumarize proprocessing time and memory lows: consumption of our algorithms. Using only inverted index and galloping search took 2 GB of memory and needed 4 • Fast set intersection algorithm does not lead to better Experimental Comparison of Set Intersection Algorithms 63 300 0.02 250 0.015 Our set intersection Number of queries 200 150 0.01 100 0.005 50 0 0 0 20 40 60 80 100 0 0.005 0.01 0.015 0.02 Fraction of skipped elements Galloping search Figure 9: Histogram of fraction of skipped elements from Figure 11: Comparison of query times for galloping search the smaller set in queries for our intersection algorithm (x-axis) and ours set intersection (y-axis) with document with k-scan document ordering. Zero is ommited due to ordering based on k-scans its big size (12200 queries). 0.02 0.02 0.015 Our set intersection 0.015 Fast set intersection 0.01 0.01 0.005 0.005 0 0 0 0.005 0.01 0.015 0.02 0 0.005 0.01 0.015 0.02 Galloping search Galloping search Figure 12: Comparison of query times for galloping search Figure 10: Comparison of query times for galloping search with random document ordering (x-axis) and ours set in- (x-axis) and fast set intersection (y-axis) with document tersection with document ordering based on k-scans (y- ordering based on k-scans axis) empirical performance on real data. We also investigated effect of document ordering on • We can achieve some speedup using our algorithm query times. We showed that better document ordering but this speed up is not big. leads to greater improvement than using a different algo- rithm. • Our algorithm is slighly better at eliminating useless work than fast set intersection. Fast set intersection algorithm in most cases eliminates less then 10% of 7 Acknowledgements work. It is interesting question whether more careful imple- This research was supported by VEGA grant 1/1085/12. mentation of fast set intersection algorithm can lead to bet- Author would also like to thank Brona Brejova for many ter query times. usefull comments. 64 V. Boža References [1] Cohen, Hagai, and Ely Porat: Fast set intersection and two- patterns matching. LATIN 2010: Theoretical Informatics. Springer Berlin Heidelberg, 2010. 234-242. [2] Yan, Hao, Shuai Ding, and Torsten Suel: Inverted index compression and query processing with optimized document ordering. Proceedings of the 18th international conference on World wide web. ACM, 2009. [3] F. Silvestri, S. Orlando, and R. Perego.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. [4] Shieh, W. Y., Chen, T. F., Shann, J. J. J., and Chung, C. P. (2003).: Inverted file compression through document iden- tifier reassignment. Information processing & management, 39(1), 117-131. [5] Culpepper, J. Shane, and Alistair Moffat: Efficient set inter- section for inverted indexing. ACM Transactions on Infor- mation Systems (TOIS) 29.1 (2010): 1. [6] Barbay, Jérémy, Alejandro López-Ortiz, Tyler Lu, and Ale- jandro Salinger: An experimental investigation of set inter- section algorithms for text searching. Journal of Experimen- tal Algorithmics (JEA) 14 (2009): 7. [7] Ng, Julio, Amihood Amir, and Pavel A. Pevzner.: Blocked pattern matching problem and its applications in proteomics. Research in Computational Molecular Biology. Springer Berlin Heidelberg, 2011. [8] Jon Louis Bentley and Andrew Chi-chih Yao: An almost op- timal algorithm for unbounded searching. Information Pro- cessing Letters - IPL , vol. 5, no. 3, pp. 82-87, 1976. [9] http://download.wikimedia.org/enwiki/latest/ enwiki-latest-pages-articles.xml.bz2 [10] http://trec.nist.gov/data/terabyte06.html