=Paper= {{Paper |id=None |storemode=property |title=Experimental Comparison of Set Intersection Algorithms for Inverted Indexing |pdfUrl=https://ceur-ws.org/Vol-1003/58.pdf |volume=Vol-1003 |dblpUrl=https://dblp.org/rec/conf/itat/Boza13 }} ==Experimental Comparison of Set Intersection Algorithms for Inverted Indexing== https://ceur-ws.org/Vol-1003/58.pdf
ITAT 2013 Proceedings, CEUR Workshop Proceedings Vol. 1003, pp. 58–64
http://ceur-ws.org/Vol-1003, Series ISSN 1613-0073, c 2013 V. Boža



       Experimental Comparison of Set Intersection Algorithms for Inverted
                                   Indexing

                                                           Vladimír Boža

              Department of Computer Science, Faculty of Mathematics, Physics and Informatics Comenius University
                                         Mlynská dolina, 842 48 Bratislava, Slovakia

Abstract: The set intersection problem is one of the main           also when the shorter set is much shorter than longer one.
problems in document retrieval. Query consists of two               We will describe this algorithm in the next section.
keywords, and for each of keyword we have a sorted set                 All previous algorithms get the two sorted sets on input
of document IDs containing it. The goal is to retrieve the          without any additional preprocessing. However in inverted
set of document IDs containing both keywords. We per-               indexing, sets for all keywords are known in advance, and
form an experimental comparison of Galloping search and             perhaps some preprocessing of these sets could speed up
a new algorithm by Cohen and Porat (LATIN2010), which               query processing. In particular, the length of the output
has a better theoretical time complexity. We show that the          can be much smaller than the length of the shorter set, and
new algorithm has often worse performance than the trivial          it would be desirable to have an algorithm, which would
one on real data. We also propose a variant of the Cohen            not depend linearly on m. The first step in this direction is a
and Porat algorithm with a similar complexity but better            recent algorithm by Cohen and Porat [1], which uses linear
empirical performance. Finally, we investigate influence            memory to store√all sets and processes each intersection
of document ordering on query time.                                 query in time O( No + o), where o is the length of output
                                                                    and N is the sum of the sizes of all sets. We will denote
                                                                    this algorithm as the fast set intersection algorithm.
1 Introduction                                                         However, it is not clear, whether this theoretical im-
In the set intersection problem, we are given a collection          provement is really useful in practice. In this article, we
of sets A1 , A2 , . . . , Ad . Our goal is to preprocess them and   compare the query times of the fast set intersection algo-
then answer queries of the following type: For given i,             rithm and the galloping search on a dataset consisting of
 j, find the intersection of sets Ai and A j . This problem         a sample of English Wikipedia articles with a set of two-
appears in various areas. For example, set intersection is          words queries from TREC Terabyte 2006 query stream.
needed in conjuctive queries in relational databases, and           We also present a variant of the fast set intersection al-
Ng, Amir, Pevzner [7] use set intersection to match mass            gorithm with a similar time and memory complexity but
spectra against a protein database. However, perhaps the            better empirical performance.
most important application is in document retrieval, where             Previous experimental comparisons of set intersection
the goal is to maintain data structures over a set of doc-          algorithms [5, 6] consider only different variants of gal-
uments. The most typical data structure is the inverted             loping search or other algorithms that do not use set pre-
index, which stores for each word the set of documents              processing. Yan et al. [2] observe that better query times
containing that word. Such an index allows us to eas-               can be achieved via better document ordering. In our ex-
ily retrieve documents containing a particular query word.          periments, we also compare query times of random doc-
When we want to retrieve documents which contain two                ument ordering and document ordering based on simple
or more given words, we can do set intersection of corre-           clustering scheme for all tested algorithms.
sponding document sets from the inverted index.
    Classical algorithms for set intersection are merging and       2   Algorithms
binary search. Merging identifies common elements by it-
erating through both sorted lists, as in the final phase of         In this section we descibe the three algorithms, which we
the merge sort algorithm. If we denote the length of the            will compare. We will denote the two input sets as A, B,
smaller set as m and the larger set as n, then the time com-        where |A| = n, |B| = m, m ≤ n. We will denote the total
plexity of merging is O(m + n), which is good when the              number of sets as s and the total size of sets as N.
lengths of the two sets are almost same. If m is much
smaller than n, it is better to search for each element of the
                                                                    2.1 Galloping search
smaller set in the larger set by binary search, in O(m lg n)
total time.                                                         The Galloping search ([8]) is a simple modification of
    There is a better algorithm originally introduced by            the binary search. We try to find each element of B in
Bentley and Yao [8], called galloping search with time              A; formally for each B[i] we will find index ki such that
complexity O(m lg(n/m)). This algorithm has good time               A[ki ] ≤ B[i] and A[ki + 1] > B[i]. We will use several im-
complexity when the lengths of the sets are similar and             provements. First, if j > i, then k j ≥ ki . This means that
Experimental Comparison of Set Intersection Algorithms                                                                           59


when we are searching for ki+1 , we need to search only
                                                   This tree will have at most O(lg N) levels. At each level,
in range ki , ki + 1, . . . , n. Secondly before doing binary
                                                we need O(N) bits for intersection matrices. This means
search we will find the smallest p ∈ {1, 2, 4, 8, . . .} such
                                                we need O(N lg N) bits, which is O(N) in term of words.
that A[ki + p] ≥ B[i + 1]. This limits the range of the bi-
                                                   During query answering we will start traversing the tree
nary search when difference between ki and ki+1 is small.
                                                starting in the root node. In each node we will check
When using this modifications the algorithm achieves time
                                                whether both sets are large. If not, we will answer query
complexity O(m lg(n/m)). Pseudocode of this algorithm
                                                using the galloping search. If both sets are large, we will
is given below:                                 look into intersection matrix. If sets do not have intersec-
                                                tion in this node, we stop the traversal in this node. Oth-
low:=1                                          erwise, we will propagate the query to the children of that
for i := 1 to m:                                node. We also need to check whether the element kept in
  diff := 1                                     the node belongs to the intersection.
  while low + diff <= n and A[low + diff] < B[i]: It can be shown that the time complexity of a set inter-
                                                                              √
    diff *= 2                                   section query is at most O(( No + o) lg n). It can be also
  high := min(n, low + diff)                    shown that it is never worse than time complexity of the
  k = binary_search(A, low, high)               galloping search [1].
  if A[k] == B[i]:                                 Note that the original article used hash tables instead of
    output B[i]                                 the galloping search.
  low = k
                                                                  2.3 Our set intersection
2.2   The fast set intersection algorithm
                                                                  The set intersection matrix usually contains many ones and
We now briefly describe the data structure used by fast set       only few zeroes. We can use the space better by instead
intersection algorithm [1].                                       storing intervals in with intersection of two sets is empty.
   We build tree where each node handles some subsets of                    √
                                                                  We take N biggest sets and call them large sets. We will
the original sets. The cost of a node is the sum of the sizes     call other sets√as small sets. Note that the size of a small
of all the subsets it handles. The root handles all original      set is at most N.
sets, so it costs N.                                                 Now we will do a preprocessing for intersections of the
Definition 1. Let                                                 large sets.
               √ d be a node with costs x. A large set in
d is one of the x biggest sets in node d.                         Definition 2. Let A, B be two large sets, where |B| ≤ |A|.
   Note the the original article defined large set in a slighly   The empty interval is a sequence B[i], B[i + 1], . . . , B[ j] of
                                                        √         elements of set B such that:
different way. Their large set was a set with at√least x el-
ements. It is clear, that every set with at least x elements        • For each k such that i ≤ k ≤ j: B[k] ∈
                                                                                                           / A.
is a large set by our definition.
   A set intersection matrix is a matrix that stores for each       • i = 1 or B[i − 1] ∈ A.
pair of sets a bit indicating whether their intersection is
non-empty. For x sets this matrix needs O(x2 ) bits of space        • j = n or B[ j + 1] ∈ A.
and therefore node with cost c needs O(c) bits of memory.         The size of this interval is j − i + 1.
For each node of tree we will construct a set intersection
matrix for all the large sets in that node.                          Now we will find and store k largest empty intervals
   Now we need to describe how to build the tree. We will         from all intersections of large sets (note that we can store
use a top-down approach. We will start with root node and         zero, one or more than one intervals for some pairs of sets).
in each node we will divide the sets and propagate them           Note that if k > N,
                                                                                    √ then the smallest stored empty interval
to its children. Only large sets are propagated down to the       has size at most N.
node children. We will call this sets a propagated group.            We will answer set a intersection query as follows. If
Let d be a node with cost x and G its propagated group.           any of the sets is small, we will use the galloping search.
                                                                                                                          √
The G costs at most x. Let E be the set of all elements in        This gives us query time O(m lg(n/m)),√ but since m ≤ N,
the sets of G. We will try to split E into two disjoint sets      the query time can be written as O( N lg(n/m)). If both
E1 , E2 . For a given set S ∈ G the child will handle S ∩ E1      sets are large, then we will again use the galloping search,
and the right child will handle S ∩ E2 . We want the each         but we will ignore empty intervals found for the given in-
child to cost at most x/2. This is sometimes impossible           tersection.
to achieve. We will fix this by keeping one element of E             We will show two things about time complexity of the
in d. We add elements to E1 until adding another element          query in our algorithm.
would make the left child cost more than x/2. The next               First, if k =√N, then the query time complexity is
element will be kept in d. All other elements will go to E2       bounded by O(o N). The memory complexity in this case
and the right child will cost at most x/2.                        is O(N). Secondly, if we put k = N lg N, the average query
60                                                                                                                                  V. Boža


time complexity of this approach is not worse than the time
                                                                                               0.02
complexity of the fast set intersection. This is because the
number of empty intervals is the same as the number of
bits in the set intersection matrices of the fast set intersec-
tion algorithm and our empty intervals allows us to skip                                      0.015




                                                                      Fast set intersection
search for more elements than the zeroes in the set inter-
section matrices. Thus√ average query time complexity of
our approach is O(( No+o) lg n). The memory complex-                                           0.01
ity is O(N lg N), which is little bit higher.
   The complexity of our algorithm does not look as
promising as complexity of the fast set intersection algo-                                    0.005
rithm, but this algorithm allows time-memory tradeoff. We
can store any number of empty intervals as we want. In our
experiements we set this number to achieve same memory                                           0
consumption as fast set intersection algorithm.                                                       0   0.005    0.01    0.015   0.02
                                                                                                             Galloping search
3    Document ordering
                                                                  Figure 1: Comparison of query times for galloping search
In the document retrieval we can choose arbitrary IDs for         (x-axis) and fast set intersection (y-axis) with random doc-
individual documents, and thus influence the ordering of          ument ordering
elements in the input sets. There are several proposed
heuristics for document ordering; most of them try to order
documents for achieving better index compression ([3],            4.1 Documents
[4]). But good document ordering improves query time
[2]. This happens because similar document are closer to-         We took all articles from the English Wikipedia [9] and
gether in the sets and during the galloping search we will        divided each article into paragraphs. We have used para-
make smaller jumps. In our work, we will use the k-scan           graphs instead of documents, because otherwise we would
algorithm [3]. To describe this algorithm, we first need to       only have a few big documents and the difference between
define similarity of documents.                                   algorithms would be hard to measure. Then we sampled
                                                                  6.5 millions of paragraphs and took them as documents.
Definition 3. The Jaccard similarity of two sets A, B is
                                                                  The total size of index N (the sum of size of all sets) was
given by:
                             |A ∩ B|                              313 millions word-document pairs.
                   J(A, B) =
                             |A ∪ B|
   The document is a set√of terms. For calculating distance       4.2 Queries
we will only consider N terms occuring in the largest
                                                                  We have used query log from TREC Terabyte 2006 query
number of documents (the large sets from the previous
                                                                  stream [10]. We only consider two word queries from the
sections). The similarity of two documents is the Jaccard
                                                                  log. This gives us approximately 14000 queries. We ran
similarity of their sets.
                                                                  each query 100 times and measured the average time in
   The k-scan algorithm tries to find an ordering of docu-
                                                                  seconds.
ments by partioning them into k clusters. Let d be the num-
ber of documents. This algorithm has k iterations. In each
iteration, it first picks a cluster center and then chooses       5               Experimental results
among the unassigned documents the d/k − 1 ones most
similar to the cluster center. Also it picks the cluster cen-     5.1 Galloping search vs. fast set interserction vs. our
ter for the next iteration, which is the d/k-th most simi-            algorithm
lar document. The cluster center for the first iteration is
picked randomly. If we assume that document similarity            We will first show comparision between all algorithms us-
can be computed in time s, then time complexity of this           ing random document order. Results are shown in Figures
algorithm is O(kds). In our experiments we use k = 1000.          1, 2.
                                                                     As we can see the fast set intersection algorithm intro-
4 Experimental setup                                              duces significant overhead in query processing time for
                                                                  large queries. Our hypothesis is that this is because this
We implemented all algorithms in C++ and run our exper-           algorithm does not use the caching in an optimal way. On
iment on a computer with Intel i7 920 CPU, 12 GB RAM              the other hand our set intersection algorithm introduces a
and Ubuntu Linux. We compiled our code with g++ 4.7.3             small improvement in query processing time. The average
using -O3 optimizations.                                          improvement is around 14%.
Experimental Comparison of Set Intersection Algorithms                                                                                    61



                           0.02                                                            160
                                                                                           140
                          0.015                                                            120
   Our set intersection




                                                                       Number of queries
                                                                                           100
                           0.01                                                             80
                                                                                            60
                          0.005                                                             40
                                                                                            20
                             0                                                               0
                                  0   0.005    0.01    0.015   0.02                              0   20      40     60      80      100
                                         Galloping search                                            Fraction of skipped elements

Figure 2: Comparison of query times for galloping search              Figure 3: Histogram of fraction of skipped elements from
(x-axis) and ours set intersection (y-axis) with random               the smaller set in queries for fast intersection algorithm
document ordering                                                     with random document ordering. Zero is ommited due to
                                                                      its big size (12780 queries).
   We also explored individual algorithm using more de-
tailed statistics. The set intersection matrices of fast set                               160
intersection algorithms contained 15% of zeroes. There
                                                                                           140
are 5400 (39%) queries where both sets are large in root
node. Only in 709 of these queries we found zero in the set                                120
                                                                       Number of queries




intersection matrices. We have also measured how many
elements of the smaller set we can skip due to zeroes in set                               100
intersection matrices. As we can see from the histogram                                     80
in Figure 3, there are some queries where the output size
is zero and all elements are skipped. But overall there are                                 60
only few queries where we skipped more than half of the                                     40
smaller set. Most of the time we skip only few percent of
the smaller set.                                                                            20
   In our algorithm, there are 825 queries where we en-
                                                                                             0
counter an empty interval stored for the two sets. Again we
                                                                                                 0   20      40     60      80      100
measured fraction of skipped elements due to empty inter-
vals with respect to the size of the smaller set and plotted                                         Fraction of skipped elements
histogram of this fractions (see Figure 4). We see that this
histogram looks quite better than the previous one.                   Figure 4: Histogram of fraction of skipped elements from
                                                                      the smaller set in queries for our intersection algorithm
5.2                   Document ordering effects                       with random document ordering. Zero is ommited due to
                                                                      its big size (12400 queries).
The overall effect of document ordering on the query time
is shown in Figures 5, 6, 7.
   The average improvement of query time for galloping                the same problems as in random document ordering – the
search is 18%, for fast set intersection 27% and for our set          number of queries with 80 − 90 percent fraction is zero.
intersection 22%.                                                     On the other hand, we gained a lot of queries where we
   It is worth noting that the set intersections matrices con-        eliminated around 10% of work.
tain 25% zeroes when using document ordering based on                    In our algorithm, there are 1500 queries where we en-
k-scans, compared to 15% with random document order.                  counter an empty interval. Histogram of the fraction of
We had 2100 queries which encoutered zero in some set                 skipped elements is in Figure 9. Its shape is similar to his-
intersection matrix in the fast set intersection algorithm.           togram when using random document ordering.
This approximatelly three times more than when we used                   Finally, in Figures 10, 11 we see a comparision of run-
random document ordering. Histogram of the fraction of                ning times of algorithms when using document ordering
skipped elements is in Figure 8. In this histogram we see             based on k-scans. We still see significant slowdown for
62                                                                                                                                                                    V. Boža



                                    0.02                                                                                  0.02
     Clustered document ordering




                                                                                           Clustered document ordering
                                   0.015                                                                                 0.015


                                    0.01                                                                                  0.01


                                   0.005                                                                                 0.005


                                      0                                                                                      0
                                           0    0.005    0.01   0.015     0.02                                                   0        0.005   0.01     0.015     0.02
                                               Random document ordering                                                                Random document ordering

Figure 5: Comparison of query times for galloping search                         Figure 7: Comparison of query times for our set intersec-
using random document order(x-axis) and document order                           tion using random document order(x-axis) and document
based on k-scan algorithm (y-axis)                                               order based on k-scan algorithm (y-axis)

                                    0.02                                                                          600
     Clustered document ordering




                                                                                                                  500
                                   0.015
                                                                                  Number of queries




                                                                                                                  400

                                    0.01                                                                          300

                                                                                                                  200
                                   0.005
                                                                                                                  100

                                      0                                                                                  0
                                           0    0.005    0.01   0.015     0.02                                               0       20      40     60      80      100
                                               Random document ordering                                                              Fraction of skipped elements

Figure 6: Comparison of query times for fast set intersec-                       Figure 8: Histogram of fraction of skipped elements from
tion using random document order(x-axis) and document                            the smaller set in queries for fast intersection algorithm
order based on k-scan algorithm (y-axis)                                         with k-scan document ordering. Zero is ommited due to
                                                                                 its big size (12390 queries).

fast set intersection. The speedup for our set intersection
was 19% which is similar to speedup for random docu-                             minutes for preprocessing. The fast set intersection algo-
ment ordering. Finally, in Figure 12 we see a compari-                           rithm required 7 GB of memory and 90 minutes of prepro-
son of galloping search using random document ordering                           cessing. Our algorithm required 7 GB of memory and 2.5
and our algorithm using better document ordering. Com-                           hours of preprocessing.
bination of these two factors leads to average improvement
around 35%.
                                                                                 6                         Conclusion
5.3                            Preprocessing time and memory consumption         We examined three different algorithms for set intersec-
                                                                                 tion. The experimental result can be summarized as fol-
We now briefly sumarize proprocessing time and memory                            lows:
consumption of our algorithms. Using only inverted index
and galloping search took 2 GB of memory and needed 4                                     • Fast set intersection algorithm does not lead to better
Experimental Comparison of Set Intersection Algorithms                                                                                                       63



                            300                                                                                   0.02

                            250
                                                                                                                 0.015




                                                                                          Our set intersection
 Number of queries




                            200

                            150                                                                                   0.01

                            100
                                                                                                                 0.005
                                  50

                                   0                                                                                0
                                       0       20      40      60     80      100                                        0   0.005    0.01    0.015   0.02
                                               Fraction of skipped elements                                                     Galloping search

Figure 9: Histogram of fraction of skipped elements from                              Figure 11: Comparison of query times for galloping search
the smaller set in queries for our intersection algorithm                             (x-axis) and ours set intersection (y-axis) with document
with k-scan document ordering. Zero is ommited due to                                 ordering based on k-scans
its big size (12200 queries).

                                                                                                                  0.02
                                   0.02

                                                                                                                 0.015
                                                                                          Our set intersection




                                  0.015
          Fast set intersection




                                                                                                                  0.01
                                   0.01

                                                                                                                 0.005
                                  0.005

                                                                                                                    0
                                       0                                                                                 0   0.005    0.01    0.015   0.02
                                           0        0.005    0.01    0.015     0.02
                                                                                                                                Galloping search
                                                       Galloping search
                                                                                      Figure 12: Comparison of query times for galloping search
Figure 10: Comparison of query times for galloping search                             with random document ordering (x-axis) and ours set in-
(x-axis) and fast set intersection (y-axis) with document                             tersection with document ordering based on k-scans (y-
ordering based on k-scans                                                             axis)


                         empirical performance on real data.                             We also investigated effect of document ordering on
         • We can achieve some speedup using our algorithm                            query times. We showed that better document ordering
           but this speed up is not big.                                              leads to greater improvement than using a different algo-
                                                                                      rithm.
         • Our algorithm is slighly better at eliminating useless
           work than fast set intersection. Fast set intersection
           algorithm in most cases eliminates less then 10% of                        7              Acknowledgements
           work.

   It is interesting question whether more careful imple-                             This research was supported by VEGA grant 1/1085/12.
mentation of fast set intersection algorithm can lead to bet-                         Author would also like to thank Brona Brejova for many
ter query times.                                                                      usefull comments.
64                                                                    V. Boža


References
[1] Cohen, Hagai, and Ely Porat: Fast set intersection and two-
    patterns matching. LATIN 2010: Theoretical Informatics.
    Springer Berlin Heidelberg, 2010. 234-242.
[2] Yan, Hao, Shuai Ding, and Torsten Suel: Inverted index
    compression and query processing with optimized document
    ordering. Proceedings of the 18th international conference
    on World wide web. ACM, 2009.
[3] F. Silvestri, S. Orlando, and R. Perego.: Assigning identifiers
    to documents to enhance the clustering property of fulltext
    indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf.
[4] Shieh, W. Y., Chen, T. F., Shann, J. J. J., and Chung, C. P.
    (2003).: Inverted file compression through document iden-
    tifier reassignment. Information processing & management,
    39(1), 117-131.
[5] Culpepper, J. Shane, and Alistair Moffat: Efficient set inter-
    section for inverted indexing. ACM Transactions on Infor-
    mation Systems (TOIS) 29.1 (2010): 1.
[6] Barbay, Jérémy, Alejandro López-Ortiz, Tyler Lu, and Ale-
    jandro Salinger: An experimental investigation of set inter-
    section algorithms for text searching. Journal of Experimen-
    tal Algorithmics (JEA) 14 (2009): 7.
[7] Ng, Julio, Amihood Amir, and Pavel A. Pevzner.: Blocked
    pattern matching problem and its applications in proteomics.
    Research in Computational Molecular Biology. Springer
    Berlin Heidelberg, 2011.
[8] Jon Louis Bentley and Andrew Chi-chih Yao: An almost op-
    timal algorithm for unbounded searching. Information Pro-
    cessing Letters - IPL , vol. 5, no. 3, pp. 82-87, 1976.
[9] http://download.wikimedia.org/enwiki/latest/
    enwiki-latest-pages-articles.xml.bz2
[10] http://trec.nist.gov/data/terabyte06.html