PEL: Position-Enhanced Length Filter for Set Similarity Joins Willi Mann Nikolaus Augsten Department of Computer Sciences Department of Computer Sciences Jakob-Haringer-Str. 2 Jakob-Haringer-Str. 2 Salzburg, Austria Salzburg, Austria wmann@cosy.sbg.ac.at nikolaus.augsten@sbg.ac.at ABSTRACT opments in the field leverage the position of the matching Set similarity joins compute all pairs of similar sets from two tokens between two prefixes (positional filter) and the num- collections of sets. Set similarity joins are typically imple- ber of remaining tokens in the overall set (length filter) to mented in a filter-verify framework: a filter generates candi- further reduce the number of candidates. date pairs, possibly including false positives, which must be This paper proposes a new filter, the position-enhanced verified to produce the final join result. Good filters produce length filter (PEL), which tightens the length filter based a small number of false positives, while they reduce the time on the position of the current token match. In previous they spend on hopeless candidates. The best known algo- work, position and length information have only partially rithms generate candidates using the so-called prefix filter been exploited. PEL fully leverages position and length to in conjunction with length- and position-based filters. achieve additional pruning power. As a key feature, PEL In this paper we show that the potential of length and po- does not require changes in the prefix filter index, but is sition have only partially been leveraged. We propose a new applied on top of previous algorithms at almost no cost. In filter, the position-enhanced length filter, which exploits the our experiments we show that PEL is particularly effective matching position to incrementally tighten the length filter; for foreign joins. PEL also performs well for self joins over our filter identifies hopeless candidates and avoids process- large collections of small sets.1 ing them. The filter is very efficient, requires no change in The remaining paper is organized as follows: Section 2 the data structures of most prefix filter algorithms, and is introduces the set similarity join, provides background ma- particularly effective for foreign joins, i.e., joins between two terial, and an in-depth analysis of filtering techniques based different collections of sets. on position and length. Our novel position-enhanced length filter (PEL) is introduced in Section 3. We empirically eval- uate our technique and demonstrate its effectiveness on real 1. INTRODUCTION world data in Section 4. In Section 5 we survey related work The set similarity join computes all pairs of similar sets and finally draw conclusions in Section 6. from two collections of sets. The similarity is assessed using a set similarity function, e.g., set overlap, Jaccard, or Cosine 2. BACKGROUND similarity, together with a threshold. A pair of sets is in the join result if the similarity exceeds the threshold. We revisit candidate generation, candidate reduction, and Set similarity joins have many interesting applications efficient verification techniques discussed in literature. The ranging from near duplicate detection of Web documents to main concepts of this section are summarized in Figure 1. community mining in social networks [9]. The set elements are called tokens [3] and are often used to represent complex 2.1 Candidate Generation objects, e.g., strings (q-grams [11]) or trees (pq-grams [2]). We shorty explain the prefix filter, translate normal- The best algorithms for set similarity joins are based on ized thresholds to overlap thresholds, revisit length- and an inverted list index and the so-called prefix filter [5]. The position-based filter conditions, and discuss prefixes in prefix filter operates on sorted sets and rejects candidate index implementations of set similarity joins. pairs that have no overlap in a (short) prefix. Only the pre- Prefix Filter. The fastest set similarity joins are based fix must be indexed, which leads to substantial savings in on the prefix filter principle [5]: A pair of sorted sets s0 , s1 space and runtime. However, for frequent tokens, the prefix can only meet an overlap threshold tO , i.e., |s0 ∩ s1 | ≥ tO , filter produces a large number of candidates. Recent devel- if there is a non-zero overlap in the prefixes of the sets. The prefixes are of length |s0 | − tO + 1 for s0 and |s1 | − tO + 1 for s1 . The set similarity join proceeds in three steps: (a) index the prefixes of one join partner, (b) probe the prefixes of the other join partner against the index to generate candidates, (c) verify the candidates by inspecting the full sets. Copyright c by the paper’s authors. Copying permitted only Threshold for normalized set overlap. Normalized for private and academic purposes. set similarity measures take the set sizes into account. For In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 1 Workshop on Foundations of Databases (Grundlagen von Datenbanken), In a self join, both input relations are identical, which can- 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. not be assumed in a foreign join. Table 1: Set similarity functions and related definitions, extending [7, Table 1] by new pmaxsize. Similarity function minoverlap(t, s0 , s1 ) minsize(t, s0 ) pmaxsize(t, s0 , p0 ) maxsize(t, s0 ) |s0 ∩s1 | t |s0 |−(1+t)·p0 |s0 | Jaccard J(s0 , s1 ) = |s 0 ∪s1 | 1+t (|s0 | + |s1 |) t |s0 | t t (|s0 |−p0 )2 C(s0 , s1 ) = √|s0 ∩s1 | t2 |s0 | |s0 | p Cosine t |s0 | · |s1 | |s0 |·t2 t2 |s0 |·|s1 | Dice D(s0 , s1 ) = 2·(|s 0 ∩s1 |) |s0 |+|s1 | t(|s0 |+|s1 |) 2 t |s0 | 2−t (2−t)·|s0 |−2p0 t (2−t)|s0 | t Overlap O(s0 , s1 ) = |s0 ∩ s1 | t t ∞ ∞ Previous work: the purpose of set similarity joins, the normalized threshold is translated into an overlap threshold (called minoverlap). • minoverlap(t, s0 , s1 ): equivalent overlap threshold for The overlap threshold may be different for each pair of sets. Jaccard, Cosine, or Dice threshold t for a pair of sets s0 , s1 . For example, • minsize(t, s0 ), maxsize(t, s0 ): minimum and maximum p for the Cosine threshold tC (join predicate size of any set that can satisfy threshold t w.r.t. set s0 . |s0 ∩ s1 |/ |s0 | · |s1 | ≥ tC ), minoverlap(tC , s0 , s1 ) := tC · p • maxprefix(t, s0 ) = |s0 | − minsize(t, s0 ) + 1: length of |s0 | · |s1 |. Table 1 lists the definitions of well-known set probing prefix similarity functions with the respective overlap thresholds. • midprefix(t, s0 ) = |s0 | − minoverlap(t, s0 , s0 ) + 1: length Length filter. For a given threshold t and a reference of indexing prefix for self joins set s0 , the size of the matching partners must be within • minprefix(t, s0 , s1 ) = |s0 | − minoverlap(t, s0 , s1 ) + 1: the interval [minsize(t, s0 ), maxsize(t, s0 )] (see Table 1). length of optimal prefix for a particular pair of sets This was first observed for the PartEnum algorithm [1] Position-enhanced length filter (PEL): and was later called the length filter. Example: |s0 | = 10, • pmaxsize(t, s0 , p0 ): new tighter limit for maximum set size |s1 | = 6, |s2 | = 16, Cosine threshold tC = 0.8. Since based on the probing set position minoverlap(tC , s0 , s1 ) = 6.1 > |s1 | we conclude (without inspecting any set element) that s0 cannot reach threshold Figure 1: Overview of functions. tC with s1 . Similarly, minoverlap(tC , s0 , s2 ) = 10.1, thus s2 is too large to meet the threshold with s0 . In fact, minsize(tC , s0 ) = 6.4 and maxsize(tC , s0 ) = 15.6. The positional filter is stricter than the prefix filter and Prefix length. The prefix length is |s0 | − tO + 1 for is applied on top of it. The pruning power of the positional a given overlap threshold tO and set s0 . For normalized filter is larger for prefix matches further to right (i.e., when thresholds t the prefix length does not only depend on s0 , p0 , p1 increase). Since the prefix filter may produce the same but also on the sets we compare to. If we compare to s1 , the candidate pair multiple times (for each match in the prefix), minimum prefix size of |s0 | is minprefix(t, s0 , s1 ) = |s0 | − an interesting situation arises: a pair that passes the posi- minoverlap(t, s0 , s1 ) + 1. When we index one of the join tional filter for the first match may not pass the filter for partners, we do not know the size of the matching partners later matches. Thus, the positional filter is applied to pairs upfront and need to cover the worst case; this results in the that are already in the candidate set whenever a new match prefix length maxprefix(t, s0 ) = |s0 |−minsize(t, s0 )+1 [7], is found. To correctly apply the positional filter we need which does not depend on s1 . For typical Jaccard thresholds to maintain the overlap value for each pair in the candidate t ≥ 0.8, this reduces the number of tokens to be processed set. We illustrate the positional filter with examples. during the candidate generation phase by 80 % or more. Example 1. Set s0 in Figure 2 is the probing set (prefix For self joins we can further reduce the prefix length [12] length maxprefix = 4), s1 is the indexed set (prefix length w.r.t. maxprefix: when the index is built on-the-fly in in- midprefix = 2, assuming self join). Set s1 is returned from creasing order of the sets, then the indexed prefix of s0 will the index due to the match on g (first match between s0 and never be compared to any set s1 with |s1 | < |s0 |. This al- s1 ). The required overlap is dminoverlapC (0.8, s0 , s1 )e = lows us to reduce the prefix length to midprefix(t, s0 ) = 8. Since there are only 6 tokens left in s1 after the match, |s0 | − minoverlap(t, s0 , s0 ) + 1. the maximum overlap we can get is 7, and the pair is pruned. Positional filter. The minimum prefix length for a pair This is also confirmed by the positional filter condition (1) of sets is often smaller than the worst case length, which we (o = 0, p0 = 3, p1 = 1). use to build and probe the index. When we probe the index Example 2. Assume a situation similar to Figure 2, but with a token from the prefix of s0 and find a match in the the match on g is the second match (i.e., o = 1, p0 = 3, prefix of set s1 , then the matching token may be outside the p1 = 1). Condition (1) holds and the pair can not be pruned, optimal prefix. If this is the first matching token between i.e., it remains in the candidate set. s0 and s1 , we do not need to consider the pair. In general, Example 3. Consider Figure 3 with probing set s0 and a candidate pair s0 , s1 must be considered only if indexed set s1 . The match on token a adds pair (s0 , s1 ) to the candidate set. Condition (1) holds for the match on a minoverlap(t, s0 , s1 ) ≤ o + min{|s0 | − p0 , |s1 | − p1 }, (1) (o = 0, p0 = 0, p1 = 0), and the pair is not pruned by where o is the current overlap (i.e., number of matching the positional filter. For the next match (on e), however, tokens so far excluding the current match) and p0 (p1 ) is condition (1) does not hold (o = 1, p0 = 1, p1 = 4) and the position of the current match in the prefix of s0 (s1 ); the positional filter removes the pair from the candidate set. positions start at 0. Thus, the positional filter does not only avoid pairs to enter pred: C(s0 , s1 ) ≥ 0.8 s0 : b c e f g h ? ? ? pr ⇒ dminoverlap(s0 , s1 , 0.8)e = 8 s1 : a e h ? ? ? ? ? ? idx 7 s0 : c e f g ? ? ? ? ? ? probing set (pr) Figure 4: Verification: where to start? s1 : a g ? ? ? ? ? ? indexed set (idx) pred: J(s0 , s1 ) ≥ 0.7 pred: J(s0 , s1 ) ≥ 0.7 7 ⇒ dminoverlap(. . .)e = 6 ⇒ dminoverlap(. . .)e = 5 s0 : c d e ? ? ? ? pr s0 : c d e ? ? ? ? pr Figure 2: Sets with matching token In prefix: match impossible due to positions of matching tokens and s1 : e ? ? ? ? ? idx s1 : e ? ? ? ? idx remaining tokens. (a) Match impossible (b) Match possible pred: C(s0 , s1 ) ≥ 0.6 Figure 5: Impossible and possible set sizes based on ⇒ dminoverlap(s0 , s1 , 0.8)e = 8 position in s0 and the size-dependent minoverlap. 14 s0 : a e ? ? ? ? ? ? ? ? ? ? ? ? ? ? pr midprefix (indexing set) as discussed in earlier sections. s1 : a b c d e ? ? ? ? ? idx Since the sets are sorted, we compute the overlap in a +1 +1 5 =7<8 merge fashion. At each merge step, we verify if the current overlap and the remaining set size are sufficient to achieve Figure 3: Sets with two matching tokens: pruning the threshold, i.e., we check positional filter condition (1). of candidate pair by second match. (A) Prefix overlap [12] : At verification time we already know the overlap between the two prefixes of a candidate pair. This piece of information should be leveraged. Note the candidate set, but may remove them later. that we cannot simply continue verification after the two prefixes. This is illustrated in Figure 4: there is 1 match in 2.2 Improving the Prefix Filter the prefixes of s0 and s1 ; when we start verification after the The prefix filter often produces candidates that will be prefixes, we miss token h. Token h occurs after the prefix removed immediately in the next filter stage, the positional of s0 but inside the prefix of s1 . Instead, we compare the filter (see Example 1). Ideally, such candidates are not pro- last element of the prefixes: for the set with the smaller duced at all. This issue is addressed in the mpjoin algo- element (s0 ), we start verification after the prefix (g). For rithm [7] as outlined below. the other set (s1 ) we leverage the number of matches in the Consider condition (1) for the positional filter. We split prefix (overlap o). Since the leftmost positions where these the condition into two new conditions by expanding the min- matches can appear are the first o elements, we skip o tokens imum such that the conjunction of the new conditions is and start at position o (token e in s1 ). There is no risk of equivalent to the positional filter condition: double-counting tokens w.r.t. overlap o since we start after the end of the prefix in s0 . minoverlap(t, s0 , s1 ) ≤ o + |s0 | − p0 (2) (B) Position of last match [7] : A further improvement is minoverlap(t, s0 , s1 ) ≤ o + |s1 | − p1 (3) to store the position of the last match. Then we start the verification in set s1 after this position (h in s1 , Figure 4). The mpjoin algorithm leverages condition (2) as follows. The probing sets s0 are processed in increasing size order, so Small candidate set vs. fast verification. The po- |s0 | grows monotonically during the execution of the algo- sitional filter is applied on each candidate pair returned by rithm. Hence, for a specific set s1 , minoverlap grows mono- the prefix filter. The same candidate pair may be returned tonically. We assume o = 0 (and justify this assumption multiple times for different matches in the prefix. The po- later). For a given index entry (s1 , p1 ), the right side of con- sitional filter potentially removes existing candidate pairs dition (2) is constant, while the left side can only grow. Af- when they appear again (cf. Section 2.1). This reduces the ter the condition fails to hold for the first time, it will never size of the candidate set, but comes at the cost of (a) lookups hold again, and the index list entry is removed. For a given in the candidate set, (b) deletions from the candidate set, index set s1 , this improvement changes the effective length and (c) book keeping of the overlaps for each candidate pair. of the prefix (i.e., the part of the sets where we may detect Overall, it might be more efficient to batch-verify a larger matches) w.r.t. a probing set s0 to minprefix(t, s0 , s1 ) = candidate set than to incrementally maintain the candidates; |s1 | − minoverlap(t, s0 , s1 ) + 1, which is optimal. On the Ribeiro and Härder [7] empirically analyze this trade-off. downside, a shorter prefix may require more work in the verification phase: in some cases, the verification can start 3. POSITION-ENHANCED LENGTH FIL- after the prefix as will be discussed in Section 2.3. TERING 2.3 Verification In this section, we motivate the position-enhanced length Efficient verification techniques are crucial for fast set sim- filter (PEL), derive the new filter function pmaxsize, dis- ilarity joins. We revisit a baseline algorithm and two im- cuss the effect of PEL on self vs. foreign joins, and show how provements, which affect the verification speed of both false to apply PEL to previous algorithms. and true positives. Unless explicitly mentioned, the term Motivation. The introduction of the position-enhanced prefix subsequently refers to maxprefix (probing set) resp. length filter is inspired by examples for positional filtering 1250 base region. The base region is partitioned into four regions maxsize (A, B, C, and D) by the probing set size and pmaxsize. For C D probing set size foreign joins, our filter reduces the base region to A+C. If we set size assume that all set sizes occur equally likely in the individual 1000 inverted lists of the index, our filter cuts the number of index B list entries that must be processed by 50%. Since the tokens A pmaxsize are typically ordered by their frequency, the list length will minsize increase with increasing matching position. Thus the gain of 800 PEL in practical settings can be expected to be even higher. 0 100 maxprefix 200 This analysis holds for all parameters of Jaccard and Dice. position in prefix For Cosine, the situation is more tricky since pmaxsize is quadratic and describes a parabola. Again, this is in our Figure 6: Illustrating possible set sizes. favor since the parabola is open to the top, and the curve that splits the base region is below the diagonal. For self joins, the only relevant regions are A and B since like Figure 5(a). In set s1 , the only match in the prefix oc- the size of the sets is bounded by the probing set size. Our curs at the leftmost position. Despite this being the leftmost filter reduces the relevant region from A + B to A. As Fig- match in s1 , the positional filter removes s1 : the overlap ure 6 illustrates, this reduction is smaller than the reduction threshold cannot be reached due the position of the match for foreign joins. For the similarity functions in Table 1, B in s0 . Apparently, the position of the token in the probing is always less than a quarter of the full region A + B. In the set can render a match of the index sets impossible, inde- example, region B covers about 0.22 of A + B. pendently of the matching position in the index set. Let us analyze how we need to modify the example such that it passes the positional filter: the solution is to shorten index Algorithm 1: AllPairs-PEL(Sp , I, t) set s1 , as shown in Figure 5(b). This suggests that some Version using pmaxsize for foreign join; tighter limit on the set size can be derived from the position input : Sp collection of outer sets, I inverted list index of the matching token. covering maxprefix of inner sets, t similarity Deriving the PEL filter. For the example in threshold Figure 5(a) the first part of the positional filter, i.e., output: res set of result pairs (similarity at least t) condition (2), does not hold. We solve the equation 1 foreach s0 in Sp do minoverlap(t, s0 , s1 ) ≤ |s0 | − p0 to |s1 | by replacing 2 M = {}; /* Hashmap: candidate set → count */ minoverlap with its definition for the different similarity 3 for p0 ← 0 to maxprefix(t, s0 ) − 1 do functions. The result is pmaxsize(t, s0 , p0 ), an upper 4 for s1 in Is0 [p] do bound on the size of eligible sets in the index. This bound 5 if |s1 | < minsize(t, s0 ) then is at the core of the PEL filter, and definitions of pmaxsize 6 remove index entry with s1 from Is0 [p] ; for various similarity measures are listed in Table 1. 7 else if |s1 | > pmaxsize(t, s0 , p0 ) then Application of PEL. We integrate the pmaxsize 8 break; upper bound into the prefix filter. The basic prefix filter 9 else algorithm processes a probing set as follows: loop over 10 if M [s1 ] = ∅ then the tokens of the probing set from position p0 = 0 to 11 M = M ∪ (s1 , 0); maxprefix(t, s0 ) − 1 and probe each token against the 12 M [s1 ] = M [s1 ] + 1; index. The index returns a list of sets (their IDs) which 13 end contain this token. The sets in these lists are ordered by 14 end increasing size, so we stop processing a list when we hit a 15 /* Verify() verifies the candidates in M */ set that is larger than pmaxsize(t, s0 , p0 ). 16 res = res ∪ V erif y(s0 , M, t); Intuitively, we move half of the positional filter to the 17 end prefix filter, where we can evaluate it at lower cost: (a) the value of pmaxsize needs to be computed only once for each probing token; (b) we check pmaxsize against the size of Algorithm. Algorithm 1 shows AllPairs-PEL2 , a ver- each index list entry, which is a simple integer comparison. sion of AllPairs enhanced with our PEL filter. AllPairs- Overall, this is much cheaper than the candidate lookup that PEL is designed for foreign joins, i.e., the index is con- the positional filter must do for each index match. structed in a preprocessing step before the join is executed. Self Joins vs. Foreign Joins. The PEL filter is more The only difference w.r.t. AllPairs is that AllPairs-PEL uses powerful on foreign joins than on self joins. In self joins, pmaxsize(t, s0 , p0 ) instead of maxsize(t, s0 ) in the condi- the size of the probing set is an upper bound for the set tion on line 7. The extensions of the algorithms ppjoin and size in the index. For all the similarity functions in Table 1, mpjoin with PEL are similar. pmaxsize is below the probing set size in less than 50% An enhancement that is limited to ppjoin and mpjoin is to of the prefix positions. Figure 6 gives an example: The simplify the positional filter: PEL ensures that no candidate probing set size is 1000, the Jaccard threshold is 0.8, so set can fail on the first condition (Equation 2) of the split minsize(0.8, 1000) = 800, maxsize(0.8, 1000) = 1250, and positional filter. Therefore, we remove the first part of the the prefix size is 201. The x-axis represents the position in the prefix, the y-axis represents bounds for the set size of the 2 We use the -PEL suffix for algorithm variants that make other set. The region between minsize and maxsize is the use of our PEL filter. collections are identical. Figures 7(a) and 7(b) show the per- Table 2: Input set characteristics. formance on DBLP with Jaccard similarity threshold 0.75 #sets in set size # of diff. and Cosine similarity 0.85. These thresholds produce result collection min max avg tokens sets of similar size. We observe a speedup of factor 3.5 for DBLP 3.9 · 106 2 283 12 1.34 · 106 AllPairs-PEL over AllPairs with Jaccard, and a speedup of 5 TREC 3.5 · 10 2 628 134 3.4 · 105 3.8 with Cosine. For mpjoin to mpjoin-PEL we observe a 5 ENRON 5 · 10 1 192 000 298 7.3 · 106 speedup of 4.0 with Jaccard and 4.2 with Cosine. Thus, the PEL filter provides a substantial speed advantage on these data points. For other Jaccard thresholds and mpjoin vs. minimum in the original positional filter (Equation 1), such mpjoin-PEL, the maximum speedup is 4.1 and the minimum that the minimum is no longer needed. speedup is 1.02. For threshold 0.5, only mpjoin-PEL finishes Note that the removal of index entries on line 6 is the eas- within the time limit of one hour. Among all Cosine thresh- iest solution to apply minsize, but in real-world scenarios, olds and mpjoin vs. mpjoin-PEL, the maximum speedup is it only makes sense for a single join to be executed. For 4.2 (tC = 0.85), the minimum speedup is 1.14 (tC = 0.95). a similarity search scenario, we recommend to apply binary We only consider Cosine thresholds tC ≥ 0.75, because the search on the lists. For multiple joins with the same indexed non-PEL variants exceed the time limit for smaller thresh- sets in a row, we suggest to use an overlay over the index olds. There is no data point where PEL slows down an that stores the pointer for each list where to start. algorithm. It is also worth noting that AllPairs-PEL beats mpjoin by a factor of 2.7 with Jaccard threshold tJ = 0.75 4. EXPERIMENTS and 3.3 on Cosine threshold tC = 0.85; we observe such speedups also on other thresholds. We compare the algorithms AllPairs [4] and mpjoin [7] Figure 7(c) shows the performance on TREC with Jac- with and without our PEL extension on both self and for- card threshold tJ = 0.75. The speedup for AllPairs-PEL eign joins. Our implementation works on integers, which we compared to AllPairs is 1.64, and for mpjoin-PEL compared order by the frequency of appearance in the collection. The to mpjoin 2.3. The minimum speedup of mpjoin over all time to generate integers from tokens is not measured in our thresholds is 1.26 (tJ = 0.95), the maximum speedup is experiments since it is the same for all algorithms. We also 2.3 (tJ = 0.75). Performance gains on ENRON are slightly do not consider the indexing time for foreign joins, which smaller – we observe speedups of 1.15 (AllPairs-PEL over is considered a preprocessing step. The use of PEL has no AllPairs), and 1.85 (mpjoin-PEL over mpjoin) on Jaccard impact on the index construction. The prefix sizes are max- threshold tJ = 0.75 as illustrated in Figure 7(d). The mini- prefix for foreign joins and midprefix for self joins. For self mum speedup of mpjoin over mpjoin-PEL is 1.24 (tJ = 0.9 joins, we include the indexing time in the overall runtime and 0.95), the maximum speedup is 2.0 (tJ = 0.6). since the index is built incrementally on-the-fly. We report Figure 8(a) shows the number of processed index entries results for Jaccard and Cosine similarity, the results for Dice (i.e., the overall length of the inverted lists that must be show similar behavior. Our experiments are executed on the scanned) for Jaccard threshold tJ = 0.75 on TREC. The following real-world data sets: number of index entries increases by a factor of 1.67 for AllPairs w.r.t. AllPairs-PEL, and a factor of 4.0 for mpjoin • DBLP3 : Snapshot (February 2014) of the DBLP bib- w.r.t. mpjoin-PEL. liographic database. We concatenate authors and ti- Figure 8(b) shows the number of candidates that must tle of each entry and generate tokens by splitting on be verified for Jaccard threshold tJ = 0.75 on TREC. On whitespace. AllPairs, PEL decreases the number of candidates. This is because AllPairs does not apply any further filters before • TREC4 : References from the MEDLINE database, verification. On mpjoin, the number of candidates increases years 1987–1991. We concatenate author, title, and by 20%. This is due to the smaller number of matches from abstract, remove punctuation, and split on whitespace. the prefix index in the case of PEL: later matches can remove • ENRON5 : Real e-mail messages published by FERC pairs from the candidate set (using the positional filter) and after the ENRON bankruptcy. We concatenate sub- thus decrease its size. However, the larger candidate set ject and body fields, remove punctuation, and split on for PEL does not seriously impact the overall performance: whitespace. the positional filter is also applied in the verification phase, where the extra candidate pairs are pruned immediately. Table 2 lists basic characteristics of the input sets. We Self joins. Due to space constraints, we only show re- conduct our experiments on an Intel Xeon 2.60GHz machine sults for DBLP and ENRON, i.e., the input sets with the with 128 GB RAM running Debian 7.6 ’wheezy’. We com- smallest and the largest average set sizes, respectively. Fig- pile our code with gcc -O3. Claims about results on “all” ure 7(e) and 7(f) show the performance of the algorithms on thresholds for a particular data set refer to the thresholds DBLP and ENRON with Jaccard threshold tJ = 0.75. Our {0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95}. We stop tests whose PEL filter provides a speed up of about 1.22 for AllPairs, runtime exceeds one hour. and 1.17 for mpjoin on DBLP. The maximum speedup we Foreign Joins. For foreign joins, we join a collection of observe is 1.70 (AllPairs-PEL vs. AllPairs, tJ = 0.6); for sets with a copy of itself, but do not leverage the fact that the tJ = 0.95 there is no speed difference between mpjoin and mpjoin-PEL. On the large sets of ENRON, the performance 3 http://www.informatik.uni-trier.de/~Ley/db/ is worse for AllPairs-PEL because verification takes more 4 http://trec.nist.gov/data/t9_filtering.html time than PEL can save in the probing phase (by reducing 5 https://www.cs.cmu.edu/~enron/ the number of processed index entries). There is almost no sec sec sec 400 sec sec sec 500 500 100 150 400 300 30 400 80 AllPairs-PEL AllPairs-PEL AllPairs-PEL AllPairs-PEL AllPairs-PEL AllPairs-PEL mpjoin-PEL mpjoin-PEL mpjoin-PEL mpjoin-PEL mpjoin-PEL mpjoin-PEL 300 300 100 20 60 200 AllPairs AllPairs AllPairs AllPairs AllPairs AllPairs 200 mpjoin 200 40 mpjoin mpjoin mpjoin mpjoin mpjoin 50 100 10 100 100 20 0 0 0 0 0 0 (a) Foreign join, (b) Foreign join, (c) Foreign join, (d) Foreign j., EN- (e) Self join, (f) Self join, EN- DBLP, tJ = 0.75. DBLP, tC = 0.85. TREC, tJ = 0.75 RON, tJ = 0.75 DBLP, tJ = 0.75 RON, tJ = 0.75 Figure 7: Join times. 8.0e10 the PEL filter improves performance in almost any foreign 1.5e10 join and also in some self join scenarios, despite the fact that 6.0e10 it may increase the number of candidates to be verified. AllPairs-PEL AllPairs-PEL mpjoin-PEL mpjoin-PEL 1.0e10 4.0e10 7. REFERENCES AllPairs AllPairs mpjoin mpjoin 2.0e10 5.0e9 [1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. VLDB, pages 918 – 929, 0 0 2006. (a) Number of pro- (b) Number of candi- [2] N. Augsten, M. H. Böhlen, and J. Gamper. The cessed index entries. dates to be verify. pq-gram distance between ordered labeled trees. ACM TODS, 35(1), 2010. Figure 8: TREC (foreign join): tJ = 0.75 [3] N. Augsten, A. Miraglia, T. Neumann, and A. Kemper. On-the-fly token similarity joins in relational databases. In Proc. SIGMOD, pages 1495 – difference between mpjoin and mpjoin-PEL. The maximum 1506. ACM, 2014. increase in speed is 9% (threshold 0.8, mpjoin), the maxi- [4] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all mum slowdown is 30% (threshold 0.6, AllPairs). pairs similarity search. WWW, 7:131 – 140, 2007. Summarizing, PEL substantially improves the runtime in [5] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive foreign join scenarios. For self joins, PEL is less effective operator for similarity joins in data cleaning. In Proc. and, in some cases, may even slightly increase the runtime. ICDE, page 5. IEEE, 2006. [6] A. Gionis, P. Indyk, and R. Motwani. Similarity 5. RELATED WORK search in high dimensions via hashing. In Proc. VLDB, pages 518–529, 1999. Sarawagi and Kirpal [8] first discuss efficient algorithms for exact set similarity joins. Chaudhuri et al. [5] propose [7] L. A. Ribeiro and T. Härder. Generalizing prefix SSJoin as an in-database operator for set similarity joins filtering to improve set similarity joins. Information and introduce the prefix filter. AllPairs [4] uses the prefix Systems, 36(1):62 – 78, 2011. filter with an inverted list index. The ppjoin algorithm [12] [8] S. Sarawagi and A. Kirpal. Efficient set joins on extends AllPairs by the positional filter and introduces the similarity predicates. In Proc. SIGMOD, pages 743 – suffix filter, which reduces the candidate set before the final 754. ACM, 2004. verification. The mpjoin algorithm [7] improves over ppjoin [9] E. Spertus, M. Sahami, and O. Buyukkokten. by reducing the number of entries returned from the index. Evaluating similarity measures: A large-scale study in AdaptJoin [10] takes the opposite approach and drastically the orkut social network. In Proc. SIGKDD, pages 678 reduces the number of candidates at the expense of longer – 684. ACM, 2005. prefixes. Gionis et al. [6] propose an approximate algorithm [10] J. Wang, G. Li, and J. Feng. Can we beat the prefix based on LSH for set similarity joins. Recently, an SQL op- filtering?: An adaptive framework for similarity join erator for the token generation problem was introduced [3]. and search. In Proc. SIGMOD, pages 85 – 96. ACM, 2012. [11] C. Xiao, W. Wang, and X. Lin. Ed-Join: An efficient 6. CONCLUSIONS algorithm for similarity joins with edit distance We presented PEL, a new filter based on the pmaxsize constraints. In Proc. VLDB, 2008. upper bound derived in this paper. PEL can be easily [12] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. plugged into algorithms that store prefixes in an inverted Efficient similarity joins for near-duplicate detection. list index (e.g., AllPairs, ppjoin, or mpjoin). For these algo- ACM TODS, 36(3):15, 2011. rithms, PEL will effectively reduce the number of list entries that must be processed. This reduces the overall lookup time in the inverted list index at the cost of a potentially larger candidate set. We analyzed this trade-off for foreign joins and self joins. Our empirical evaluation demonstrated that