=Paper= {{Paper |id=Vol-1313/paper_16 |storemode=property |title=PEL: Position-Enhanced Length Filter for Set Similarity Joins |pdfUrl=https://ceur-ws.org/Vol-1313/paper_16.pdf |volume=Vol-1313 |dblpUrl=https://dblp.org/rec/conf/gvd/MannA14 }} ==PEL: Position-Enhanced Length Filter for Set Similarity Joins== https://ceur-ws.org/Vol-1313/paper_16.pdf

PEL: Position-Enhanced Length Filter for Set Similarity
Joins

Willi Mann Nikolaus Augsten
Department of Computer Sciences Department of Computer Sciences
Jakob-Haringer-Str. 2 Jakob-Haringer-Str. 2
Salzburg, Austria Salzburg, Austria
wmann@cosy.sbg.ac.at nikolaus.augsten@sbg.ac.at

ABSTRACT opments in the field leverage the position of the matching
Set similarity joins compute all pairs of similar sets from two tokens between two prefixes (positional filter) and the num-
collections of sets. Set similarity joins are typically imple- ber of remaining tokens in the overall set (length filter) to
mented in a filter-verify framework: a filter generates candi- further reduce the number of candidates.
date pairs, possibly including false positives, which must be This paper proposes a new filter, the position-enhanced
verified to produce the final join result. Good filters produce length filter (PEL), which tightens the length filter based
a small number of false positives, while they reduce the time on the position of the current token match. In previous
they spend on hopeless candidates. The best known algo- work, position and length information have only partially
rithms generate candidates using the so-called prefix filter been exploited. PEL fully leverages position and length to
in conjunction with length- and position-based filters. achieve additional pruning power. As a key feature, PEL
In this paper we show that the potential of length and po- does not require changes in the prefix filter index, but is
sition have only partially been leveraged. We propose a new applied on top of previous algorithms at almost no cost. In
filter, the position-enhanced length filter, which exploits the our experiments we show that PEL is particularly effective
matching position to incrementally tighten the length filter; for foreign joins. PEL also performs well for self joins over
our filter identifies hopeless candidates and avoids process- large collections of small sets.1
ing them. The filter is very efficient, requires no change in The remaining paper is organized as follows: Section 2
the data structures of most prefix filter algorithms, and is introduces the set similarity join, provides background ma-
particularly effective for foreign joins, i.e., joins between two terial, and an in-depth analysis of filtering techniques based
different collections of sets. on position and length. Our novel position-enhanced length
filter (PEL) is introduced in Section 3. We empirically eval-
uate our technique and demonstrate its effectiveness on real
1. INTRODUCTION world data in Section 4. In Section 5 we survey related work
The set similarity join computes all pairs of similar sets and finally draw conclusions in Section 6.
from two collections of sets. The similarity is assessed using
a set similarity function, e.g., set overlap, Jaccard, or Cosine 2. BACKGROUND
similarity, together with a threshold. A pair of sets is in the
join result if the similarity exceeds the threshold. We revisit candidate generation, candidate reduction, and
Set similarity joins have many interesting applications efficient verification techniques discussed in literature. The
ranging from near duplicate detection of Web documents to main concepts of this section are summarized in Figure 1.
community mining in social networks [9]. The set elements
are called tokens [3] and are often used to represent complex
2.1 Candidate Generation
objects, e.g., strings (q-grams [11]) or trees (pq-grams [2]). We shorty explain the prefix filter, translate normal-
The best algorithms for set similarity joins are based on ized thresholds to overlap thresholds, revisit length- and
an inverted list index and the so-called prefix filter [5]. The position-based filter conditions, and discuss prefixes in
prefix filter operates on sorted sets and rejects candidate index implementations of set similarity joins.
pairs that have no overlap in a (short) prefix. Only the pre- Prefix Filter. The fastest set similarity joins are based
fix must be indexed, which leads to substantial savings in on the prefix filter principle [5]: A pair of sorted sets s0 , s1
space and runtime. However, for frequent tokens, the prefix can only meet an overlap threshold tO , i.e., |s0 ∩ s1 | ≥ tO ,
filter produces a large number of candidates. Recent devel- if there is a non-zero overlap in the prefixes of the sets. The
prefixes are of length |s0 | − tO + 1 for s0 and |s1 | − tO + 1 for
s1 . The set similarity join proceeds in three steps: (a) index
the prefixes of one join partner, (b) probe the prefixes of the
other join partner against the index to generate candidates,
(c) verify the candidates by inspecting the full sets.
Copyright c by the paper’s authors. Copying permitted only Threshold for normalized set overlap. Normalized
for private and academic purposes. set similarity measures take the set sizes into account. For
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- 1
Workshop on Foundations of Databases (Grundlagen von Datenbanken), In a self join, both input relations are identical, which can-
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. not be assumed in a foreign join.
Table 1: Set similarity functions and related definitions, extending [7, Table 1] by new pmaxsize.
Similarity function minoverlap(t, s0 , s1 ) minsize(t, s0 ) pmaxsize(t, s0 , p0 ) maxsize(t, s0 )
|s0 ∩s1 | t |s0 |−(1+t)·p0 |s0 |
Jaccard J(s0 , s1 ) = |s 0 ∪s1 | 1+t
(|s0 | + |s1 |) t |s0 | t t

(|s0 |−p0 )2
C(s0 , s1 ) = √|s0 ∩s1 | t2 |s0 | |s0 |
p
Cosine t |s0 | · |s1 | |s0 |·t2 t2
|s0 |·|s1 |

Dice D(s0 , s1 ) = 2·(|s 0 ∩s1 |)
|s0 |+|s1 |
t(|s0 |+|s1 |)
2
t |s0 |
2−t
(2−t)·|s0 |−2p0
t
(2−t)|s0 |
t

Overlap O(s0 , s1 ) = |s0 ∩ s1 | t t ∞ ∞

Previous work:
the purpose of set similarity joins, the normalized threshold
is translated into an overlap threshold (called minoverlap). • minoverlap(t, s0 , s1 ): equivalent overlap threshold for
The overlap threshold may be different for each pair of sets. Jaccard, Cosine, or Dice threshold t for a pair of sets s0 , s1 .
For example, • minsize(t, s0 ), maxsize(t, s0 ): minimum and maximum
p for the Cosine threshold tC (join predicate size of any set that can satisfy threshold t w.r.t. set s0 .
|s0 ∩ s1 |/ |s0 | · |s1 | ≥ tC ), minoverlap(tC , s0 , s1 ) := tC ·
p • maxprefix(t, s0 ) = |s0 | − minsize(t, s0 ) + 1: length of
|s0 | · |s1 |. Table 1 lists the definitions of well-known set probing prefix
similarity functions with the respective overlap thresholds. • midprefix(t, s0 ) = |s0 | − minoverlap(t, s0 , s0 ) + 1: length
Length filter. For a given threshold t and a reference of indexing prefix for self joins
set s0 , the size of the matching partners must be within • minprefix(t, s0 , s1 ) = |s0 | − minoverlap(t, s0 , s1 ) + 1:
the interval [minsize(t, s0 ), maxsize(t, s0 )] (see Table 1). length of optimal prefix for a particular pair of sets
This was first observed for the PartEnum algorithm [1] Position-enhanced length filter (PEL):
and was later called the length filter. Example: |s0 | = 10, • pmaxsize(t, s0 , p0 ): new tighter limit for maximum set size
|s1 | = 6, |s2 | = 16, Cosine threshold tC = 0.8. Since based on the probing set position
minoverlap(tC , s0 , s1 ) = 6.1 > |s1 | we conclude (without
inspecting any set element) that s0 cannot reach threshold
Figure 1: Overview of functions.
tC with s1 . Similarly, minoverlap(tC , s0 , s2 ) = 10.1, thus
s2 is too large to meet the threshold with s0 . In fact,
minsize(tC , s0 ) = 6.4 and maxsize(tC , s0 ) = 15.6.
The positional filter is stricter than the prefix filter and
Prefix length. The prefix length is |s0 | − tO + 1 for is applied on top of it. The pruning power of the positional
a given overlap threshold tO and set s0 . For normalized filter is larger for prefix matches further to right (i.e., when
thresholds t the prefix length does not only depend on s0 , p0 , p1 increase). Since the prefix filter may produce the same
but also on the sets we compare to. If we compare to s1 , the candidate pair multiple times (for each match in the prefix),
minimum prefix size of |s0 | is minprefix(t, s0 , s1 ) = |s0 | − an interesting situation arises: a pair that passes the posi-
minoverlap(t, s0 , s1 ) + 1. When we index one of the join tional filter for the first match may not pass the filter for
partners, we do not know the size of the matching partners later matches. Thus, the positional filter is applied to pairs
upfront and need to cover the worst case; this results in the that are already in the candidate set whenever a new match
prefix length maxprefix(t, s0 ) = |s0 |−minsize(t, s0 )+1 [7], is found. To correctly apply the positional filter we need
which does not depend on s1 . For typical Jaccard thresholds to maintain the overlap value for each pair in the candidate
t ≥ 0.8, this reduces the number of tokens to be processed set. We illustrate the positional filter with examples.
during the candidate generation phase by 80 % or more. Example 1. Set s0 in Figure 2 is the probing set (prefix
For self joins we can further reduce the prefix length [12] length maxprefix = 4), s1 is the indexed set (prefix length
w.r.t. maxprefix: when the index is built on-the-fly in in- midprefix = 2, assuming self join). Set s1 is returned from
creasing order of the sets, then the indexed prefix of s0 will the index due to the match on g (first match between s0 and
never be compared to any set s1 with |s1 | < |s0 |. This al- s1 ). The required overlap is dminoverlapC (0.8, s0 , s1 )e =
lows us to reduce the prefix length to midprefix(t, s0 ) = 8. Since there are only 6 tokens left in s1 after the match,
|s0 | − minoverlap(t, s0 , s0 ) + 1. the maximum overlap we can get is 7, and the pair is pruned.
Positional filter. The minimum prefix length for a pair This is also confirmed by the positional filter condition (1)
of sets is often smaller than the worst case length, which we (o = 0, p0 = 3, p1 = 1).
use to build and probe the index. When we probe the index Example 2. Assume a situation similar to Figure 2, but
with a token from the prefix of s0 and find a match in the the match on g is the second match (i.e., o = 1, p0 = 3,
prefix of set s1 , then the matching token may be outside the p1 = 1). Condition (1) holds and the pair can not be pruned,
optimal prefix. If this is the first matching token between i.e., it remains in the candidate set.
s0 and s1 , we do not need to consider the pair. In general, Example 3. Consider Figure 3 with probing set s0 and
a candidate pair s0 , s1 must be considered only if indexed set s1 . The match on token a adds pair (s0 , s1 ) to
the candidate set. Condition (1) holds for the match on a
minoverlap(t, s0 , s1 ) ≤ o + min{|s0 | − p0 , |s1 | − p1 }, (1)
(o = 0, p0 = 0, p1 = 0), and the pair is not pruned by
where o is the current overlap (i.e., number of matching the positional filter. For the next match (on e), however,
tokens so far excluding the current match) and p0 (p1 ) is condition (1) does not hold (o = 1, p0 = 1, p1 = 4) and
the position of the current match in the prefix of s0 (s1 ); the positional filter removes the pair from the candidate set.
positions start at 0. Thus, the positional filter does not only avoid pairs to enter
pred: C(s0 , s1 ) ≥ 0.8 s0 : b c e f g h ? ? ? pr
⇒ dminoverlap(s0 , s1 , 0.8)e = 8
s1 : a e h ? ? ? ? ? ? idx
7
s0 : c e f g ? ? ? ? ? ? probing set (pr) Figure 4: Verification: where to start?
s1 : a g ? ? ? ? ? ? indexed set (idx)
pred: J(s0 , s1 ) ≥ 0.7 pred: J(s0 , s1 ) ≥ 0.7
7 ⇒ dminoverlap(. . .)e = 6 ⇒ dminoverlap(. . .)e = 5

s0 : c d e ? ? ? ? pr s0 : c d e ? ? ? ? pr
Figure 2: Sets with matching token In prefix: match
impossible due to positions of matching tokens and s1 : e ? ? ? ? ? idx s1 : e ? ? ? ? idx
remaining tokens. (a) Match impossible (b) Match possible

pred: C(s0 , s1 ) ≥ 0.6
Figure 5: Impossible and possible set sizes based on
⇒ dminoverlap(s0 , s1 , 0.8)e = 8 position in s0 and the size-dependent minoverlap.
14
s0 : a e ? ? ? ? ? ? ? ? ? ? ? ? ? ? pr
midprefix (indexing set) as discussed in earlier sections.
s1 : a b c d e ? ? ? ? ? idx Since the sets are sorted, we compute the overlap in a
+1 +1 5 =7<8 merge fashion. At each merge step, we verify if the current
overlap and the remaining set size are sufficient to achieve
Figure 3: Sets with two matching tokens: pruning the threshold, i.e., we check positional filter condition (1).
of candidate pair by second match. (A) Prefix overlap [12] : At verification time we already
know the overlap between the two prefixes of a candidate
pair. This piece of information should be leveraged. Note
the candidate set, but may remove them later. that we cannot simply continue verification after the two
prefixes. This is illustrated in Figure 4: there is 1 match in
2.2 Improving the Prefix Filter the prefixes of s0 and s1 ; when we start verification after the
The prefix filter often produces candidates that will be prefixes, we miss token h. Token h occurs after the prefix
removed immediately in the next filter stage, the positional of s0 but inside the prefix of s1 . Instead, we compare the
filter (see Example 1). Ideally, such candidates are not pro- last element of the prefixes: for the set with the smaller
duced at all. This issue is addressed in the mpjoin algo- element (s0 ), we start verification after the prefix (g). For
rithm [7] as outlined below. the other set (s1 ) we leverage the number of matches in the
Consider condition (1) for the positional filter. We split prefix (overlap o). Since the leftmost positions where these
the condition into two new conditions by expanding the min- matches can appear are the first o elements, we skip o tokens
imum such that the conjunction of the new conditions is and start at position o (token e in s1 ). There is no risk of
equivalent to the positional filter condition: double-counting tokens w.r.t. overlap o since we start after
the end of the prefix in s0 .
minoverlap(t, s0 , s1 ) ≤ o + |s0 | − p0 (2)
(B) Position of last match [7] : A further improvement is
minoverlap(t, s0 , s1 ) ≤ o + |s1 | − p1 (3) to store the position of the last match. Then we start the
verification in set s1 after this position (h in s1 , Figure 4).
The mpjoin algorithm leverages condition (2) as follows.
The probing sets s0 are processed in increasing size order, so Small candidate set vs. fast verification. The po-
|s0 | grows monotonically during the execution of the algo- sitional filter is applied on each candidate pair returned by
rithm. Hence, for a specific set s1 , minoverlap grows mono- the prefix filter. The same candidate pair may be returned
tonically. We assume o = 0 (and justify this assumption multiple times for different matches in the prefix. The po-
later). For a given index entry (s1 , p1 ), the right side of con- sitional filter potentially removes existing candidate pairs
dition (2) is constant, while the left side can only grow. Af- when they appear again (cf. Section 2.1). This reduces the
ter the condition fails to hold for the first time, it will never size of the candidate set, but comes at the cost of (a) lookups
hold again, and the index list entry is removed. For a given in the candidate set, (b) deletions from the candidate set,
index set s1 , this improvement changes the effective length and (c) book keeping of the overlaps for each candidate pair.
of the prefix (i.e., the part of the sets where we may detect Overall, it might be more efficient to batch-verify a larger
matches) w.r.t. a probing set s0 to minprefix(t, s0 , s1 ) = candidate set than to incrementally maintain the candidates;
|s1 | − minoverlap(t, s0 , s1 ) + 1, which is optimal. On the Ribeiro and Härder [7] empirically analyze this trade-off.
downside, a shorter prefix may require more work in the
verification phase: in some cases, the verification can start 3. POSITION-ENHANCED LENGTH FIL-
after the prefix as will be discussed in Section 2.3.
TERING
2.3 Verification In this section, we motivate the position-enhanced length
Efficient verification techniques are crucial for fast set sim- filter (PEL), derive the new filter function pmaxsize, dis-
ilarity joins. We revisit a baseline algorithm and two im- cuss the effect of PEL on self vs. foreign joins, and show how
provements, which affect the verification speed of both false to apply PEL to previous algorithms.
and true positives. Unless explicitly mentioned, the term Motivation. The introduction of the position-enhanced
prefix subsequently refers to maxprefix (probing set) resp. length filter is inspired by examples for positional filtering
1250 base region. The base region is partitioned into four regions
maxsize (A, B, C, and D) by the probing set size and pmaxsize. For
C D probing set size foreign joins, our filter reduces the base region to A+C. If we
set size

assume that all set sizes occur equally likely in the individual
1000 inverted lists of the index, our filter cuts the number of index
B list entries that must be processed by 50%. Since the tokens
A pmaxsize are typically ordered by their frequency, the list length will
minsize increase with increasing matching position. Thus the gain of
800
PEL in practical settings can be expected to be even higher.
0 100 maxprefix 200 This analysis holds for all parameters of Jaccard and Dice.
position in prefix For Cosine, the situation is more tricky since pmaxsize is
quadratic and describes a parabola. Again, this is in our
Figure 6: Illustrating possible set sizes. favor since the parabola is open to the top, and the curve
that splits the base region is below the diagonal.
For self joins, the only relevant regions are A and B since
like Figure 5(a). In set s1 , the only match in the prefix oc- the size of the sets is bounded by the probing set size. Our
curs at the leftmost position. Despite this being the leftmost filter reduces the relevant region from A + B to A. As Fig-
match in s1 , the positional filter removes s1 : the overlap ure 6 illustrates, this reduction is smaller than the reduction
threshold cannot be reached due the position of the match for foreign joins. For the similarity functions in Table 1, B
in s0 . Apparently, the position of the token in the probing is always less than a quarter of the full region A + B. In the
set can render a match of the index sets impossible, inde- example, region B covers about 0.22 of A + B.
pendently of the matching position in the index set. Let us
analyze how we need to modify the example such that it
passes the positional filter: the solution is to shorten index Algorithm 1: AllPairs-PEL(Sp , I, t)
set s1 , as shown in Figure 5(b). This suggests that some Version using pmaxsize for foreign join;
tighter limit on the set size can be derived from the position input : Sp collection of outer sets, I inverted list index
of the matching token. covering maxprefix of inner sets, t similarity
Deriving the PEL filter. For the example in threshold
Figure 5(a) the first part of the positional filter, i.e., output: res set of result pairs (similarity at least t)
condition (2), does not hold. We solve the equation 1 foreach s0 in Sp do
minoverlap(t, s0 , s1 ) ≤ |s0 | − p0 to |s1 | by replacing 2 M = {}; /* Hashmap: candidate set → count */
minoverlap with its definition for the different similarity 3 for p0 ← 0 to maxprefix(t, s0 ) − 1 do
functions. The result is pmaxsize(t, s0 , p0 ), an upper 4 for s1 in Is0 [p] do
bound on the size of eligible sets in the index. This bound 5 if |s1 | < minsize(t, s0 ) then
is at the core of the PEL filter, and definitions of pmaxsize 6 remove index entry with s1 from Is0 [p] ;
for various similarity measures are listed in Table 1. 7 else if |s1 | > pmaxsize(t, s0 , p0 ) then
Application of PEL. We integrate the pmaxsize 8 break;
upper bound into the prefix filter. The basic prefix filter 9 else
algorithm processes a probing set as follows: loop over 10 if M [s1 ] = ∅ then
the tokens of the probing set from position p0 = 0 to 11 M = M ∪ (s1 , 0);
maxprefix(t, s0 ) − 1 and probe each token against the 12 M [s1 ] = M [s1 ] + 1;
index. The index returns a list of sets (their IDs) which 13 end
contain this token. The sets in these lists are ordered by 14 end
increasing size, so we stop processing a list when we hit a 15 /* Verify() verifies the candidates in M */
set that is larger than pmaxsize(t, s0 , p0 ). 16 res = res ∪ V erif y(s0 , M, t);
Intuitively, we move half of the positional filter to the 17 end
prefix filter, where we can evaluate it at lower cost: (a) the
value of pmaxsize needs to be computed only once for each
probing token; (b) we check pmaxsize against the size of
Algorithm. Algorithm 1 shows AllPairs-PEL2 , a ver-
each index list entry, which is a simple integer comparison.
sion of AllPairs enhanced with our PEL filter. AllPairs-
Overall, this is much cheaper than the candidate lookup that
PEL is designed for foreign joins, i.e., the index is con-
the positional filter must do for each index match.
structed in a preprocessing step before the join is executed.
Self Joins vs. Foreign Joins. The PEL filter is more The only difference w.r.t. AllPairs is that AllPairs-PEL uses
powerful on foreign joins than on self joins. In self joins, pmaxsize(t, s0 , p0 ) instead of maxsize(t, s0 ) in the condi-
the size of the probing set is an upper bound for the set tion on line 7. The extensions of the algorithms ppjoin and
size in the index. For all the similarity functions in Table 1, mpjoin with PEL are similar.
pmaxsize is below the probing set size in less than 50% An enhancement that is limited to ppjoin and mpjoin is to
of the prefix positions. Figure 6 gives an example: The simplify the positional filter: PEL ensures that no candidate
probing set size is 1000, the Jaccard threshold is 0.8, so set can fail on the first condition (Equation 2) of the split
minsize(0.8, 1000) = 800, maxsize(0.8, 1000) = 1250, and positional filter. Therefore, we remove the first part of the
the prefix size is 201. The x-axis represents the position in
the prefix, the y-axis represents bounds for the set size of the 2
We use the -PEL suffix for algorithm variants that make
other set. The region between minsize and maxsize is the use of our PEL filter.
collections are identical. Figures 7(a) and 7(b) show the per-
Table 2: Input set characteristics. formance on DBLP with Jaccard similarity threshold 0.75
#sets in set size # of diff.
and Cosine similarity 0.85. These thresholds produce result
collection min max avg tokens
sets of similar size. We observe a speedup of factor 3.5 for
DBLP 3.9 · 106 2 283 12 1.34 · 106 AllPairs-PEL over AllPairs with Jaccard, and a speedup of
5
TREC 3.5 · 10 2 628 134 3.4 · 105 3.8 with Cosine. For mpjoin to mpjoin-PEL we observe a
5
ENRON 5 · 10 1 192 000 298 7.3 · 106 speedup of 4.0 with Jaccard and 4.2 with Cosine. Thus, the
PEL filter provides a substantial speed advantage on these
data points. For other Jaccard thresholds and mpjoin vs.
minimum in the original positional filter (Equation 1), such mpjoin-PEL, the maximum speedup is 4.1 and the minimum
that the minimum is no longer needed. speedup is 1.02. For threshold 0.5, only mpjoin-PEL finishes
Note that the removal of index entries on line 6 is the eas- within the time limit of one hour. Among all Cosine thresh-
iest solution to apply minsize, but in real-world scenarios, olds and mpjoin vs. mpjoin-PEL, the maximum speedup is
it only makes sense for a single join to be executed. For 4.2 (tC = 0.85), the minimum speedup is 1.14 (tC = 0.95).
a similarity search scenario, we recommend to apply binary We only consider Cosine thresholds tC ≥ 0.75, because the
search on the lists. For multiple joins with the same indexed non-PEL variants exceed the time limit for smaller thresh-
sets in a row, we suggest to use an overlay over the index olds. There is no data point where PEL slows down an
that stores the pointer for each list where to start. algorithm. It is also worth noting that AllPairs-PEL beats
mpjoin by a factor of 2.7 with Jaccard threshold tJ = 0.75
4. EXPERIMENTS and 3.3 on Cosine threshold tC = 0.85; we observe such
speedups also on other thresholds.
We compare the algorithms AllPairs [4] and mpjoin [7] Figure 7(c) shows the performance on TREC with Jac-
with and without our PEL extension on both self and for- card threshold tJ = 0.75. The speedup for AllPairs-PEL
eign joins. Our implementation works on integers, which we compared to AllPairs is 1.64, and for mpjoin-PEL compared
order by the frequency of appearance in the collection. The to mpjoin 2.3. The minimum speedup of mpjoin over all
time to generate integers from tokens is not measured in our thresholds is 1.26 (tJ = 0.95), the maximum speedup is
experiments since it is the same for all algorithms. We also 2.3 (tJ = 0.75). Performance gains on ENRON are slightly
do not consider the indexing time for foreign joins, which smaller – we observe speedups of 1.15 (AllPairs-PEL over
is considered a preprocessing step. The use of PEL has no AllPairs), and 1.85 (mpjoin-PEL over mpjoin) on Jaccard
impact on the index construction. The prefix sizes are max- threshold tJ = 0.75 as illustrated in Figure 7(d). The mini-
prefix for foreign joins and midprefix for self joins. For self mum speedup of mpjoin over mpjoin-PEL is 1.24 (tJ = 0.9
joins, we include the indexing time in the overall runtime and 0.95), the maximum speedup is 2.0 (tJ = 0.6).
since the index is built incrementally on-the-fly. We report Figure 8(a) shows the number of processed index entries
results for Jaccard and Cosine similarity, the results for Dice (i.e., the overall length of the inverted lists that must be
show similar behavior. Our experiments are executed on the scanned) for Jaccard threshold tJ = 0.75 on TREC. The
following real-world data sets: number of index entries increases by a factor of 1.67 for
AllPairs w.r.t. AllPairs-PEL, and a factor of 4.0 for mpjoin
• DBLP3 : Snapshot (February 2014) of the DBLP bib- w.r.t. mpjoin-PEL.
liographic database. We concatenate authors and ti- Figure 8(b) shows the number of candidates that must
tle of each entry and generate tokens by splitting on be verified for Jaccard threshold tJ = 0.75 on TREC. On
whitespace. AllPairs, PEL decreases the number of candidates. This is
because AllPairs does not apply any further filters before
• TREC4 : References from the MEDLINE database,
verification. On mpjoin, the number of candidates increases
years 1987–1991. We concatenate author, title, and
by 20%. This is due to the smaller number of matches from
abstract, remove punctuation, and split on whitespace.
the prefix index in the case of PEL: later matches can remove
• ENRON5 : Real e-mail messages published by FERC pairs from the candidate set (using the positional filter) and
after the ENRON bankruptcy. We concatenate sub- thus decrease its size. However, the larger candidate set
ject and body fields, remove punctuation, and split on for PEL does not seriously impact the overall performance:
whitespace. the positional filter is also applied in the verification phase,
where the extra candidate pairs are pruned immediately.
Table 2 lists basic characteristics of the input sets. We Self joins. Due to space constraints, we only show re-
conduct our experiments on an Intel Xeon 2.60GHz machine sults for DBLP and ENRON, i.e., the input sets with the
with 128 GB RAM running Debian 7.6 ’wheezy’. We com- smallest and the largest average set sizes, respectively. Fig-
pile our code with gcc -O3. Claims about results on “all” ure 7(e) and 7(f) show the performance of the algorithms on
thresholds for a particular data set refer to the thresholds DBLP and ENRON with Jaccard threshold tJ = 0.75. Our
{0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95}. We stop tests whose PEL filter provides a speed up of about 1.22 for AllPairs,
runtime exceeds one hour. and 1.17 for mpjoin on DBLP. The maximum speedup we
Foreign Joins. For foreign joins, we join a collection of observe is 1.70 (AllPairs-PEL vs. AllPairs, tJ = 0.6); for
sets with a copy of itself, but do not leverage the fact that the tJ = 0.95 there is no speed difference between mpjoin and
mpjoin-PEL. On the large sets of ENRON, the performance
3
http://www.informatik.uni-trier.de/~Ley/db/ is worse for AllPairs-PEL because verification takes more
4
http://trec.nist.gov/data/t9_filtering.html time than PEL can save in the probing phase (by reducing
5
https://www.cs.cmu.edu/~enron/ the number of processed index entries). There is almost no
sec sec sec 400 sec sec sec
500 500 100
150
400 300 30
400 80

AllPairs-PEL

AllPairs-PEL
mpjoin-PEL

mpjoin-PEL

mpjoin-PEL
300 300 100 20 60
200
AllPairs

AllPairs

AllPairs
200 mpjoin 200 40

mpjoin

mpjoin
50 100 10
100 100 20
0 0 0 0 0 0
(a) Foreign join, (b) Foreign join, (c) Foreign join, (d) Foreign j., EN- (e) Self join, (f) Self join, EN-
DBLP, tJ = 0.75. DBLP, tC = 0.85. TREC, tJ = 0.75 RON, tJ = 0.75 DBLP, tJ = 0.75 RON, tJ = 0.75

Figure 7: Join times.

8.0e10 the PEL filter improves performance in almost any foreign
1.5e10 join and also in some self join scenarios, despite the fact that
6.0e10 it may increase the number of candidates to be verified.
AllPairs-PEL

AllPairs-PEL
mpjoin-PEL

mpjoin-PEL
1.0e10
4.0e10
7. REFERENCES
AllPairs

AllPairs
mpjoin

mpjoin

2.0e10 5.0e9
[1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
set-similarity joins. In Proc. VLDB, pages 918 – 929,
0 0
2006.
(a) Number of pro- (b) Number of candi- [2] N. Augsten, M. H. Böhlen, and J. Gamper. The
cessed index entries. dates to be verify.
pq-gram distance between ordered labeled trees. ACM
TODS, 35(1), 2010.
Figure 8: TREC (foreign join): tJ = 0.75 [3] N. Augsten, A. Miraglia, T. Neumann, and
A. Kemper. On-the-fly token similarity joins in
relational databases. In Proc. SIGMOD, pages 1495 –
difference between mpjoin and mpjoin-PEL. The maximum 1506. ACM, 2014.
increase in speed is 9% (threshold 0.8, mpjoin), the maxi- [4] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all
mum slowdown is 30% (threshold 0.6, AllPairs). pairs similarity search. WWW, 7:131 – 140, 2007.
Summarizing, PEL substantially improves the runtime in [5] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
foreign join scenarios. For self joins, PEL is less effective operator for similarity joins in data cleaning. In Proc.
and, in some cases, may even slightly increase the runtime. ICDE, page 5. IEEE, 2006.
[6] A. Gionis, P. Indyk, and R. Motwani. Similarity
5. RELATED WORK search in high dimensions via hashing. In Proc.
VLDB, pages 518–529, 1999.
Sarawagi and Kirpal [8] first discuss efficient algorithms
for exact set similarity joins. Chaudhuri et al. [5] propose [7] L. A. Ribeiro and T. Härder. Generalizing prefix
SSJoin as an in-database operator for set similarity joins filtering to improve set similarity joins. Information
and introduce the prefix filter. AllPairs [4] uses the prefix Systems, 36(1):62 – 78, 2011.
filter with an inverted list index. The ppjoin algorithm [12] [8] S. Sarawagi and A. Kirpal. Efficient set joins on
extends AllPairs by the positional filter and introduces the similarity predicates. In Proc. SIGMOD, pages 743 –
suffix filter, which reduces the candidate set before the final 754. ACM, 2004.
verification. The mpjoin algorithm [7] improves over ppjoin [9] E. Spertus, M. Sahami, and O. Buyukkokten.
by reducing the number of entries returned from the index. Evaluating similarity measures: A large-scale study in
AdaptJoin [10] takes the opposite approach and drastically the orkut social network. In Proc. SIGKDD, pages 678
reduces the number of candidates at the expense of longer – 684. ACM, 2005.
prefixes. Gionis et al. [6] propose an approximate algorithm [10] J. Wang, G. Li, and J. Feng. Can we beat the prefix
based on LSH for set similarity joins. Recently, an SQL op- filtering?: An adaptive framework for similarity join
erator for the token generation problem was introduced [3]. and search. In Proc. SIGMOD, pages 85 – 96. ACM,
2012.
[11] C. Xiao, W. Wang, and X. Lin. Ed-Join: An efficient
6. CONCLUSIONS algorithm for similarity joins with edit distance
We presented PEL, a new filter based on the pmaxsize constraints. In Proc. VLDB, 2008.
upper bound derived in this paper. PEL can be easily [12] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang.
plugged into algorithms that store prefixes in an inverted Efficient similarity joins for near-duplicate detection.
list index (e.g., AllPairs, ppjoin, or mpjoin). For these algo- ACM TODS, 36(3):15, 2011.
rithms, PEL will effectively reduce the number of list entries
that must be processed. This reduces the overall lookup time
in the inverted list index at the cost of a potentially larger
candidate set. We analyzed this trade-off for foreign joins
and self joins. Our empirical evaluation demonstrated that