PEL: Position-Enhanced Length Filter for Set Similarity
                            Joins

                                Willi Mann                                               Nikolaus Augsten
                  Department of Computer Sciences                               Department of Computer Sciences
                        Jakob-Haringer-Str. 2                                         Jakob-Haringer-Str. 2
                          Salzburg, Austria                                             Salzburg, Austria
                      wmann@cosy.sbg.ac.at                                       nikolaus.augsten@sbg.ac.at

ABSTRACT                                                                  opments in the field leverage the position of the matching
Set similarity joins compute all pairs of similar sets from two           tokens between two prefixes (positional filter) and the num-
collections of sets. Set similarity joins are typically imple-            ber of remaining tokens in the overall set (length filter) to
mented in a filter-verify framework: a filter generates candi-            further reduce the number of candidates.
date pairs, possibly including false positives, which must be                This paper proposes a new filter, the position-enhanced
verified to produce the final join result. Good filters produce           length filter (PEL), which tightens the length filter based
a small number of false positives, while they reduce the time             on the position of the current token match. In previous
they spend on hopeless candidates. The best known algo-                   work, position and length information have only partially
rithms generate candidates using the so-called prefix filter              been exploited. PEL fully leverages position and length to
in conjunction with length- and position-based filters.                   achieve additional pruning power. As a key feature, PEL
   In this paper we show that the potential of length and po-             does not require changes in the prefix filter index, but is
sition have only partially been leveraged. We propose a new               applied on top of previous algorithms at almost no cost. In
filter, the position-enhanced length filter, which exploits the           our experiments we show that PEL is particularly effective
matching position to incrementally tighten the length filter;             for foreign joins. PEL also performs well for self joins over
our filter identifies hopeless candidates and avoids process-             large collections of small sets.1
ing them. The filter is very efficient, requires no change in                The remaining paper is organized as follows: Section 2
the data structures of most prefix filter algorithms, and is              introduces the set similarity join, provides background ma-
particularly effective for foreign joins, i.e., joins between two         terial, and an in-depth analysis of filtering techniques based
different collections of sets.                                            on position and length. Our novel position-enhanced length
                                                                          filter (PEL) is introduced in Section 3. We empirically eval-
                                                                          uate our technique and demonstrate its effectiveness on real
1.    INTRODUCTION                                                        world data in Section 4. In Section 5 we survey related work
   The set similarity join computes all pairs of similar sets             and finally draw conclusions in Section 6.
from two collections of sets. The similarity is assessed using
a set similarity function, e.g., set overlap, Jaccard, or Cosine          2.    BACKGROUND
similarity, together with a threshold. A pair of sets is in the
join result if the similarity exceeds the threshold.                         We revisit candidate generation, candidate reduction, and
   Set similarity joins have many interesting applications                efficient verification techniques discussed in literature. The
ranging from near duplicate detection of Web documents to                 main concepts of this section are summarized in Figure 1.
community mining in social networks [9]. The set elements
are called tokens [3] and are often used to represent complex
                                                                          2.1    Candidate Generation
objects, e.g., strings (q-grams [11]) or trees (pq-grams [2]).               We shorty explain the prefix filter, translate normal-
   The best algorithms for set similarity joins are based on              ized thresholds to overlap thresholds, revisit length- and
an inverted list index and the so-called prefix filter [5]. The           position-based filter conditions, and discuss prefixes in
prefix filter operates on sorted sets and rejects candidate               index implementations of set similarity joins.
pairs that have no overlap in a (short) prefix. Only the pre-                Prefix Filter. The fastest set similarity joins are based
fix must be indexed, which leads to substantial savings in                on the prefix filter principle [5]: A pair of sorted sets s0 , s1
space and runtime. However, for frequent tokens, the prefix               can only meet an overlap threshold tO , i.e., |s0 ∩ s1 | ≥ tO ,
filter produces a large number of candidates. Recent devel-               if there is a non-zero overlap in the prefixes of the sets. The
                                                                          prefixes are of length |s0 | − tO + 1 for s0 and |s1 | − tO + 1 for
                                                                          s1 . The set similarity join proceeds in three steps: (a) index
                                                                          the prefixes of one join partner, (b) probe the prefixes of the
                                                                          other join partner against the index to generate candidates,
                                                                          (c) verify the candidates by inspecting the full sets.
Copyright c by the paper’s authors. Copying permitted only                  Threshold for normalized set overlap. Normalized
for private and academic purposes.                                        set similarity measures take the set sizes into account. For
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-     1
Workshop on Foundations of Databases (Grundlagen von Datenbanken),          In a self join, both input relations are identical, which can-
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.   not be assumed in a foreign join.
       Table 1: Set similarity functions and related definitions, extending [7, Table 1] by new pmaxsize.
           Similarity function   minoverlap(t, s0 , s1 ) minsize(t, s0 ) pmaxsize(t, s0 , p0 ) maxsize(t, s0 )
                            |s0 ∩s1 |          t                                                  |s0 |−(1+t)·p0            |s0 |
 Jaccard      J(s0 , s1 ) = |s 0 ∪s1 |        1+t
                                                  (|s0 | + |s1 |)          t |s0 |                       t                    t

                                                                                                  (|s0 |−p0 )2
              C(s0 , s1 ) = √|s0 ∩s1 |                                     t2 |s0 |                                         |s0 |
                                                  p
   Cosine                                     t       |s0 | · |s1 |                                  |s0 |·t2                t2
                                |s0 |·|s1 |

     Dice     D(s0 , s1 ) = 2·(|s 0 ∩s1 |)
                             |s0 |+|s1 |
                                              t(|s0 |+|s1 |)
                                                     2
                                                                           t |s0 |
                                                                            2−t
                                                                                                  (2−t)·|s0 |−2p0
                                                                                                         t
                                                                                                                            (2−t)|s0 |
                                                                                                                               t

 Overlap      O(s0 , s1 ) = |s0 ∩ s1 |        t                            t                      ∞                         ∞


                                                                                     Previous work:
the purpose of set similarity joins, the normalized threshold
is translated into an overlap threshold (called minoverlap).                            • minoverlap(t, s0 , s1 ): equivalent overlap threshold for
The overlap threshold may be different for each pair of sets.                             Jaccard, Cosine, or Dice threshold t for a pair of sets s0 , s1 .
For example,                                                                            • minsize(t, s0 ), maxsize(t, s0 ): minimum and maximum
             p for the Cosine threshold tC (join predicate                                size of any set that can satisfy threshold t w.r.t. set s0 .
|s0 ∩ s1 |/      |s0 | · |s1 | ≥ tC ), minoverlap(tC , s0 , s1 ) := tC ·
p                                                                                       • maxprefix(t, s0 ) = |s0 | − minsize(t, s0 ) + 1: length of
   |s0 | · |s1 |. Table 1 lists the definitions of well-known set                         probing prefix
similarity functions with the respective overlap thresholds.                            • midprefix(t, s0 ) = |s0 | − minoverlap(t, s0 , s0 ) + 1: length
   Length filter. For a given threshold t and a reference                                 of indexing prefix for self joins
set s0 , the size of the matching partners must be within                               • minprefix(t, s0 , s1 ) = |s0 | − minoverlap(t, s0 , s1 ) + 1:
the interval [minsize(t, s0 ), maxsize(t, s0 )] (see Table 1).                            length of optimal prefix for a particular pair of sets
This was first observed for the PartEnum algorithm [1]                               Position-enhanced length filter (PEL):
and was later called the length filter. Example: |s0 | = 10,                            • pmaxsize(t, s0 , p0 ): new tighter limit for maximum set size
|s1 | = 6, |s2 | = 16, Cosine threshold tC = 0.8. Since                                   based on the probing set position
minoverlap(tC , s0 , s1 ) = 6.1 > |s1 | we conclude (without
inspecting any set element) that s0 cannot reach threshold
                                                                                                 Figure 1: Overview of functions.
tC with s1 . Similarly, minoverlap(tC , s0 , s2 ) = 10.1, thus
s2 is too large to meet the threshold with s0 . In fact,
minsize(tC , s0 ) = 6.4 and maxsize(tC , s0 ) = 15.6.
                                                                                        The positional filter is stricter than the prefix filter and
   Prefix length. The prefix length is |s0 | − tO + 1 for                            is applied on top of it. The pruning power of the positional
a given overlap threshold tO and set s0 . For normalized                             filter is larger for prefix matches further to right (i.e., when
thresholds t the prefix length does not only depend on s0 ,                          p0 , p1 increase). Since the prefix filter may produce the same
but also on the sets we compare to. If we compare to s1 , the                        candidate pair multiple times (for each match in the prefix),
minimum prefix size of |s0 | is minprefix(t, s0 , s1 ) = |s0 | −                     an interesting situation arises: a pair that passes the posi-
minoverlap(t, s0 , s1 ) + 1. When we index one of the join                           tional filter for the first match may not pass the filter for
partners, we do not know the size of the matching partners                           later matches. Thus, the positional filter is applied to pairs
upfront and need to cover the worst case; this results in the                        that are already in the candidate set whenever a new match
prefix length maxprefix(t, s0 ) = |s0 |−minsize(t, s0 )+1 [7],                       is found. To correctly apply the positional filter we need
which does not depend on s1 . For typical Jaccard thresholds                         to maintain the overlap value for each pair in the candidate
t ≥ 0.8, this reduces the number of tokens to be processed                           set. We illustrate the positional filter with examples.
during the candidate generation phase by 80 % or more.                                  Example 1. Set s0 in Figure 2 is the probing set (prefix
   For self joins we can further reduce the prefix length [12]                       length maxprefix = 4), s1 is the indexed set (prefix length
w.r.t. maxprefix: when the index is built on-the-fly in in-                          midprefix = 2, assuming self join). Set s1 is returned from
creasing order of the sets, then the indexed prefix of s0 will                       the index due to the match on g (first match between s0 and
never be compared to any set s1 with |s1 | < |s0 |. This al-                         s1 ). The required overlap is dminoverlapC (0.8, s0 , s1 )e =
lows us to reduce the prefix length to midprefix(t, s0 ) =                           8. Since there are only 6 tokens left in s1 after the match,
|s0 | − minoverlap(t, s0 , s0 ) + 1.                                                 the maximum overlap we can get is 7, and the pair is pruned.
   Positional filter. The minimum prefix length for a pair                           This is also confirmed by the positional filter condition (1)
of sets is often smaller than the worst case length, which we                        (o = 0, p0 = 3, p1 = 1).
use to build and probe the index. When we probe the index                               Example 2. Assume a situation similar to Figure 2, but
with a token from the prefix of s0 and find a match in the                           the match on g is the second match (i.e., o = 1, p0 = 3,
prefix of set s1 , then the matching token may be outside the                        p1 = 1). Condition (1) holds and the pair can not be pruned,
optimal prefix. If this is the first matching token between                          i.e., it remains in the candidate set.
s0 and s1 , we do not need to consider the pair. In general,                            Example 3. Consider Figure 3 with probing set s0 and
a candidate pair s0 , s1 must be considered only if                                  indexed set s1 . The match on token a adds pair (s0 , s1 ) to
                                                                                     the candidate set. Condition (1) holds for the match on a
  minoverlap(t, s0 , s1 ) ≤ o + min{|s0 | − p0 , |s1 | − p1 }, (1)
                                                                                     (o = 0, p0 = 0, p1 = 0), and the pair is not pruned by
where o is the current overlap (i.e., number of matching                             the positional filter. For the next match (on e), however,
tokens so far excluding the current match) and p0 (p1 ) is                           condition (1) does not hold (o = 1, p0 = 1, p1 = 4) and
the position of the current match in the prefix of s0 (s1 );                         the positional filter removes the pair from the candidate set.
positions start at 0.                                                                Thus, the positional filter does not only avoid pairs to enter
      pred: C(s0 , s1 ) ≥ 0.8                                                     s0 : b c e f g h ? ? ? pr
         ⇒ dminoverlap(s0 , s1 , 0.8)e = 8
                                                                                  s1 : a e h ? ? ? ? ? ? idx
                              7
        s0 : c e f g ? ? ? ? ? ? probing set (pr)                            Figure 4: Verification: where to start?
       s1 :   a g ? ? ? ? ? ? indexed set (idx)
                                                                     pred: J(s0 , s1 ) ≥ 0.7        pred: J(s0 , s1 ) ≥ 0.7
                      7                                                 ⇒ dminoverlap(. . .)e = 6      ⇒ dminoverlap(. . .)e = 5

                                                                      s0 : c d e ? ? ? ? pr           s0 : c d e ? ? ? ? pr
Figure 2: Sets with matching token In prefix: match
impossible due to positions of matching tokens and                    s1 : e ? ? ? ? ? idx            s1 : e ? ? ? ? idx
remaining tokens.                                                       (a) Match impossible              (b) Match possible

  pred: C(s0 , s1 ) ≥ 0.6
                                                                     Figure 5: Impossible and possible set sizes based on
     ⇒ dminoverlap(s0 , s1 , 0.8)e = 8                               position in s0 and the size-dependent minoverlap.
                                 14
    s0 : a e ? ? ? ? ? ? ? ? ? ? ? ? ? ? pr
                                                                     midprefix (indexing set) as discussed in earlier sections.
      s1 : a b c d e ? ? ? ? ? idx                                      Since the sets are sorted, we compute the overlap in a
           +1     +1     5     =7<8                                  merge fashion. At each merge step, we verify if the current
                                                                     overlap and the remaining set size are sufficient to achieve
Figure 3: Sets with two matching tokens: pruning                     the threshold, i.e., we check positional filter condition (1).
of candidate pair by second match.                                      (A) Prefix overlap [12] : At verification time we already
                                                                     know the overlap between the two prefixes of a candidate
                                                                     pair. This piece of information should be leveraged. Note
the candidate set, but may remove them later.                        that we cannot simply continue verification after the two
                                                                     prefixes. This is illustrated in Figure 4: there is 1 match in
2.2    Improving the Prefix Filter                                   the prefixes of s0 and s1 ; when we start verification after the
   The prefix filter often produces candidates that will be          prefixes, we miss token h. Token h occurs after the prefix
removed immediately in the next filter stage, the positional         of s0 but inside the prefix of s1 . Instead, we compare the
filter (see Example 1). Ideally, such candidates are not pro-        last element of the prefixes: for the set with the smaller
duced at all. This issue is addressed in the mpjoin algo-            element (s0 ), we start verification after the prefix (g). For
rithm [7] as outlined below.                                         the other set (s1 ) we leverage the number of matches in the
   Consider condition (1) for the positional filter. We split        prefix (overlap o). Since the leftmost positions where these
the condition into two new conditions by expanding the min-          matches can appear are the first o elements, we skip o tokens
imum such that the conjunction of the new conditions is              and start at position o (token e in s1 ). There is no risk of
equivalent to the positional filter condition:                       double-counting tokens w.r.t. overlap o since we start after
                                                                     the end of the prefix in s0 .
              minoverlap(t, s0 , s1 ) ≤ o + |s0 | − p0        (2)
                                                                        (B) Position of last match [7] : A further improvement is
              minoverlap(t, s0 , s1 ) ≤ o + |s1 | − p1        (3)    to store the position of the last match. Then we start the
                                                                     verification in set s1 after this position (h in s1 , Figure 4).
The mpjoin algorithm leverages condition (2) as follows.
The probing sets s0 are processed in increasing size order, so          Small candidate set vs. fast verification. The po-
|s0 | grows monotonically during the execution of the algo-          sitional filter is applied on each candidate pair returned by
rithm. Hence, for a specific set s1 , minoverlap grows mono-         the prefix filter. The same candidate pair may be returned
tonically. We assume o = 0 (and justify this assumption              multiple times for different matches in the prefix. The po-
later). For a given index entry (s1 , p1 ), the right side of con-   sitional filter potentially removes existing candidate pairs
dition (2) is constant, while the left side can only grow. Af-       when they appear again (cf. Section 2.1). This reduces the
ter the condition fails to hold for the first time, it will never    size of the candidate set, but comes at the cost of (a) lookups
hold again, and the index list entry is removed. For a given         in the candidate set, (b) deletions from the candidate set,
index set s1 , this improvement changes the effective length         and (c) book keeping of the overlaps for each candidate pair.
of the prefix (i.e., the part of the sets where we may detect        Overall, it might be more efficient to batch-verify a larger
matches) w.r.t. a probing set s0 to minprefix(t, s0 , s1 ) =         candidate set than to incrementally maintain the candidates;
|s1 | − minoverlap(t, s0 , s1 ) + 1, which is optimal. On the        Ribeiro and Härder [7] empirically analyze this trade-off.
downside, a shorter prefix may require more work in the
verification phase: in some cases, the verification can start        3.   POSITION-ENHANCED LENGTH FIL-
after the prefix as will be discussed in Section 2.3.
                                                                          TERING
2.3    Verification                                                     In this section, we motivate the position-enhanced length
   Efficient verification techniques are crucial for fast set sim-   filter (PEL), derive the new filter function pmaxsize, dis-
ilarity joins. We revisit a baseline algorithm and two im-           cuss the effect of PEL on self vs. foreign joins, and show how
provements, which affect the verification speed of both false        to apply PEL to previous algorithms.
and true positives. Unless explicitly mentioned, the term              Motivation. The introduction of the position-enhanced
prefix subsequently refers to maxprefix (probing set) resp.          length filter is inspired by examples for positional filtering
           1250                                                     base region. The base region is partitioned into four regions
                                             maxsize                (A, B, C, and D) by the probing set size and pmaxsize. For
                      C              D    probing set size          foreign joins, our filter reduces the base region to A+C. If we
set size


                                                                    assume that all set sizes occur equally likely in the individual
           1000                                                     inverted lists of the index, our filter cuts the number of index
                                                   B                list entries that must be processed by 50%. Since the tokens
                      A       pmaxsize                              are typically ordered by their frequency, the list length will
                                         minsize                    increase with increasing matching position. Thus the gain of
            800
                                                                    PEL in practical settings can be expected to be even higher.
                  0                 100        maxprefix 200        This analysis holds for all parameters of Jaccard and Dice.
                             position in prefix                     For Cosine, the situation is more tricky since pmaxsize is
                                                                    quadratic and describes a parabola. Again, this is in our
              Figure 6: Illustrating possible set sizes.            favor since the parabola is open to the top, and the curve
                                                                    that splits the base region is below the diagonal.
                                                                       For self joins, the only relevant regions are A and B since
like Figure 5(a). In set s1 , the only match in the prefix oc-      the size of the sets is bounded by the probing set size. Our
curs at the leftmost position. Despite this being the leftmost      filter reduces the relevant region from A + B to A. As Fig-
match in s1 , the positional filter removes s1 : the overlap        ure 6 illustrates, this reduction is smaller than the reduction
threshold cannot be reached due the position of the match           for foreign joins. For the similarity functions in Table 1, B
in s0 . Apparently, the position of the token in the probing        is always less than a quarter of the full region A + B. In the
set can render a match of the index sets impossible, inde-          example, region B covers about 0.22 of A + B.
pendently of the matching position in the index set. Let us
analyze how we need to modify the example such that it
passes the positional filter: the solution is to shorten index       Algorithm 1: AllPairs-PEL(Sp , I, t)
set s1 , as shown in Figure 5(b). This suggests that some             Version using pmaxsize for foreign join;
tighter limit on the set size can be derived from the position        input : Sp collection of outer sets, I inverted list index
of the matching token.                                                         covering maxprefix of inner sets, t similarity
   Deriving the PEL filter. For the example in                                 threshold
Figure 5(a) the first part of the positional filter, i.e.,            output: res set of result pairs (similarity at least t)
condition (2), does not hold. We solve the equation                 1 foreach s0 in Sp do
minoverlap(t, s0 , s1 ) ≤ |s0 | − p0 to |s1 | by replacing          2    M = {}; /* Hashmap: candidate set → count */
minoverlap with its definition for the different similarity         3    for p0 ← 0 to maxprefix(t, s0 ) − 1 do
functions. The result is pmaxsize(t, s0 , p0 ), an upper            4        for s1 in Is0 [p] do
bound on the size of eligible sets in the index. This bound         5           if |s1 | < minsize(t, s0 ) then
is at the core of the PEL filter, and definitions of pmaxsize       6               remove index entry with s1 from Is0 [p] ;
for various similarity measures are listed in Table 1.              7           else if |s1 | > pmaxsize(t, s0 , p0 ) then
   Application of PEL. We integrate the pmaxsize                    8               break;
upper bound into the prefix filter. The basic prefix filter         9           else
algorithm processes a probing set as follows: loop over            10               if M [s1 ] = ∅ then
the tokens of the probing set from position p0 = 0 to              11                    M = M ∪ (s1 , 0);
maxprefix(t, s0 ) − 1 and probe each token against the             12               M [s1 ] = M [s1 ] + 1;
index. The index returns a list of sets (their IDs) which          13        end
contain this token. The sets in these lists are ordered by         14    end
increasing size, so we stop processing a list when we hit a        15    /* Verify() verifies the candidates in M */
set that is larger than pmaxsize(t, s0 , p0 ).                     16    res = res ∪ V erif y(s0 , M, t);
   Intuitively, we move half of the positional filter to the       17 end
prefix filter, where we can evaluate it at lower cost: (a) the
value of pmaxsize needs to be computed only once for each
probing token; (b) we check pmaxsize against the size of
                                                                       Algorithm. Algorithm 1 shows AllPairs-PEL2 , a ver-
each index list entry, which is a simple integer comparison.
                                                                    sion of AllPairs enhanced with our PEL filter. AllPairs-
Overall, this is much cheaper than the candidate lookup that
                                                                    PEL is designed for foreign joins, i.e., the index is con-
the positional filter must do for each index match.
                                                                    structed in a preprocessing step before the join is executed.
   Self Joins vs. Foreign Joins. The PEL filter is more             The only difference w.r.t. AllPairs is that AllPairs-PEL uses
powerful on foreign joins than on self joins. In self joins,        pmaxsize(t, s0 , p0 ) instead of maxsize(t, s0 ) in the condi-
the size of the probing set is an upper bound for the set           tion on line 7. The extensions of the algorithms ppjoin and
size in the index. For all the similarity functions in Table 1,     mpjoin with PEL are similar.
pmaxsize is below the probing set size in less than 50%                An enhancement that is limited to ppjoin and mpjoin is to
of the prefix positions. Figure 6 gives an example: The             simplify the positional filter: PEL ensures that no candidate
probing set size is 1000, the Jaccard threshold is 0.8, so          set can fail on the first condition (Equation 2) of the split
minsize(0.8, 1000) = 800, maxsize(0.8, 1000) = 1250, and            positional filter. Therefore, we remove the first part of the
the prefix size is 201. The x-axis represents the position in
the prefix, the y-axis represents bounds for the set size of the    2
                                                                      We use the -PEL suffix for algorithm variants that make
other set. The region between minsize and maxsize is the            use of our PEL filter.
                                                                    collections are identical. Figures 7(a) and 7(b) show the per-
          Table 2: Input set characteristics.                       formance on DBLP with Jaccard similarity threshold 0.75
            #sets in        set size       # of diff.
                                                                    and Cosine similarity 0.85. These thresholds produce result
            collection min      max avg tokens
                                                                    sets of similar size. We observe a speedup of factor 3.5 for
     DBLP    3.9 · 106   2       283  12 1.34 · 106                 AllPairs-PEL over AllPairs with Jaccard, and a speedup of
                     5
     TREC    3.5 · 10    2       628 134     3.4 · 105              3.8 with Cosine. For mpjoin to mpjoin-PEL we observe a
                     5
    ENRON      5 · 10    1 192 000 298       7.3 · 106              speedup of 4.0 with Jaccard and 4.2 with Cosine. Thus, the
                                                                    PEL filter provides a substantial speed advantage on these
                                                                    data points. For other Jaccard thresholds and mpjoin vs.
minimum in the original positional filter (Equation 1), such        mpjoin-PEL, the maximum speedup is 4.1 and the minimum
that the minimum is no longer needed.                               speedup is 1.02. For threshold 0.5, only mpjoin-PEL finishes
   Note that the removal of index entries on line 6 is the eas-     within the time limit of one hour. Among all Cosine thresh-
iest solution to apply minsize, but in real-world scenarios,        olds and mpjoin vs. mpjoin-PEL, the maximum speedup is
it only makes sense for a single join to be executed. For           4.2 (tC = 0.85), the minimum speedup is 1.14 (tC = 0.95).
a similarity search scenario, we recommend to apply binary          We only consider Cosine thresholds tC ≥ 0.75, because the
search on the lists. For multiple joins with the same indexed       non-PEL variants exceed the time limit for smaller thresh-
sets in a row, we suggest to use an overlay over the index          olds. There is no data point where PEL slows down an
that stores the pointer for each list where to start.               algorithm. It is also worth noting that AllPairs-PEL beats
                                                                    mpjoin by a factor of 2.7 with Jaccard threshold tJ = 0.75
4.     EXPERIMENTS                                                  and 3.3 on Cosine threshold tC = 0.85; we observe such
                                                                    speedups also on other thresholds.
   We compare the algorithms AllPairs [4] and mpjoin [7]               Figure 7(c) shows the performance on TREC with Jac-
with and without our PEL extension on both self and for-            card threshold tJ = 0.75. The speedup for AllPairs-PEL
eign joins. Our implementation works on integers, which we          compared to AllPairs is 1.64, and for mpjoin-PEL compared
order by the frequency of appearance in the collection. The         to mpjoin 2.3. The minimum speedup of mpjoin over all
time to generate integers from tokens is not measured in our        thresholds is 1.26 (tJ = 0.95), the maximum speedup is
experiments since it is the same for all algorithms. We also        2.3 (tJ = 0.75). Performance gains on ENRON are slightly
do not consider the indexing time for foreign joins, which          smaller – we observe speedups of 1.15 (AllPairs-PEL over
is considered a preprocessing step. The use of PEL has no           AllPairs), and 1.85 (mpjoin-PEL over mpjoin) on Jaccard
impact on the index construction. The prefix sizes are max-         threshold tJ = 0.75 as illustrated in Figure 7(d). The mini-
prefix for foreign joins and midprefix for self joins. For self     mum speedup of mpjoin over mpjoin-PEL is 1.24 (tJ = 0.9
joins, we include the indexing time in the overall runtime          and 0.95), the maximum speedup is 2.0 (tJ = 0.6).
since the index is built incrementally on-the-fly. We report           Figure 8(a) shows the number of processed index entries
results for Jaccard and Cosine similarity, the results for Dice     (i.e., the overall length of the inverted lists that must be
show similar behavior. Our experiments are executed on the          scanned) for Jaccard threshold tJ = 0.75 on TREC. The
following real-world data sets:                                     number of index entries increases by a factor of 1.67 for
                                                                    AllPairs w.r.t. AllPairs-PEL, and a factor of 4.0 for mpjoin
     • DBLP3 : Snapshot (February 2014) of the DBLP bib-            w.r.t. mpjoin-PEL.
       liographic database. We concatenate authors and ti-             Figure 8(b) shows the number of candidates that must
       tle of each entry and generate tokens by splitting on        be verified for Jaccard threshold tJ = 0.75 on TREC. On
       whitespace.                                                  AllPairs, PEL decreases the number of candidates. This is
                                                                    because AllPairs does not apply any further filters before
     • TREC4 : References from the MEDLINE database,
                                                                    verification. On mpjoin, the number of candidates increases
       years 1987–1991. We concatenate author, title, and
                                                                    by 20%. This is due to the smaller number of matches from
       abstract, remove punctuation, and split on whitespace.
                                                                    the prefix index in the case of PEL: later matches can remove
     • ENRON5 : Real e-mail messages published by FERC              pairs from the candidate set (using the positional filter) and
       after the ENRON bankruptcy. We concatenate sub-              thus decrease its size. However, the larger candidate set
       ject and body fields, remove punctuation, and split on       for PEL does not seriously impact the overall performance:
       whitespace.                                                  the positional filter is also applied in the verification phase,
                                                                    where the extra candidate pairs are pruned immediately.
   Table 2 lists basic characteristics of the input sets. We           Self joins. Due to space constraints, we only show re-
conduct our experiments on an Intel Xeon 2.60GHz machine            sults for DBLP and ENRON, i.e., the input sets with the
with 128 GB RAM running Debian 7.6 ’wheezy’. We com-                smallest and the largest average set sizes, respectively. Fig-
pile our code with gcc -O3. Claims about results on “all”           ure 7(e) and 7(f) show the performance of the algorithms on
thresholds for a particular data set refer to the thresholds        DBLP and ENRON with Jaccard threshold tJ = 0.75. Our
{0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95}. We stop tests whose    PEL filter provides a speed up of about 1.22 for AllPairs,
runtime exceeds one hour.                                           and 1.17 for mpjoin on DBLP. The maximum speedup we
  Foreign Joins. For foreign joins, we join a collection of         observe is 1.70 (AllPairs-PEL vs. AllPairs, tJ = 0.6); for
sets with a copy of itself, but do not leverage the fact that the   tJ = 0.95 there is no speed difference between mpjoin and
                                                                    mpjoin-PEL. On the large sets of ENRON, the performance
3
  http://www.informatik.uni-trier.de/~Ley/db/                       is worse for AllPairs-PEL because verification takes more
4
  http://trec.nist.gov/data/t9_filtering.html                       time than PEL can save in the probing phase (by reducing
5
  https://www.cs.cmu.edu/~enron/                                    the number of processed index entries). There is almost no
         sec                       sec                    sec              400   sec                   sec                                           sec
500                      500                                                                                                                   100
                                                   150
                         400                                               300                    30
400                                                                                                                                             80

          AllPairs-PEL


                                    AllPairs-PEL


                                                            AllPairs-PEL


                                                                                   AllPairs-PEL


                                                                                                                   AllPairs-PEL


                                                                                                                                                      AllPairs-PEL
          mpjoin-PEL


                                    mpjoin-PEL


                                                            mpjoin-PEL


                                                                                   mpjoin-PEL


                                                                                                                                  mpjoin-PEL


                                                                                                                                                      mpjoin-PEL
300                      300                       100                                            20                                            60
                                                                           200
          AllPairs


                                    AllPairs


                                                            AllPairs


                                                                                   AllPairs


                                                                                                        AllPairs


                                                                                                                                                      AllPairs
200       mpjoin         200                                                                                                                    40


                                    mpjoin


                                                            mpjoin


                                                                                   mpjoin


                                                                                                                   mpjoin


                                                                                                                                                      mpjoin
                                                    50                     100                    10
100                      100                                                                                                                    20
     0                     0                         0                       0                     0                                             0
(a) Foreign join,        (b) Foreign join,         (c) Foreign join,       (d) Foreign j., EN-    (e) Self    join,                            (f) Self join, EN-
DBLP, tJ = 0.75.         DBLP, tC = 0.85.          TREC, tJ = 0.75         RON, tJ = 0.75         DBLP, tJ = 0.75                              RON, tJ = 0.75

                                                            Figure 7: Join times.


         8.0e10                                                             the PEL filter improves performance in almost any foreign
                                         1.5e10                             join and also in some self join scenarios, despite the fact that
         6.0e10                                                             it may increase the number of candidates to be verified.
                    AllPairs-PEL


                                                   AllPairs-PEL
                    mpjoin-PEL


                                                   mpjoin-PEL
                                         1.0e10
         4.0e10
                                                                            7.    REFERENCES
                    AllPairs


                                                   AllPairs
                    mpjoin


                                                   mpjoin

         2.0e10                           5.0e9
                                                                             [1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
                                                                                 set-similarity joins. In Proc. VLDB, pages 918 – 929,
               0                             0
                                                                                 2006.
         (a) Number of pro-              (b) Number of candi-                [2] N. Augsten, M. H. Böhlen, and J. Gamper. The
         cessed index entries.           dates to be verify.
                                                                                 pq-gram distance between ordered labeled trees. ACM
                                                                                 TODS, 35(1), 2010.
         Figure 8: TREC (foreign join): tJ = 0.75                            [3] N. Augsten, A. Miraglia, T. Neumann, and
                                                                                 A. Kemper. On-the-fly token similarity joins in
                                                                                 relational databases. In Proc. SIGMOD, pages 1495 –
difference between mpjoin and mpjoin-PEL. The maximum                            1506. ACM, 2014.
increase in speed is 9% (threshold 0.8, mpjoin), the maxi-                   [4] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all
mum slowdown is 30% (threshold 0.6, AllPairs).                                   pairs similarity search. WWW, 7:131 – 140, 2007.
  Summarizing, PEL substantially improves the runtime in                     [5] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
foreign join scenarios. For self joins, PEL is less effective                    operator for similarity joins in data cleaning. In Proc.
and, in some cases, may even slightly increase the runtime.                      ICDE, page 5. IEEE, 2006.
                                                                             [6] A. Gionis, P. Indyk, and R. Motwani. Similarity
5.       RELATED WORK                                                            search in high dimensions via hashing. In Proc.
                                                                                 VLDB, pages 518–529, 1999.
   Sarawagi and Kirpal [8] first discuss efficient algorithms
for exact set similarity joins. Chaudhuri et al. [5] propose                 [7] L. A. Ribeiro and T. Härder. Generalizing prefix
SSJoin as an in-database operator for set similarity joins                       filtering to improve set similarity joins. Information
and introduce the prefix filter. AllPairs [4] uses the prefix                    Systems, 36(1):62 – 78, 2011.
filter with an inverted list index. The ppjoin algorithm [12]                [8] S. Sarawagi and A. Kirpal. Efficient set joins on
extends AllPairs by the positional filter and introduces the                     similarity predicates. In Proc. SIGMOD, pages 743 –
suffix filter, which reduces the candidate set before the final                  754. ACM, 2004.
verification. The mpjoin algorithm [7] improves over ppjoin                  [9] E. Spertus, M. Sahami, and O. Buyukkokten.
by reducing the number of entries returned from the index.                       Evaluating similarity measures: A large-scale study in
AdaptJoin [10] takes the opposite approach and drastically                       the orkut social network. In Proc. SIGKDD, pages 678
reduces the number of candidates at the expense of longer                        – 684. ACM, 2005.
prefixes. Gionis et al. [6] propose an approximate algorithm                [10] J. Wang, G. Li, and J. Feng. Can we beat the prefix
based on LSH for set similarity joins. Recently, an SQL op-                      filtering?: An adaptive framework for similarity join
erator for the token generation problem was introduced [3].                      and search. In Proc. SIGMOD, pages 85 – 96. ACM,
                                                                                 2012.
                                                                            [11] C. Xiao, W. Wang, and X. Lin. Ed-Join: An efficient
6.       CONCLUSIONS                                                             algorithm for similarity joins with edit distance
   We presented PEL, a new filter based on the pmaxsize                          constraints. In Proc. VLDB, 2008.
upper bound derived in this paper. PEL can be easily                        [12] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang.
plugged into algorithms that store prefixes in an inverted                       Efficient similarity joins for near-duplicate detection.
list index (e.g., AllPairs, ppjoin, or mpjoin). For these algo-                  ACM TODS, 36(3):15, 2011.
rithms, PEL will effectively reduce the number of list entries
that must be processed. This reduces the overall lookup time
in the inverted list index at the cost of a potentially larger
candidate set. We analyzed this trade-off for foreign joins
and self joins. Our empirical evaluation demonstrated that