=Paper=
{{Paper
|id=Vol-1882/paper10
|storemode=property
|title=Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework
|pdfUrl=https://ceur-ws.org/Vol-1882/paper10.pdf
|volume=Vol-1882
|authors=Fabian Fier
|dblpUrl=https://dblp.org/rec/conf/vldb/Fier17
}}
==Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework==
<pdf width="1500px">https://ceur-ws.org/Vol-1882/paper10.pdf</pdf>
<pre>
             Distributed Similarity Joins on Big Textual Data:
                Toward a Robust Cost-Based Framework

                                                                 Fabian Fier
                                               Supervised by Johann-Christoph Freytag
                                                   Humboldt-Universität zu Berlin
                                                         Unter den Linden 6
                                                       10099 Berlin, Germany
                                                    fier@informatik.hu-berlin.de

ABSTRACT                                                                               45
                                                                                       40
                                                                                       35
                                                                                       30


                                                                          #Instances
Motivated by increasing dataset sizes, various MapReduce-                              25
                                                                                       20
                                                                                       15
based similarity join algorithms have emerged. In our past                             10
                                                                                        5
work (to appear), we compared nine of the most prominent                                0
                                                                                                              Time
algorithms experimentally. Surprisingly, we found that their
runtimes become inhibitively long for only moderately large                                 Figure 1: Straggling Reducer Issue.
datasets. There are two main reasons. First, data grouping
and replication between Map and Reduce relies on input
data characteristics such as word distribution. A skewed                  similarity joins based on a two-phase approach [1, 2, 3, 10,
distribution as it is common for textual data leads to data               14]. They compute a set of candidate pairs which is usually
groups which reveal very unequal computation costs, leading               orders of magnitudes smaller than the cross product. Sub-
to Straggling Reducer issues. Second, each Reduce instance                sequently, they verify if the candidates are similar. We refer
only has limited main memory. Data spilling also leads to                 to them as filter-and-verification approaches. Motivated by
Straggling Reducers. In order to leverage parallelization,                increasing dataset sizes, MapReduce-based distributed ap-
all approaches we investigated rely on high replication and               proaches have emerged [5, 12, 13]. We conducted an exten-
hit this memory limit even with relatively small input data.              sive experimental study on nine current MapReduce-based
In this work, we propose an initial approach toward a join                set similarity join algorithms on textual data (to appear).
framework to overcome both of these issues. It includes                   There are two key findings. First, we compared the runtime
a cost-based grouping and replication strategy which is ro-               of the MapReduce join algorithms to the runtime of compet-
bust against large data sizes and various data characteristics            ing non-distributed algorithms from the recent experimental
such as skew. Furthermore, we propose an addition to the                  survey of Mann et al. [11]. The runtime of MapReduce join
MapReduce programming paradigm. It unblocks the Re-                       algorithms on small datasets is inferior to the runtime of
duce execution by running Reducers on partial intermedi-                  non-distributed approaches. This is not surprising due to
ate datasets, allowing for arbitrarily large data sets between            the MapReduce overhead. The second finding is that none
Map and Reduce.                                                           of the approaches can compute the join on larger (or even
                                                                          arbitrarily large) datasets. The runtimes increase so drasti-
1.    INTRODUCTION                                                        cally that we terminated the executions after a long timeout.
  Similarity joins are an important operation for user rec-                  We identified two main reasons for these runtime issues
ommendations, near-duplicate detection, or plagiarism de-                 on large datasets. First, for every MapReduce-based simi-
tection. They compute similar pairs of objects, such as                   larity join algorithm we investigated we found non-optimal
strings, sets, multisets, or more complex structures. Simi-               input datasets that lead to only a few running join Reduce
larity is expressed by similarity (or distance) functions such            instances while all other instances were left idle. That is,
as Jaccard, Cosine, or Edit. A naive approach to compute                  we often observed Straggling Reducers. Figure 1 shows the
a similarity self-join is to build the cross product over an              compute instance usage of a non-optimal join execution on
input dataset and filter out all non-similar pairs. This ap-              a cluster of 48 compute instances. After roughly half the
proach has a quadratic runtime. In the literature, there                  execution time, only a few instances are used. The instance
are various non-distributed non-parallelized approaches for               usage is directly connected to data grouping and replication
                                                                          between Map and Reduce. All algorithms under investiga-
                                                                          tion exploit and thus rely on certain data characteristics for
                                                                          replication and grouping. The most relevant characteristics
                                                                          are the global token frequency of the input dataset and the
                                                                          number of tokens in each record of a dataset. Stop words,
                                                                          which occur in a majority of records of a dataset, cause
                                                                          skewed data groups within most join approaches we inves-
Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich,       tigated. As second cause, we identified memory overload
Germany.
Copyright (C) 2017 for this paper by its authors. Copying permitted for   within Reduce instances. All approaches heavily replicate
private and academic purposes.                                            data to leverage parallelization. The original MapReduce
programming paradigm as introduced by Dean et al. [4] re-          sets, there are similarity functions such as Jaccard, Cosine,
quires the Reduce instances to wait for the Map steps to           or Dice. The user chooses a threshold t above which two
finish before the intermediate data groups are sorted and          sets are considered being similar. Formally, given a set S,
grouped by key. When the Reduce buffers are filled, data           a similarity function sim(s1 , s2 ), and a similarity thresh-
is spilled to disk, often causing high runtime penalties. The      old t, the set similarity join computes the set {(s1 , s2 ) ∈
use of Combiners is not possible for similarity joins, because     S × S|sim(s1 , s2 ) ≥ t, s1 6= s2 }.
Reducers are stateful. This limitation is inherent to stan-           A naive approach computes the similarity on all pairs
dard MapReduce.                                                    (s1 , s2 ). Since it has a quadratic runtime, it is not feasi-
   In this paper, we propose an initial approach toward a          ble even for small datasets. In the literature, filter-and-
robust framework to compute distributed similarity joins.          verification approaches emerged. Their basic idea is to gen-
It overcomes the Straggling Reducer issues and the input           erate an (inverted) index over all input records. For each
dataset size limitation we experienced in our past experi-         postings list, they compute the cross product (half of it in
ments. Our approach is twofold. First, we find a group-            the self-join case to be exact) and the union of all these
ing and replication strategy which distributes compute load        cross products. Each distinct record ID pair in the union is
evenly over the existing compute instances. This is challeng-      a candidate pair, because the two records contain at least
ing since it is not sufficient to generate data groups of equal    one common token. These candidate pairs are further veri-
size. The runtime of a join computation within one group is        fied to compute the end result. Sophisticated filtering tech-
dependent on characteristics of the data in the group such         niques keep the indexes and the number of candidate pairs
as record lengths. Second, we enable MapReduce to handle           small. The most prominent filter is the prefix filter [1, 2, 3].
large intermediate datasets by proposing an extension for          Given a record length, a similarity function, and a similar-
MapReduce which unblocks the Reduce execution based on             ity threshold, the prefix length is the minimum number of
statistical information gathered in a preprocessing step.          tokens which need to be indexed to guarantee an overlap of
   The idea of load balancing in MapReduce based on statis-        at least one common token if it is similar to another record.
tics is not new. The TopCluster algorithm [8] is an online            Motivated by increasing dataset sizes, MapReduce-based
approach which includes cardinality estimations at runtime.        versions of the filter-and-verification approach emerged [5,
Our approach on the other hand needs exact data statistics         12, 13]. The main idea is identical to the non-distributed
in order to unblock the Reduce execution. These statistics         approaches. It is to compute an inverted index, to com-
have to be collected before the join execution. Our approach       pute the cross product on each postings list, and to verify
is comparable to the one by Kolb et al. [9], which involves a      the resulting candidate pairs. The inverted index is built
preprocessing MR job to collect data statistics and a join job     as follows. A Map step computes key-value pairs with a to-
which uses the statistics for an optimal data grouping and         ken or a more complex signature as key. The MapReduce
replication. We extend this approach by using the knowlege         framework groups key-value pairs with the same key to one
of the group sizes to unblock the Reduce execution. Further-       Reduce instance. This instance computes the cross product
more, we tailor the grouping and replication to the specific       on the postings list. Depending on the value of the key-value
problem of set similarity joins.                                   pair (all tokens of the input record vs. only the record ID),
   The contributions of this paper are as follows:                 the verification takes place within the Reduce, or there are
                                                                   further MapReduce steps to join the original records to the
     • We propose a first approach toward a robust distributed     candidate pairs for the verification.
       similarity join framework.                                     The key generation of all algorithms known to us relies
                                                                   on characteristics of the input data. In the most basic algo-
     • We define a robust grouping and replication strategy
                                                                   rithm [7], each token in the input record is used as key. Obvi-
       leading to evenly distributed compute loads amongst
                                                                   ously, the number of record groups is equal to the number of
       the available compute nodes.
                                                                   distinct tokens in the input dataset. The size of each record
     • We extend the MapReduce programming paradigm to             group depends on the global frequency of its key token. The
       unblock Reduce execution to handle (potentially arbi-       data replication is dependent on the record lengths. For
       trarily) large datasets.                                    sufficiently large datasets with stop words (tokens which oc-
                                                                   cur in almost every record) and/ or many long records, the
  The structure of the paper is as follows. In Section 2, we       Straggling Reducer effect occurs. More sophisticated ap-
give an overview on the similarity join problem, algorithmic       proaches use a prefix filter, which reduces the number of
approaches, and motivate the need for research with the            tokens for replication to a prefix, which is shorter than the
runtime issues we experienced in our past experiments. In          record length, but still dependent on it. The use of such
Section 3, we introduce our approach for a robust join frame-      filters shifts the Straggling Reducer issue to larger datasets
work and its interaction with an extension of MapReduce to         and/ or datasets with longer records, but does not solve it
unblock Reduce execution. In Section 4, we conclude our            for arbitrarily large datasets.
work and give an outlook on future work.                              We expect the input of the similarity join to be text, which
                                                                   is integer-tokenized by a preprocessing step. The tokeniza-
2.    BACKGROUND                                                   tion may include changing letter cases, stemming, or stop
                                                                   word removal. Depending on the preprocessing, the proper-
   Without loss of generality, we use the set similarity self-
                                                                   ties of input datasets vary by token distribution (stop words,
join as a running example. Our framework can be applied
                                                                   infrequent tokens), dictionary size, and record size. The to-
to other filter-and-verification-based similarity joins as well.
                                                                   ken distribution of textual data is usually Zipfian, which
The set similarity join computes all pairs of similar sets
                                                                   means that there are few very frequent tokens. This is a
(s1 , s2 ) within a set of records S. A similarity function
                                                                   challenge for approaches relying on token distribution.
sim(s1 , s2 ) expresses the similarity between two records. For
                                                                                                                                         Possible Similar Record Lengths
3.     APPROACH                                                                                           1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
                                                                                                      1
   In Figure 2, we illustrate the dataflow of our framework.                                          2
                                                                                                      3
The first step computes exact data statistics. It computes                                            4
                                                                                                      5
record length frequencies and global token frequencies. These                                         6
                                                                                                      7
statistics can be computed in linear time and are highly par-


                                                                               Input Record Length
                                                                                                      8
                                                                                                      9
allelizable. Furthermore, it estimates runtime costs for the                                         10
join execution, based on data samples with differing aver-                                           11
                                                                                                     12
age record lengths. The second step computes the actual                                              13
                                                                                                     14
join. Every Map instance obtains the statistics from the                                             15
                                                                                                     16
first step via a setup function which is called once before the                                      17
                                                                                                     18
input data is read. Based on these statistics, it determines                                         19
                                                                                                     20
a suitable data grouping and replication and assigns keys                                            21
                                                                                                     22
to its output accordingly. Each join Reducer also obtains                                            23
the statistics via the setup function. Using the statistics,
it can compute the exact size of each group and start com-                    Figure 3: Possible Similar Record Lengths for Jaccard and
puting the join on this group once all data for it has com-                   Similarity Threshold 0.7.
pletely arrived. This can happen before all Mappers have
finished their execution. Note that this requires a change
in the original MapReduce. The Reduce-side shuffling peri-                                           160000
odically counts the occurrences of each key in its input. It
                                                                                                     140000
triggers the execution of the first-order function once one of
the groups is complete. The Reducer can run any existing                                             120000
state-of-the art non-distributed similarity join.                                                    100000


                                                                              Number of Records
   In the following, we describe how to find a suitable group-
                                                                                                      80000
ing and replication based on the statistics. We use the Jac-
card similarity function as an example, because it is the                                             60000
most commonly used function in the literature. Our frame-                                             40000
work is also applicable to any other set-based similarity func-
tion. Jaccard is defined by the intersection divided by the                                           20000
                         |a∩b|
union of two records |a∪b|     . Note that records with differing                                             0
                                                                                                                   2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60
lengths can be similar. Figure 3 shows this length rela-                                                                                                                                         57 59
                                                                                                                                                   Length
tionship for a similarity threshold of 0.7. For each record
length on the y axis, it shows on the x axis, which record                                                   Figure 4: Example Record Length Distribution.
lengths have to be considered as join candidates. Let us
assume the input has a length distribution as depicted in
Figure 4. In order to obtain data groups which can be self-
joined independently, we group together all records with the                                                                              1       2      3      4
same length and replicate each group to all larger length                                                                             1
groups it can be similar to. The resulting groups would be
                                                                                                                                      2
{1}, {2}, {3, 4}, {4, 5, 6, 7}, {5, 6, 7, 8}, {6, 7, 8, 9, 10} etc. Note
that these groups have very uneven cardinalities, for exam-                                                                           3
ple |{1}| = 8.000, |{6, 7, 8, 9, 10}| = 688.000 etc.
   In order to distribute the cardinalities evenly, we propose                                                                        4
to apply a hash-based grouping and replication on these
groups. Figure 5 shows an example for a hashing factor
of 4. A hashing function assigns each record to one of 4                      Figure 5: Hash-Based Grouping and Replication with hash-
groups. Each record is distributed 4 times, so that it joins                  ing factor h=4.
each other record in exactly one of the squares in the figure.
Note that there is a tradeoff related to the hashing factor.                     An even data distribution is not sufficient to prevent Strag-
If it is very low, there are only few large groups and the                    gling Reducer effects. The costs of joining a partition with
replication is low. If it is high, there are many small groups                long records is higher than the cost of joining a partition
and the replication is high.                                                  with equally many short records. Let us assume that the
                                                                              Reducer of the join step has a quadratic runtime, which rep-
                                                                              resents the worst case. The runtime costs of computing a
                                                                              self-join on one group of records with cardinality groupSize
                                                                              and with an average record length of avgRecLen can be es-
                                                   Setup                      timated with Equation 1, assuming that the tokens in the
      Input
                                            Partition
                                                            Join
                                                                              records are sorted by a global token order allowing for a
               Statistics     Statistics   Replication              Output
     Dataset     Map          Reduce         Map
                                                           Reduce   Dataset   merge-join. In Figure 6, we show a plot of this cost esti-
                          Sta-
                        tistics                       Join                    mation function. It showsSeite
                                                                                                          that1 the costs grow exponentially
                                                                              with regard to the number of records in the group. The
                                                                              power of the increase grows exponentially with regard to
 Figure 2: Dataflow Graph of our Execution Framework.                         the average length of the records in the group. In order
                                                                                                  6.   REFERENCES
                                                                                                   [1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact
                                                                                                       set-similarity joins. In Proceedings of the 32nd
Costs/Group


                                                                                                       international conference on Very large data bases,
                                                                                                       pages 918–929. VLDB Endowment, 2006.
                                                                                                   [2] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all
                                                                                                       pairs similarity search. In Proceedings of the 16th
                                                                                      100
                                                                                                       international conference on World Wide Web, pages
                                                                                 80          up        131–140. ACM, 2007.
                                                                                          ro
                                                                             60         /G
                                                                                   r ds            [3] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive
              20000                                                             co
                         40000                                       40 Re
                      Numb                                                h
                                                                            of                         operator for similarity joins in data cleaning. In Data
                          er of R            60000                      t
                                  ecord   s/Grou               20 e  ng                                Engineering, 2006. ICDE’06. Proceedings of the 22nd
                                                 p   80000          L
                                                                  g
                                                               Av                                      International Conference on, pages 5–5. IEEE, 2006.
               Figure 6: Cost Estimation for one data group.                                       [4] J. Dean and S. Ghemawat. Mapreduce: Simplified
                                                                                                       data processing on large clusters. OSDI, pages
                                                                                                       137–150, 2004.
 to avoid a Straggling Reducer effect, our aim is to find a                                        [5] D. Deng, G. Li, S. Hao, J. Wang, and J. Feng.
 data grouping and replication which at least limits the max-                                          Massjoin: A mapreduce-based method for scalable
 imum compute costs over all groups or ideally imposes equal                                           string similarity joins. In Data Engineering (ICDE),
 computation costs for each data group. In Figure 6, equal                                             2014 IEEE 30th International Conference on, pages
 computation costs would occur if all groups would exhibit a                                           340–351. IEEE, 2014.
 combination of number of records and average record lengths                                       [6] D. Deng, G. Li, H. Wen, and J. Feng. An efficient
 on an intersection of the graph with a horizontal plane.                                              partition based method for exact set similarity joins.
                                        !                                                              Proceedings of the VLDB Endowment, 9(4):360–371,
                               groupSize                                                               2015.
                                          ∗ 2 ∗ avgRecLen                         (1)              [7] T. Elsayed, J. Lin, and D. W. Oard. Pairwise
                                   2
                                                                                                       document similarity in large collections with
   Our idea is to optimize the overall computation costs with                                          mapreduce. In Proceedings of the 46th Annual Meeting
 the hashing factor h as variable (Equation 2) and the con-                                            of the Association for Computational Linguistics on
 straint that the computation cost of each group may not be                                            Human Language Technologies: Short Papers, pages
 larger than the maximum cost threshold m, which ensures                                               265–268. Association for Computational Linguistics,
 that no Reducer gets overloaded.                                                                      2008.
                         X                                                                         [8] B. Gufler, N. Augsten, A. Reiser, and A. Kemper.
                min             costs(group, h), costs(group, h) ≤ m              (2)                  Load balancing in mapreduce based on scalable
               h∈N+
                       group                                                                           cardinality estimates. In Data Engineering (ICDE),
   The group-wise costs within this equation could either be                                           2012 IEEE 28th International Conference on, pages
 estimated by Equation 1 or it might use runtimes on sampled                                           522–533. IEEE, 2012.
 data from the statistics MapReduce step.                                                          [9] L. Kolb, A. Thor, and E. Rahm. Load balancing for
                                                                                                       mapreduce-based entity resolution. In Data
                                                                                                       Engineering (ICDE), 2012 IEEE 28th International
 4.           CONCLUSIONS, FUTURE WORK                                                                 Conference on, pages 618–629. IEEE, 2012.
    In this paper, we introduced a first approach toward a dis-                                   [10] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A
 tributed similarity join framework which is robust against                                            partition-based method for similarity joins.
 arbitrary input dataset sizes and data characteristics such                                           Proceedings of the VLDB Endowment, 5(3):253–264,
 as skew. We plan to detail it out, implement it and run ex-                                           2011.
 periments with it. One crucial detail is to ensure that there
                                                                                                  [11] W. Mann, N. Augsten, and P. Bouros. An empirical
 is a sufficient number of record groups which is complete. If
                                                                                                       evaluation of set similarity join techniques. Proceedings
 a Reduce instance collects only non-complete groups, strag-
                                                                                                       of the VLDB Endowment, 9(9):636–647, 2016.
 gling will still occur. Another open detail is the choice of
 the hash function for the join. Grouping and replication                                         [12] A. Metwally and C. Faloutsos. V-smart-join: A
 strategies from existing MapReduce-based similarity join ap-                                          scalable mapreduce framework for all-pair similarity
 proaches could be integrated in the proposed strategy. Es-                                            joins of multisets and vectors. Proceedings of the
 pecially signature creating approaches like MassJoin [5] and                                          VLDB Endowment, 5(8):704–715, 2012.
 sophisticated grouping strategies like MRGroupJoin [6] us-                                       [13] R. Vernica, M. J. Carey, and C. Li. Efficient parallel
 ing the pigeonhole principle are promising.                                                           set-similarity joins using mapreduce. In Proceedings of
    In future experiments, we are especially interested in the                                         the 2010 ACM SIGMOD International Conference on
 tradeoff between replication and group size. Furthermore, it                                          Management of data, pages 495–506. ACM, 2010.
 is interesting if it pays off to use empirical runtime statistics                                [14] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang.
 for the join costs or simply estimate the runtime analytically.                                       Efficient similarity joins for near-duplicate detection.
                                                                                                       ACM Transactions on Database Systems (TODS),
 5.           ACKNOWLEDGMENTS                                                                          36(3):15, 2011.
   This work was supported by the Humboldt Elsevier Ad-
 vanced Data and Text (HEADT) Center.

</pre>