=Paper=
{{Paper
|id=Vol-1882/paper10
|storemode=property
|title=Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework
|pdfUrl=https://ceur-ws.org/Vol-1882/paper10.pdf
|volume=Vol-1882
|authors=Fabian Fier
|dblpUrl=https://dblp.org/rec/conf/vldb/Fier17
}}
==Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework==
Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework Fabian Fier Supervised by Johann-Christoph Freytag Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin, Germany fier@informatik.hu-berlin.de ABSTRACT 45 40 35 30 #Instances Motivated by increasing dataset sizes, various MapReduce- 25 20 15 based similarity join algorithms have emerged. In our past 10 5 work (to appear), we compared nine of the most prominent 0 Time algorithms experimentally. Surprisingly, we found that their runtimes become inhibitively long for only moderately large Figure 1: Straggling Reducer Issue. datasets. There are two main reasons. First, data grouping and replication between Map and Reduce relies on input data characteristics such as word distribution. A skewed similarity joins based on a two-phase approach [1, 2, 3, 10, distribution as it is common for textual data leads to data 14]. They compute a set of candidate pairs which is usually groups which reveal very unequal computation costs, leading orders of magnitudes smaller than the cross product. Sub- to Straggling Reducer issues. Second, each Reduce instance sequently, they verify if the candidates are similar. We refer only has limited main memory. Data spilling also leads to to them as filter-and-verification approaches. Motivated by Straggling Reducers. In order to leverage parallelization, increasing dataset sizes, MapReduce-based distributed ap- all approaches we investigated rely on high replication and proaches have emerged [5, 12, 13]. We conducted an exten- hit this memory limit even with relatively small input data. sive experimental study on nine current MapReduce-based In this work, we propose an initial approach toward a join set similarity join algorithms on textual data (to appear). framework to overcome both of these issues. It includes There are two key findings. First, we compared the runtime a cost-based grouping and replication strategy which is ro- of the MapReduce join algorithms to the runtime of compet- bust against large data sizes and various data characteristics ing non-distributed algorithms from the recent experimental such as skew. Furthermore, we propose an addition to the survey of Mann et al. [11]. The runtime of MapReduce join MapReduce programming paradigm. It unblocks the Re- algorithms on small datasets is inferior to the runtime of duce execution by running Reducers on partial intermedi- non-distributed approaches. This is not surprising due to ate datasets, allowing for arbitrarily large data sets between the MapReduce overhead. The second finding is that none Map and Reduce. of the approaches can compute the join on larger (or even arbitrarily large) datasets. The runtimes increase so drasti- 1. INTRODUCTION cally that we terminated the executions after a long timeout. Similarity joins are an important operation for user rec- We identified two main reasons for these runtime issues ommendations, near-duplicate detection, or plagiarism de- on large datasets. First, for every MapReduce-based simi- tection. They compute similar pairs of objects, such as larity join algorithm we investigated we found non-optimal strings, sets, multisets, or more complex structures. Simi- input datasets that lead to only a few running join Reduce larity is expressed by similarity (or distance) functions such instances while all other instances were left idle. That is, as Jaccard, Cosine, or Edit. A naive approach to compute we often observed Straggling Reducers. Figure 1 shows the a similarity self-join is to build the cross product over an compute instance usage of a non-optimal join execution on input dataset and filter out all non-similar pairs. This ap- a cluster of 48 compute instances. After roughly half the proach has a quadratic runtime. In the literature, there execution time, only a few instances are used. The instance are various non-distributed non-parallelized approaches for usage is directly connected to data grouping and replication between Map and Reduce. All algorithms under investiga- tion exploit and thus rely on certain data characteristics for replication and grouping. The most relevant characteristics are the global token frequency of the input dataset and the number of tokens in each record of a dataset. Stop words, which occur in a majority of records of a dataset, cause skewed data groups within most join approaches we inves- Proceedings of the VLDB 2017 PhD Workshop, August 28, 2017. Munich, tigated. As second cause, we identified memory overload Germany. Copyright (C) 2017 for this paper by its authors. Copying permitted for within Reduce instances. All approaches heavily replicate private and academic purposes. data to leverage parallelization. The original MapReduce programming paradigm as introduced by Dean et al. [4] re- sets, there are similarity functions such as Jaccard, Cosine, quires the Reduce instances to wait for the Map steps to or Dice. The user chooses a threshold t above which two finish before the intermediate data groups are sorted and sets are considered being similar. Formally, given a set S, grouped by key. When the Reduce buffers are filled, data a similarity function sim(s1 , s2 ), and a similarity thresh- is spilled to disk, often causing high runtime penalties. The old t, the set similarity join computes the set {(s1 , s2 ) ∈ use of Combiners is not possible for similarity joins, because S × S|sim(s1 , s2 ) ≥ t, s1 6= s2 }. Reducers are stateful. This limitation is inherent to stan- A naive approach computes the similarity on all pairs dard MapReduce. (s1 , s2 ). Since it has a quadratic runtime, it is not feasi- In this paper, we propose an initial approach toward a ble even for small datasets. In the literature, filter-and- robust framework to compute distributed similarity joins. verification approaches emerged. Their basic idea is to gen- It overcomes the Straggling Reducer issues and the input erate an (inverted) index over all input records. For each dataset size limitation we experienced in our past experi- postings list, they compute the cross product (half of it in ments. Our approach is twofold. First, we find a group- the self-join case to be exact) and the union of all these ing and replication strategy which distributes compute load cross products. Each distinct record ID pair in the union is evenly over the existing compute instances. This is challeng- a candidate pair, because the two records contain at least ing since it is not sufficient to generate data groups of equal one common token. These candidate pairs are further veri- size. The runtime of a join computation within one group is fied to compute the end result. Sophisticated filtering tech- dependent on characteristics of the data in the group such niques keep the indexes and the number of candidate pairs as record lengths. Second, we enable MapReduce to handle small. The most prominent filter is the prefix filter [1, 2, 3]. large intermediate datasets by proposing an extension for Given a record length, a similarity function, and a similar- MapReduce which unblocks the Reduce execution based on ity threshold, the prefix length is the minimum number of statistical information gathered in a preprocessing step. tokens which need to be indexed to guarantee an overlap of The idea of load balancing in MapReduce based on statis- at least one common token if it is similar to another record. tics is not new. The TopCluster algorithm [8] is an online Motivated by increasing dataset sizes, MapReduce-based approach which includes cardinality estimations at runtime. versions of the filter-and-verification approach emerged [5, Our approach on the other hand needs exact data statistics 12, 13]. The main idea is identical to the non-distributed in order to unblock the Reduce execution. These statistics approaches. It is to compute an inverted index, to com- have to be collected before the join execution. Our approach pute the cross product on each postings list, and to verify is comparable to the one by Kolb et al. [9], which involves a the resulting candidate pairs. The inverted index is built preprocessing MR job to collect data statistics and a join job as follows. A Map step computes key-value pairs with a to- which uses the statistics for an optimal data grouping and ken or a more complex signature as key. The MapReduce replication. We extend this approach by using the knowlege framework groups key-value pairs with the same key to one of the group sizes to unblock the Reduce execution. Further- Reduce instance. This instance computes the cross product more, we tailor the grouping and replication to the specific on the postings list. Depending on the value of the key-value problem of set similarity joins. pair (all tokens of the input record vs. only the record ID), The contributions of this paper are as follows: the verification takes place within the Reduce, or there are further MapReduce steps to join the original records to the • We propose a first approach toward a robust distributed candidate pairs for the verification. similarity join framework. The key generation of all algorithms known to us relies on characteristics of the input data. In the most basic algo- • We define a robust grouping and replication strategy rithm [7], each token in the input record is used as key. Obvi- leading to evenly distributed compute loads amongst ously, the number of record groups is equal to the number of the available compute nodes. distinct tokens in the input dataset. The size of each record • We extend the MapReduce programming paradigm to group depends on the global frequency of its key token. The unblock Reduce execution to handle (potentially arbi- data replication is dependent on the record lengths. For trarily) large datasets. sufficiently large datasets with stop words (tokens which oc- cur in almost every record) and/ or many long records, the The structure of the paper is as follows. In Section 2, we Straggling Reducer effect occurs. More sophisticated ap- give an overview on the similarity join problem, algorithmic proaches use a prefix filter, which reduces the number of approaches, and motivate the need for research with the tokens for replication to a prefix, which is shorter than the runtime issues we experienced in our past experiments. In record length, but still dependent on it. The use of such Section 3, we introduce our approach for a robust join frame- filters shifts the Straggling Reducer issue to larger datasets work and its interaction with an extension of MapReduce to and/ or datasets with longer records, but does not solve it unblock Reduce execution. In Section 4, we conclude our for arbitrarily large datasets. work and give an outlook on future work. We expect the input of the similarity join to be text, which is integer-tokenized by a preprocessing step. The tokeniza- 2. BACKGROUND tion may include changing letter cases, stemming, or stop word removal. Depending on the preprocessing, the proper- Without loss of generality, we use the set similarity self- ties of input datasets vary by token distribution (stop words, join as a running example. Our framework can be applied infrequent tokens), dictionary size, and record size. The to- to other filter-and-verification-based similarity joins as well. ken distribution of textual data is usually Zipfian, which The set similarity join computes all pairs of similar sets means that there are few very frequent tokens. This is a (s1 , s2 ) within a set of records S. A similarity function challenge for approaches relying on token distribution. sim(s1 , s2 ) expresses the similarity between two records. For Possible Similar Record Lengths 3. APPROACH 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 1 In Figure 2, we illustrate the dataflow of our framework. 2 3 The first step computes exact data statistics. It computes 4 5 record length frequencies and global token frequencies. These 6 7 statistics can be computed in linear time and are highly par- Input Record Length 8 9 allelizable. Furthermore, it estimates runtime costs for the 10 join execution, based on data samples with differing aver- 11 12 age record lengths. The second step computes the actual 13 14 join. Every Map instance obtains the statistics from the 15 16 first step via a setup function which is called once before the 17 18 input data is read. Based on these statistics, it determines 19 20 a suitable data grouping and replication and assigns keys 21 22 to its output accordingly. Each join Reducer also obtains 23 the statistics via the setup function. Using the statistics, it can compute the exact size of each group and start com- Figure 3: Possible Similar Record Lengths for Jaccard and puting the join on this group once all data for it has com- Similarity Threshold 0.7. pletely arrived. This can happen before all Mappers have finished their execution. Note that this requires a change in the original MapReduce. The Reduce-side shuffling peri- 160000 odically counts the occurrences of each key in its input. It 140000 triggers the execution of the first-order function once one of the groups is complete. The Reducer can run any existing 120000 state-of-the art non-distributed similarity join. 100000 Number of Records In the following, we describe how to find a suitable group- 80000 ing and replication based on the statistics. We use the Jac- card similarity function as an example, because it is the 60000 most commonly used function in the literature. Our frame- 40000 work is also applicable to any other set-based similarity func- tion. Jaccard is defined by the intersection divided by the 20000 |a∩b| union of two records |a∪b| . Note that records with differing 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 lengths can be similar. Figure 3 shows this length rela- 57 59 Length tionship for a similarity threshold of 0.7. For each record length on the y axis, it shows on the x axis, which record Figure 4: Example Record Length Distribution. lengths have to be considered as join candidates. Let us assume the input has a length distribution as depicted in Figure 4. In order to obtain data groups which can be self- joined independently, we group together all records with the 1 2 3 4 same length and replicate each group to all larger length 1 groups it can be similar to. The resulting groups would be 2 {1}, {2}, {3, 4}, {4, 5, 6, 7}, {5, 6, 7, 8}, {6, 7, 8, 9, 10} etc. Note that these groups have very uneven cardinalities, for exam- 3 ple |{1}| = 8.000, |{6, 7, 8, 9, 10}| = 688.000 etc. In order to distribute the cardinalities evenly, we propose 4 to apply a hash-based grouping and replication on these groups. Figure 5 shows an example for a hashing factor of 4. A hashing function assigns each record to one of 4 Figure 5: Hash-Based Grouping and Replication with hash- groups. Each record is distributed 4 times, so that it joins ing factor h=4. each other record in exactly one of the squares in the figure. Note that there is a tradeoff related to the hashing factor. An even data distribution is not sufficient to prevent Strag- If it is very low, there are only few large groups and the gling Reducer effects. The costs of joining a partition with replication is low. If it is high, there are many small groups long records is higher than the cost of joining a partition and the replication is high. with equally many short records. Let us assume that the Reducer of the join step has a quadratic runtime, which rep- resents the worst case. The runtime costs of computing a self-join on one group of records with cardinality groupSize and with an average record length of avgRecLen can be es- Setup timated with Equation 1, assuming that the tokens in the Input Partition Join records are sorted by a global token order allowing for a Statistics Statistics Replication Output Dataset Map Reduce Map Reduce Dataset merge-join. In Figure 6, we show a plot of this cost esti- Sta- tistics Join mation function. It showsSeite that1 the costs grow exponentially with regard to the number of records in the group. The power of the increase grows exponentially with regard to Figure 2: Dataflow Graph of our Execution Framework. the average length of the records in the group. In order 6. REFERENCES [1] A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proceedings of the 32nd Costs/Group international conference on Very large data bases, pages 918–929. VLDB Endowment, 2006. [2] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th 100 international conference on World Wide Web, pages 80 up 131–140. ACM, 2007. ro 60 /G r ds [3] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive 20000 co 40000 40 Re Numb h of operator for similarity joins in data cleaning. In Data er of R 60000 t ecord s/Grou 20 e ng Engineering, 2006. ICDE’06. Proceedings of the 22nd p 80000 L g Av International Conference on, pages 5–5. IEEE, 2006. Figure 6: Cost Estimation for one data group. [4] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI, pages 137–150, 2004. to avoid a Straggling Reducer effect, our aim is to find a [5] D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. data grouping and replication which at least limits the max- Massjoin: A mapreduce-based method for scalable imum compute costs over all groups or ideally imposes equal string similarity joins. In Data Engineering (ICDE), computation costs for each data group. In Figure 6, equal 2014 IEEE 30th International Conference on, pages computation costs would occur if all groups would exhibit a 340–351. IEEE, 2014. combination of number of records and average record lengths [6] D. Deng, G. Li, H. Wen, and J. Feng. An efficient on an intersection of the graph with a horizontal plane. partition based method for exact set similarity joins. ! Proceedings of the VLDB Endowment, 9(4):360–371, groupSize 2015. ∗ 2 ∗ avgRecLen (1) [7] T. Elsayed, J. Lin, and D. W. Oard. Pairwise 2 document similarity in large collections with Our idea is to optimize the overall computation costs with mapreduce. In Proceedings of the 46th Annual Meeting the hashing factor h as variable (Equation 2) and the con- of the Association for Computational Linguistics on straint that the computation cost of each group may not be Human Language Technologies: Short Papers, pages larger than the maximum cost threshold m, which ensures 265–268. Association for Computational Linguistics, that no Reducer gets overloaded. 2008. X [8] B. Gufler, N. Augsten, A. Reiser, and A. Kemper. min costs(group, h), costs(group, h) ≤ m (2) Load balancing in mapreduce based on scalable h∈N+ group cardinality estimates. In Data Engineering (ICDE), The group-wise costs within this equation could either be 2012 IEEE 28th International Conference on, pages estimated by Equation 1 or it might use runtimes on sampled 522–533. IEEE, 2012. data from the statistics MapReduce step. [9] L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In Data Engineering (ICDE), 2012 IEEE 28th International 4. CONCLUSIONS, FUTURE WORK Conference on, pages 618–629. IEEE, 2012. In this paper, we introduced a first approach toward a dis- [10] G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A tributed similarity join framework which is robust against partition-based method for similarity joins. arbitrary input dataset sizes and data characteristics such Proceedings of the VLDB Endowment, 5(3):253–264, as skew. We plan to detail it out, implement it and run ex- 2011. periments with it. One crucial detail is to ensure that there [11] W. Mann, N. Augsten, and P. Bouros. An empirical is a sufficient number of record groups which is complete. If evaluation of set similarity join techniques. Proceedings a Reduce instance collects only non-complete groups, strag- of the VLDB Endowment, 9(9):636–647, 2016. gling will still occur. Another open detail is the choice of the hash function for the join. Grouping and replication [12] A. Metwally and C. Faloutsos. V-smart-join: A strategies from existing MapReduce-based similarity join ap- scalable mapreduce framework for all-pair similarity proaches could be integrated in the proposed strategy. Es- joins of multisets and vectors. Proceedings of the pecially signature creating approaches like MassJoin [5] and VLDB Endowment, 5(8):704–715, 2012. sophisticated grouping strategies like MRGroupJoin [6] us- [13] R. Vernica, M. J. Carey, and C. Li. Efficient parallel ing the pigeonhole principle are promising. set-similarity joins using mapreduce. In Proceedings of In future experiments, we are especially interested in the the 2010 ACM SIGMOD International Conference on tradeoff between replication and group size. Furthermore, it Management of data, pages 495–506. ACM, 2010. is interesting if it pays off to use empirical runtime statistics [14] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. for the join costs or simply estimate the runtime analytically. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 5. ACKNOWLEDGMENTS 36(3):15, 2011. This work was supported by the Humboldt Elsevier Ad- vanced Data and Text (HEADT) Center.