-

MapReduce Frameworks: Comparing Hadoop and HPCC

Fabian Fier

fabian.fier@informatik.hu-berlin.de 0

Eva Hofer

eva.hoefer@informatik.hu-berlin.de 0

Johann-Christoph Freytag

freytag@informatik.hu-berlin.de 0 0 Humboldt-Universitat zu Berlin, Institut fur Informatik , Unter den Linden 6, 10099 Berlin , Germany

MapReduce and Hadoop are often used synonymously. For optimal runtime performance, Hadoop users have to consider various implementation details and con guration parameters. When conducting performance experiments with Hadoop on di erent algorithms, it is hard to choose a set of such implementation optimizations and con guration options which is fair to all algorithms. By fair we mean default con gurations and automatic optimizations provided by the execution system which ideally do not require manual intervention. HPCC is a promising alternative open source implementation of MapReduce. We show that HPCC provides sensible default con guration values allowing for fairer experimental comparisons. On the other hand, we show that HPCC users still have to consider implementing optimizations known from Hadoop.

In our research, we use MapReduce and its implementation Hadoop to experimentally compare the runtime of various scalable algorithms. Along these experiments, we identi ed practical issues that make a fair comparison and experimental reproducability hard. Hadoop o ers many con guration parameters that in uence the runtimes of programs such as the number of Reducers or whether data compression is to be used between Map and Reduce. Furthermore, there are numerous possible optimizations in Hadoop programs, such as custom datatypes and very e cient byte-wise comparators. It is possible to \tweak" every implementation with a certain set of options and implementation optimizations. The same set of options and optimizations can lead to poor execution times for the implementation of di erent algorithms [ 1 ]. The literature barely discusses these con guration parameters and implementation details, although they are crucial for the validity of experimental results.

Another promising open source MapReduce implementation is HPCC (High Performance Computing Cluster) from LexisNexis [ 2 ]. Unlike Hadoop, HPCC hides many con guration details from the user. We are interested in how HPCC might replace Hadoop in our context and whether it allows for a fairer comparison when implementing MapReduce algorithms in it. As a running example, we describe a common MapReduce-based textual similarity join algorithm. We introduce our implementation of this algorithm in Hadoop and its pitfalls concerning system con guration and code details. We compare this implementation to our corresponding implementation in HPCC and discuss our ndings.

The paper is structured as follows. Section 2 describes HPCC and ECL. Section 3 contains the textual similarity join problem and the MapReduce-based approach we use as a running example. Section 4 introduces and compares our implementations of the algorithm in Hadoop and HPCC. The last section sums up our ndings and gives an outlook on future research on this topic. 2

HPCC and ECL

HPCC is an open source parallel distributed system for compute- and dataintensive computations [ 2 ]. It contains a distributed le system. The main user interface of HPCC is ECL (Enterprise Control Language). ECL follows a data oworiented programming paradigm and has declarative components. The following example code1 computes a word count: De ne record structure \WordLayout" consisting of string \word": WordLayout := RECORD

STRING word;

END; Read given dataset into the variable \wordsDS", apply WordLayout: wordsDS := DATASET([{ HPCC }, .., { ANALYTICS } ], WordLayout); De ne record structure \WordCountLayout" consisting of \word" and \count". Note the count is de ned by COUNT(GROUP) which implies that this record structure is to be applied to a grouped dataset:

WordCountLayout := RECORD wordsDS.word; wordCount := COUNT(GROUP);

END; Apply WordCountLayout on dataset wordsDS and group by \word": wordCountTable := TABLE(wordsDS, WordCountLayout, word); ECL allows to incorporate user-de ned rst-order functions written in C++ or Java [ 2 ]. These functions can be called from ECL functions. The semantics of the Map operator is represented in the ECL function PROJECT. It applies a user-de ned function to each record in a given dataset. Reduce semantics can be emulated by partitioning and distributing the data with DISTRIBUTE, sorting it locally with SORT, and running a user-de ned function on each group with ROLLUP. Although HPCC is not originally designed for the MapReduce programming paradigm, it is straightforward to adapt a MapReduce program to ECL. Thus, we regard HPCC as an alternative implementation of Hadoop. 1 Adapted from https://aws.hpccsystems.com/aws/code-samples/

Textual Similarity Join

This section describes the textual similarity join problem and outlines the algorithmic approach from Vernica et al. [ 3 ] to compute it. We subsequently use this algorithm as an example to compare Hadoop to HPCC.

The textual all-pairs similarity join is a common operation that detects similar pairs of objects. Objects can either be strings, sets, or multisets. The similarity is de ned by similarity functions such as Cosine or Jaccard similarity. Applications of this join are near-duplicate removal, document clustering, or plagiarism detection. Without loss of generality, we assume a self-join on sets. De nition 1 (Similarity Join). Given a collection of sets S, a similarity function sim, and a user-de ned threshold , a similarity join nds all pairs with a similarity above the threshold: fhs1; s2ijsim(s1; s2) ; s1 2 S; s2 2 S; s1 6= s2g.

A naive approach of computing this join is to compare each possible pair of objects. Due to its quadratic complexity, it is not feasible even for small datasets. More advanced approaches use a lter-and-veri cation framework. The framework consists of two steps. The rst step computes candidate pairs, which are a superset of the result set. Due to the use of lters, the candidate set is much smaller than the cross product (assuming that a majority of pairs of objects of S is not similar). The second step computes the actual similarity for each candidate pair to verify if its similarity is above .

A prominent lter-and-veri cation MapReduce-based algorithm for set similarity joins is the VernicaJoin [ 3 ]. The main ltering idea is to only compare short pre xes of two objects to generate candidate pairs. Given a similarity function, a threshold and an object length jsj, we can compute a pre x length. It can be shown that two objects can only be similar if they have an overlap of at least 1 in their pre xes. One optimization is to sort the words in the objects by their global frequency in ascending order. This assures that the pre xes only contain the least frequent words which reduces the number of candidate pairs.

For our experimental comparison of similarity join algorithms, we adapted VernicaJoin to use already integer-tokenized input (instead of raw string input) and to output ID pairs of similar objects (instead of string pairs). These changes enable us to compare this algorithm to others. In the following, we describe our implementations of this adapted algorithm. 4

Comparison of Implementations

In this section, we introduce our implementations of the previously described algorithm in Hadoop and HPCC. We discuss the most runtime-relevant details concerning implementation and con guration. Due to space restrictions, we refer to the upcoming full version of this paper for experimental results.

The implementations consist of three steps. In the rst step, we compute the global token frequency. In the second step, we sort the tokens in each object by this frequency and replicate each object for each token in its pre x. We group all objects by their pre x tokens and verify for each pair in this group if it meets the threshold . The third step removes duplicates.

(combine)

PROJECT NORMA read input

LIZE Input Dataset ROLLUP

HPCC Comment

add SORT frequencies locally SORT hash-partition by word, add frequencies ROLLUP locally sort by frequency

SORT PROJECT J o i n D e d u p li c a it o n

Job 4

Hadoop

Prefix Map Verific.

Reduce Result w/ duplicates Dedup Map Dedup Reduce Output Dataset

HPCC Comment

DIantpaustet f(rsoomrtepdretovikoeunssstep) PROJECT NORMA LIZE SORT ROLLUP SORT ROLLUP sort tokens by global frequency replicate by prefix compute join for NORMA each LIZE lpoacratiltliyon Output

Dataset example by manually changing the input split size parameter of Hadoop. The distributed le system of HPCC splits the data at object borders and evenly distributes it amongst the available compute nodes. If the subsequent operator can operate on independant data chunks, it is executed on each data split.

In Hadoop, we manually set the number of Reduce instances. The default number is 1. If it is set too low, the computing nodes are under-utilized. If it is set too high, resources like main memory or network get overloaded. An optimal value is usually application- or even data-dependant. HPCC handles this parallelization implicitly.

Figure 2 shows the data ows for the second and third step of the implementations. Consider the data example in the right column. We read the input, sort the tokens by their frequency in ascending order, and replicate each object for each token in the pre x. We illustrate the pre x with bold numbers in the input. Since this replication can be computed on independent data partitions, we use two boxes for the resulting partitions. We group all records with the same tokens and compute their pairwise similarity. If two records share more than one common token in the pre x, the similarity of this pair is computed more than once. In the last step, we deduplicate the result.

The Map in Hadoop Job 3 uses a setup function, which initially reads the word frequencies from the rst step. It sorts the words in each object according to their global frequency and computes the pre x length. For each word in the pre x, it outputs the key-value pair hhword; objectLengthi; objecti. Note that the key consists of two integers, the word and the length. As proposed by Vernica et al. [ 3 ], we use a combined key consisting of the word and the length of the containing record. The word in the key partitions the data. The length is used additionally for local sort on the Reduce side. The Reducer retrieves the objects ordered by length. Since it performs a local nested-loop, we can prune locally bu ered objects which cannot be similar anymore to all subsequent objects due to their length di erence. We implemented a custom partitioner, sorter and grouper for this. This approach has an impact on the runtime. If the number of locally bu ered objects exceeds memory boundaries, the computation becomes slow. In ECL, we implement the same approach by sorting the records by length within each partition. For each partition, we run a user-de ned Java function that computes the similarity join locally. As in Hadoop, the user-de ned function is stateful and can cause memory over ow. 5

Summary, Future Work

We were interested in how HPCC might be an alternative to Hadoop as an execution platform to allow for a fairer comparison when implementing MapReduce algorithms. Using the VernicaJoin to implement a textual similarity join, we showed that a complex MapReduce algorithm can be adapted to HPCC in a straightforward way. HPCC takes away some con guration details from the user like the parallelization degree (number of Map and Reduce instances). However, ECL still requires its users to carefully partition data so that intermediate bu ers do not get overloaded. It is also necessary to explicitly implement optimizations such as local Combines. We plan to investigate further the in uence of memory con guration on runtime. Especially in Hadoop, it is usually not clear to the user how its memory-related parameters impact performance. Furthermore, we plan to adapt this textual similarity join approach to use even more native ECL functions rather than user-de ned \black box" code. This opens optimization possibilities which can potentially be integrated into the HPCC system.

Acknowledgements. This work was supported by the Humboldt Elsevier Advanced Data and Text (HEADT) Center.

1. Babu , S. : Towards Automatic Optimization of MapReduce Programs . In Proceedings of the 1st ACM symposium on Cloud computing. ACM ( 2010 )

2. Middleton , A. M. , Bayliss , D. A. , and Halliday , G.: ECL/HPCC: A Uni ed Approach to Big Data . In: Furth, B. and Escalante , A. : Handbook of Data Intensive Computing , pp. 59 { 107 . Springer, New York ( 2011 )

Vernica ,

M. J.

Carey , and

Li . E cient parallel set-similarity joins using mapreduce . In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data , pages 495506 . ACM ( 2010 ).