<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MapReduce Frameworks: Comparing Hadoop and HPCC</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabian Fier</string-name>
          <email>fabian.fier@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Hofer</string-name>
          <email>eva.hoefer@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johann-Christoph Freytag</string-name>
          <email>freytag@informatik.hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt-Universitat zu Berlin, Institut fur Informatik</institution>
          ,
          <addr-line>Unter den Linden 6, 10099 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>MapReduce and Hadoop are often used synonymously. For optimal runtime performance, Hadoop users have to consider various implementation details and con guration parameters. When conducting performance experiments with Hadoop on di erent algorithms, it is hard to choose a set of such implementation optimizations and con guration options which is fair to all algorithms. By fair we mean default con gurations and automatic optimizations provided by the execution system which ideally do not require manual intervention. HPCC is a promising alternative open source implementation of MapReduce. We show that HPCC provides sensible default con guration values allowing for fairer experimental comparisons. On the other hand, we show that HPCC users still have to consider implementing optimizations known from Hadoop.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In our research, we use MapReduce and its implementation Hadoop to
experimentally compare the runtime of various scalable algorithms. Along these
experiments, we identi ed practical issues that make a fair comparison and
experimental reproducability hard. Hadoop o ers many con guration parameters that
in uence the runtimes of programs such as the number of Reducers or whether
data compression is to be used between Map and Reduce. Furthermore, there are
numerous possible optimizations in Hadoop programs, such as custom datatypes
and very e cient byte-wise comparators. It is possible to \tweak" every
implementation with a certain set of options and implementation optimizations. The
same set of options and optimizations can lead to poor execution times for the
implementation of di erent algorithms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The literature barely discusses these
con guration parameters and implementation details, although they are crucial
for the validity of experimental results.
      </p>
      <p>
        Another promising open source MapReduce implementation is HPCC (High
Performance Computing Cluster) from LexisNexis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Unlike Hadoop, HPCC
hides many con guration details from the user. We are interested in how HPCC
might replace Hadoop in our context and whether it allows for a fairer
comparison when implementing MapReduce algorithms in it. As a running example,
we describe a common MapReduce-based textual similarity join algorithm. We
introduce our implementation of this algorithm in Hadoop and its pitfalls
concerning system con guration and code details. We compare this implementation
to our corresponding implementation in HPCC and discuss our ndings.
      </p>
      <p>The paper is structured as follows. Section 2 describes HPCC and ECL.
Section 3 contains the textual similarity join problem and the MapReduce-based
approach we use as a running example. Section 4 introduces and compares our
implementations of the algorithm in Hadoop and HPCC. The last section sums
up our ndings and gives an outlook on future research on this topic.
2</p>
    </sec>
    <sec id="sec-2">
      <title>HPCC and ECL</title>
      <p>
        HPCC is an open source parallel distributed system for compute- and
dataintensive computations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It contains a distributed le system. The main user
interface of HPCC is ECL (Enterprise Control Language). ECL follows a data
oworiented programming paradigm and has declarative components. The following
example code1 computes a word count:
De ne record structure \WordLayout" consisting of string \word":
WordLayout := RECORD
      </p>
      <p>STRING word;</p>
      <p>END;
Read given dataset into the variable \wordsDS", apply WordLayout:
wordsDS := DATASET([{ HPCC }, .., { ANALYTICS } ], WordLayout);
De ne record structure \WordCountLayout" consisting of \word" and \count".
Note the count is de ned by COUNT(GROUP) which implies that this record
structure is to be applied to a grouped dataset:</p>
      <p>WordCountLayout := RECORD
wordsDS.word;
wordCount := COUNT(GROUP);</p>
      <p>
        END;
Apply WordCountLayout on dataset wordsDS and group by \word":
wordCountTable := TABLE(wordsDS, WordCountLayout, word);
ECL allows to incorporate user-de ned rst-order functions written in C++
or Java [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These functions can be called from ECL functions. The semantics
of the Map operator is represented in the ECL function PROJECT. It applies
a user-de ned function to each record in a given dataset. Reduce semantics
can be emulated by partitioning and distributing the data with DISTRIBUTE,
sorting it locally with SORT, and running a user-de ned function on each group
with ROLLUP. Although HPCC is not originally designed for the MapReduce
programming paradigm, it is straightforward to adapt a MapReduce program to
ECL. Thus, we regard HPCC as an alternative implementation of Hadoop.
1 Adapted from https://aws.hpccsystems.com/aws/code-samples/
      </p>
    </sec>
    <sec id="sec-3">
      <title>Textual Similarity Join</title>
      <p>
        This section describes the textual similarity join problem and outlines the
algorithmic approach from Vernica et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to compute it. We subsequently use this
algorithm as an example to compare Hadoop to HPCC.
      </p>
      <p>The textual all-pairs similarity join is a common operation that detects
similar pairs of objects. Objects can either be strings, sets, or multisets. The
similarity is de ned by similarity functions such as Cosine or Jaccard similarity.
Applications of this join are near-duplicate removal, document clustering, or
plagiarism detection. Without loss of generality, we assume a self-join on sets.
De nition 1 (Similarity Join). Given a collection of sets S, a similarity
function sim, and a user-de ned threshold , a similarity join nds all pairs with a
similarity above the threshold: fhs1; s2ijsim(s1; s2) ; s1 2 S; s2 2 S; s1 6= s2g.</p>
      <p>A naive approach of computing this join is to compare each possible pair
of objects. Due to its quadratic complexity, it is not feasible even for small
datasets. More advanced approaches use a lter-and-veri cation framework. The
framework consists of two steps. The rst step computes candidate pairs, which
are a superset of the result set. Due to the use of lters, the candidate set is much
smaller than the cross product (assuming that a majority of pairs of objects of S
is not similar). The second step computes the actual similarity for each candidate
pair to verify if its similarity is above .</p>
      <p>
        A prominent lter-and-veri cation MapReduce-based algorithm for set
similarity joins is the VernicaJoin [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The main ltering idea is to only compare short
pre xes of two objects to generate candidate pairs. Given a similarity function,
a threshold and an object length jsj, we can compute a pre x length. It can
be shown that two objects can only be similar if they have an overlap of at least
1 in their pre xes. One optimization is to sort the words in the objects by their
global frequency in ascending order. This assures that the pre xes only contain
the least frequent words which reduces the number of candidate pairs.
      </p>
      <p>For our experimental comparison of similarity join algorithms, we adapted
VernicaJoin to use already integer-tokenized input (instead of raw string input)
and to output ID pairs of similar objects (instead of string pairs). These changes
enable us to compare this algorithm to others. In the following, we describe our
implementations of this adapted algorithm.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Comparison of Implementations</title>
      <p>In this section, we introduce our implementations of the previously described
algorithm in Hadoop and HPCC. We discuss the most runtime-relevant details
concerning implementation and con guration. Due to space restrictions, we refer
to the upcoming full version of this paper for experimental results.</p>
      <p>The implementations consist of three steps. In the rst step, we compute the
global token frequency. In the second step, we sort the tokens in each object by
this frequency and replicate each object for each token in its pre x. We group
all objects by their pre x tokens and verify for each pair in this group if it meets
the threshold . The third step removes duplicates.</p>
      <p>(combine)</p>
      <p>PROJECT NORMA read input</p>
      <p>LIZE
Input
Dataset
ROLLUP</p>
      <sec id="sec-4-1">
        <title>HPCC</title>
      </sec>
      <sec id="sec-4-2">
        <title>Comment</title>
        <p>add
SORT frequencies
locally
SORT
hash-partition
by word, add
frequencies
ROLLUP locally
sort by
frequency</p>
        <p>SORT
PROJECT
J
o
i
n
D
e
d
u
p
li
c
a
it
o
n</p>
        <p>Job 4</p>
        <sec id="sec-4-2-1">
          <title>Hadoop</title>
          <p>Prefix
Map
Verific.</p>
          <p>Reduce
Result w/
duplicates
Dedup
Map
Dedup
Reduce
Output
Dataset</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>HPCC Comment</title>
          <p>DIantpaustet f(rsoomrtepdretovikoeunssstep)
PROJECT
NORMA
LIZE
SORT
ROLLUP
SORT
ROLLUP
sort
tokens by
global
frequency
replicate
by prefix
compute
join for
NORMA each
LIZE lpoacratiltliyon
Output</p>
          <p>Dataset
example by manually changing the input split size parameter of Hadoop. The
distributed le system of HPCC splits the data at object borders and evenly
distributes it amongst the available compute nodes. If the subsequent operator
can operate on independant data chunks, it is executed on each data split.</p>
          <p>In Hadoop, we manually set the number of Reduce instances. The default
number is 1. If it is set too low, the computing nodes are under-utilized. If
it is set too high, resources like main memory or network get overloaded. An
optimal value is usually application- or even data-dependant. HPCC handles
this parallelization implicitly.</p>
          <p>Figure 2 shows the data ows for the second and third step of the
implementations. Consider the data example in the right column. We read the input, sort
the tokens by their frequency in ascending order, and replicate each object for
each token in the pre x. We illustrate the pre x with bold numbers in the
input. Since this replication can be computed on independent data partitions, we
use two boxes for the resulting partitions. We group all records with the same
tokens and compute their pairwise similarity. If two records share more than one
common token in the pre x, the similarity of this pair is computed more than
once. In the last step, we deduplicate the result.</p>
          <p>
            The Map in Hadoop Job 3 uses a setup function, which initially reads the
word frequencies from the rst step. It sorts the words in each object according
to their global frequency and computes the pre x length. For each word in the
pre x, it outputs the key-value pair hhword; objectLengthi; objecti. Note that the
key consists of two integers, the word and the length. As proposed by Vernica
et al. [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], we use a combined key consisting of the word and the length of the
containing record. The word in the key partitions the data. The length is used
additionally for local sort on the Reduce side. The Reducer retrieves the objects
ordered by length. Since it performs a local nested-loop, we can prune locally
bu ered objects which cannot be similar anymore to all subsequent objects due
to their length di erence. We implemented a custom partitioner, sorter and
grouper for this. This approach has an impact on the runtime. If the number of
locally bu ered objects exceeds memory boundaries, the computation becomes
slow. In ECL, we implement the same approach by sorting the records by length
within each partition. For each partition, we run a user-de ned Java function
that computes the similarity join locally. As in Hadoop, the user-de ned function
is stateful and can cause memory over ow.
5
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Summary, Future Work</title>
      <p>We were interested in how HPCC might be an alternative to Hadoop as an
execution platform to allow for a fairer comparison when implementing MapReduce
algorithms. Using the VernicaJoin to implement a textual similarity join, we
showed that a complex MapReduce algorithm can be adapted to HPCC in a
straightforward way. HPCC takes away some con guration details from the user
like the parallelization degree (number of Map and Reduce instances). However,
ECL still requires its users to carefully partition data so that intermediate bu ers
do not get overloaded. It is also necessary to explicitly implement optimizations
such as local Combines. We plan to investigate further the in uence of memory
con guration on runtime. Especially in Hadoop, it is usually not clear to the
user how its memory-related parameters impact performance. Furthermore, we
plan to adapt this textual similarity join approach to use even more native ECL
functions rather than user-de ned \black box" code. This opens optimization
possibilities which can potentially be integrated into the HPCC system.</p>
      <p>Acknowledgements. This work was supported by the Humboldt Elsevier
Advanced Data and Text (HEADT) Center.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Babu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Towards Automatic Optimization of MapReduce Programs</article-title>
          .
          <source>In Proceedings of the 1st ACM symposium on Cloud computing. ACM</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Middleton</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bayliss</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Halliday</surname>
          </string-name>
          , G.:
          <article-title>ECL/HPCC: A Uni ed Approach to Big Data</article-title>
          . In: Furth,
          <string-name>
            <given-names>B.</given-names>
            and
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <source>Handbook of Data Intensive Computing</source>
          , pp.
          <volume>59</volume>
          {
          <fpage>107</fpage>
          . Springer, New York (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Vernica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Carey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>E cient parallel set-similarity joins using mapreduce</article-title>
          .
          <source>In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data</source>
          , pages
          <fpage>495506</fpage>
          .
          <source>ACM</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>