<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A computational model for MapReduce job flow</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tommaso Di Noia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marina Mongiello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugenio Di Sciascio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Ingegneria Elettrica e Dell'informazione Politecnico di Bari Via E. Orabona</institution>
          ,
          <addr-line>4 - 70125 BARI</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>335</fpage>
      <lpage>340</lpage>
      <abstract>
        <p>Massive quantities of data are today processed using parallel computing frameworks that parallelize computations on large distributed clusters consisting of many machines. Such frameworks are adopted in big data analytic tasks as recommender systems, social network analysis, legal investigation that involve iterative computations over large datasets. One of the most used framework is MapReduce, scalable and suitable for data-intensive processing with a parallel computation model characterized by sequential and parallel processing interleaving. Its open-source implementation - Hadoop - is adopted by many cloud infrastructures as Google, Yahoo, Amazon, Facebook. In this paper we propose a formal approach to model the MapReduce framework using model checking and temporal logics to verify properties of reliability and load balancingof the MapReduce job flow.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Introduction and motivation</p>
      <p>
        In this paper we propose a formal approach to model the MapReduce
framework using model checking and tempora l logics to verify some relevant properties
as reliability, load balancing, lack of deadlock of the MapReduce job flow. To
the best of our knowledge only two works have combined MapReduce and model
checking with a di↵erent aim from ours: in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] MapReduce is adopted to
compute distributed CTL algorithms and in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] MapReduce is modeled using CSP
formalism. The remaining of the paper is organized as follows. In Section 2 we
recall basics of model checking and temporal logics, the formalism used to define
and simulate our model. Section 3 provides a brief overview of the main features
of MapReduce. Section 4 proposes our formal model of job flow in Mapreduce
computation while Section 5 proposes an analytical model in the Uppaal model
checker language with properties to be checked. Conclusion and future works are
drawn in the last section.
2
      </p>
      <p>
        Model Checking and Temporal Logics
The logical language we use for the model checking task is the Computation
Tree Logic (CTL), a propositional, branching, temporal logic [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>The syntax of the formulae can be defined, using Backus-Naur form, as
follows (where p is an atomic proposition): , ::= p | ^ | _ | ¬ | !
| $ | EF | EX | EG | E(U ) | AG | AF | AX | A(U ). An
atomic proposition is the formula true or a ground atom CTL formulae can also
contain path quantifiers followed by temporal operators. The path quantifier E
specifies some path from the current state, while the path quantifier A specifies
all paths from the current state. The temporal operators are X, the neXt-state
operator; U , the Until operator; G, the Globally operator; and F the Future
operator. The symbols X, U , G, F cannot occur without being preceded by the
quantifiers E and A.</p>
      <p>
        The semantics of the language is defined through a Kripke structure as the
triple (S, ! , L) where S is a collection of states, ! is a binary relation on S ⇥ S,
stating that the system can move from state to state. Associated with each state
s, the interpretation function L provides the set of atomic propositions L(s) that
are true at that particular state [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The semantics of boolean connectives is as
usual. The semantics for temporal connectives is as follows: X specifies that
holds in the next state along the path. U specifies that holds on every
state along the path until is true. G specifies that holds on every state
along the path. F specifies that there is at least one state along the path in
which is true. The semantics of formulae is defined as follows: EX : holds
in some next state; EF : a path exists such that holds in some Future state ;
EG : a path exists such that holds Globally along the path; E(U ): a path
exists such that Until holds on it; AX : holds in every next state; AF :
for All paths there will be some Future state where holds; AG : for All paths
the property holds Globally; A(U ): All paths satisfy Until . The model
checking problem is the following: Given a model M , an initial state s and a CTL
formula , check whether M, s |= . M |= ehen the formula must be checked
for every state of M . 336
3
      </p>
      <p>
        MapReduce Overview and proposed model
MapReduce is a software framework for solving large-scale computing problems
over large data-sets and data-intensive computing. It has grown to be the
programming model for current distributed systems, i.e. cloud computing. It also
forms the basis of the data-center software stack [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        MapReduce framework was developed at Google Research as a parallel
programming model with an associated implementation. The framework is highly
scalable and location independent. It is used for the generation of data for
Google’s production web search service, for sorting, for data-intensive
applications, for optimizing parallel jobs performance in data-intensive clusters. The
most relevant feature of MapReduce processing is that computation runs on a
large cluster of commodity machines [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; while the main feature with respect
to other existing parallel computational models is the sequential and parallel
computation interleaving.
      </p>
      <p>
        The MapReduce model is made up of the Map and Reduce functions, which
are borrowed from functional languages such as Lisp [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Users’ computations are
written as Map and Reduce functions. The Map function processes a key/value
pair to generate a set of intermediate key/value pairs. The Reduce function
merges all intermediate values associated with the same intermediate key.
Intermediate functions of Shu✏e and Sorting are useful to split and sort the data
chunks to be given in input to the Reduce function. We now define the
computational model for MapReduce framework. We model jobs and tasks in MapReduce
using the flow description as shown in Figure 1.
      </p>
      <p>Definition 1 (MapReduce Graph (MRG)). A MapReduce Graph (MRG) is
a Direct Acyclic Graph G = {N, E}, where nodes N in the computation graph are
the tasks of computation – N = M [ S [ SR [ R (M = map, S = shuf f le, SR =
sort, R = reduce) – and edges e 2 E are such that:
1. E ✓ (M ⇥ S) [ (S ⇥ SR) [ (SR ⇥ R), i.e. “edges connect map with shu✏e,
shu✏e with sort and sort with reduce tasks”;
2. e 2 M ⇥ S breaks input into tokens;
e 2 S ⇥ SR sorts input tokens by type;
e 2 SR ⇥ R gives sort tokens to reducer.
3. Reduce sends input data for cluster allocation to the file system
Definition 2 (MapReduce Task). A MapReduce task t is a token of
computation such that t 2 (M [ S [ SR [ R).</p>
      <p>Definition 3 (MapReduce Job). A MapReduce Job is the sequence t1 !
t2 . . . ! tn of MapReduce tasks where</p>
      <p>tj1 = (Mi, .., Mi+p) ! tj2 = Sk ! tj3 = SRk ! tj4 = Rk
with Mi 2</p>
      <p>M, i = 1 . . . n, Sj 2 S, SRj 2 SR, Rj 2 R, j = 1 . . . m.</p>
      <p>Uppaal simulation model
In this Section we briefly describe the implemented model in the Uppaal2 model
checker’s formalism. Uppaal model is described in XML data format of model
checker description language and shown in the graphical interface of the tool as
a graph model.</p>
      <p>The analytical model is made up of three templates: job, mapper and Google
File System (GFS). Figure 2 shows only the Uppaal graph model of the job
described using the statechart notation. Main states of the job template are Map,
Shu✏e , Sort and Reduce as defined in the theoretical model in Section 3. Other
states manage the flow of the job: Starvation models the state in which the task
waits for unavailable resource, Middle state manages exceptions, receiving input
and receiving keyval manage data during job flow. Finally medium level checks
input data for the Shu✏e state. Mapper template is made up of states: Prevision
that checks the behavior of the mappers working for the given job. The Prevision
state is followed by Error and out of order state in case of wrong behavior of
the mapper, or Work state in case of correct behavior of the mapper. Finally
the GFS template only manages the job execution.</p>
      <p>To test the validity of the approach we simulated the model by instantiating a
given number of jobs with relative mappers. We checked the two main properties
of MapReduce framework, i.e. load balancing and fault tolerance.</p>
      <p>Load Balancing. This property checks that the load of the Job J will be
distributed to all tasks of the Mapper Mi with given map and reduce tasks.
Hence when the job enters the state Map all the previson state of the mapper
are veryfied, this means that the load is balanced between all the mappers.</p>
      <p>EG(J.M ap) ^ AG(Mi.P revision)</p>
      <p>Fault Tolerance. If the Mi map is out of service the job mapper schedules
an alternative task to perform the missed function that belongs to remaining
2 http://www.uppaal.org/
Mi. In the simulation model, mappercount is the counter of the total number of
mappers and mappersched is the variable that counts the scheduled mappers.
The assigned value, m is the number of mappers.</p>
      <p>EF (M.Out of order) ^ AG(mapper count = m ^ mapper sched = m)
From the simulation results we checked that the model ensures load balancing
and fault tolerance.
5</p>
      <p>Conclusion and future work
We proposed a formal model of MapReduce framework for data-intensive and
computing-intensive environment. The model introduces the definition of
MapReduce graph and MapReduce job and task. At this stage of the work we
implemented and simulated the model with Uppaal model checker to verify basics
properties of its computation as fault tolerance, load balancing and lack of
deadlock. We are currently modeling other relevant features as scalability, data
locality and extending the model with advanced job management activities such
as job folding and job chaining.</p>
      <p>We acknowledge support of project “A Knowledge based Holistic Integrated
Research Approach” (KHIRA - PO3N3902 00563 3446857).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Je</surname>
          </string-name>
          <article-title>↵rey Dean and Sanjay Ghemawat</article-title>
          . Mapreduce:
          <article-title>Simplified data processing on large clusters, OSDI04: Sixth symposium on operating system design and implementation</article-title>
          , san francisco, ca, december,
          <year>2004</year>
          . S. Dill,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McCurley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajagopalan</surname>
          </string-name>
          , D. Sivakumar, ad
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomkins</surname>
          </string-name>
          ,
          <article-title>Self-similarity in the Web</article-title>
          ,
          <string-name>
            <surname>Proc</surname>
            <given-names>VLDB</given-names>
          </string-name>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Feng</given-names>
            <surname>Guo</surname>
          </string-name>
          , Guang Wei, Mengmeng Deng, and
          <string-name>
            <given-names>Wanlin</given-names>
            <surname>Shi</surname>
          </string-name>
          .
          <article-title>Ctl model checking algorithm using mapreduce</article-title>
          .
          <source>In Emerging Technologies for Information Systems, Computing, and Management</source>
          , pages
          <fpage>341</fpage>
          -
          <lpage>348</lpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.R.A.</given-names>
            <surname>Huth</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.D.</given-names>
            <surname>Ryan</surname>
          </string-name>
          . Logic in Computer Science. Cambridge University Press,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Krishna</given-names>
            <surname>Kant</surname>
          </string-name>
          .
          <article-title>Data center evolution: A tutorial on state of the art, issues, and challenges</article-title>
          .
          <source>Computer Networks</source>
          ,
          <volume>53</volume>
          (
          <issue>17</issue>
          ):
          <fpage>2939</fpage>
          -
          <lpage>2965</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Edmund</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Clarke</surname>
          </string-name>
          , Orna Grumberg, and
          <string-name>
            <surname>Doron</surname>
            <given-names>A. Peled. Model</given-names>
          </string-name>
          <string-name>
            <surname>Checking</surname>
          </string-name>
          . Cambridge, Massachusetts, USA: MIT press.,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Fan</given-names>
            <surname>Yang</surname>
          </string-name>
          , Wen Su, Huibiao Zhu, and
          <string-name>
            <given-names>Qin</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Formalizing mapreduce with csp</article-title>
          .
          <source>In Engineering of Computer Based Systems (ECBS)</source>
          ,
          <year>2010</year>
          17th IEEE International Conference and Workshops on, pages
          <fpage>358</fpage>
          -
          <lpage>367</lpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>