=Paper=
{{Paper
|id=Vol-1800/paper5
|storemode=property
|title=Energy-efficient Mapping of Big Data Workflows under Deadline Constraints
|pdfUrl=https://ceur-ws.org/Vol-1800/paper5.pdf
|volume=Vol-1800
|authors=Tong Shu,Chase Wu
|dblpUrl=https://dblp.org/rec/conf/sc/ShuW16
}}
==Energy-efficient Mapping of Big Data Workflows under Deadline Constraints==
<pdf width="1500px">https://ceur-ws.org/Vol-1800/paper5.pdf</pdf>
<pre>
      Energy-efficient Mapping of Big Data Workflows under
                                           ∗
                      Deadline Constraints

                                              Tong Shu and Chase Q. Wu
                                             Department of Computer Science
                                             New Jersey Institute of Technology
                                                 Newark, NJ 07102, USA
                                              {ts372, chase.wu}@njit.edu

ABSTRACT                                                               on and off a server may reduce its lifespan or cause unnec-
   Large-scale workflows for big data analytics have become            essary peaks of energy consumption, and ii) DVFS may not
a main consumer of energy in data centers where moldable               be always available on all servers in a cluster. Therefore, we
parallel computing models such as MapReduce are widely                 direct our efforts to workflow mapping for dynamic energy
applied to meet high computational demands with time-                  saving by adaptively determining the degree of parallelism
varying computing resources. The granularity of task par-              in each MapReduce job to mitigate the workload overhead
titioning in each moldable job of such big data workflows              while meeting a given performance requirement.
has a significant impact on energy efficiency, which remains              Parallel jobs are generally categorized into three classes
largely unexplored. In this paper, we analyze the properties           with flexibility from low to high: rigid jobs exemplified by
of moldable jobs and formulate a workflow mapping prob-                multi-threaded programs running on a fixed number of pro-
lem to minimize the dynamic energy consumption of a given              cessors, moldable jobs exemplified by MapReduce programs
workflow request under a deadline constraint. Since this               running on any number of processors decided prior to exe-
problem is strongly NP-hard, we design a fully polynomial-             cution, and malleable jobs running on a variable number of
time approximation scheme (FPTAS) for a special case with              processors at runtime [11]. A moldable job typically follows
a pipeline-structured workflow on a homogeneous cluster                a performance model where the workload of each component
and a heuristic for the generalized problem with an arbitrary          task decreases and the total workload, proportional to DEC,
workflow on a heterogeneous cluster. The performance supe-             increases as the number of allotted processors increases [15].
riority of the proposed solution in terms of dynamic energy            The validity of this model has been verified by many real-
saving and deadline missing rate is illustrated by extensive           life parallel programs in various big data domains and will
simulation results in Hadoop/YARN in comparison with ex-               serve as a base of our workflow mapping solution for energy
isting algorithms.                                                     saving of big data workflows.
Keywords                                                                 In this paper, we construct analytical cost models and for-
                                                                       mulate a workflow mapping problem to minimize the DEC
    Big Data; Workflow Mapping; Green Computing                        of a workflow under deadline and resource constraints in
1    Introduction                                                      a Hadoop cluster. This problem is strongly NP-hard be-
                                                                       cause a subproblem to minimize the makespan of indepen-
   Next-generation applications in science, industry, and busi-        dent jobs on identical machines under a single resource con-
ness domains are producing colossal amounts of data, now               straint without considering energy cost has been proved to
frequently termed as “big data”, which must be analyzed                be strongly NP-hard [12]. In our problem, it is challenging
in a timely manner for knowledge discovery and technologi-             to balance the trade-off between energy cost and execution
cal innovation. Among many practical computing solutions,              time of each component job to determine their respective
workflows have been increasingly employed as an important              completion time in MapReduce workflows, regardless of sev-
technique for big data analytics, and consequently such big            eral previous efforts in traditional workflows, such as the
data workflows have become a main consumer of energy                   partial critical path method [4].
in data centers. Most existing efforts on green computing
were focused on independent MapReduce jobs and tradi-                     We start with a special case with a pipeline-structured
tional workflows comprised of serial programs. Energy effi-            workflow (a set of linearly arranged jobs with a dependency
ciency of big data workflows in Hadoop systems still remains           between any two neighbors along the line) on a homoge-
largely unexplored.                                                    neous cluster. We prove this special case to be weakly
   Modern computing systems achieve energy saving mainly               NP-complete and design a fully polynomial-time approxi-
through two types of techniques, i.e. task consolidation to            mation scheme (FPTAS) of time complexity linear with re-
reduce static energy consumption (SEC) by turning off idle             spect to 1/ǫ. By leveraging the near optimality and low
servers, and load balancing to reduce dynamic energy con-              time complexity of our FPTAS, we design a heuristic for the
sumption (DEC) through dynamic voltage and frequency                   generalized problem with a directed acyclic graph (DAG)-
scaling (DVFS), or a combination of both. However, these               structured workflow on a heterogeneous cluster. This heuris-
techniques are not sufficient to address the energy efficiency         tic iteratively selects the longest chain of unmapped jobs
issue of big data workflows because i) frequently switching            from the workflow and applies our FPTAS to the selected
                                                                       pipeline while taking machine heterogeneity into considera-
∗Copyright held by the author(s).                                      tion.


                                                                  34
       WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


   In sum, our work makes the following contributions to the         Hadoop cluster powered by mixed brown and green energy,
field.                                                               which dynamically determines resource allocation to hetero-
    • To the best of our knowledge, our work is among the            geneous jobs based on the estimation of job completion time
       first to study energy-efficient mapping of big data work-     and the prediction of future resource availability [10].
       flows comprised of moldable jobs in Hadoop systems.           2.1.3 Resource Allocation
    • We prove a deadline-constrained pipeline-structured
       workflow mapping problem for minimum total (en-                  The majority of existing efforts targeted the first gener-
       ergy) cost to be weakly NP-complete and design an             ation of Hadoop. The work on the second generation of
       FPTAS for it.                                                 Hadoop, i.e. YARN, is still quite limited. Li et al. proposed
    • The performance superiority of the proposed heuristic          a suspend-resume mechanism in YARN to mitigate the over-
       for the general workflow mapping problem in terms of          head of preemption in cluster scheduling, and used a check
       dynamic energy saving and deadline missing rate is il-        pointing mechanism to save the states of jobs for resump-
       lustrated by extensive simulation results in Hadoop/YARN      tion [19]. Their approach dynamically selects appropriate
       in comparison with existing algorithms.                       preemption mechanisms based on the progress of a task and
   The rest of the paper is organized as follows. Section 2          its suspend-resume overhead to improve job response time
provides a survey of related work. Section 3 formulates a big        and reduce energy consumption.
data workflow mapping problem. We prove a special case               2.2 Energy-efficient Workflow Scheduling
to be weakly NP-complete and design an FPTAS for it in
Section 4, and design a heuristic for the generalized problem           Many efforts were made on energy-efficient scheduling of
in Section 5. Section 6 evaluates the performance.                   workflows comprised of precedence-constrained serial pro-
                                                                     grams. Some of these approaches targeted virtualized en-
2   Related Work                                                     vironments by migrating active VMs onto energy-efficient
   A large number of research efforts have been made to op-          PMs in time [25] or consolidating applications with comple-
timize the data replication scheme in Hadoop distributed             mentary resource requirements [28]. Zhu et al. developed a
file system (HDFS) so that data nodes can be turned off              workflow scheduling framework, pSciMapper, which consists
without affecting data availability [16, 5, 8]. Our research         of two major components: i) online power-aware consolida-
on job scheduling is orthogonal to these efforts, and adds an        tion, based on available information on the utilization of
additional level of energy efficiency to Hadoop systems.             CPU, memory, disk, and network by each job, and ii) of-
                                                                     fline analysis including a hidden Markov model for estimat-
2.1 Energy-efficient Job Scheduling in Hadoop                        ing resource usage per job and kernel canonical correlation
2.1.1 Heterogeneous Computing Environments                           analysis for modeling the resource-time and resource-power
   Since servers in large-scale clusters are typically upgraded      relationships [28].
or replaced in an incremental manner, many techniques con-             Other approaches were focused on physical clusters as
sider hardware heterogeneity of Hadoop clusters for energy          follows. Lee et al. proposed a static workflow schedule
saving. Cardosa et al. proposed static virtual machine (VM)         compaction algorithm to consolidate the resource use of a
placement algorithms to minimize the cumulative machine             workflow schedule generated by any scheduling algorithm
uptime of all physical machines (PMs), based on two prin-           in homogeneous environments [17], and designed two static
ciples: spatial fitting of VMs on PMs to achieve high re-           energy-conscious workflow scheduling algorithms based on
source utilization according to complementary resource re-          DVFS in heterogeneous distributed systems [18]. In [20],
quirements from VMs, and temporal fitting of PMs with               three types of DVFS-based heuristics, namely, prepower-
VMs having similar runtime to ensure that a server runs at a        determination, postpower-determination, and hybrid algo-
high utilization level throughout its uptime [6]. Mashayekhy et al. rithms, were designed to solve a static problem of joint power
modeled the energy-aware static task scheduling of a MapRe-         allocation and workflow scheduling for schedule length (or
duce job as an Integer Programming problem, and designed            energy consumption) minimization under an energy con-
two heuristics that assign map/reduce tasks to machine slots        straint (or a time constraint). Zhang et al. proposed a
to minimize energy consumption while satisfying the ser-            DVFS-based heuristic to statically maximize workflow reli-
vice level agreement (SLA) [22]. Cheng et al. proposed a            ability under a energy constraint in a heterogeneous clus-
heterogeneity-aware dynamic task assignment approach us-            ter [27], and designed a Pareto-based bi-objective genetic
ing ant colony optimization, referred to as E-Ant, to min-          algorithm to achieve low energy consumption and high sys-
imize the overall energy consumption of MapReduce appli-            tem reliability for static workflow scheduling [26].
cations with heterogeneous workloads in a heterogeneous
Hadoop cluster without a priori knowledge of workload prop-         2.3 Moldable/Malleable Job Scheduling
erties [9].
                                                                       Some efforts have been made to minimize the comple-
2.1.2 Renewable Energy                                              tion time of a workflow comprised of malleable jobs [23,
   Several efforts were focused on utilizing renewable energy       7, 21], but there exist relatively limited efforts on mold-
in the operation of Hadoop clusters. Goiri et al. proposed          able/malleable job scheduling for energy efficiency. Sanders et al.
a framework, GreenHadoop, for a data center powered by              designed a polynomial-time optimal solution and an FP-
renewable (green) energy and by carbon-intensive (brown)            TAS to statically schedule independent malleable jobs with
energy from the electrical grid as a backup. It dynamically         a common deadline for energy consumption minimization
schedules MapReduce jobs to minimize brown energy con-              based on the theoretical power models of a single processor
sumption by delaying background computations within their           using the DVFS technology, i.e. p = f α and p = f α + δ,
time bounds to match the green energy supply that is not            respectively, where f is CPU frequency and δ is the constant
always available [14]. Cheng et al. designed a scheduler for a      static power consumption [24].


                                                                35
           WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


3 Problem Formulation
                                                                                 Table 1: Notations used in the cost models.
3.1 Cost Models                                                           Notations
                                                                              S                                     Definitions
                                                                          M = l Cl           a cluster of machines divided into homogeneous
3.1.1 Cluster Model                                                                          subclusters {Cl }
   We consider a heterogeneous Hadoop cluster consisting                     mi (Ni , pi ,   the i-th machine equipped with a memory of size oi
                                                                              oi , Pi )      and Ni CPU cores of speed pi and DPC Pi per core
of a set M of machines connected via high-speed switches,                                    at full utilization
which can be partitioned into homogeneous sub-clusters {Cl }.                   R            the available resource-time table of cluster M
Each machine mi is equipped with Ni homogeneous CPU                          NiA (t)         the number of idle CPU cores on machine mi at time t
cores of speed pi and a shared memory of size oi . For                       oAi (t)         the size of available memory on machine mi at time t
                                                                          f (G(V, A),        a workflow request consisting of a workflow structure
the entire cluster, a central scheduler maintains an avail-                     d)           of a DAG G(V, A) and a deadline d
able resource-time (ART) table R, which records the num-                    vj , sj,k        the j-th component job in a workflow and the k-th
ber NiA (t) ≤ Ni of idle CPU cores and the size oA
                                                 i (t) ≤ oi                                  task in job vj
of available memory in each machine mi at time t.                              aj,j′         the directed edge from job vj to job vj′
                                                                             tAS   AF
                                                                              j , tj         the actual start and finish time of job vj
3.1.2 Workflow Model                                                            tC           the completion time of a workflow
   We consider a user request in the form of a workflow                        µi,j          the percentage of execution time for CPU-bound
f (G, d), which specifies a workflow structure G and a dead-                                 instructions in job vj on machine mi
                                                                                 oj          the memory demand per task in job vj
line d. The workflow structure is defined as a DAG G(V, A),                    wj (K)        the workload of job vj partitioned into K tasks
where each vertex vj ∈ V represents a component job, and                      wj,k (K)       the workload of task sj,k in vj with K tasks
each directed edge aj,j ′ ∈ A denotes an execution depen-                     Kj , Kj′       the number and the maximum possible number of
dency, i.e. the actual finish time (AFT) tAF
                                           j   of job vj must                                tasks in vj
                                                                               ti,j,k        the execution time of task sj,k running on machine mi
not be later than the actual start time (AST) tAS
                                                j ′ of job vj ′ .             ai,j,k (t)     indicate whether task sj,k is active on machine mi
The completion time of the workflow is denoted as tC . We                                    at time t
consider the map and reduce phases of each MapReduce job                       ni,j (t)      the number of running tasks in job vj on machine mi
                                                                                             at time t
as two component jobs connected via an execution depen-                         ni (t)       the number of CPU cores used by f on machine mi
dency edge.                                                                                  at time t
3.1.3 MapReduce Model                                                           oi (t)       the size of memory used by workflow f on machine mi
                                                                                             at time t
   We consider a MapReduce job vj running a set of parallel                       E          the DEC of workflow f in cluster M
map (or reduce) tasks, each of which requires a memory of
size oj and spends a percentage µi,j of time executing CPU-
bound instructions on a CPU core of machine mi . In job                  denotes that the k-th task of the j-th job is mapped onto
vj , generally, as the number Kj of parallel tasks increases,            the i-th machine from time tS             E
                                                                                                      j,k to time tj,k . The domain
the workload wj,k (Kj ) of each task sj,k decreases and the              of this mapping function covers all possible combinations
total workload wj (Kj ) = Kj · wj,k (Kj ) of all tasks increases.        of a set V of component jobs of the workflow, a set M of
However, the maximum number Kj′ of tasks that can be exe-                machines, and a time period T of workflow execution.
cuted in parallel without performance  P degradation is limited          3.2 Problem Definition
by the cluster capacity, e.g. Kj′ ≤ mi ∈M min{Ni , ⌊oi /oj ⌋}.              We formulate a deadline- and resource-constrained work-
Note that a serial program can be considered as a special                flow mapping problem for energy efficiency (EEWM):
case of a MapReduce job with Kj′ = 1. The execution time                    Definition 1. EEWM: Given a cluster {mi (Ni , pi , oi , Pi )}
of task sj,k on machine mi is ti,j,k = wj,k (Kj )/(µi,j · pi ).          of machines with an available resource-time table {NiA (t), oA   i (t)},
Estimating the execution time of a task on any service is                and a workflow request f (G(V, A), d), where each job vj has
an important issue. Many techniques have been proposed                   a set {wj (Kj )|Kj = 1, 2, . . . , Kj′ } of workloads for different
such as code analysis, analytical benchmarking/code profil-              task partitions, and each task in job vj has a percentage µi,j
ing, and statistical prediction, which are beyond the scope              of execution time for CPU-bound instructions on machine
of this work.                                                            mi and a memory demand oj , we wish to find a mapping
   The active state ai,j,k (t) of task sj,k on machine mi is 1                                                        [tS      E
                                                                                                                        j,k , tj,k ]
(or 0) if it is active (or inactive) at time t. The number of            function M : (V, M, T ) → {sk (vj ) ======⇒ mi } to mini-
active tasks in job vj on machine mi at time t is ni,j (t) =
P                                                                        mize the dynamic energy consumption:
    sj,k ∈vj ai,j,k (t). The number of CPU cores and the size                                              min E,
of memory used by all component        P jobs of a workflow on
                                                                                                             M
machine mi at time t are ni (t) = vj ∈V ni,j (t) and oi (t) =            subject to the following deadline and resource constraints:
P
    vj ∈V [oj ni,j (t)], respectively.
                                                                                                   tC ≤ d,
3.1.4 Energy Model
                                                                                                  tAF
                                                                                                   j  ≤ tAS
                                                                                                         j ′ , ∀aj,j ′ ∈ A,
   The DEC of a workflow in Ra cluster  is
      P           P             tC                                                               ni (t) ≤ NiA (t), ∀mi ∈ M,
E = mi ∈M {Pi vj ∈V [µi,j 0 ni,j (t)dt]}, where Pi is the
dynamic power consumption (DPC) of a fully utilized CPU                                          oi (t) ≤ oA
                                                                                                           i (t), ∀mi ∈ M.
core, and which is validated by energy measurements of prac-
tical systems in [9].                                                    4       Special Case: Pipeline-structured Workflow
3.1.5 Mapping Function                                                     We start with a special case with a Pipelined-structured
  We define a workflow mapping function as M : {sk (vj )                 workflow running on HOmogeneous machines (PHO). We
[tS      E                                                               prove it to be NP-complete and design an FPTAS to solve
  j,k , tj,k ]
======⇒ mi , ∀vj ∈ V, ∃mi ∈ M, ∃[tS      F
                                  j,k , tj,k ] ⊂ T }, which              EEWM-PHO.


                                                                    36
         WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


   Generally, we may achieve more energy savings on an
under-utilized cluster than on a fully-utilized cluster. Hence,                                               v1,1                                                                     vJ,1

the problem for a single pipeline-structured workflow is still                                e1,1                           e2,1                                e2J-1,1                                e2J,1
                                                                                                              v1,2                                                                     vJ,2
valuable in real-life systems. The EEWM-PHO problem is
defined as follows.                                                                                  e1,2      ⁞      e2,2                                                  e2J-1,2      ⁞      e2J,2
                                                                                                     e1,k             e2,k               e3,k       e2J-2,k               e2J-1,k                 e2J,k
   Definition 2. EEWM-PHO: Given I idle homogeneous                                      u0                   v1,k                  u1          …             uJ-1                     vJ,k                     uJ

machines {mi (N, p, o, P )} and a workflow f (G(V, A), d) con-                                                ⁞                                                                          ⁞
                                                                                          e1,K’1                               e2,K’1                         e2J-1,K’J                                   e2J,K’J
taining a chain of J jobs, where each job vj has a workload                                                  v1,K’1                                                                    vJ,K’J
list {wj (Kj )|Kj = 1, 2, . . . , Kj′ }, and each task in job vj has
                                                                                                            1st Job                                                                   Jth Job
a percentage µj of execution time for CPU-bound instruc-
tions and a memory demand oj , does there exist a feasible
mapping scheme such that DEC is no more than E?
                                                                                  Figure 1: A constructed network corresponding to
4.1 Complexity Analysis                                                           a workflow with a pipeline structure.
   We prove that EEWM-PHO is NP-complete by reducing
the two-choice knapsack problem (TCKP) to it.
                                                                                   Algorithm 1:                         EEWM-PHO-FPTAS
   Definition 3. Two-Choice Knapsack: Given J classes
                                                                                      Input: A cluster {mi (N, p, o, P )} and a chain of jobs {vj }
of items to pack in a knapsack of capacity H, where each                                      with a deadline d and a set {wj (Kj )} of workloads
class Cj (j = 1, 2, . . . , J) has two items and each item rj,l                        1: Construct a DAG G(V, E) for pipeline {vj } as shown in
(l = 1, 2) has a value bj,l and a weight hj,l , is there a choice                         Fig. 1, and assign cost Ej (k) and delay tj (k) to edge e2j−1,k
of exactly one item from each class such that the total value                             and zero cost and zero delay to edge e2j,k ;
is no less than B and the total weight does not exceed H?                              2: Use FPTAS in [13] to find the minimum-cost path from u0
                                                                                          to uJ under delay constraint d with approximate rate (1 + ǫ)
   The knapsack problem is a special case of TCKP when we                                 and convert it to mapping scheme.
put each item in the knapsack problem and a dummy item
with zero value and zero weight together into a class. Since
the knapsack problem is NP-complete, so is TCKP.
    Theorem 1. EEWM-PHO is NP-complete.                                           and accordingly its DEC as Ej (k) = k · P · µj · tj (k). Sub-
    Proof. Obviously, EEWM-PHO ∈ N P . We prove that                              sequently, we assign the cost c(e) and delay l(e) of each
EEWM-PHO is NP-hard by reducing TCKP to EEWM-                                     edge e ∈ E as c(e2j−1,k ) = Ej (k), l(e2j−1,k ) = tj (k), and
PHO. Let ({Cj (bj,1 , hj,1 , bj,2 , hj,2 )|1 ≤ j ≤ J}, B, H) be an                c(e2j,k ) = l(e2j,k ) = 0, and set the delay constraint on a
instance of TCKP. Without loss of generality, we assume                           path from u0 to uJ to be d. As a result, the minimum cost
that bj,1 > bj,2 and hj,1 > hj,2 > 0. If hj,1 < hj,2 , rj,1                       in RSP is exactly the minimum DEC in EEWM-PHO, and if
would always be selected. If hj,2 = 0, we can always add                          vj,k is on the solution path to RSP, the j-th job has k tasks.
τ > 0 to hj,1 , hj,2 and H such that hj,2 > 0.                                    Based on Theorem 1 and the above reduction, we have
   We construct an instance of EEWM-PHO as follows. Let                              Theorem 2. EEWM-PHO is weakly NP-complete.
I = 2, d = H, vj = Cj , Kj′ = 2, oj = o, wj (1) =                                    Let K ′ = max1≤j≤J Kj′ . Then, |V| ≤ JK ′ + J + 1 and
h
Pj,1 µj p, wj (2) = 2hj,2 µj p, uj = (Bj −bj,1 )/(hj,1 P ) and E =                |E| ≤ 2JK ′ in the constructed graph G. It is obvious that
    1≤j≤J Bj − B, where Bj = (2hj,2 bj,1 − hj,1 bj,2 )/(2hj,2 −                   the construction process can be done within time O(JK ′ ).
hj,1 ). It is obvious that the construction process can be done                   Therefore, EEWM-PHO finds a feasible solution that con-
in polynomial time.                                                               sumes energy within the least DEC multiplied by (1 + ǫ) in
   Then, if job vj only has one task, its execution time is                       time O(J 2 K ′2 /ǫ) if the FPTAS in [13] is used to solve RSP in
tj (1) = wj (1)/(µj p) = hj,1 , and its DEC is Ej (1) = tj (1)µj P =              acyclic graphs. Thanks to the special topology in Fig. 1, the
Bj − bj,1 . If job vj has two tasks, the execution time of each                   time complexity is further reduced to O(JK ′ (log K ′ + 1/ǫ)).
task is tj (2) = wj (2)/(2µj p) = hj,2 , and the DEC of job vj
is Ej (2) = 2tj (2)µj P = Bj − bj,2 . Obviously, two tasks in a                   5     Algorithm Design for an Arbitrary Work-
job are mapped onto two machines simultaneously.                                        flow on a Heterogeneous Cluster
   Therefore, if the answer to the given instance of TCKP                            We consider EEWM with a DAG-structured workflow on
is YES (or No), the answer to the constructed instance of                         a heterogeneous cluster and design a heuristic algorithm,
EEWM-IJOM is also YES (or No). Proof ends.                                        referred to as big-data adaptive workflow mapping for energy
4.2 Approximation Algorithm                                                       efficiency (BAWMEE).
   We prove that EEWM-PHO is weakly NP-complete and                               5.1 An Overview of BAWMEE
design an FPTAS as shown in Alg. 1 by reducing this prob-                            The key idea of BAWMEE is to partition a DAG into
lem to the weakly NP-complete restricted shortest path (RSP)                      a set of pipelines and then repeatedly employ Alg. 1 with
problem [13], which is solvable with an FPTAS.                                    near optimality and low time complexity to achieve energy-
   Given an instance of EEWM-PHO, we construct an in-                             efficient mapping of each pipeline.
stance of RSP according to the pipeline-structured workflow                          In BAWMEE, each workflow mapping consists of two com-
as follows. As illustrated in Fig. 1, the network graph G con-                    ponents: iterative critical path (CP) selection and pipeline
sists of V = {vj,k |j = 1, . . . , J, k = 1, . . . , Kj′ } ∪ {u0 , uj |j =        mapping. A CP is the longest execution path in a workflow,
1, . . . , J} with a source u0 and a destination uJ , and E =                     which can be calculated in linear time. The algorithm starts
{e2j−1,k , e2j,k |j = 1, . . . , J, k = 1, . . . , Kj′ }, where e2j−1,k =         with computing an initial CP according to the average exe-
(uj−1 , vj,k ) and e2j,k = (vj,k , uj ). Then, we calculate the ex-               cution time of each job running in serial on all the machines,
ecution time of job vj with k tasks as tj (k) = wj (k)/(k·p·µj ),                 followed by a pipeline mapping process. Then, it iteratively


                                                                             37
         WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


 Algorithm 2:             BAWMEE                                                        Algorithm 3:          EEPM
  Input: a workflow f (G(V, A), d) and an ART table R for                                Input: a pipeline pl with its EST pl.est and LFT pl.lf t, an
           sub-clusters {Cl }                                                                     ART table R({Cl }), and TETs {T blj }
   1: T bl ← buildT ET (V, {Cl });                                                       Output: a boolean variable to indicate whether pl or its
   2: if simplyM ap(f, R({Cl }), T bl) =True then                                                    part is mapped
   3:    return .                                                                         1: Label the index j of each job in pl from 1 to the length of pl;
   4: tLF
       j   ← +∞ for ∀vj ∈ f ; tLF    J  ← d for the end job vJ in f ;                     2: Calculate the earliest possible start time of the first job in pl
   5: Calculate the average execution time t̄j of each job vj                                on any machine as est according to R({Cl });
      running in serial on all the machines;                                              3: pl.est
                                                                                                 P ← max{est, pl.est};
   6: G′ ← G;                                                                             4: if     vj ∈pl tj (Kj,1 , Cj,1 ) > pl.lf t − pl.est then
   7: while ∃ an unmapped job ∈ V do                                                      5: return False.
   8:    Find the critical path cp ending at a job v with the                             6: Convert pipeline pl, where each quadruple in T blj of each
         earliest LFT in G′ according to {t̄j |vj ∈ G′ };                                    job vj ∈ pl corresponds to one of its mapping options, into a
   9:    if EEP M (cp, R({Cl }), T bl) =False then                                           network graph in RSP;
  10:       v ← M DP M (cp, R({Cl }));                                                    7: Use Alg. 1 to calculate the number Kj of tasks, sub-cluster
  11:       if v 6= N ull then                                                               C(vj ), and start and finish time, tS              F
                                                                                                                                       j and tj , for each job vj ;
  12:          D ← {all the downstream jobs of v in G − G′ };                             8: for vj+1 ∈ pl do
  13:          if D 6= ∅ then
                                                                                          9: if tF         ′             F     ′
  14:             Cancel the mapping of each job v′ ∈ D, and add                                    j > tLF (vj ) or tj < tES (vj+1 ) then
                  v′ and its associated precedence constraints to G′ ;                   10:       pl(1, j).est ← pl.est;
  15:          EAJM (v, R({Cl });                                                        11:       if tF       ′
                                                                                                        j > tLF (vj ) then
  16:     G′ ← G′ − {vj ∈ cp|vj is mapped};                                              12:           pl(1, j).lf t ← t′LF (vj );
                                                                                         13:       else
                                                                                         14:           pl(1, j).lf t ← min{t′ES (vj+1 ), t′LF (vj ), pl.lf t};
                                                                                         15:       return EEP M (pl(1, j), R({Cl }), T bl);
      Table 2: Time-Energy Table T blj of Job vj .                                       16: if ∃ Kj pairs of a CPU core and memory of size oj in
 tj (Kj,1 , Cj,1 )    <     tj (Kj,2 , Cj,2 )   <   ...   <   tj (Kj,n , Cj,n )
                                                                                             R(C(vj )) for ∀vj ∈ pl then
 ej (Kj,1 , Cj,1 )    >     ej (Kj,2 , Cj,2 )   >   ...   >   ej (Kj,n Cj,n )            17:    Map all Kj tasks onto C(vj ) from tS              F
                                                                                                                                            j to tj for ∀vj ∈ pl;
 Kj,1 ∈ [1, Kj′ ]           Kj,2 ∈ [1, Kj′ ]        ...       Kj,n ∈ [1, Kj′ ]           18:    return True;
    Cj,1 ⊂ M                   Cj,2 ⊂ M             ...          Cj,n ⊂ M                19: return False;


computes a CP with the earliest last finish time (LFT) from
the remaining unmapped workflow branches based on the                                  and
same average execution time of a job as above and performs                                                 d, if vj is the end job of workflow f,
                                                                                                       (
a pipeline mapping of the computed CP until there are no                                       tLF
                                                                                                j  =           min      tAS
                                                                                                                         j ′ , otherwise,
branches left.                                                                                             vj ′ ∈Succ(vj )

   In pipeline mapping, we consider two extreme scenarios:
                                                                                       respectively. If there exist unmapped preceding and suc-
resource/time sufficiency and resource/time insufficiency. In
                                                                                       ceeding jobs of vj , its temporary earliest start time (TEST)
the former case, we only need to focus on energy efficiency,
                                                                                       t′ES (vj ) and temporary last finish time (TLFT) t′LF (vj ) can
while in the latter case, it may be unlikely to meet the
                                                                                       be calculated based on only its mapped preceding and suc-
performance requirement. Therefore, we design one algo-
                                                                                       ceeding jobs, respectively. The EST and LFT of a pipeline
rithm for each of these two scenarios: a heuristic for energy-
                                                                                       are the EST of its first job and LFT of its end job, respec-
efficient pipeline mapping (EEPM) under a deadline con-
                                                                                       tively.
straint in Alg. 3, which calls Alg. 1, and a heuristic for min-
                                                                                           Each job vj is associated with a set of pairs of the num-
imum delay pipeline mapping (MDPM) with energy aware-
                                                                                       ber Kj,n of tasks and the used homogeneous sub-cluster
ness in Alg. 4. If Alg. 3 fails to find a feasible mapping
                                                                                       Cj,n . Each pair corresponds to a certain execution time
scheme due to limited resources, we resort to Alg. 4. In
                                                                                       tj (Kj,n , Cj,n ) and DEC ej (Kj,n , Cj,n ) = P (Cj,n )wj (Kj,n )/p(Cj,n ),
EEPM, due to the homogeneity of tasks in a job, we map
                                                                                       where p(Cj,n ) and P (Cj,n ) are the speed and the DPC of a
all the tasks in the same job onto a homogeneous sub-cluster,
                                                                                       fully utilized CPU core on a machine in Cj,n , respectively,
hence using Alg. 1 to balance the trade-off between execu-
                                                                                       and wj (Kj,n ) is the workload of vj with Kj,n tasks. All the
tion time and DEC (directly associated with total work-
                                                                                       quadruples {(tj (Kj,n , Cj,n ), ej (Kj,n , Cj,n ), Kj,n , Cj,n )} are sorted
load) for each job on a pipeline. In MDPM, we search for
                                                                                       in the ascending order of execution time as listed in Table 2,
a good task partitioning to minimize the end time of each
                                                                                       and are referred to as the time-energy table (TET) T blj of
job through a limited number of tries by reducing the possi-
                                                                                       job vj . Any quadruple with both execution time and DEC
ble number of tasks in each job vj from {1, 2, 3, . . . , Kj′ } to
                                                                                       larger (worse) than those of another will be deleted from
{1, 2, 22 , . . . , 2⌊log Kj ⌋ } ∪ {K ′ }.
                           ′
                                        j                                              T blj .
5.2 Algorithm Description                                                                 In Alg. 2, BAWMEE first builds a time-energy table for
                                                                                       each job by calling buildT ET (). If the workflow cannot meet
  If a job vj has been mapped, it has AST tAS   j   and AFT
 AF                                                                                    its deadline with each job running the fastest, BAWMEE
tj . If all the preceding (and succeeding) jobs, in P rec (and                         performs energy-aware job mapping (EAJM) with minimum
Succ), of job vj are mapped, its earliest start time (EST)                             finish time for each job in a topologically sorted order by
(and LFT) can be calculated as                                                         calling simplyM ap(). Otherwise, BAWMEE employs itera-
             (
               0, if vj is the start job of workflow f,                                tive CP selection to find a CP with the earliest LFT from
       ES
      tj =         max      tAF                                                        unmapped jobs, and performs EEPM or MDPM (if EEPM
                             j ′ , otherwise;
                     vj ′ ∈P rec(vj )                                                  fails) for the selected CP. If there is any job that cannot


                                                                                  38
       WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


 Algorithm 4:      MDPM                                                                                                    v2         v4        v6
  Input: a pipeline pl and an ART table R for {Cl }                                                         v1                                                    v8
  Output: the first job that cannot be mapped
   1: for all vj ∈ pl do                                                                                                   v3         v5        v7
   2:   if EAJM (vj , R({Cl })) > t′LF (vj ) then
   3:      Cancel the mapping of job vj ;
   4:      return vj ;                                                  Figure 2: An example of a workflow structure G.
   5: return N ull.

                                                                           Table 3: Time-Energy Table in the Example
                                                                                  Time                     3              2          5      4                         2      3          5
 Algorithm 5:      EAJM                                                          Energy                    6              8          5      8            ⇒            8      6          5
  Input: a job vj and an ART table R for sub-clusters {Cl }                    # of Tasks                  1              2          1      2                         2      1          1
                                                                               Sub-cluster                 C1             C1         C2    C2                         C1     C1        C2
  Output: the EFT tEFj   of job vj
   1: Update the TEST t′ES (vj ); tEF
                                   j  ← +∞;
                          j             k
                              log K ′                                          0        3        6         11        14         19              0        3        6          11        16        19
   2: for K ← 1, 2, 4, . . . 2   j
                                   , Kj′ do                               m1
                                                                          m2
                                                                                   v1       v2
                                                                                            v3
                                                                                                                v6
                                                                                                                v7
                                                                                                                                           m1
                                                                                                                                           m2
                                                                                                                                                    v1       v2
                                                                                                                                                             v3
                                                                                                                                                                                            v8
                                                                          m3                         v4                   v8               m3                          v4         v6
   3:   Calculate the EFT tEF  j (K)  of job vj with K tasks by           m4                         v5                                    m4                          v5         v7
        minimizing the finish time of each task one by one;
   4:   if tEF
            j  > tEF
                  j (K) then                                                                         (a)                                                               (b)
   5:      tj ← tEF
            EF
                   j (K); Kj ← K;
   6: Map job vj consisting of Kj tasks until tEF
                                               j ;                     Figure 3: Workflow mapping in the example: (a)
   7: return tEF
               j  .                                                    BAWMEE; (b) Optimal.


                                                                       5.3 A Numerical Example
be mapped in MDPM, we cancel the mapping of its down-                     In this subsection, we use a simple example to illustrate
stream jobs. If it is the last job of the workflow, we perform         how BAWMEE achieves energy-efficient workflow mapping
EAJM with minimum finish time.                                         without violating precedence constraints. We consider an
   In Alg. 3 of EEPM, we reset the EST for the input pipeline          idle cluster M = C1 ∪C2 consisting of 4 single-core machines,
according to the earliest time such that enough resources              where C1 = {m1 , m2 } and C2 = {m3 , m4 }, and receives
are made available to the first job. If the pipeline can-              a workflow f comprised of homogeneous jobs organized in
not meet its LFT with each job running the fastest, we                 Fig. 2 with a deadline of 19 time units. The execution time
exit EEPM; otherwise, the mapping of a pipeline with its               and DEC of a job with a different task partitioning on a dif-
EST and LFT is converted into the RSP problem with a re-               ferent sub-cluster are calculated and listed on the left side of
laxed resource limit. Accordingly, we calculate the number             Table 3. BAWMEE first builds a TET for each job on the
of tasks, the sub-cluster, and the start/finish time for each          right side of Table 3. A pipeline {v1 , v2 , v4 , v6 , v8 } is selected
job using Alg. 1. Then, we check if the start and finish time          as the initial CP. We assume that ǫ is set to be 0.02. In an
of each job are between its TEST and TLFT in their execu-              approximation solution of pipeline mapping with EST of 0
tion order. If there exists a job that violates the precedence         and LFT of 19, each job has only one task, and v1 , v2 and
constraint, we divide the pipeline at this job, and use Alg. 3         v6 are mapped onto machine m1 in C1 from 0 to 3, from
to compute the mapping of the upstream sub-pipeline with               3 to 6, and from 11 to 14, respectively, and v4 and v8 are
an updated LFT constraint. We repeat this process until                mapped onto machine m3 in C2 from 6 to 11 and from 14
we find a sub-pipeline whose mapping meets all precedence              to 19, respectively. Then, the second pipeline {v3 , v5 , v7 } is
constraints. If the cluster is able to provide each job in this        selected as the CP in G − {v1 , v2 , v4 , v6 , v8 }. In an approxi-
sub-pipeline with enough resources based on the mapping re-            mation solution of pipeline mapping with EST of 3 and LFT
sult of Alg. 1, we proceed with this mapping; otherwise, we            of 14, v3 intends to have one task and be mapped onto C2
fail to find an EEPM and thus exit. In this case, BAWMEE               from 3 to 8, and v5 and v7 intend to have one task and be
would proceed to search for an MDPM.                                   mapped onto C1 from 8 to 11 and from 11 to 14, respectively.
                                                                       Since v3 misses its TLFT of 6, the first sub-pipeline {v3 } of
  In Alg. 4 of MDPM, we search for the earliest finish time            {v3 , v5 , v7 } is extracted and the approximation solution of
(EFT) of each job using EAJM in their execution order,                 sub-pipeline mapping with EST of 3 and LFT of 6 is that v3
and thus obtain the EFT of the entire pipeline. In Alg. 5 of           has one task and is mapped onto a machine m2 in C1 from 3
EAJM with minimum finish time under resource constraints,              to 6. Subsequently, the third pipeline {v5 , v7 } is selected as
we exponentially relax the limit on the maximum number                 the CP in G − {v1 , v2 , v3 , v4 , v6 , v8 }, and the approximation
of tasks in a job to make a tradeoff between the optimality            solution of its mapping with EST of 6 and LFT of 14 is that
and the time complexity of EAJM.                                       v5 intends to have one task and be mapped onto C2 from 6
  Since EEPM and MDPM are of O(J 2 K ′ L[log(K ′ L) +                  to 9 and v7 intends to have one task and be mapped onto C1
1/ǫ] + M ′ H) and O(M ′ HJK ′ log K ′ ), respectively, the time        from 9 to 14. Since v7 starts before its TEST of 11, the first
complexity of BAWMEE is O(J 2 K ′ [JL(1/ǫ + log(K ′ L)) +              sub-pipeline {v5 } of {v5 , v7 } is extracted and the approxi-
M ′ H log K ′ ]). Here, M ′ is the number of machines; L is            mation mapping solution of the sub-pipeline with EST of 6
the number of homogeneous sub-clusters, J is the number                and LFT of 11 is that v5 has one task and is mapped onto a
of jobs; K ′ is the maximum number of tasks in a job; and              machine m4 in C2 from 6 to 11. Finally, the fourth pipeline
H is the number of time slots in the ART table.                        {v7 } is selected as the CP in G − {v1 , v2 , v3 , v4 , v5 , v6 , v8 },


                                                                  39
       WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


                                            12
                                            11
                                            10
                                                                                           Table 4: Specifications for Four Types of Machines


              Execution Time (in Minutes)
                                             9                                              Mach.      CPU Models        # of     Freq.   DPC per    Mem.
                                             8                                              Type                         cores   (GHz)    core (W)   (GB)
                                             7
                                             6
                                                                                             1      6-core Xeon E7450     18      2.40        90      64
                                             5                                               2       Single Core Xeon      6      3.20        92      64
                                             4
                                                                                             3      2-Core Xeon 7150N     12      3.50       150      64
                                             3
                                             2                                               4       Itanium 2 9152M       8      1.66       104      64
                                             1
                                             0
                                                 22   29       51          96   187
                                                      The Number of Map Tasks


                                                                                           precedence constraints of the workflow is set to 1.5 times
Figure 4: The execution time of a MapReduce job                                            of the number of jobs, if possible. The maximum possible
versus the number of tasks.                                                                number of tasks for each job is randomly selected between
                                                                                           12 and 48. The workload of a job is randomly selected be-
                                                                                           tween 0.6 × 1012 and 21.6 × 1012 CPU cycles when running
and the approximation solution of its mapping with EST of                                  in serial. The workload w(k) of a job with k > 1 tasks is
11 and LFT of 14 is that v7 has one task and is mapped onto                                randomly selected between w(k − 1)[1 + 0.2/(k − 1)] and
machine m2 in C2 from 11 to 14. Specifically, the mapping                                  w(k − 1)[1 + 0.6/(k − 1)]. We calculate the sum t1 of the
result of BAWMEE is shown in Fig. 3(a), and its DEC is 45                                  average execution time of the serial jobs on the critical path
units. The optimal mapping is shown in Fig. 3(b), and the                                  and the sum t2 of the average execution time of all serial
minimum DEC is 44 units.                                                                   jobs according to the CPU speeds of all types of machines,
                                                                                           and randomly select a workflow deadline baseline from the
6   Performance Evaluation                                                                 time range [t1 , t2 ]. The percentage of execution time for
  We conduct experiments to illustrate the effect of task                                  CPU-bound instructions of a task in each job on each type
partitioning on job workload and conduct simulations to                                    of machine is randomly selected between 0.1 and 1.0. The
evaluate the performance of BAWMEE in comparison with                                      memory demand of a task in each job is randomly selected
two existing algorithms adapted from different scenarios: i)                               from 0.5GB to 4GB at an interval of 0.5GB.
pSciMapper adapted from a workflow mapping algorithm                                          We evaluate these algorithms in a heterogeneous clus-
in [28] by applying the interference avoidance to MapRe-                                   ter consisting of machines with four different specifications
duce mapping, and ii) EEDA adapted from a MapReduce                                        listed in Table 4, based on 4 types of Intel processors. Each
job mapping algorithm integrated with algorithms in [9]                                    homogeneous sub-cluster has the same number of machines.
and [10].                                                                                  Each scheduling simulation lasts for 3 days and is repeated
6.1 Performance Model Validation                                                           20 times with different workflow instances, whose arrivals
   We consider a computing performance model where the                                     follow the Poisson distribution. In the performance evalua-
total workload of a moldable parallel job increases and the                                tion, each data point represents the average of 20 runs with
execution time of each task decreases as the number of tasks                               a standard deviation. The parameter ǫ in BAWMEE is set
increases. For model validation, we conduct an experiment                                  to 0.2. By default, the workflow size is randomly selected
to illustrate the effect of task partitioning on job workload                              between 40 and 60 jobs; the cluster size and the average ar-
for big data applications, which is the foundation of this                                 rival interval of workflows are set to be 128 machines and 30
research. Towards this goal, we implement a MapReduce                                      minutes, respectively; the deadline factor, which is a coef-
program to find out the most common reason for flight can-                                 ficient multiplied by the deadline baseline to determine the
cellations based on the airline on-time performance data set                               actual workflow deadline, is set to 0.1.
from [1] and run this program on a computer server equipped                                   The dynamic energy consumption reduction (DECR) over
with 2 processors of Intel(R) Xeon(R) CPU E5-2630 with 6                                   the other algorithms in comparison is defined as
cores of 2.30GHz and 64GB memory. The program execu-                                                            DECOther − DECBAW M EE
tion time is measured and plotted in Fig. 4, which shows                                     DECR(Other) =                             · 100%,
                                                                                                                      DECOther
that the execution time of this MapReduce job increases as
the number of tasks increases when the server is fully uti-                                where DECBAW M EE and DECOther are the average DEC
lized during execution, which means that the total workload                                per workflow achieved by BAWMEE and the other algo-
increases with the number of tasks.                                                        rithm, respectively. The deadline missing rate (DMR) is de-
                                                                                           fined as the ratio of the number of workflows missing their
6.2 Simulation Settings                                                                    deadlines to the total number of workflows. The unit run-
   We generate a series of random workflows as follows: (i)                                ning time (URT) is measured as the average simulation run-
randomly select the length L of the critical path of a work-                               ning time for computing the mapping scheme of each work-
flow (no less than 3) and divide the workflow into L layers,                               flow. The simulation runs on a Linux machine equipped
in each of which every job has the same length of the longest                              with Intel Xeon CPU E5-2620 v3 of 2.4 GHz and a memory
path from the start job; (ii) randomly select the number of                                of 16 GB.
jobs in each layer except the first and last layers, in which
there is only one job; (iii) for each job, add an input edge
                                                                                           6.3 Simulation Results
from a randomly selected job in the immediately preceding                                  6.3.1 Problem Size
layer, if absent, and an output edge to a randomly selected                                   For performance evaluation, we consider 20 different prob-
job in its downstream layer(s); (iv) randomly pick up two                                  lem sizes from small to large scales, indexed from 1 to 20 as
jobs in different layers and add a directed edge from the job                              tabulated in Table 5. Each problem size is defined as a
in the upstream layer to the job in the downstream layer                                   quadruple (|V |, |M |, 1/λ, T ), where 1/λ is the average ar-
until we reach the given number of edges. The number of                                    rival interval of workflow requests in minutes, and T is the


                                                                                      40
                                            WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


                                 80                                                                                                                                                                                                                               10 1


                                                                                                                                                                                                                         Running Time per Workflow Mapping (s)
                                                     BAWMEE over pSciMapper         BAWMEE over EEDA                                       100                                                      pSciMapper                                                                                                   pSciMapper
                                 70                                                                                                                                                                 EEDA                                                                                                         EEDA
                                                                                                                                           90                                                       BAWMEE                                                                                                       BAWMEE
                                                                                                                                                                                                                                                                  10 0


                                                                                                              Deadlline Missing Rate (%)
                                 60                                                                                                        80
             DEC Reduction (%)


                                                                                                                                           70
                                 50
                                                                                                                                           60                                                                                                                    10 -1
                                 40                                                                                                        50
                                 30                                                                                                        40                                                                                                                    10 -2
                                                                                                                                           30
                                 20
                                                                                                                                           20                                                                                                                    10 -3
                                 10                                                                                                        10
                                  0                                                                                                         0                                                                                                                    10 -4
                                       1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                                                           1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20                                                                                                        1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                                                      Problem Index                                                                                                Problem Index                                                                                                                                             Problem Index


Figure 5: DECR vs. problem sizes.                                                                            Figure 6: DMR vs. problem sizes.                                                                        Figure 7: URT vs. problem sizes.

                          320                                                                                                              100                                                                                                                    0.6


                                                                                                                                                                                                                     Running Time per Workflow Mapping (s)
                                                                       pSciMapper     EEDA     BAWMEE                                                                                               pSciMapper
                                                                                                                                           90                                                       EEDA                                                         0.55
                          280
                                                                                                                                                                                                    BAWMEE                                                        0.5
                                                                                                                                           80


                                                                                                              Deadline Missing Rate (%)
  DEC per Workflow (MJ)


                          240                                                                                                                                                                                                                                    0.45
                                                                                                                                           70
                                                                                                                                                                                                                                                                  0.4
                          200                                                                                                              60
                                                                                                                                                                                                                                                                 0.35
                                                                                                                                                                                                                                                                                                                                                              pSciMapper
                          160                                                                                                              50                                                                                                                     0.3                                                                                         EEDA
                                                                                                                                           40                                                                                                                                                                                                                 BAWMEE
                                                                                                                                                                                                                                                                 0.25
                          120
                                                                                                                                           30                                                                                                                     0.2
                                 80                                                                                                        20                                                                                                                    0.15
                                                                                                                                           10                                                                                                                     0.1
                                 40
                                                                                                                                                                                                                                                                 0.05
                                                                                                                                            0
                                  0                                                                                                                                                                                                                                 0
                                      .05 .1 .15 .2 .25 .3 .35 .4 .45 .5 .55 .6 .65 .7 .75 .8 .85 .9 .95 1                                      0      0.1    0.2   0.3   0.4 0.5 0.6       0.7   0.8   0.9      1                                                    0                                          0.1    0.2   0.3   0.4 0.5 0.6       0.7   0.8   0.9      1
                                                                Deadline Factor                                                                                           Deadline Factor                                                                                                                                           Deadline Factor


                          Figure 8: DEC vs. deadlines.                                                                             Figure 9: DMR vs. deadlines.                                                                                       Figure 10: URT vs. deadlines.

                                 75                                                                                                        100                                                                                                                    10 2


                                                                                                                                                                                                                         Running Time per Workflow Mapping (s)
                                 70
                                                                                                                                           90
                                 65
                                                                                                                                                                                                                                                                  10 1
                                                                                                              Deadline Missing Rate (%)


                                 60                                                                                                        80
                                 55
             DEC Reduction (%)


                                                                                                                                           70
                                 50                                                                                                                                                                                                                               10 0
                                 45                                                                                                        60
                                 40                                                                                                                   pSciMapper
                                 35
                                                                                                                                           50         EEDA                                                                                                       10 -1
                                                                                                                                                      BAWMEE
                                 30                                                                                                        40
                                 25
                                                                                                                                           30                                                                                                                    10 -2
                                 20
                                 15                                                                                                        20
                                 10                                                                                                                                                                                                                              10 -3                                                                                        pSciMapper
                                            BAWMEE over pSciMapper                                                                         10                                                                                                                                                                                                                 EEDA
                                  5         BAWMEE over EEDA                                                                                                                                                                                                                                                                                                  BAWMEE
                                  0                                                                                                         0                                                                                                                                                        -4
                                                                                                                                                                                                                                                                 10
                                      5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100                                                 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100                                                                                                  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100
                                                   The Number of Jobs on Average                                                                              The Number of Jobs on Average                                                                                                                                The Number of Jobs on Average


Figure 11:                                                DECR vs.                           workflow Figure 12:                                                      DMR vs.                      workflow Figure 13: URT vs. workflow sizes.
sizes.                                                                                                sizes.

                          300                                                                                                              100                                                                                                                                                           7
                                                                                                                                                                                                                                                                 Running Time per Workflow Mapping (s)


                                                                       pSciMapper     EEDA     BAWMEE                                                                                                                                                                                                                                                         pSciMapper
                          270                                                                                                              90                                                                                                                                                                                                                 EEDA
                                                                                                                                                                                                                                                                                                         6                                                    BAWMEE
                                                                                                              Deadline Missing Rate (%)


                          240                                                                                                              80
  DEC per Workflow (MJ)


                          210                                                                                                              70                                                                                                                                                            5

                          180                                                                                                              60
                                                                                                                                                                                                    pSciMapper                                                                                           4
                          150                                                                                                              50                                                       EEDA
                                                                                                                                                                                                    BAWMEE                                                                                               3
                          120                                                                                                              40
                                 90                                                                                                        30                                                                                                                                                            2
                                 60                                                                                                        20
                                                                                                                                                                                                                                                                                                         1
                                 30                                                                                                        10
                                  0                                                                                                         0                                                                                                                                                            0
                                       64    80   96 112 128 144 160 176 192 208 224 240 256                                                 64       80     96 112 128 144 160 176 192 208 224 240 256                                                                                                  64      80    96 112 128 144 160 176 192 208 224 240 256
                                                        The Number of Machines                                                                                      The Number of Machines                                                                                                                                    The Number of Machines


 Figure 14: DEC vs. cluster sizes.                                                                           Figure 15: DMR vs. cluster sizes.                                                                       Figure 16: URT vs. cluster sizes.


time period in unit of days for accepting workflow requests in                                                                                                                       59.7%). Furthermore, the URT of BAWMEE is on the same
each simulation. As the workflow size and arrival frequency                                                                                                                          order of magnitude as those of pSciMapper and EEDA and
increase from index 1 to 20, we increase the resources cor-                                                                                                                          is less than 7.4 seconds even for problem index 20.
respondingly to meet tight deadlines with factor 0.1. We                                                                                                                             6.3.2 Deadline Constraint
plot the DECR, DMR, and URT of pSciMapper, EEDA,
and BAWMEE in Figs. 5-7, respectively, which show that                                                                                                                                  We evaluate the performance of pSciMapper, EEDA, and
BAWMEE saves 58.1% to 67.3% DEC and 22.1% to 40.1%                                                                                                                                   BAWMEE in terms of DEC, DMR, and URT under differ-
DEC in comparison with pSciMapper and EEDA, respec-                                                                                                                                  ent deadline constraints obtained from the deadline base-
tively, and the DMR of pSciMapper (or EEDA) minus that                                                                                                                               line multiplied by a factor from 0.05 to 1 with an interval
of BAWMEE is in the range of 42.6% to 93.4% (or 4.1% to                                                                                                                              of 0.05. The DEC, DMR, and URT of these algorithms
                                                                                                                                                                                     are plotted in Figs. 8-10, respectively. These measurements


                                                                                                                                                                              41
                                 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


                          320                                                                                            100                                                                                                7


                                                                                                                                                                                  Running Time per Workflow Mapping (s)
                                                      pSciMapper   EEDA    BAWMEE                                                                  pSciMapper   EEDA    BAWMEE                                            6.5    pSciMapper
                          280                                                                                            90                                                                                                      EEDA
                                                                                                                                                                                                                            6    BAWMEE


                                                                                             Deadline Missing Rate (%)
                                                                                                                         80                                                                                               5.5
  DEC per Workflow (MJ)


                          240
                                                                                                                         70                                                                                                 5
                          200                                                                                                                                                                                             4.5
                                                                                                                         60
                                                                                                                                                                                                                            4
                          160                                                                                            50                                                                                               3.5
                                                                                                                                                                                                                            3
                          120                                                                                            40
                                                                                                                                                                                                                          2.5
                                                                                                                         30                                                                                                 2
                          80
                                                                                                                         20                                                                                               1.5
                          40                                                                                                                                                                                                1
                                                                                                                         10
                                                                                                                                                                                                                          0.5
                           0                                                                                              0                                                                                                 0
                                Random     Chain       Tree Reverse Tree Diamond                                               Random   Chain       Tree Reverse Tree Diamond                                                   Random        Chain       Tree Reverse Tree Diamond
                                                Workflow Structures                                                                          Workflow Structures                                                                                   Workflow Structures


Figure 17:                                     DEC vs.                    workflow Figure 18:                                              DMR vs.                     workflow Figure 19:                                                        URT vs.                workflow
structures.                                                                        structures.                                                                                  structures.


                                                                                                                                                        sources. In addition, the URT of BAWMEE is 75.5% to
                                               Table 5: Problem Sizes.                                                                                  3.6 times and 1.3 to 10.2 times of those of pSciMapper and
                           Index         (|V |, |M|, 1/λ, T )         Index           (|V |, |M|, 1/λ, T )
                              1             (3-7, 4, 240, 7)            11            (53-57, 192, 30, 1)                                               EEDA, respectively.
                              2            (8-12, 8, 200, 7)            12            (58-62, 256, 25, 1)
                              3          (13-17, 12, 160, 7)            13            (63-67, 384, 20, 1)
                                                                                                                                                        6.3.4 Cluster Size
                              4          (18-22, 16, 150, 7)            14            (68-72, 512, 15, 1)                                                  We run these three algorithms under different cluster sizes
                              5          (23-27, 24, 120, 7)            15            (73-77, 768, 12, 1)                                               of 64 to 256 machines at a step of 16 for scalability test. The
                              6          (28-32, 32, 105, 3)            16          (78-82, 1024, 10, 1/3)
                              7           (33-37, 48, 90, 3)            17           (83-87, 1536, 8, 1/3)
                                                                                                                                                        DEC, DMR, and URT of these algorithms are plotted in
                              8           (38-42, 64, 60, 3)            18           (88-92, 2048, 6, 1/3)                                              Figs. 14-16, respectively, where we observe that as the num-
                              9           (43-47, 96, 45, 3)            19           (93-97, 3072, 5, 1/3)                                              ber of machines increases, BAWMEE consumes 54.2% to
                             10          (48-52, 128, 30, 3)            20          (98-102, 4096, 4, 1/3)                                              69.3% and 6.6% to 43.9% less DEC than pSciMapper and
                                                                                                                                                        EEDA, respectively, hence exhibiting a satisfactory scala-
                                                                                                                                                        bility property with respect to the cluster size. Further-
show that BAWMEE saves 56.7% to 71.0% DEC and 8.8%                                                                                                      more, DAWMEE only misses 0.1% to 4.0% deadlines while
to 43.1% DEC as the deadline increases, and reduces DMR                                                                                                 pSciMapper and EEDA miss 77.9% to 98.9% and 7.7% to
from 99.8% and 90.4% to 54.2% with a deadline factor of                                                                                                 78.3% deadlines, respectively. The increase in the cluster
0.05 in comparison with pSciMapper and EEDA, respec-                                                                                                    size results in a relatively looser deadline and a more flex-
tively. The DMR of BAWMEE is close to zero when the                                                                                                     ible workflow mapping, as a result of which, the DMRs of
deadline factor is larger than 0.1. Additionally, the URT of                                                                                            these three algorithms drastically decrease, and BAWMEE
BAWMEE is less than 0.6 second and is 89.1% to 1.2 times                                                                                                has more chances to save energy. Moreover, the URT of
and 6.6 to 13.2 times of those of pSciMapper and EEDA,                                                                                                  BAWMEE is less than 6.4 seconds and is comparable with
respectively. It is worth pointing out that as the deadline                                                                                             those of pSciMapper and EEDA.
increases, the DEC and URT of BAWMEE decrease, be-                                                                                                      6.3.5 Workflow Structure
cause EEPM plays a more significant role than MDPM in
                                                                                                                                                          We further investigate these three algorithms with vari-
BAWMEE. Hence, BAWMEE makes a better tradeoff be-
                                                                                                                                                        ous workflow structures, including a random shape, a chain,
tween DEC and DMR than the other algorithms in compar-
                                                                                                                                                        a tree, a reverse tree, and a diamond. The DEC, DMR,
ison at an acceptable cost of running time.
                                                                                                                                                        and URT are plotted in Figs. 17-19, respectively, which
6.3.3 Workflow Size                                                                                                                                     show that BAWMEE reduces DEC by 63.6%, 56.6%, 86.3%,
   For scalability test, we run these three algorithms under                                                                                            86.5% and 86.2% as well as by 27.6%, 14.7%, 72.1%, 72.5%
different average workflow sizes with 5 to 100 jobs per work-                                                                                           and 72.2% in comparison with pSciMapper and EEDA in
flow at a step of 5, where the maximum and minimum work-                                                                                                random, chain, tree, reverse tree and diamond structured
flow sizes are 2 jobs more and less than the average work-                                                                                              workflows, respectively. Further, BAWMEE saves less en-
flow size, respectively. We plot the DECR, DMR, and URT                                                                                                 ergy in chain-structured workflows than others, because the
of these algorithms in Figs. 11-13, respectively, where we                                                                                              deadline baseline is set to be the tightest for this struc-
observe that BAWMEE achieves an increasing DECR be-                                                                                                     ture based on our deadline generation method. BAWMEE
tween 54.8% and 73.2% in comparison with pSciMapper,                                                                                                    only misses 0.2%, 0%, 1.6%, 0% and 0% deadlines, while
and between 8.2% and 49.3% in comparison with EEDA.                                                                                                     pSciMapper and EEDA miss 89.7%, 100%, 58.0%, 71.9%
Moreover, BAWMEE only misses less than 3.0% deadlines                                                                                                   and 61.6% deadlines, and 12.3%, 32.8%, 3.2%, 4.6% and
while pSciMapper and EEDA miss 77.9% to 97.6% and 9.6%                                                                                                  3.4% deadlines in random, chain, tree, reverse tree and dia-
to 69.6% deadlines, respectively. For large workflow sizes                                                                                              mond structured workflows, respectively. Besides, the URT
with 80 to 100 jobs per workflow that impose high resource                                                                                              of BAWMEE is less than 0.6 second, and is 1.0 times, 4.5
demands, BAWMEE achieves a DECR between 8.2% and                                                                                                        times, 1.1%, 0.4% and 0.4%, as well as 9.4 times, 12.7 times,
11.5%, because it significantly reduces DMR (the first ob-                                                                                              16.5%, 9.1% and 6.4% of those of pSciMapper and EEDA
jective) from over 54.3% to less than 0.1%, in comparison                                                                                               in random, chain, tree, reverse tree and diamond structured
with EEDA. The DMR of EEDA experiences a slump under                                                                                                    workflows, respectively.
the medium workflow sizes because a higher accuracy could
be achieved on the execution progress of a smaller workflow                                                                                             7       Conclusion
than a larger one, while a further increase in the workflow                                                                                                We investigated the property of moldable jobs and formu-
size may lead to a more severe shortage of computing re-                                                                                                lated a workflow mapping problem to minimize dynamic en-


                                                                                                                                                  42
       WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah


ergy consumption under deadline and resource constraints.                   Letters, 83(5):287–291, 2002.
We designed an FPTAS for a special case with a pipeline-               [14] I. Goiri, K. Le, T. D. Nguyen, J. Guitart, J. Torres,
structured workflow on a homogeneous cluster, which was                     and R. Bianchini. GreenHadoop: Leveraging green
proved to be NP-complete, and a heuristic for a general-                    energy in data-processing frameworks. In Proc. of
ized problem with an arbitrary workflow on a heterogeneous                  ACM EuroSys, pages 57–70, Bern, Switzerland, Apr
cluster. The performance superiority of the proposed heuris-                2012.
tic in terms of dynamic energy saving and deadline miss-               [15] T. F. Gonzalez, editor. Handbook of Approximation
ing rate was illustrated by extensive simulation results in                 Algorithms and Metaheuristics. Chapman and
Hadoop/YARN in comparison with existing algorithms.                         Hall/CRC, 2007.
   Our work reveals that the energy-efficient and deadline-            [16] W. Lang and J. M. Patel. Energy management for
aware mapping algorithms tailored to big data workflows                     MapReduce clusters. Proceedings of the VLDB
could lead to significant energy savings and a higher level of              Endowment, 3(1):129–139, 2010.
Quality of Service. It is of our future interest to incorporate        [17] Y. C. Lee, H. Han, A. Y. Zomaya, and M. Yousif.
the proposed mapping algorithms into the existing workflow                  Resource-efficient workflow scheduling in clouds.
engines in the Hadoop ecosystem including Oozie [2] and                     Elsevier Knowledge-Based Systems, 80:153–162, 2015.
Tez [3], and evaluate the performance of energy saving for
                                                                       [18] Y. C. Lee and A. Y. Zomaya. Energy conscious
real-life big data workflows.
                                                                            scheduling for distributed computing systems under
8   Acknowledgments                                                         different operating conditions. IEEE TPDS,
                                                                            22(8):1374–1381, 2011.
  This research is sponsored by National Science Foundation
under Grant No. CNS-1560698 with New Jersey Institute of               [19] J. Li, C. Pu, Y. Chen, V. Talwar, and D. Milojicic.
Technology.                                                                 Improving preemptive scheduling with
                                                                            application-transparent checkpointing in shared
9   References                                                              clusters. In Proc. of ACM Middleware, pages 222–234,
                                                                            Vancouver, BC, Canada, Dec 2015.
 [1] Statistical Computing. http://stat-                               [20] K. Li. Scheduling precedence constrained tasks with
     computing.org/dataexpo/2009/the-data.html.                             reduced processor energy on multiprocessor
 [2] Apache Oozie. https://oozie.apache.org.                                computers. IEEE Tran. on Computers,
 [3] Apache Tez. https://tez.apache.org.                                    61(12):1668–1681, 2012.
 [4] S. Abrishami, M. Naghibzadeh, and D. H. Epema.                    [21] K. Makarychev and D. Panigrahi.
     Cost-driven scheduling of grid workflows using partial                 Precedence-constrained scheduling of malleable jobs
     critical paths. IEEE TPDS, 23(8):1400–1414, 2012.                      with preemption. In Proc. of ICALP, pages 823–834,
 [5] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A.                       Copenhagen, Denmark, Jul 2014.
     Kozuch, and K. Schwan. Robust and flexible                        [22] L. Mashayekhy, M. M. Nejad, D. Grosu, Q. Zhang,
     power-proportional storage. In Proc. of ACM SoCC,                      and W. Shi. Energy-aware scheduling of MapReduce
     pages 217–228, Indianapolis, IN, USA, Jun 2010.                        jobs for big data applications. IEEE TPDS,
 [6] M. Cardosa, A. Singh, H. Pucha, and A. Chandra.                        26(10):2720–2733, 2015.
     Exploiting spatio-temporal tradeoffs for energy-aware             [23] V. Nagarajan, J. Wolf, A. Balmin, and K. Hildrum.
     MapReduce in the cloud. IEEE Tran. on Computers,                       FlowFlex: Malleable scheduling for flows of
     61(12):1737–1751, 2012.                                                MapReduce jobs. In Proc. of ACM/IFIP/USENIX
 [7] C.-Y. Chen and C.-P. Chu. A 3.42-approximation                         Middleware, pages 103–122, Beijing, China, Dec 2013.
     algorithm for scheduling malleable tasks under                    [24] P. Sanders and J. Speck. Energy efficient frequency
     precedence constraints. IEEE TPDS, 24(8):1479–1488,                    scaling and scheduling for malleable tasks. In Proc. of
     2013.                                                                  Euro-Par, pages 167–178, Rhodes Island, Greece, Aug
 [8] Y. Chen, S. Alspaugh, D. Borthakur, and R. Katz.                       2012.
     Energy efficiency for large-scale MapReduce workloads             [25] X. Xu, W. Dou, X. Zhang, and J. Chen. Enreal: An
     with significant interactive analysis. In Proc. of ACM                 energy-aware resource allocation method for scientific
     EuroSys, pages 43–56, Bern, Switzerland, Apr 2012.                     workflow executions in cloud environment. IEEE
 [9] D. Cheng, P. Lama, C. Jiang, and X. Zhou. Towards                      Tran. on Cloud Comp., 4(2):166–179, 2016.
     energy efficiency in heterogeneous Hadoop clusters by             [26] L. Zhang, K. Li, C. Li, and K. Li. Bi-objective
     adaptive task assignment. In Proc. of IEEE ICDCS,                      workflow scheduling of the energy consumption and
     pages 359–368, Columbus, OH, USA, Jun-Jul 2015.                        reliability in heterogeneous computing systems.
[10] D. Cheng, J. Rao, C. Jiang, and X. Zhou. Resource                      Elsevier Info. Sci., in press.
     and deadline-aware job scheduling in dynamic Hadoop               [27] L. Zhang, K. Li, Y. Xu, J. Mei, F. Zhang, and K. Li.
     clusters. In Proc. of IEEE IPDPS, pages 956–965,                       Maximizing reliability with energy conservation for
     Hyderabad, India, May 2015.                                            parallel task scheduling in a heterogeneous cluster.
[11] M. Drozdowski. Scheduling for Parallel Processing.                     Elsevier Info. Sci., 319:113–131, 2015.
     Springer-Verlag London, 2009.                                     [28] Q. Zhu, J. Zhu, and G. Agrawal. Power-aware
[12] J. Du and J. Y.-T. Leung. Complexity of scheduling                     consolidation of scientific workflows in virtualized
     parallel task systems. SIAM J. Disc. Math.,                            environments. In Proc. of ACM/IEEE SC, pages 1–12,
     2(4):473–487, 1989.                                                    New Orleans, LA, USA, Nov 2010.
[13] F. Ergun, R. Sinha, and L. Zhang. An improved
     FPTAS for restricted shortest path. Info. Processing


                                                                  43

</pre>