=Paper=
{{Paper
|id=Vol-1800/paper5
|storemode=property
|title=Energy-efficient Mapping of Big Data Workflows under Deadline Constraints
|pdfUrl=https://ceur-ws.org/Vol-1800/paper5.pdf
|volume=Vol-1800
|authors=Tong Shu,Chase Wu
|dblpUrl=https://dblp.org/rec/conf/sc/ShuW16
}}
==Energy-efficient Mapping of Big Data Workflows under Deadline Constraints==
Energy-efficient Mapping of Big Data Workflows under ∗ Deadline Constraints Tong Shu and Chase Q. Wu Department of Computer Science New Jersey Institute of Technology Newark, NJ 07102, USA {ts372, chase.wu}@njit.edu ABSTRACT on and off a server may reduce its lifespan or cause unnec- Large-scale workflows for big data analytics have become essary peaks of energy consumption, and ii) DVFS may not a main consumer of energy in data centers where moldable be always available on all servers in a cluster. Therefore, we parallel computing models such as MapReduce are widely direct our efforts to workflow mapping for dynamic energy applied to meet high computational demands with time- saving by adaptively determining the degree of parallelism varying computing resources. The granularity of task par- in each MapReduce job to mitigate the workload overhead titioning in each moldable job of such big data workflows while meeting a given performance requirement. has a significant impact on energy efficiency, which remains Parallel jobs are generally categorized into three classes largely unexplored. In this paper, we analyze the properties with flexibility from low to high: rigid jobs exemplified by of moldable jobs and formulate a workflow mapping prob- multi-threaded programs running on a fixed number of pro- lem to minimize the dynamic energy consumption of a given cessors, moldable jobs exemplified by MapReduce programs workflow request under a deadline constraint. Since this running on any number of processors decided prior to exe- problem is strongly NP-hard, we design a fully polynomial- cution, and malleable jobs running on a variable number of time approximation scheme (FPTAS) for a special case with processors at runtime [11]. A moldable job typically follows a pipeline-structured workflow on a homogeneous cluster a performance model where the workload of each component and a heuristic for the generalized problem with an arbitrary task decreases and the total workload, proportional to DEC, workflow on a heterogeneous cluster. The performance supe- increases as the number of allotted processors increases [15]. riority of the proposed solution in terms of dynamic energy The validity of this model has been verified by many real- saving and deadline missing rate is illustrated by extensive life parallel programs in various big data domains and will simulation results in Hadoop/YARN in comparison with ex- serve as a base of our workflow mapping solution for energy isting algorithms. saving of big data workflows. Keywords In this paper, we construct analytical cost models and for- mulate a workflow mapping problem to minimize the DEC Big Data; Workflow Mapping; Green Computing of a workflow under deadline and resource constraints in 1 Introduction a Hadoop cluster. This problem is strongly NP-hard be- cause a subproblem to minimize the makespan of indepen- Next-generation applications in science, industry, and busi- dent jobs on identical machines under a single resource con- ness domains are producing colossal amounts of data, now straint without considering energy cost has been proved to frequently termed as “big data”, which must be analyzed be strongly NP-hard [12]. In our problem, it is challenging in a timely manner for knowledge discovery and technologi- to balance the trade-off between energy cost and execution cal innovation. Among many practical computing solutions, time of each component job to determine their respective workflows have been increasingly employed as an important completion time in MapReduce workflows, regardless of sev- technique for big data analytics, and consequently such big eral previous efforts in traditional workflows, such as the data workflows have become a main consumer of energy partial critical path method [4]. in data centers. Most existing efforts on green computing were focused on independent MapReduce jobs and tradi- We start with a special case with a pipeline-structured tional workflows comprised of serial programs. Energy effi- workflow (a set of linearly arranged jobs with a dependency ciency of big data workflows in Hadoop systems still remains between any two neighbors along the line) on a homoge- largely unexplored. neous cluster. We prove this special case to be weakly Modern computing systems achieve energy saving mainly NP-complete and design a fully polynomial-time approxi- through two types of techniques, i.e. task consolidation to mation scheme (FPTAS) of time complexity linear with re- reduce static energy consumption (SEC) by turning off idle spect to 1/ǫ. By leveraging the near optimality and low servers, and load balancing to reduce dynamic energy con- time complexity of our FPTAS, we design a heuristic for the sumption (DEC) through dynamic voltage and frequency generalized problem with a directed acyclic graph (DAG)- scaling (DVFS), or a combination of both. However, these structured workflow on a heterogeneous cluster. This heuris- techniques are not sufficient to address the energy efficiency tic iteratively selects the longest chain of unmapped jobs issue of big data workflows because i) frequently switching from the workflow and applies our FPTAS to the selected pipeline while taking machine heterogeneity into considera- ∗Copyright held by the author(s). tion. 34 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah In sum, our work makes the following contributions to the Hadoop cluster powered by mixed brown and green energy, field. which dynamically determines resource allocation to hetero- • To the best of our knowledge, our work is among the geneous jobs based on the estimation of job completion time first to study energy-efficient mapping of big data work- and the prediction of future resource availability [10]. flows comprised of moldable jobs in Hadoop systems. 2.1.3 Resource Allocation • We prove a deadline-constrained pipeline-structured workflow mapping problem for minimum total (en- The majority of existing efforts targeted the first gener- ergy) cost to be weakly NP-complete and design an ation of Hadoop. The work on the second generation of FPTAS for it. Hadoop, i.e. YARN, is still quite limited. Li et al. proposed • The performance superiority of the proposed heuristic a suspend-resume mechanism in YARN to mitigate the over- for the general workflow mapping problem in terms of head of preemption in cluster scheduling, and used a check dynamic energy saving and deadline missing rate is il- pointing mechanism to save the states of jobs for resump- lustrated by extensive simulation results in Hadoop/YARN tion [19]. Their approach dynamically selects appropriate in comparison with existing algorithms. preemption mechanisms based on the progress of a task and The rest of the paper is organized as follows. Section 2 its suspend-resume overhead to improve job response time provides a survey of related work. Section 3 formulates a big and reduce energy consumption. data workflow mapping problem. We prove a special case 2.2 Energy-efficient Workflow Scheduling to be weakly NP-complete and design an FPTAS for it in Section 4, and design a heuristic for the generalized problem Many efforts were made on energy-efficient scheduling of in Section 5. Section 6 evaluates the performance. workflows comprised of precedence-constrained serial pro- grams. Some of these approaches targeted virtualized en- 2 Related Work vironments by migrating active VMs onto energy-efficient A large number of research efforts have been made to op- PMs in time [25] or consolidating applications with comple- timize the data replication scheme in Hadoop distributed mentary resource requirements [28]. Zhu et al. developed a file system (HDFS) so that data nodes can be turned off workflow scheduling framework, pSciMapper, which consists without affecting data availability [16, 5, 8]. Our research of two major components: i) online power-aware consolida- on job scheduling is orthogonal to these efforts, and adds an tion, based on available information on the utilization of additional level of energy efficiency to Hadoop systems. CPU, memory, disk, and network by each job, and ii) of- fline analysis including a hidden Markov model for estimat- 2.1 Energy-efficient Job Scheduling in Hadoop ing resource usage per job and kernel canonical correlation 2.1.1 Heterogeneous Computing Environments analysis for modeling the resource-time and resource-power Since servers in large-scale clusters are typically upgraded relationships [28]. or replaced in an incremental manner, many techniques con- Other approaches were focused on physical clusters as sider hardware heterogeneity of Hadoop clusters for energy follows. Lee et al. proposed a static workflow schedule saving. Cardosa et al. proposed static virtual machine (VM) compaction algorithm to consolidate the resource use of a placement algorithms to minimize the cumulative machine workflow schedule generated by any scheduling algorithm uptime of all physical machines (PMs), based on two prin- in homogeneous environments [17], and designed two static ciples: spatial fitting of VMs on PMs to achieve high re- energy-conscious workflow scheduling algorithms based on source utilization according to complementary resource re- DVFS in heterogeneous distributed systems [18]. In [20], quirements from VMs, and temporal fitting of PMs with three types of DVFS-based heuristics, namely, prepower- VMs having similar runtime to ensure that a server runs at a determination, postpower-determination, and hybrid algo- high utilization level throughout its uptime [6]. Mashayekhy et al. rithms, were designed to solve a static problem of joint power modeled the energy-aware static task scheduling of a MapRe- allocation and workflow scheduling for schedule length (or duce job as an Integer Programming problem, and designed energy consumption) minimization under an energy con- two heuristics that assign map/reduce tasks to machine slots straint (or a time constraint). Zhang et al. proposed a to minimize energy consumption while satisfying the ser- DVFS-based heuristic to statically maximize workflow reli- vice level agreement (SLA) [22]. Cheng et al. proposed a ability under a energy constraint in a heterogeneous clus- heterogeneity-aware dynamic task assignment approach us- ter [27], and designed a Pareto-based bi-objective genetic ing ant colony optimization, referred to as E-Ant, to min- algorithm to achieve low energy consumption and high sys- imize the overall energy consumption of MapReduce appli- tem reliability for static workflow scheduling [26]. cations with heterogeneous workloads in a heterogeneous Hadoop cluster without a priori knowledge of workload prop- 2.3 Moldable/Malleable Job Scheduling erties [9]. Some efforts have been made to minimize the comple- 2.1.2 Renewable Energy tion time of a workflow comprised of malleable jobs [23, Several efforts were focused on utilizing renewable energy 7, 21], but there exist relatively limited efforts on mold- in the operation of Hadoop clusters. Goiri et al. proposed able/malleable job scheduling for energy efficiency. Sanders et al. a framework, GreenHadoop, for a data center powered by designed a polynomial-time optimal solution and an FP- renewable (green) energy and by carbon-intensive (brown) TAS to statically schedule independent malleable jobs with energy from the electrical grid as a backup. It dynamically a common deadline for energy consumption minimization schedules MapReduce jobs to minimize brown energy con- based on the theoretical power models of a single processor sumption by delaying background computations within their using the DVFS technology, i.e. p = f α and p = f α + δ, time bounds to match the green energy supply that is not respectively, where f is CPU frequency and δ is the constant always available [14]. Cheng et al. designed a scheduler for a static power consumption [24]. 35 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah 3 Problem Formulation Table 1: Notations used in the cost models. 3.1 Cost Models Notations S Definitions M = l Cl a cluster of machines divided into homogeneous 3.1.1 Cluster Model subclusters {Cl } We consider a heterogeneous Hadoop cluster consisting mi (Ni , pi , the i-th machine equipped with a memory of size oi oi , Pi ) and Ni CPU cores of speed pi and DPC Pi per core of a set M of machines connected via high-speed switches, at full utilization which can be partitioned into homogeneous sub-clusters {Cl }. R the available resource-time table of cluster M Each machine mi is equipped with Ni homogeneous CPU NiA (t) the number of idle CPU cores on machine mi at time t cores of speed pi and a shared memory of size oi . For oAi (t) the size of available memory on machine mi at time t f (G(V, A), a workflow request consisting of a workflow structure the entire cluster, a central scheduler maintains an avail- d) of a DAG G(V, A) and a deadline d able resource-time (ART) table R, which records the num- vj , sj,k the j-th component job in a workflow and the k-th ber NiA (t) ≤ Ni of idle CPU cores and the size oA i (t) ≤ oi task in job vj of available memory in each machine mi at time t. aj,j′ the directed edge from job vj to job vj′ tAS AF j , tj the actual start and finish time of job vj 3.1.2 Workflow Model tC the completion time of a workflow We consider a user request in the form of a workflow µi,j the percentage of execution time for CPU-bound f (G, d), which specifies a workflow structure G and a dead- instructions in job vj on machine mi oj the memory demand per task in job vj line d. The workflow structure is defined as a DAG G(V, A), wj (K) the workload of job vj partitioned into K tasks where each vertex vj ∈ V represents a component job, and wj,k (K) the workload of task sj,k in vj with K tasks each directed edge aj,j ′ ∈ A denotes an execution depen- Kj , Kj′ the number and the maximum possible number of dency, i.e. the actual finish time (AFT) tAF j of job vj must tasks in vj ti,j,k the execution time of task sj,k running on machine mi not be later than the actual start time (AST) tAS j ′ of job vj ′ . ai,j,k (t) indicate whether task sj,k is active on machine mi The completion time of the workflow is denoted as tC . We at time t consider the map and reduce phases of each MapReduce job ni,j (t) the number of running tasks in job vj on machine mi at time t as two component jobs connected via an execution depen- ni (t) the number of CPU cores used by f on machine mi dency edge. at time t 3.1.3 MapReduce Model oi (t) the size of memory used by workflow f on machine mi at time t We consider a MapReduce job vj running a set of parallel E the DEC of workflow f in cluster M map (or reduce) tasks, each of which requires a memory of size oj and spends a percentage µi,j of time executing CPU- bound instructions on a CPU core of machine mi . In job denotes that the k-th task of the j-th job is mapped onto vj , generally, as the number Kj of parallel tasks increases, the i-th machine from time tS E j,k to time tj,k . The domain the workload wj,k (Kj ) of each task sj,k decreases and the of this mapping function covers all possible combinations total workload wj (Kj ) = Kj · wj,k (Kj ) of all tasks increases. of a set V of component jobs of the workflow, a set M of However, the maximum number Kj′ of tasks that can be exe- machines, and a time period T of workflow execution. cuted in parallel without performance P degradation is limited 3.2 Problem Definition by the cluster capacity, e.g. Kj′ ≤ mi ∈M min{Ni , ⌊oi /oj ⌋}. We formulate a deadline- and resource-constrained work- Note that a serial program can be considered as a special flow mapping problem for energy efficiency (EEWM): case of a MapReduce job with Kj′ = 1. The execution time Definition 1. EEWM: Given a cluster {mi (Ni , pi , oi , Pi )} of task sj,k on machine mi is ti,j,k = wj,k (Kj )/(µi,j · pi ). of machines with an available resource-time table {NiA (t), oA i (t)}, Estimating the execution time of a task on any service is and a workflow request f (G(V, A), d), where each job vj has an important issue. Many techniques have been proposed a set {wj (Kj )|Kj = 1, 2, . . . , Kj′ } of workloads for different such as code analysis, analytical benchmarking/code profil- task partitions, and each task in job vj has a percentage µi,j ing, and statistical prediction, which are beyond the scope of execution time for CPU-bound instructions on machine of this work. mi and a memory demand oj , we wish to find a mapping The active state ai,j,k (t) of task sj,k on machine mi is 1 [tS E j,k , tj,k ] (or 0) if it is active (or inactive) at time t. The number of function M : (V, M, T ) → {sk (vj ) ======⇒ mi } to mini- active tasks in job vj on machine mi at time t is ni,j (t) = P mize the dynamic energy consumption: sj,k ∈vj ai,j,k (t). The number of CPU cores and the size min E, of memory used by all component P jobs of a workflow on M machine mi at time t are ni (t) = vj ∈V ni,j (t) and oi (t) = subject to the following deadline and resource constraints: P vj ∈V [oj ni,j (t)], respectively. tC ≤ d, 3.1.4 Energy Model tAF j ≤ tAS j ′ , ∀aj,j ′ ∈ A, The DEC of a workflow in Ra cluster is P P tC ni (t) ≤ NiA (t), ∀mi ∈ M, E = mi ∈M {Pi vj ∈V [µi,j 0 ni,j (t)dt]}, where Pi is the dynamic power consumption (DPC) of a fully utilized CPU oi (t) ≤ oA i (t), ∀mi ∈ M. core, and which is validated by energy measurements of prac- tical systems in [9]. 4 Special Case: Pipeline-structured Workflow 3.1.5 Mapping Function We start with a special case with a Pipelined-structured We define a workflow mapping function as M : {sk (vj ) workflow running on HOmogeneous machines (PHO). We [tS E prove it to be NP-complete and design an FPTAS to solve j,k , tj,k ] ======⇒ mi , ∀vj ∈ V, ∃mi ∈ M, ∃[tS F j,k , tj,k ] ⊂ T }, which EEWM-PHO. 36 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah Generally, we may achieve more energy savings on an under-utilized cluster than on a fully-utilized cluster. Hence, v1,1 vJ,1 the problem for a single pipeline-structured workflow is still e1,1 e2,1 e2J-1,1 e2J,1 v1,2 vJ,2 valuable in real-life systems. The EEWM-PHO problem is defined as follows. e1,2 ⁞ e2,2 e2J-1,2 ⁞ e2J,2 e1,k e2,k e3,k e2J-2,k e2J-1,k e2J,k Definition 2. EEWM-PHO: Given I idle homogeneous u0 v1,k u1 … uJ-1 vJ,k uJ machines {mi (N, p, o, P )} and a workflow f (G(V, A), d) con- ⁞ ⁞ e1,K’1 e2,K’1 e2J-1,K’J e2J,K’J taining a chain of J jobs, where each job vj has a workload v1,K’1 vJ,K’J list {wj (Kj )|Kj = 1, 2, . . . , Kj′ }, and each task in job vj has 1st Job Jth Job a percentage µj of execution time for CPU-bound instruc- tions and a memory demand oj , does there exist a feasible mapping scheme such that DEC is no more than E? Figure 1: A constructed network corresponding to 4.1 Complexity Analysis a workflow with a pipeline structure. We prove that EEWM-PHO is NP-complete by reducing the two-choice knapsack problem (TCKP) to it. Algorithm 1: EEWM-PHO-FPTAS Definition 3. Two-Choice Knapsack: Given J classes Input: A cluster {mi (N, p, o, P )} and a chain of jobs {vj } of items to pack in a knapsack of capacity H, where each with a deadline d and a set {wj (Kj )} of workloads class Cj (j = 1, 2, . . . , J) has two items and each item rj,l 1: Construct a DAG G(V, E) for pipeline {vj } as shown in (l = 1, 2) has a value bj,l and a weight hj,l , is there a choice Fig. 1, and assign cost Ej (k) and delay tj (k) to edge e2j−1,k of exactly one item from each class such that the total value and zero cost and zero delay to edge e2j,k ; is no less than B and the total weight does not exceed H? 2: Use FPTAS in [13] to find the minimum-cost path from u0 to uJ under delay constraint d with approximate rate (1 + ǫ) The knapsack problem is a special case of TCKP when we and convert it to mapping scheme. put each item in the knapsack problem and a dummy item with zero value and zero weight together into a class. Since the knapsack problem is NP-complete, so is TCKP. Theorem 1. EEWM-PHO is NP-complete. and accordingly its DEC as Ej (k) = k · P · µj · tj (k). Sub- Proof. Obviously, EEWM-PHO ∈ N P . We prove that sequently, we assign the cost c(e) and delay l(e) of each EEWM-PHO is NP-hard by reducing TCKP to EEWM- edge e ∈ E as c(e2j−1,k ) = Ej (k), l(e2j−1,k ) = tj (k), and PHO. Let ({Cj (bj,1 , hj,1 , bj,2 , hj,2 )|1 ≤ j ≤ J}, B, H) be an c(e2j,k ) = l(e2j,k ) = 0, and set the delay constraint on a instance of TCKP. Without loss of generality, we assume path from u0 to uJ to be d. As a result, the minimum cost that bj,1 > bj,2 and hj,1 > hj,2 > 0. If hj,1 < hj,2 , rj,1 in RSP is exactly the minimum DEC in EEWM-PHO, and if would always be selected. If hj,2 = 0, we can always add vj,k is on the solution path to RSP, the j-th job has k tasks. τ > 0 to hj,1 , hj,2 and H such that hj,2 > 0. Based on Theorem 1 and the above reduction, we have We construct an instance of EEWM-PHO as follows. Let Theorem 2. EEWM-PHO is weakly NP-complete. I = 2, d = H, vj = Cj , Kj′ = 2, oj = o, wj (1) = Let K ′ = max1≤j≤J Kj′ . Then, |V| ≤ JK ′ + J + 1 and h Pj,1 µj p, wj (2) = 2hj,2 µj p, uj = (Bj −bj,1 )/(hj,1 P ) and E = |E| ≤ 2JK ′ in the constructed graph G. It is obvious that 1≤j≤J Bj − B, where Bj = (2hj,2 bj,1 − hj,1 bj,2 )/(2hj,2 − the construction process can be done within time O(JK ′ ). hj,1 ). It is obvious that the construction process can be done Therefore, EEWM-PHO finds a feasible solution that con- in polynomial time. sumes energy within the least DEC multiplied by (1 + ǫ) in Then, if job vj only has one task, its execution time is time O(J 2 K ′2 /ǫ) if the FPTAS in [13] is used to solve RSP in tj (1) = wj (1)/(µj p) = hj,1 , and its DEC is Ej (1) = tj (1)µj P = acyclic graphs. Thanks to the special topology in Fig. 1, the Bj − bj,1 . If job vj has two tasks, the execution time of each time complexity is further reduced to O(JK ′ (log K ′ + 1/ǫ)). task is tj (2) = wj (2)/(2µj p) = hj,2 , and the DEC of job vj is Ej (2) = 2tj (2)µj P = Bj − bj,2 . Obviously, two tasks in a 5 Algorithm Design for an Arbitrary Work- job are mapped onto two machines simultaneously. flow on a Heterogeneous Cluster Therefore, if the answer to the given instance of TCKP We consider EEWM with a DAG-structured workflow on is YES (or No), the answer to the constructed instance of a heterogeneous cluster and design a heuristic algorithm, EEWM-IJOM is also YES (or No). Proof ends. referred to as big-data adaptive workflow mapping for energy 4.2 Approximation Algorithm efficiency (BAWMEE). We prove that EEWM-PHO is weakly NP-complete and 5.1 An Overview of BAWMEE design an FPTAS as shown in Alg. 1 by reducing this prob- The key idea of BAWMEE is to partition a DAG into lem to the weakly NP-complete restricted shortest path (RSP) a set of pipelines and then repeatedly employ Alg. 1 with problem [13], which is solvable with an FPTAS. near optimality and low time complexity to achieve energy- Given an instance of EEWM-PHO, we construct an in- efficient mapping of each pipeline. stance of RSP according to the pipeline-structured workflow In BAWMEE, each workflow mapping consists of two com- as follows. As illustrated in Fig. 1, the network graph G con- ponents: iterative critical path (CP) selection and pipeline sists of V = {vj,k |j = 1, . . . , J, k = 1, . . . , Kj′ } ∪ {u0 , uj |j = mapping. A CP is the longest execution path in a workflow, 1, . . . , J} with a source u0 and a destination uJ , and E = which can be calculated in linear time. The algorithm starts {e2j−1,k , e2j,k |j = 1, . . . , J, k = 1, . . . , Kj′ }, where e2j−1,k = with computing an initial CP according to the average exe- (uj−1 , vj,k ) and e2j,k = (vj,k , uj ). Then, we calculate the ex- cution time of each job running in serial on all the machines, ecution time of job vj with k tasks as tj (k) = wj (k)/(k·p·µj ), followed by a pipeline mapping process. Then, it iteratively 37 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah Algorithm 2: BAWMEE Algorithm 3: EEPM Input: a workflow f (G(V, A), d) and an ART table R for Input: a pipeline pl with its EST pl.est and LFT pl.lf t, an sub-clusters {Cl } ART table R({Cl }), and TETs {T blj } 1: T bl ← buildT ET (V, {Cl }); Output: a boolean variable to indicate whether pl or its 2: if simplyM ap(f, R({Cl }), T bl) =True then part is mapped 3: return . 1: Label the index j of each job in pl from 1 to the length of pl; 4: tLF j ← +∞ for ∀vj ∈ f ; tLF J ← d for the end job vJ in f ; 2: Calculate the earliest possible start time of the first job in pl 5: Calculate the average execution time t̄j of each job vj on any machine as est according to R({Cl }); running in serial on all the machines; 3: pl.est P ← max{est, pl.est}; 6: G′ ← G; 4: if vj ∈pl tj (Kj,1 , Cj,1 ) > pl.lf t − pl.est then 7: while ∃ an unmapped job ∈ V do 5: return False. 8: Find the critical path cp ending at a job v with the 6: Convert pipeline pl, where each quadruple in T blj of each earliest LFT in G′ according to {t̄j |vj ∈ G′ }; job vj ∈ pl corresponds to one of its mapping options, into a 9: if EEP M (cp, R({Cl }), T bl) =False then network graph in RSP; 10: v ← M DP M (cp, R({Cl })); 7: Use Alg. 1 to calculate the number Kj of tasks, sub-cluster 11: if v 6= N ull then C(vj ), and start and finish time, tS F j and tj , for each job vj ; 12: D ← {all the downstream jobs of v in G − G′ }; 8: for vj+1 ∈ pl do 13: if D 6= ∅ then 9: if tF ′ F ′ 14: Cancel the mapping of each job v′ ∈ D, and add j > tLF (vj ) or tj < tES (vj+1 ) then v′ and its associated precedence constraints to G′ ; 10: pl(1, j).est ← pl.est; 15: EAJM (v, R({Cl }); 11: if tF ′ j > tLF (vj ) then 16: G′ ← G′ − {vj ∈ cp|vj is mapped}; 12: pl(1, j).lf t ← t′LF (vj ); 13: else 14: pl(1, j).lf t ← min{t′ES (vj+1 ), t′LF (vj ), pl.lf t}; 15: return EEP M (pl(1, j), R({Cl }), T bl); Table 2: Time-Energy Table T blj of Job vj . 16: if ∃ Kj pairs of a CPU core and memory of size oj in tj (Kj,1 , Cj,1 ) < tj (Kj,2 , Cj,2 ) < ... < tj (Kj,n , Cj,n ) R(C(vj )) for ∀vj ∈ pl then ej (Kj,1 , Cj,1 ) > ej (Kj,2 , Cj,2 ) > ... > ej (Kj,n Cj,n ) 17: Map all Kj tasks onto C(vj ) from tS F j to tj for ∀vj ∈ pl; Kj,1 ∈ [1, Kj′ ] Kj,2 ∈ [1, Kj′ ] ... Kj,n ∈ [1, Kj′ ] 18: return True; Cj,1 ⊂ M Cj,2 ⊂ M ... Cj,n ⊂ M 19: return False; computes a CP with the earliest last finish time (LFT) from the remaining unmapped workflow branches based on the and same average execution time of a job as above and performs d, if vj is the end job of workflow f, ( a pipeline mapping of the computed CP until there are no tLF j = min tAS j ′ , otherwise, branches left. vj ′ ∈Succ(vj ) In pipeline mapping, we consider two extreme scenarios: respectively. If there exist unmapped preceding and suc- resource/time sufficiency and resource/time insufficiency. In ceeding jobs of vj , its temporary earliest start time (TEST) the former case, we only need to focus on energy efficiency, t′ES (vj ) and temporary last finish time (TLFT) t′LF (vj ) can while in the latter case, it may be unlikely to meet the be calculated based on only its mapped preceding and suc- performance requirement. Therefore, we design one algo- ceeding jobs, respectively. The EST and LFT of a pipeline rithm for each of these two scenarios: a heuristic for energy- are the EST of its first job and LFT of its end job, respec- efficient pipeline mapping (EEPM) under a deadline con- tively. straint in Alg. 3, which calls Alg. 1, and a heuristic for min- Each job vj is associated with a set of pairs of the num- imum delay pipeline mapping (MDPM) with energy aware- ber Kj,n of tasks and the used homogeneous sub-cluster ness in Alg. 4. If Alg. 3 fails to find a feasible mapping Cj,n . Each pair corresponds to a certain execution time scheme due to limited resources, we resort to Alg. 4. In tj (Kj,n , Cj,n ) and DEC ej (Kj,n , Cj,n ) = P (Cj,n )wj (Kj,n )/p(Cj,n ), EEPM, due to the homogeneity of tasks in a job, we map where p(Cj,n ) and P (Cj,n ) are the speed and the DPC of a all the tasks in the same job onto a homogeneous sub-cluster, fully utilized CPU core on a machine in Cj,n , respectively, hence using Alg. 1 to balance the trade-off between execu- and wj (Kj,n ) is the workload of vj with Kj,n tasks. All the tion time and DEC (directly associated with total work- quadruples {(tj (Kj,n , Cj,n ), ej (Kj,n , Cj,n ), Kj,n , Cj,n )} are sorted load) for each job on a pipeline. In MDPM, we search for in the ascending order of execution time as listed in Table 2, a good task partitioning to minimize the end time of each and are referred to as the time-energy table (TET) T blj of job through a limited number of tries by reducing the possi- job vj . Any quadruple with both execution time and DEC ble number of tasks in each job vj from {1, 2, 3, . . . , Kj′ } to larger (worse) than those of another will be deleted from {1, 2, 22 , . . . , 2⌊log Kj ⌋ } ∪ {K ′ }. ′ j T blj . 5.2 Algorithm Description In Alg. 2, BAWMEE first builds a time-energy table for each job by calling buildT ET (). If the workflow cannot meet If a job vj has been mapped, it has AST tAS j and AFT AF its deadline with each job running the fastest, BAWMEE tj . If all the preceding (and succeeding) jobs, in P rec (and performs energy-aware job mapping (EAJM) with minimum Succ), of job vj are mapped, its earliest start time (EST) finish time for each job in a topologically sorted order by (and LFT) can be calculated as calling simplyM ap(). Otherwise, BAWMEE employs itera- ( 0, if vj is the start job of workflow f, tive CP selection to find a CP with the earliest LFT from ES tj = max tAF unmapped jobs, and performs EEPM or MDPM (if EEPM j ′ , otherwise; vj ′ ∈P rec(vj ) fails) for the selected CP. If there is any job that cannot 38 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah Algorithm 4: MDPM v2 v4 v6 Input: a pipeline pl and an ART table R for {Cl } v1 v8 Output: the first job that cannot be mapped 1: for all vj ∈ pl do v3 v5 v7 2: if EAJM (vj , R({Cl })) > t′LF (vj ) then 3: Cancel the mapping of job vj ; 4: return vj ; Figure 2: An example of a workflow structure G. 5: return N ull. Table 3: Time-Energy Table in the Example Time 3 2 5 4 2 3 5 Algorithm 5: EAJM Energy 6 8 5 8 ⇒ 8 6 5 Input: a job vj and an ART table R for sub-clusters {Cl } # of Tasks 1 2 1 2 2 1 1 Sub-cluster C1 C1 C2 C2 C1 C1 C2 Output: the EFT tEFj of job vj 1: Update the TEST t′ES (vj ); tEF j ← +∞; j k log K ′ 0 3 6 11 14 19 0 3 6 11 16 19 2: for K ← 1, 2, 4, . . . 2 j , Kj′ do m1 m2 v1 v2 v3 v6 v7 m1 m2 v1 v2 v3 v8 m3 v4 v8 m3 v4 v6 3: Calculate the EFT tEF j (K) of job vj with K tasks by m4 v5 m4 v5 v7 minimizing the finish time of each task one by one; 4: if tEF j > tEF j (K) then (a) (b) 5: tj ← tEF EF j (K); Kj ← K; 6: Map job vj consisting of Kj tasks until tEF j ; Figure 3: Workflow mapping in the example: (a) 7: return tEF j . BAWMEE; (b) Optimal. 5.3 A Numerical Example be mapped in MDPM, we cancel the mapping of its down- In this subsection, we use a simple example to illustrate stream jobs. If it is the last job of the workflow, we perform how BAWMEE achieves energy-efficient workflow mapping EAJM with minimum finish time. without violating precedence constraints. We consider an In Alg. 3 of EEPM, we reset the EST for the input pipeline idle cluster M = C1 ∪C2 consisting of 4 single-core machines, according to the earliest time such that enough resources where C1 = {m1 , m2 } and C2 = {m3 , m4 }, and receives are made available to the first job. If the pipeline can- a workflow f comprised of homogeneous jobs organized in not meet its LFT with each job running the fastest, we Fig. 2 with a deadline of 19 time units. The execution time exit EEPM; otherwise, the mapping of a pipeline with its and DEC of a job with a different task partitioning on a dif- EST and LFT is converted into the RSP problem with a re- ferent sub-cluster are calculated and listed on the left side of laxed resource limit. Accordingly, we calculate the number Table 3. BAWMEE first builds a TET for each job on the of tasks, the sub-cluster, and the start/finish time for each right side of Table 3. A pipeline {v1 , v2 , v4 , v6 , v8 } is selected job using Alg. 1. Then, we check if the start and finish time as the initial CP. We assume that ǫ is set to be 0.02. In an of each job are between its TEST and TLFT in their execu- approximation solution of pipeline mapping with EST of 0 tion order. If there exists a job that violates the precedence and LFT of 19, each job has only one task, and v1 , v2 and constraint, we divide the pipeline at this job, and use Alg. 3 v6 are mapped onto machine m1 in C1 from 0 to 3, from to compute the mapping of the upstream sub-pipeline with 3 to 6, and from 11 to 14, respectively, and v4 and v8 are an updated LFT constraint. We repeat this process until mapped onto machine m3 in C2 from 6 to 11 and from 14 we find a sub-pipeline whose mapping meets all precedence to 19, respectively. Then, the second pipeline {v3 , v5 , v7 } is constraints. If the cluster is able to provide each job in this selected as the CP in G − {v1 , v2 , v4 , v6 , v8 }. In an approxi- sub-pipeline with enough resources based on the mapping re- mation solution of pipeline mapping with EST of 3 and LFT sult of Alg. 1, we proceed with this mapping; otherwise, we of 14, v3 intends to have one task and be mapped onto C2 fail to find an EEPM and thus exit. In this case, BAWMEE from 3 to 8, and v5 and v7 intend to have one task and be would proceed to search for an MDPM. mapped onto C1 from 8 to 11 and from 11 to 14, respectively. Since v3 misses its TLFT of 6, the first sub-pipeline {v3 } of In Alg. 4 of MDPM, we search for the earliest finish time {v3 , v5 , v7 } is extracted and the approximation solution of (EFT) of each job using EAJM in their execution order, sub-pipeline mapping with EST of 3 and LFT of 6 is that v3 and thus obtain the EFT of the entire pipeline. In Alg. 5 of has one task and is mapped onto a machine m2 in C1 from 3 EAJM with minimum finish time under resource constraints, to 6. Subsequently, the third pipeline {v5 , v7 } is selected as we exponentially relax the limit on the maximum number the CP in G − {v1 , v2 , v3 , v4 , v6 , v8 }, and the approximation of tasks in a job to make a tradeoff between the optimality solution of its mapping with EST of 6 and LFT of 14 is that and the time complexity of EAJM. v5 intends to have one task and be mapped onto C2 from 6 Since EEPM and MDPM are of O(J 2 K ′ L[log(K ′ L) + to 9 and v7 intends to have one task and be mapped onto C1 1/ǫ] + M ′ H) and O(M ′ HJK ′ log K ′ ), respectively, the time from 9 to 14. Since v7 starts before its TEST of 11, the first complexity of BAWMEE is O(J 2 K ′ [JL(1/ǫ + log(K ′ L)) + sub-pipeline {v5 } of {v5 , v7 } is extracted and the approxi- M ′ H log K ′ ]). Here, M ′ is the number of machines; L is mation mapping solution of the sub-pipeline with EST of 6 the number of homogeneous sub-clusters, J is the number and LFT of 11 is that v5 has one task and is mapped onto a of jobs; K ′ is the maximum number of tasks in a job; and machine m4 in C2 from 6 to 11. Finally, the fourth pipeline H is the number of time slots in the ART table. {v7 } is selected as the CP in G − {v1 , v2 , v3 , v4 , v5 , v6 , v8 }, 39 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah 12 11 10 Table 4: Specifications for Four Types of Machines Execution Time (in Minutes) 9 Mach. CPU Models # of Freq. DPC per Mem. 8 Type cores (GHz) core (W) (GB) 7 6 1 6-core Xeon E7450 18 2.40 90 64 5 2 Single Core Xeon 6 3.20 92 64 4 3 2-Core Xeon 7150N 12 3.50 150 64 3 2 4 Itanium 2 9152M 8 1.66 104 64 1 0 22 29 51 96 187 The Number of Map Tasks precedence constraints of the workflow is set to 1.5 times Figure 4: The execution time of a MapReduce job of the number of jobs, if possible. The maximum possible versus the number of tasks. number of tasks for each job is randomly selected between 12 and 48. The workload of a job is randomly selected be- tween 0.6 × 1012 and 21.6 × 1012 CPU cycles when running and the approximation solution of its mapping with EST of in serial. The workload w(k) of a job with k > 1 tasks is 11 and LFT of 14 is that v7 has one task and is mapped onto randomly selected between w(k − 1)[1 + 0.2/(k − 1)] and machine m2 in C2 from 11 to 14. Specifically, the mapping w(k − 1)[1 + 0.6/(k − 1)]. We calculate the sum t1 of the result of BAWMEE is shown in Fig. 3(a), and its DEC is 45 average execution time of the serial jobs on the critical path units. The optimal mapping is shown in Fig. 3(b), and the and the sum t2 of the average execution time of all serial minimum DEC is 44 units. jobs according to the CPU speeds of all types of machines, and randomly select a workflow deadline baseline from the 6 Performance Evaluation time range [t1 , t2 ]. The percentage of execution time for We conduct experiments to illustrate the effect of task CPU-bound instructions of a task in each job on each type partitioning on job workload and conduct simulations to of machine is randomly selected between 0.1 and 1.0. The evaluate the performance of BAWMEE in comparison with memory demand of a task in each job is randomly selected two existing algorithms adapted from different scenarios: i) from 0.5GB to 4GB at an interval of 0.5GB. pSciMapper adapted from a workflow mapping algorithm We evaluate these algorithms in a heterogeneous clus- in [28] by applying the interference avoidance to MapRe- ter consisting of machines with four different specifications duce mapping, and ii) EEDA adapted from a MapReduce listed in Table 4, based on 4 types of Intel processors. Each job mapping algorithm integrated with algorithms in [9] homogeneous sub-cluster has the same number of machines. and [10]. Each scheduling simulation lasts for 3 days and is repeated 6.1 Performance Model Validation 20 times with different workflow instances, whose arrivals We consider a computing performance model where the follow the Poisson distribution. In the performance evalua- total workload of a moldable parallel job increases and the tion, each data point represents the average of 20 runs with execution time of each task decreases as the number of tasks a standard deviation. The parameter ǫ in BAWMEE is set increases. For model validation, we conduct an experiment to 0.2. By default, the workflow size is randomly selected to illustrate the effect of task partitioning on job workload between 40 and 60 jobs; the cluster size and the average ar- for big data applications, which is the foundation of this rival interval of workflows are set to be 128 machines and 30 research. Towards this goal, we implement a MapReduce minutes, respectively; the deadline factor, which is a coef- program to find out the most common reason for flight can- ficient multiplied by the deadline baseline to determine the cellations based on the airline on-time performance data set actual workflow deadline, is set to 0.1. from [1] and run this program on a computer server equipped The dynamic energy consumption reduction (DECR) over with 2 processors of Intel(R) Xeon(R) CPU E5-2630 with 6 the other algorithms in comparison is defined as cores of 2.30GHz and 64GB memory. The program execu- DECOther − DECBAW M EE tion time is measured and plotted in Fig. 4, which shows DECR(Other) = · 100%, DECOther that the execution time of this MapReduce job increases as the number of tasks increases when the server is fully uti- where DECBAW M EE and DECOther are the average DEC lized during execution, which means that the total workload per workflow achieved by BAWMEE and the other algo- increases with the number of tasks. rithm, respectively. The deadline missing rate (DMR) is de- fined as the ratio of the number of workflows missing their 6.2 Simulation Settings deadlines to the total number of workflows. The unit run- We generate a series of random workflows as follows: (i) ning time (URT) is measured as the average simulation run- randomly select the length L of the critical path of a work- ning time for computing the mapping scheme of each work- flow (no less than 3) and divide the workflow into L layers, flow. The simulation runs on a Linux machine equipped in each of which every job has the same length of the longest with Intel Xeon CPU E5-2620 v3 of 2.4 GHz and a memory path from the start job; (ii) randomly select the number of of 16 GB. jobs in each layer except the first and last layers, in which there is only one job; (iii) for each job, add an input edge 6.3 Simulation Results from a randomly selected job in the immediately preceding 6.3.1 Problem Size layer, if absent, and an output edge to a randomly selected For performance evaluation, we consider 20 different prob- job in its downstream layer(s); (iv) randomly pick up two lem sizes from small to large scales, indexed from 1 to 20 as jobs in different layers and add a directed edge from the job tabulated in Table 5. Each problem size is defined as a in the upstream layer to the job in the downstream layer quadruple (|V |, |M |, 1/λ, T ), where 1/λ is the average ar- until we reach the given number of edges. The number of rival interval of workflow requests in minutes, and T is the 40 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah 80 10 1 Running Time per Workflow Mapping (s) BAWMEE over pSciMapper BAWMEE over EEDA 100 pSciMapper pSciMapper 70 EEDA EEDA 90 BAWMEE BAWMEE 10 0 Deadlline Missing Rate (%) 60 80 DEC Reduction (%) 70 50 60 10 -1 40 50 30 40 10 -2 30 20 20 10 -3 10 10 0 0 10 -4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Problem Index Problem Index Problem Index Figure 5: DECR vs. problem sizes. Figure 6: DMR vs. problem sizes. Figure 7: URT vs. problem sizes. 320 100 0.6 Running Time per Workflow Mapping (s) pSciMapper EEDA BAWMEE pSciMapper 90 EEDA 0.55 280 BAWMEE 0.5 80 Deadline Missing Rate (%) DEC per Workflow (MJ) 240 0.45 70 0.4 200 60 0.35 pSciMapper 160 50 0.3 EEDA 40 BAWMEE 0.25 120 30 0.2 80 20 0.15 10 0.1 40 0.05 0 0 0 .05 .1 .15 .2 .25 .3 .35 .4 .45 .5 .55 .6 .65 .7 .75 .8 .85 .9 .95 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Deadline Factor Deadline Factor Deadline Factor Figure 8: DEC vs. deadlines. Figure 9: DMR vs. deadlines. Figure 10: URT vs. deadlines. 75 100 10 2 Running Time per Workflow Mapping (s) 70 90 65 10 1 Deadline Missing Rate (%) 60 80 55 DEC Reduction (%) 70 50 10 0 45 60 40 pSciMapper 35 50 EEDA 10 -1 BAWMEE 30 40 25 30 10 -2 20 15 20 10 10 -3 pSciMapper BAWMEE over pSciMapper 10 EEDA 5 BAWMEE over EEDA BAWMEE 0 0 -4 10 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95100 The Number of Jobs on Average The Number of Jobs on Average The Number of Jobs on Average Figure 11: DECR vs. workflow Figure 12: DMR vs. workflow Figure 13: URT vs. workflow sizes. sizes. sizes. 300 100 7 Running Time per Workflow Mapping (s) pSciMapper EEDA BAWMEE pSciMapper 270 90 EEDA 6 BAWMEE Deadline Missing Rate (%) 240 80 DEC per Workflow (MJ) 210 70 5 180 60 pSciMapper 4 150 50 EEDA BAWMEE 3 120 40 90 30 2 60 20 1 30 10 0 0 0 64 80 96 112 128 144 160 176 192 208 224 240 256 64 80 96 112 128 144 160 176 192 208 224 240 256 64 80 96 112 128 144 160 176 192 208 224 240 256 The Number of Machines The Number of Machines The Number of Machines Figure 14: DEC vs. cluster sizes. Figure 15: DMR vs. cluster sizes. Figure 16: URT vs. cluster sizes. time period in unit of days for accepting workflow requests in 59.7%). Furthermore, the URT of BAWMEE is on the same each simulation. As the workflow size and arrival frequency order of magnitude as those of pSciMapper and EEDA and increase from index 1 to 20, we increase the resources cor- is less than 7.4 seconds even for problem index 20. respondingly to meet tight deadlines with factor 0.1. We 6.3.2 Deadline Constraint plot the DECR, DMR, and URT of pSciMapper, EEDA, and BAWMEE in Figs. 5-7, respectively, which show that We evaluate the performance of pSciMapper, EEDA, and BAWMEE saves 58.1% to 67.3% DEC and 22.1% to 40.1% BAWMEE in terms of DEC, DMR, and URT under differ- DEC in comparison with pSciMapper and EEDA, respec- ent deadline constraints obtained from the deadline base- tively, and the DMR of pSciMapper (or EEDA) minus that line multiplied by a factor from 0.05 to 1 with an interval of BAWMEE is in the range of 42.6% to 93.4% (or 4.1% to of 0.05. The DEC, DMR, and URT of these algorithms are plotted in Figs. 8-10, respectively. These measurements 41 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah 320 100 7 Running Time per Workflow Mapping (s) pSciMapper EEDA BAWMEE pSciMapper EEDA BAWMEE 6.5 pSciMapper 280 90 EEDA 6 BAWMEE Deadline Missing Rate (%) 80 5.5 DEC per Workflow (MJ) 240 70 5 200 4.5 60 4 160 50 3.5 3 120 40 2.5 30 2 80 20 1.5 40 1 10 0.5 0 0 0 Random Chain Tree Reverse Tree Diamond Random Chain Tree Reverse Tree Diamond Random Chain Tree Reverse Tree Diamond Workflow Structures Workflow Structures Workflow Structures Figure 17: DEC vs. workflow Figure 18: DMR vs. workflow Figure 19: URT vs. workflow structures. structures. structures. sources. In addition, the URT of BAWMEE is 75.5% to Table 5: Problem Sizes. 3.6 times and 1.3 to 10.2 times of those of pSciMapper and Index (|V |, |M|, 1/λ, T ) Index (|V |, |M|, 1/λ, T ) 1 (3-7, 4, 240, 7) 11 (53-57, 192, 30, 1) EEDA, respectively. 2 (8-12, 8, 200, 7) 12 (58-62, 256, 25, 1) 3 (13-17, 12, 160, 7) 13 (63-67, 384, 20, 1) 6.3.4 Cluster Size 4 (18-22, 16, 150, 7) 14 (68-72, 512, 15, 1) We run these three algorithms under different cluster sizes 5 (23-27, 24, 120, 7) 15 (73-77, 768, 12, 1) of 64 to 256 machines at a step of 16 for scalability test. The 6 (28-32, 32, 105, 3) 16 (78-82, 1024, 10, 1/3) 7 (33-37, 48, 90, 3) 17 (83-87, 1536, 8, 1/3) DEC, DMR, and URT of these algorithms are plotted in 8 (38-42, 64, 60, 3) 18 (88-92, 2048, 6, 1/3) Figs. 14-16, respectively, where we observe that as the num- 9 (43-47, 96, 45, 3) 19 (93-97, 3072, 5, 1/3) ber of machines increases, BAWMEE consumes 54.2% to 10 (48-52, 128, 30, 3) 20 (98-102, 4096, 4, 1/3) 69.3% and 6.6% to 43.9% less DEC than pSciMapper and EEDA, respectively, hence exhibiting a satisfactory scala- bility property with respect to the cluster size. Further- show that BAWMEE saves 56.7% to 71.0% DEC and 8.8% more, DAWMEE only misses 0.1% to 4.0% deadlines while to 43.1% DEC as the deadline increases, and reduces DMR pSciMapper and EEDA miss 77.9% to 98.9% and 7.7% to from 99.8% and 90.4% to 54.2% with a deadline factor of 78.3% deadlines, respectively. The increase in the cluster 0.05 in comparison with pSciMapper and EEDA, respec- size results in a relatively looser deadline and a more flex- tively. The DMR of BAWMEE is close to zero when the ible workflow mapping, as a result of which, the DMRs of deadline factor is larger than 0.1. Additionally, the URT of these three algorithms drastically decrease, and BAWMEE BAWMEE is less than 0.6 second and is 89.1% to 1.2 times has more chances to save energy. Moreover, the URT of and 6.6 to 13.2 times of those of pSciMapper and EEDA, BAWMEE is less than 6.4 seconds and is comparable with respectively. It is worth pointing out that as the deadline those of pSciMapper and EEDA. increases, the DEC and URT of BAWMEE decrease, be- 6.3.5 Workflow Structure cause EEPM plays a more significant role than MDPM in We further investigate these three algorithms with vari- BAWMEE. Hence, BAWMEE makes a better tradeoff be- ous workflow structures, including a random shape, a chain, tween DEC and DMR than the other algorithms in compar- a tree, a reverse tree, and a diamond. The DEC, DMR, ison at an acceptable cost of running time. and URT are plotted in Figs. 17-19, respectively, which 6.3.3 Workflow Size show that BAWMEE reduces DEC by 63.6%, 56.6%, 86.3%, For scalability test, we run these three algorithms under 86.5% and 86.2% as well as by 27.6%, 14.7%, 72.1%, 72.5% different average workflow sizes with 5 to 100 jobs per work- and 72.2% in comparison with pSciMapper and EEDA in flow at a step of 5, where the maximum and minimum work- random, chain, tree, reverse tree and diamond structured flow sizes are 2 jobs more and less than the average work- workflows, respectively. Further, BAWMEE saves less en- flow size, respectively. We plot the DECR, DMR, and URT ergy in chain-structured workflows than others, because the of these algorithms in Figs. 11-13, respectively, where we deadline baseline is set to be the tightest for this struc- observe that BAWMEE achieves an increasing DECR be- ture based on our deadline generation method. BAWMEE tween 54.8% and 73.2% in comparison with pSciMapper, only misses 0.2%, 0%, 1.6%, 0% and 0% deadlines, while and between 8.2% and 49.3% in comparison with EEDA. pSciMapper and EEDA miss 89.7%, 100%, 58.0%, 71.9% Moreover, BAWMEE only misses less than 3.0% deadlines and 61.6% deadlines, and 12.3%, 32.8%, 3.2%, 4.6% and while pSciMapper and EEDA miss 77.9% to 97.6% and 9.6% 3.4% deadlines in random, chain, tree, reverse tree and dia- to 69.6% deadlines, respectively. For large workflow sizes mond structured workflows, respectively. Besides, the URT with 80 to 100 jobs per workflow that impose high resource of BAWMEE is less than 0.6 second, and is 1.0 times, 4.5 demands, BAWMEE achieves a DECR between 8.2% and times, 1.1%, 0.4% and 0.4%, as well as 9.4 times, 12.7 times, 11.5%, because it significantly reduces DMR (the first ob- 16.5%, 9.1% and 6.4% of those of pSciMapper and EEDA jective) from over 54.3% to less than 0.1%, in comparison in random, chain, tree, reverse tree and diamond structured with EEDA. The DMR of EEDA experiences a slump under workflows, respectively. the medium workflow sizes because a higher accuracy could be achieved on the execution progress of a smaller workflow 7 Conclusion than a larger one, while a further increase in the workflow We investigated the property of moldable jobs and formu- size may lead to a more severe shortage of computing re- lated a workflow mapping problem to minimize dynamic en- 42 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah ergy consumption under deadline and resource constraints. Letters, 83(5):287–291, 2002. We designed an FPTAS for a special case with a pipeline- [14] I. Goiri, K. Le, T. D. Nguyen, J. Guitart, J. Torres, structured workflow on a homogeneous cluster, which was and R. Bianchini. GreenHadoop: Leveraging green proved to be NP-complete, and a heuristic for a general- energy in data-processing frameworks. In Proc. of ized problem with an arbitrary workflow on a heterogeneous ACM EuroSys, pages 57–70, Bern, Switzerland, Apr cluster. The performance superiority of the proposed heuris- 2012. tic in terms of dynamic energy saving and deadline miss- [15] T. F. Gonzalez, editor. Handbook of Approximation ing rate was illustrated by extensive simulation results in Algorithms and Metaheuristics. Chapman and Hadoop/YARN in comparison with existing algorithms. Hall/CRC, 2007. Our work reveals that the energy-efficient and deadline- [16] W. Lang and J. M. Patel. Energy management for aware mapping algorithms tailored to big data workflows MapReduce clusters. Proceedings of the VLDB could lead to significant energy savings and a higher level of Endowment, 3(1):129–139, 2010. Quality of Service. It is of our future interest to incorporate [17] Y. C. Lee, H. Han, A. Y. Zomaya, and M. Yousif. the proposed mapping algorithms into the existing workflow Resource-efficient workflow scheduling in clouds. engines in the Hadoop ecosystem including Oozie [2] and Elsevier Knowledge-Based Systems, 80:153–162, 2015. Tez [3], and evaluate the performance of energy saving for [18] Y. C. Lee and A. Y. Zomaya. Energy conscious real-life big data workflows. scheduling for distributed computing systems under 8 Acknowledgments different operating conditions. IEEE TPDS, 22(8):1374–1381, 2011. This research is sponsored by National Science Foundation under Grant No. CNS-1560698 with New Jersey Institute of [19] J. Li, C. Pu, Y. Chen, V. Talwar, and D. Milojicic. Technology. Improving preemptive scheduling with application-transparent checkpointing in shared 9 References clusters. In Proc. of ACM Middleware, pages 222–234, Vancouver, BC, Canada, Dec 2015. [1] Statistical Computing. http://stat- [20] K. Li. Scheduling precedence constrained tasks with computing.org/dataexpo/2009/the-data.html. reduced processor energy on multiprocessor [2] Apache Oozie. https://oozie.apache.org. computers. IEEE Tran. on Computers, [3] Apache Tez. https://tez.apache.org. 61(12):1668–1681, 2012. [4] S. Abrishami, M. Naghibzadeh, and D. H. Epema. [21] K. Makarychev and D. Panigrahi. Cost-driven scheduling of grid workflows using partial Precedence-constrained scheduling of malleable jobs critical paths. IEEE TPDS, 23(8):1400–1414, 2012. with preemption. In Proc. of ICALP, pages 823–834, [5] H. Amur, J. Cipar, V. Gupta, G. R. Ganger, M. A. Copenhagen, Denmark, Jul 2014. Kozuch, and K. Schwan. Robust and flexible [22] L. Mashayekhy, M. M. Nejad, D. Grosu, Q. Zhang, power-proportional storage. In Proc. of ACM SoCC, and W. Shi. Energy-aware scheduling of MapReduce pages 217–228, Indianapolis, IN, USA, Jun 2010. jobs for big data applications. IEEE TPDS, [6] M. Cardosa, A. Singh, H. Pucha, and A. Chandra. 26(10):2720–2733, 2015. Exploiting spatio-temporal tradeoffs for energy-aware [23] V. Nagarajan, J. Wolf, A. Balmin, and K. Hildrum. MapReduce in the cloud. IEEE Tran. on Computers, FlowFlex: Malleable scheduling for flows of 61(12):1737–1751, 2012. MapReduce jobs. In Proc. of ACM/IFIP/USENIX [7] C.-Y. Chen and C.-P. Chu. A 3.42-approximation Middleware, pages 103–122, Beijing, China, Dec 2013. algorithm for scheduling malleable tasks under [24] P. Sanders and J. Speck. Energy efficient frequency precedence constraints. IEEE TPDS, 24(8):1479–1488, scaling and scheduling for malleable tasks. In Proc. of 2013. Euro-Par, pages 167–178, Rhodes Island, Greece, Aug [8] Y. Chen, S. Alspaugh, D. Borthakur, and R. Katz. 2012. Energy efficiency for large-scale MapReduce workloads [25] X. Xu, W. Dou, X. Zhang, and J. Chen. Enreal: An with significant interactive analysis. In Proc. of ACM energy-aware resource allocation method for scientific EuroSys, pages 43–56, Bern, Switzerland, Apr 2012. workflow executions in cloud environment. IEEE [9] D. Cheng, P. Lama, C. Jiang, and X. Zhou. Towards Tran. on Cloud Comp., 4(2):166–179, 2016. energy efficiency in heterogeneous Hadoop clusters by [26] L. Zhang, K. Li, C. Li, and K. Li. Bi-objective adaptive task assignment. In Proc. of IEEE ICDCS, workflow scheduling of the energy consumption and pages 359–368, Columbus, OH, USA, Jun-Jul 2015. reliability in heterogeneous computing systems. [10] D. Cheng, J. Rao, C. Jiang, and X. Zhou. Resource Elsevier Info. Sci., in press. and deadline-aware job scheduling in dynamic Hadoop [27] L. Zhang, K. Li, Y. Xu, J. Mei, F. Zhang, and K. Li. clusters. In Proc. of IEEE IPDPS, pages 956–965, Maximizing reliability with energy conservation for Hyderabad, India, May 2015. parallel task scheduling in a heterogeneous cluster. [11] M. Drozdowski. Scheduling for Parallel Processing. Elsevier Info. Sci., 319:113–131, 2015. Springer-Verlag London, 2009. [28] Q. Zhu, J. Zhu, and G. Agrawal. Power-aware [12] J. Du and J. Y.-T. Leung. Complexity of scheduling consolidation of scientific workflows in virtualized parallel task systems. SIAM J. Disc. Math., environments. In Proc. of ACM/IEEE SC, pages 1–12, 2(4):473–487, 1989. New Orleans, LA, USA, Nov 2010. [13] F. Ergun, R. Sinha, and L. Zhang. An improved FPTAS for restricted shortest path. Info. Processing 43