=Paper=
{{Paper
|id=Vol-1343/paper16
|storemode=property
|title=Dynamic Local Scheduling of Multiple DAGs in Distributed Heterogeneous Systems
|pdfUrl=https://ceur-ws.org/Vol-1343/paper16.pdf
|volume=Vol-1343
}}
==Dynamic Local Scheduling of Multiple DAGs in Distributed Heterogeneous Systems==
Dynamic Local Scheduling of Multiple DAGs in Dynamic Local Scheduling of Multiple DAGs in Distributed Heterogeneous Systems Distributed Heterogeneous Systems Ondřej Votava, Peter Macejko, and Jan Janeček1 Ondřej Votava, Peter Macejko, and Jan Janeček Czech Technical University in Prague Czech Technical {votavon1, macejp1,University in Prague janecek}@fel.cvut.cz {votavon1,http://cs.fel.cvut.cz macejp1, janecek}@fel.cvut.cz http://cs.fel.cvut.cz Abstract. Heterogeneous computational platform offers a great ratio between the computational power and the price of the system. Static and dynamic scheduling methods offer a good way of how to use these systems efficiently and therefore many algorithms were proposed in the literature in past years. The aim of this article is to present the dynamic (on-line) algorithm which schedules multiple DAG applications without any central node and the schedule is created only with the knowledge of node’s network neigbourhood. The algorithm achieves great level of fair- ness for more DAGs and total computation time is close to the standard and well known competitors. 1 Introduction Homogeneous (heterogeneous) computational platform consists of a set of iden- tical (different) computers connected by a high speed communication network [1, 2]. Research has been done last few years on how to use these platforms ef- ficiently [3–5]. It is believed that scheduling is a good way on how to use the computation capacity these systems offer [1, 6]. Traditional attitude is to pre- pare the schedule before the computation begins [7, 8]. This requires information about the network topology and parameters and also node’s computational abil- ities. Knowing all of this information we can use the static (offline) scheduling algorithm. Finding the optimal value of makespan – i.e. the time of the com- putation in total – is claimed to be NP complete [9, 10]. Therefore research has been done and many heuristics have been found [11, 12]. Compared to static scheduling, dynamic (online) scheduling allows us to create the schedule as part of the computation process. This allows dynamic algorithms to use the feedback of the system and modify the schedule in accor- dance with current state of the system. Dynamic algorithms are often used for scheduling multiple applications [13–16] at the same time and therefore fairness of generated schedules is important. The algorithm presented in this paper does not require the global knowledge of the network, it uses the information gathered from node’s neighbors only. The phase of creating the schedule is fixed part of the computation cycle. The algorithm is intended to be used for scheduling more DAGs simultaneously and tries to achieve fair division of tasks for all computing nodes. M. Nečaský, J. Pokorný, P. Moravec (Eds.): Dateso 2015, pp. 1–12, CEUR-WS.org/Vol-1343. 22 Ondřej Votava Ondřej Votava,et Peter al. Macejko, Jan Janeček The structure of this article, which is the enhanced version of [17], is as follows, in the section two we describe the problem of scheduling and make a brief summary of related work. In the section three we describe the algorithm itself and in the following section we describe the testing environment and results we obtained by running several simulations. In the fifth section we conclude the results from section four, show the pros and cons of the presented algorithm and discuss the future improvements. 2 Problem definition The application model can be described as a directed acyclic graph AM = (V, E, B, C) [12, 18], where: V = {v1 , v2 , . . . , vv }, |V| = v is the set of tasks, task vi ∈ V represents the piece of code that has to be executed sequentially on the same machine, E = {e1 , e2 , . . . , ee }, |E| = e is the set of edges, the edge ej = (vk , vl ) represents data dependencies, i.e. the task vl cannot start the computation until the data from task vk has been received, task vk is called the parent of vl , vl is called the child of vk , B = {b1 , b2 , . . . , bv }, |B| = v is the set of computation costs (e.g. number of instructions), where bi ∈ B is the computation cost for the task vi , C = {c1 , c2 , . . . , ce }, |C| = e is the set of data dependency costs, where cj = ck,l is the data dependency cost (e.g. amount of data) corresponding to the edge ej = (vk , vl ). The task which has no parents or children is called entry or exit task respec- tively. If there are more than one entry/exit tasks in the graph a new virtual entry/exit task can be added to the graph. Such a task would have zero weight and would be connected by zero weight edges to the real entry/exit tasks. entry 0 0 0 1 2 3 p1 p2 p3 10 20 30 i 26 37 bi 14 25 36 4 5 6 7 c i,j p4 p5 40 25 36 57 51 48 69 58 j bj 8 9 10 88 49 11 p6 p7 p8 0 0 0 0 exit Fig. 2. The computation system can be represented as a general graph. Fig. 1. Application can be described using DAG. Dynamic Local Dynamic Local Scheduling Scheduling of of Multiple Multiple DAGs DAGs in in Distr. Distr. Heterogen. Heterogen. Systems Systems 33 Since the application can be described as DAG we use terms application, DAG application or simply DAG as synonyms in this paper. The computation system consist of a set of computing units all of which are connected by a high speed communication network. It can be described using a general graph CS = (P, Q, R, S), where: P = {p1 , p2 , . . . , pp }, |P| = p is a set of the computers, Q = {q1 , q2 , . . . , qp }, |Q| = p is the set of speeds of computers, where qi is the speed of computer pi , R is a matrix describing the communication costs, the size of R is p × p, S is a matrix used to describe the communication startup costs, it is usually one dimensional, i.e. it’s size is p × 1. 2.1 Merging application and computation model Since the structure of the application and the characteristics of the computation systems are known it is no problem to get the information about computation time of each application’s node at any computation node. This information is stored in a W matrix, whose dimensions are v × p, where the item at the position [i, j] contains the information about the length of the computation of the task i on a computation unit j. The value of W[i, j] is computed by the following equation bi W[i, j] = , (1) qj where bi is the computation cost of task vi and qj is the speed of computer pj (e.g. instructions per second). The total communication time for a message m (corresponding data depen- dency for the edge (vk , vl )) that is send from the computer i to the computer j can be computed by this equation cm = S[i] + R[i][j] · ck,l . (2) 2.2 Our model description The content of the matrix W is dependent on the properties of computation nodes. However, the computers differ only in a certain ways. The “fast” com- puters are k times faster than “slow” computers. On that account the columns in the W are usually only multiples of one column. This information can be reduced to the constant kp for each processor p. When the computation system is allowed to change, the matrix W is not useable either since it does not reflect any dynamic behaviour. The structure, we decided to use, can be described as follows. We selected one processor to serve as a reference – pref . The computation time of one instruction on this processor lasts one time unit. The speedup of the processor pi is then defined as qi SUp (i) = , (3) qref 44 Ondřej Votava Ondřej Votava,et Peter al. Macejko, Jan Janeček where qi is the speed of processor pi and qref is the speed of the reference processor. The time duration of computation of a task vj on the processor pi is then computed “on the fly” by the equation bj timej,i = . (4) SUp (i) Finally, the computation platform is described as a set of speedups and the com- munication matrices and the merging of application and computation model is being done as a part of the computation. Even the communication matrices may be reduced in our model. They contain only information about computation node’s neighbours and differ for all nodes. Still, this is a problem for imple- mentation part and does not affect the description model, as the “neighbour’s” matrices are only a part of “global” communication matrices. 2.3 Related work Task scheduling or task mapping has been in active research for a long time. Sev- eral static algorithms were introduced and dynamic algorithms were published too. Most of static scheduling algorithms are designed to work with one DAG. List scheduling algorithms are very popular and they are often used. HEFT [19] is a simple and effective algorithm used as a reference in our article. HEFT cre- ates a list of tasks sorted by an upward rank1 and then it assigns tasks to the processor so that the execution time of the task is minimized. Another algorithm presented in [19] is Critical Path On a Processor (CPOP). This algorithm is more complex and optimizes tasks on a critical path. By modifying list algorithms and allowing the execution of tasks more than once tasks duplication algorithms were introduces. HCPFD [2] compared to the HEFT obtains better makespan in most cases. Task duplication achieves surprisingly good results when applied to the computation model containing multi core computers [18]. The way of computing multiple DAGs is usually presented in dynamic algo- rithms. In [20] there was introduced a static method how to schedule multiple DAGs and the aim was not only to optimize makespan but also to achieve the fair sharing of resources for the competing DAGs. The idea of generating a new graph by appending whole DAGs to the current one is used in [21]. Compared to [20] this algorithm is dynamic, i.e. the graph is build when a new DAG arrives to the system. Truly dynamic algorithm is described in [14]. This algorithm divides the nodes into two groups. The first one contains nodes used for computation, the second one contains scheduling nodes. Scheduling nodes are independent and the knowledge about the activity of other scheduling nodes is received through the statistics of usage of the computing nodes. The quality of such scheduling is then dependent on the quality of statistics created by computation nodes. 1 See [19] for details Dynamic Local Dynamic Local Scheduling Scheduling of of Multiple Multiple DAGs DAGs in in Distr. Distr. Heterogen. Heterogen. Systems Systems 55 Unlike the previous one the algorithm presented in [16] is based on one central scheduling unit. The algorithm takes into account the time for scheduling and dispatching and focuses on reliability costs. Another model of completely dis- tributed algorithm is presented in [15]. This algorithm divides nodes into groups and uses two levels of scheduling. The high level decides which group to use and low level decides which node in the group to use. The algorithm described in [22] works a bit different way. The node works with it’s neighborhood and the distribution of task of parallel application is done according to the load of the neighbours. If the load of a node is too high, the algo- rithm allows the task to be migrated among the network. Genetic programming technique is used to decide where to migrate the task. The problem of task scheduling is loosely coupled with the network through- put. The description of network used in this paper is not very close to the reality and the problems connected to bottle necks or varying delay may cause prob- lems. The behaviour of task scheduling applications running in the network, which has different parameters, is very well described in [23]. According to this article we expect there are no bottle necks in the networks. 3 Proposed algorithm In this section we present the algorithm Dynamic Local Multiple DAG (DLMDAG). The algorithm itself, described in Algorithm 1, is a dynamic task scheduling al- gorithm that supports both homogeneous and heterogeneous computation plat- forms. The main idea of the algorithm is based on the assumption that the commu- nication lasts only very short time compared to the computation (at least in one order of magnitude). The computation node, which is the creator of a schedule for a certain DAG application, sends a message to all of its neighbours where it asks how long would the computation of these tasks last if they were com- puted by the neighbour. Then it continues computing the task and during this computation replies for the question arrive. According to the data (timestamps) received, the node makes a schedule for the set of tasks (asked in previous step), then it sends a message to it’s neighbours with the information about who should compute which task and generates another question about the computation time for the next set of tasks. The algorithm description (Algorithm 1) uses these terms. The task is called “ready” when all of its data dependencies are fulfilled. Ready tasks are stored in a tasksReady priority queue. The criterion for ordering is the time computed by computeP riority. The task that is ready and is also scheduled should be stored in a computableT asks queue. Each task’s representation contains one priority queue for storing the pair information about finish time and neighbour at which the finish time would be achieved. The queue is ordered by the time. computeP riority method is used to make the timestamps for the tasks. It is computed when a DAG application comes to the computation node (pk ) and it 66 Ondřej Votava Ondřej Votava,et Peter al. Macejko, Jan Janeček Algorithm 1 The core 1: neighbours[], readyTasks {priority queue of tasks ready to compute} 2: computableTasks {queue of scheduled tasks for computing} 3: if not initialized then 4: neighbours = findNeighbours() 5: initialized = true 6: end if 7: if received DAG then 8: computePriority(DAG) {Priority of tasks by traversing DAG} 9: push(findReadyTasks(DAG), readyTasks) 10: end if 11: if received DATA then 12: correctDependecies(DATA) 13: push(findReadyTasks(DATA.DAG), readyTasks) 14: end if 15: if received TASK then 16: push(TASK, computableTasks) 17: end if 18: if received REQUEST then 19: for all task ∈ REQUEST do 20: task.time = howLong(task) {time for task + time for tasks in computableTasks} 21: end for 22: send(REQUEST-REPLY, owner) 23: end if 24: if received REQUEST-REPLY then 25: for all task ∈ REQUEST-REPLY do 26: push((task.time, sender), task.orderingQueue) 27: end for 28: end if 29: loop {The main loop of algorithm} 30: schedule = createSchedule(tasksReady) {Creates schedule and removes tasks from queue} 31: for all (task, proc) ∈ schedule do 32: send(task, proc) 33: end for 34: for all n ∈ neighbours do 35: send(REQUEST, n) {tasks from tasksReady} 36: end for 37: TASK = pop(computableTasks) 38: compute(TASK) 39: send(DATA, TASK.owner) {Nothing is send if local task} 40: end loop is generated according to this equation priority(vj ) = timej,k + max priority(i), (5) ∀i∈parvj where parvj is the set of parents of node vj and priority(v0 ) = time0,k . The final scheduling is based on the priority queue task.orderingQueue. The scheduling step described in Algorithm 2 is close to HEFT [19]. One big difference is that our algorithm uses the reduced list of tasks2 and is forced to use all neighbours3 even if it would be slower than computing at local site. The algorithm is called local. It is because it uses only information about the node’s local neighborhood. Each node creates a set of neighbours in the initialization stage of the algorithm. Therefore there are no matrices R and S or there are these matrices but they are different for each computational node. 2 Only ready tasks are scheduled 3 If there are not enough ready tasks then not all neighbours are used. Dynamic Local Dynamic Local Scheduling Scheduling of of Multiple Multiple DAGs DAGs in in Distr. Distr. Heterogen. Heterogen. Systems Systems 77 Algorithm 2 Scheduling phase schedule {empty set for pairs} num = min(|tasksReady|, |neighbours|) for i = 0; i < num; i + + do task = pop(tasksReady) proc = pop(task.orderingQueue) push((task, proc),schedule) removeFrom(proc, tasksReady) {Once neighbour used it cannot be scheduled again} end for return schedule The size of matrix R for the computational node pi is Rpi = (si × si ) where s = |neighboursi | is the amount of neighbours of the node pi . 3.1 Time complexity The time complexity of the algorithm can be divided into two parts. The compu- tation part is connected to the sorting and scheduling phase of the algorithm and the communication part is connected to the necessity of exchanging messages for the scheduling phase. The DAG consists of v tasks and the computation node has s neighbours. One question message is sent about every task to all of the neighbours. Question contains information from one to s tasks and the precise number is dependent on the structure of the DAG. The node which receives the question message always sends a reply to it. As the node finishes the schedul- ing phase of the algorithm another message with a schedule is sent to every neighbour who is involved in the schedule. The last message (schedule informa- tion) can be put together with the question’s one and there is from 3 v/s to 3 v messages sent in total. Computation part is based on the sorting operations of the algorithm. There are two types of priority queues being used all of which are based on the heap. The first one is the tasksReady. Every task from a DAG is put once in this queue and the time complexity is O(v log v). The second priority queue (task.orderingQueue) stores one piece of information for every neighbour. The queue is used for every task and for every neighbour and the time complexity obtained by this queue is O(v s log s) and therefore the time complexity of the computational part of the algorithm is O(v log v + v s log s). 4 Performance and comparison The algorithm was implemented in a simulation environment [24] and it was executed several times for different scenarios. Makespan, unfairness and average utilization of computing nodes were measured. Makespan is the total computation time of the application, it is defined as makespan(DAG) = f inishT ime(vl ) − startT ime(vs ), (6) where f inishT ime(vl ) is the time when the last task of DAG was computed and startT ime(vs ) is the time when the first task of DAG began the computation. 88 Ondřej Votava Ondřej Votava,et Peter al. Macejko, Jan Janeček Since several DAG applications compete for the shared resources the execu- tion time for each DAG is longer compared to the execution time when there was the only one application in the system. The slowdown of the DAG represents ratio of the execution time when only one DAG was in system and when there were more in the system. It is described as Slowdown(DAG) = Tshared (DAG)/Tsingle (DAG), (7) where Tshared is the execution time when more than one DAG was scheduled and Tsingle is the execution time when there was only this DAG scheduled. The sched- ule is fair when all of the DAGs achieve almost the same slowdown[20] and the schedule is unfair when there are big differences in the slowdown of DAGs. The unfairness for the schedule S for a set of n DAGs A = {DAG1 , DAG2 , ..., DAGn } is defined X U nf airness(S) = |Slowdown(d) − AvgSlowdown|, (8) ∀d∈A where average slowdown is defined as 1 X AvgSlowdown = Slowdown(d) (9) n ∀d∈A The utilization of a computation unit pj for the schedule S is computed by this equation: X U tilS (pj ) = makespan(t)/totalT imej , (10) ∀t∈tasksS where tasksS is a set of tasks which were computed on a pj in the schedule S and totalT imej is the total time of the simulation, which is the time when the last task of all DAGs has finished. Average utilization for the whole set of processors P and for the schedule S is then defined as p 1X AvgU tilS (P) = U tilS (pi ). (11) p i=1 4.1 Testing environment Three computation platforms containing 5, 10 and 20 computers were created. A full mesh with different communication speed for several lines was chosen as a connection network – this created a network without bottle necks and allowed the algorithm obtain minimal makespan time [23]. Nodes were divided into groups of 5 and the group used a gigabit connection with a delay of 2 ms. In the network with ten nodes the groups were connected by 100 MBit lines and in the network with 20 nodes the groups were connected as follows: Dynamic Local Dynamic Local Scheduling Scheduling of of Multiple Multiple DAGs DAGs in in Distr. Distr. Heterogen. Heterogen. Systems Systems 99 80000 DLMDAG HEFT par 3 DLMDAG 75000 CPOP par HEFT par CPOP par 70000 2.5 65000 Average makespan 2 Average unfairness 60000 55000 1.5 50000 1 45000 40000 0.5 35000 2 4 6 8 10 0 Number of DAGs 2 4 6 8 10 Number of DAGs Fig. 3. Makespan for different number Fig. 4. Unfairness for different concur- of DAGs running concurrently (all plat- rently running DAGs (all platforms) forms) – 4 groups of 5 nodes intraconnected by gigabit, – 2 groups connected by 100 MBit, – 3rd and 4th group connected by 10 MBit with others. Sets of 2, 4, 6, 8 and 10 applications were generated using the method de- scribed in [19]. The application contained 25 tasks with different computation and data dependency costs. The schedule for each of the set was generated by simulation4 of DLMDAG algorithm and by static algorithms HEFT and CPOP. We used two methods of connecting several DAGs into one for the static algorithms, the first one is sequence execution of DAGs in a row, the second one is to generate virtual start and end nodes and connect DAGs to these nodes with a zero weighted edges. DAGs were ordered in the sequential execution test by the rule the shorter the makespan of DAG is the sooner it is executed. In total there were 100 sets of 2 DAGs, 100 sets of 4 DAGs etc. and the results we obtained we averaged. For the DLMDAG all DAGs arrived to the system at time 0 and on the one node. 4.2 Results Results of sequential execution of DAGs for HEFT and CPOP achieved much longer makespans and therefore were not included into graphs. HEFT par and CPOP par mean that connection of DAGs was created using virtual start and end tasks. The makespan achieved by DLMDAG is very close to the HEFT and CPOP (fig. 3). The differences after averaging were just units of percents. The special case was the architecture of five computers (fig. 8), in this case DLMDAG out- performs the others. When there were 10 or 20 computers in the system (fig. 6), DLMDAG achieved slightly worse results. Since HEFT and CPOP use the whole 4 Simulation tool OMNeT++[24] was used 10 10 Ondřej Votava Ondřej Votava,et Peter al. Macejko, Jan Janeček 1 DLMDAG 56000 HEFT par DLMDAG CPOP par HEFT par 54000 CPOP par 0.8 52000 Average effectivness 50000 0.6 Average makespan 48000 46000 0.4 44000 42000 0.2 40000 38000 0 2 4 6 8 10 36000 Number of DAGs 2 4 6 8 10 Number of DAGs Fig. 5. Utilization for different number Fig. 6. Makespan for different numbers of DAGs running concurrently (all plat- of applications (20 PC platform) forms) 3 DLMDAG 130000 DLMDAG HEFT par HEFT par CPOP par 120000 CPOP par 2.5 110000 100000 2 Average unfairness Average makespan 90000 1.5 80000 70000 1 60000 50000 0.5 40000 30000 0 2 4 6 8 10 2 4 6 8 10 Number of DAGs Number of DAGs Fig. 7. Unfairness for different numbers Fig. 8. Makespan for diferent numbers of applications (20 PC platform) of applications (5 PC platform) structure of applications DLMDAG works only with ready tasks and therefore it may be unable to use the platform with more devices so efficiently. These results are then dependent on the structure of the applications that were scheduled. The more parallel application is, the better results DLMDAG obtains. The unfairness (figures 4, 7) for DLMDAG is at the very low level and the growth of it is slow. The unfairness level obtained by HEFT and CPOP in comparison with DLMDAG is worse. The utilization of nodes (figure 5) corresponds to the makespan achieved by the algorithms. Growing the amount of DAGs in the system the utilization of nodes increases for all algorithms. As mentioned earlier, DLMDAG achieves high level of parallelization and therefore the average utilization of all nodes is also increasing. 5 Conclusion The algorithm presented in this article is dynamic, it does not use any central point for scheduling neither it requires the information about the whole network. Dynamic Local Dynamic Local Scheduling Scheduling of of Multiple Multiple DAGs DAGs in in Distr. Distr. Heterogen. Heterogen. Systems Systems 11 11 DLMDAG is based on the local knowledge of the network – only neighbours cre- ate the schedule – and the schedule is created using several messages by which the computation times are gathered on the scheduling node. The simulations of the algorithm were executed and results obtained were compared to the traditional offline scheduling algorithms. DLMDAG is able to use the computation resources in a better way than compared algorithms when there are more tasks in the system than computa- tion units. As the number of computation nodes increases the result DLMDAG achieves become worse than competitor’s. Future work There are several possibilities to improve the proposed algorithm. Initially the computation systems do change. The algorithm should be able to modify the schedules to reflect the network changes. Subsequently the current algorithm is fixed to the scheduling node and it’s neighbours and this may cause performance problems, the algorithm could be able to move the application to some other node with different neighbours. References 1. D. Feitelson, L. Rudolph, U. Schwiegelshohn, K. Sevcik, and P. Wong, “Theory and practice in parallel job scheduling,” in Job Scheduling Strategies for Parallel Pro- cessing (D. Feitelson and L. Rudolph, eds.), vol. 1291 of Lecture Notes in Computer Science, pp. 1–34, Springer Berlin / Heidelberg, 1997. 10.1007/3-540-63574-2 14. 2. T. Hagras and J. Janecek, “A high performance, low complexity algorithm for compile-time task scheduling in heterogeneous systems,” Parallel Computing, vol. 31, no. 7, pp. 653 – 670, 2005. Heterogeneous Computing. 3. H. Kikuchi, R. Kalia, A. Nakano, P. Vashishta, H. Iyetomi, S. Ogata, T. Kouno, F. Shimojo, K. Tsuruta, and S. Saini, “Collaborative simulation grid: Multiscale quantum-mechanical/classical atomistic simulations on distributed pc clusters in the us and japan,” in Supercomputing, ACM/IEEE 2002 Conference, p. 63 4. D. Kehagias, M. Grivas, G. Pantziou, and M. Apostoli, “A wildly dynamic grid- like cluster utilizing idle time of common pc,” in Telecommunications in Modern Satellite, Cable and Broadcasting Services, 2007. TELSIKS 2007. 8th International Conference on, pp. 36 –39, sept. 2007. 5. A. Wakatani, “Parallel vq compression using pnn algorithm for pc grid system,” Telecommunication Systems, vol. 37, pp. 127–135, 2008. 6. M. Maheswaran, T. D. Braun, and H. J. Siegel, “Heterogeneous distributed com- puting,” in In Encyclopedia of Electrical and Electronics Engineering, pp. 679–690, John Wiley, 1999. 7. Y. kwong Kwok and I. Ahmad, “Benchmarking the task graph scheduling algo- rithms,” in In Proc. IPPS/SPDP, pp. 531–537, 1998. 8. J. Liou and M. Palis, “A comparison of general approaches to multiprocessor scheduling,” Parallel Processing Symposium, International, vol. 0, p. 152, 1997. 9. J. Ullman, “Np-complete scheduling problems,” Journal of Computer and System Sciences, vol. 10, no. 3, pp. 384 – 393, 1975. 10. M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1979. 12 12 Ondřej Votava Ondřej Votava,et Peter al. Macejko, Jan Janeček 11. T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran, A. I. Reuther, J. P. Robertson, M. D. Theys, B. Yao, D. Hensgen, and R. F. Freund, “A com- parison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems,” Journal of Parallel and Distributed Computing, vol. 61, no. 6, pp. 810 – 837, 2001. 12. H. Topcuoglu, S. Hariri, and M. Wu, “Performance-effective and low-complexity task scheduling for heterogeneous computing,” IEEE Transactions on Parallel and Distributed Systems, vol. 13, pp. 260–274, 2002. 13. M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund, “Dynamic matching and scheduling of a class of independent tasks onto heterogeneous com- puting systems,” Heterogeneous Computing Workshop, vol. 0, p. 30, 1999. 14. M. Iverson and F. Ozguner, “Dynamic, competitive scheduling of multiple dags in a distributed heterogeneous environment,” Heterogeneous Computing Workshop, vol. 0, p. 70, 1998. 15. M. A. Iverson and F. Özgüner, “Hierarchical, competitive scheduling of multiple dags in a dynamic heterogeneous environment,” Distributed Systems Engineering, vol. 6, no. 3, p. 112, 1999. 16. X. Qin and H. Jiang, “Dynamic, reliability-driven scheduling of parallel real-time jobs in heterogeneous systems,” Parallel Processing, International Conference on, vol. 0, p. 0113, 2001. 17. O. Votava, P. Macejko, J. Kubr, and J. Janeček, “Dynamic Local Scheduling of Multiple DAGs in a Distributed Heterogeneous Systems,” in Proceedings of the 2011 International Conference on Telecommunication Systems Management, (Dal- las, TX), pp. 171–178, American Telecommunications Systems Management Asso- ciation Inc., 2011. 18. J. Janeček, P. Macejko, and T. M. G. Hagras, “Task scheduling for clustered hetero- geneous systems,” in IASTED International Conference - Parallel and Distributed Computing and Networks (PDCN 2009) (M. Hamza, ed.), pp. 115–120, February 2009. ISBN: 978-0-88986-783-3, ISBN (CD): 978-0-88986-784-0. 19. H. Topcuoglu, S. Hariri, and M.-Y. Wu, “Task scheduling algorithms for hetero- geneous processors,” in Heterogeneous Computing Workshop, 1999. (HCW ’99) Proceedings. Eighth, pp. 3 –14, 1999. 20. H. Zhao and R. Sakellariou, “Scheduling multiple dags onto heterogeneous sys- tems,” Parallel and Distributed Processing Symposium, International, 2006. 21. J. Barbosa and B. Moreira, “Dynamic job scheduling on heterogeneous clusters,” in Parallel and Distributed Computing, 2009. ISPDC ’09. Eighth International Symposium on, pp. 3 –10, 302009-july4 2009. 22. R. de Mello, J. Andrade Filho, L. Senger, and L. Yang, “Grid job scheduling using route with genetic algorithm support,” Telecommunication Systems, vol. 38, pp. 147–160, 2008. 10.1007/s11235-008-9101-5. 23. Y. Kitatsuji, K. Yamazaki, H. Koide, M. Tsuru, and Y. Oie, “Influence of network characteristics on application performance in a grid environment,” Telecommuni- cation Systems, vol. 30, pp. 99–121, 2005. 10.1007/s11235-005-4320-5. 24. A. Varga et al., “The omnet++ discrete event simulation system,” in Proceedings of the European simulation multiconference (ESM’2001), vol. 9, p. 65, sn, 2001.