=Paper= {{Paper |id=Vol-1343/paper16 |storemode=property |title=Dynamic Local Scheduling of Multiple DAGs in Distributed Heterogeneous Systems |pdfUrl=https://ceur-ws.org/Vol-1343/paper16.pdf |volume=Vol-1343 }} ==Dynamic Local Scheduling of Multiple DAGs in Distributed Heterogeneous Systems== https://ceur-ws.org/Vol-1343/paper16.pdf

Dynamic Local Scheduling of Multiple DAGs in
Dynamic Local Scheduling of Multiple DAGs in
Distributed Heterogeneous Systems
Distributed Heterogeneous Systems
Ondřej Votava, Peter Macejko, and Jan Janeček1
Ondřej Votava, Peter Macejko, and Jan Janeček
Czech Technical University in Prague
Czech Technical
{votavon1, macejp1,University in Prague
janecek}@fel.cvut.cz
{votavon1,http://cs.fel.cvut.cz
macejp1, janecek}@fel.cvut.cz
http://cs.fel.cvut.cz

Abstract. Heterogeneous computational platform offers a great ratio
between the computational power and the price of the system. Static
and dynamic scheduling methods offer a good way of how to use these
systems efficiently and therefore many algorithms were proposed in the
literature in past years. The aim of this article is to present the dynamic
(on-line) algorithm which schedules multiple DAG applications without
any central node and the schedule is created only with the knowledge of
node’s network neigbourhood. The algorithm achieves great level of fair-
ness for more DAGs and total computation time is close to the standard
and well known competitors.

1 Introduction
Homogeneous (heterogeneous) computational platform consists of a set of iden-
tical (different) computers connected by a high speed communication network
[1, 2]. Research has been done last few years on how to use these platforms ef-
ficiently [3–5]. It is believed that scheduling is a good way on how to use the
computation capacity these systems offer [1, 6]. Traditional attitude is to pre-
pare the schedule before the computation begins [7, 8]. This requires information
about the network topology and parameters and also node’s computational abil-
ities. Knowing all of this information we can use the static (offline) scheduling
algorithm. Finding the optimal value of makespan – i.e. the time of the com-
putation in total – is claimed to be NP complete [9, 10]. Therefore research has
been done and many heuristics have been found [11, 12].
Compared to static scheduling, dynamic (online) scheduling allows us to
create the schedule as part of the computation process. This allows dynamic
algorithms to use the feedback of the system and modify the schedule in accor-
dance with current state of the system. Dynamic algorithms are often used for
scheduling multiple applications [13–16] at the same time and therefore fairness
of generated schedules is important.
The algorithm presented in this paper does not require the global knowledge
of the network, it uses the information gathered from node’s neighbors only.
The phase of creating the schedule is fixed part of the computation cycle. The
algorithm is intended to be used for scheduling more DAGs simultaneously and
tries to achieve fair division of tasks for all computing nodes.

M. Nečaský, J. Pokorný, P. Moravec (Eds.): Dateso 2015, pp. 1–12, CEUR-WS.org/Vol-1343.
22 Ondřej Votava
Ondřej Votava,et
Peter
al. Macejko, Jan Janeček

The structure of this article, which is the enhanced version of [17], is as
follows, in the section two we describe the problem of scheduling and make a
brief summary of related work. In the section three we describe the algorithm
itself and in the following section we describe the testing environment and results
we obtained by running several simulations. In the fifth section we conclude the
results from section four, show the pros and cons of the presented algorithm and
discuss the future improvements.

2 Problem definition
The application model can be described as a directed acyclic graph AM = (V, E, B, C)
[12, 18], where:
V = {v1 , v2 , . . . , vv }, |V| = v is the set of tasks, task vi ∈ V represents the
piece of code that has to be executed sequentially on the same machine,
E = {e1 , e2 , . . . , ee }, |E| = e is the set of edges, the edge ej = (vk , vl ) represents
data dependencies, i.e. the task vl cannot start the computation until the
data from task vk has been received, task vk is called the parent of vl , vl is
called the child of vk ,
B = {b1 , b2 , . . . , bv }, |B| = v is the set of computation costs (e.g. number of
instructions), where bi ∈ B is the computation cost for the task vi ,
C = {c1 , c2 , . . . , ce }, |C| = e is the set of data dependency costs, where cj = ck,l
is the data dependency cost (e.g. amount of data) corresponding to the edge
ej = (vk , vl ).
The task which has no parents or children is called entry or exit task respec-
tively. If there are more than one entry/exit tasks in the graph a new virtual
entry/exit task can be added to the graph. Such a task would have zero weight
and would be connected by zero weight edges to the real entry/exit tasks.

entry

0 0 0

1 2 3 p1 p2 p3
10 20 30
i
26 37 bi
14 25 36

4 5 6 7 c i,j p4 p5
40 25 36 57
51
48 69
58 j
bj
8 9 10
88 49 11 p6 p7 p8
0 0 0 0

exit
Fig. 2. The computation system can be
represented as a general graph.
Fig. 1. Application can be described
using DAG.
Dynamic Local
Dynamic Local Scheduling
Scheduling of
of Multiple
Multiple DAGs
DAGs in
in Distr.
Distr. Heterogen.
Heterogen. Systems
Systems 33

Since the application can be described as DAG we use terms application,
DAG application or simply DAG as synonyms in this paper.
The computation system consist of a set of computing units all of which are
connected by a high speed communication network. It can be described using a
general graph CS = (P, Q, R, S), where:
P = {p1 , p2 , . . . , pp }, |P| = p is a set of the computers,
Q = {q1 , q2 , . . . , qp }, |Q| = p is the set of speeds of computers, where qi is the
speed of computer pi ,
R is a matrix describing the communication costs, the size of R is p × p,
S is a matrix used to describe the communication startup costs, it is usually
one dimensional, i.e. it’s size is p × 1.

2.1 Merging application and computation model
Since the structure of the application and the characteristics of the computation
systems are known it is no problem to get the information about computation
time of each application’s node at any computation node. This information is
stored in a W matrix, whose dimensions are v × p, where the item at the position
[i, j] contains the information about the length of the computation of the task
i on a computation unit j. The value of W[i, j] is computed by the following
equation
bi
W[i, j] = , (1)
qj
where bi is the computation cost of task vi and qj is the speed of computer pj
(e.g. instructions per second).
The total communication time for a message m (corresponding data depen-
dency for the edge (vk , vl )) that is send from the computer i to the computer j
can be computed by this equation

cm = S[i] + R[i][j] · ck,l . (2)

2.2 Our model description
The content of the matrix W is dependent on the properties of computation
nodes. However, the computers differ only in a certain ways. The “fast” com-
puters are k times faster than “slow” computers. On that account the columns
in the W are usually only multiples of one column. This information can be
reduced to the constant kp for each processor p. When the computation system
is allowed to change, the matrix W is not useable either since it does not reflect
any dynamic behaviour.
The structure, we decided to use, can be described as follows. We selected one
processor to serve as a reference – pref . The computation time of one instruction
on this processor lasts one time unit. The speedup of the processor pi is then
defined as
qi
SUp (i) = , (3)
qref
44 Ondřej Votava
Ondřej Votava,et
Peter
al. Macejko, Jan Janeček

where qi is the speed of processor pi and qref is the speed of the reference
processor.
The time duration of computation of a task vj on the processor pi is then
computed “on the fly” by the equation

bj
timej,i = . (4)
SUp (i)

Finally, the computation platform is described as a set of speedups and the com-
munication matrices and the merging of application and computation model is
being done as a part of the computation. Even the communication matrices may
be reduced in our model. They contain only information about computation
node’s neighbours and differ for all nodes. Still, this is a problem for imple-
mentation part and does not affect the description model, as the “neighbour’s”
matrices are only a part of “global” communication matrices.

2.3 Related work

Task scheduling or task mapping has been in active research for a long time. Sev-
eral static algorithms were introduced and dynamic algorithms were published
too. Most of static scheduling algorithms are designed to work with one DAG.
List scheduling algorithms are very popular and they are often used. HEFT [19]
is a simple and effective algorithm used as a reference in our article. HEFT cre-
ates a list of tasks sorted by an upward rank1 and then it assigns tasks to the
processor so that the execution time of the task is minimized. Another algorithm
presented in [19] is Critical Path On a Processor (CPOP). This algorithm is more
complex and optimizes tasks on a critical path. By modifying list algorithms and
allowing the execution of tasks more than once tasks duplication algorithms were
introduces. HCPFD [2] compared to the HEFT obtains better makespan in most
cases. Task duplication achieves surprisingly good results when applied to the
computation model containing multi core computers [18].
The way of computing multiple DAGs is usually presented in dynamic algo-
rithms. In [20] there was introduced a static method how to schedule multiple
DAGs and the aim was not only to optimize makespan but also to achieve the
fair sharing of resources for the competing DAGs. The idea of generating a new
graph by appending whole DAGs to the current one is used in [21]. Compared to
[20] this algorithm is dynamic, i.e. the graph is build when a new DAG arrives
to the system.
Truly dynamic algorithm is described in [14]. This algorithm divides the
nodes into two groups. The first one contains nodes used for computation, the
second one contains scheduling nodes. Scheduling nodes are independent and
the knowledge about the activity of other scheduling nodes is received through
the statistics of usage of the computing nodes. The quality of such scheduling is
then dependent on the quality of statistics created by computation nodes.
1
See [19] for details
Dynamic Local
Dynamic Local Scheduling
Scheduling of
of Multiple
Multiple DAGs
DAGs in
in Distr.
Distr. Heterogen.
Heterogen. Systems
Systems 55

Unlike the previous one the algorithm presented in [16] is based on one central
scheduling unit. The algorithm takes into account the time for scheduling and
dispatching and focuses on reliability costs. Another model of completely dis-
tributed algorithm is presented in [15]. This algorithm divides nodes into groups
and uses two levels of scheduling. The high level decides which group to use and
low level decides which node in the group to use.
The algorithm described in [22] works a bit different way. The node works
with it’s neighborhood and the distribution of task of parallel application is done
according to the load of the neighbours. If the load of a node is too high, the algo-
rithm allows the task to be migrated among the network. Genetic programming
technique is used to decide where to migrate the task.
The problem of task scheduling is loosely coupled with the network through-
put. The description of network used in this paper is not very close to the reality
and the problems connected to bottle necks or varying delay may cause prob-
lems. The behaviour of task scheduling applications running in the network,
which has different parameters, is very well described in [23]. According to this
article we expect there are no bottle necks in the networks.

3 Proposed algorithm

In this section we present the algorithm Dynamic Local Multiple DAG (DLMDAG).
The algorithm itself, described in Algorithm 1, is a dynamic task scheduling al-
gorithm that supports both homogeneous and heterogeneous computation plat-
forms.
The main idea of the algorithm is based on the assumption that the commu-
nication lasts only very short time compared to the computation (at least in one
order of magnitude). The computation node, which is the creator of a schedule
for a certain DAG application, sends a message to all of its neighbours where
it asks how long would the computation of these tasks last if they were com-
puted by the neighbour. Then it continues computing the task and during this
computation replies for the question arrive. According to the data (timestamps)
received, the node makes a schedule for the set of tasks (asked in previous step),
then it sends a message to it’s neighbours with the information about who should
compute which task and generates another question about the computation time
for the next set of tasks.
The algorithm description (Algorithm 1) uses these terms. The task is called
“ready” when all of its data dependencies are fulfilled. Ready tasks are stored in
a tasksReady priority queue. The criterion for ordering is the time computed by
computeP riority. The task that is ready and is also scheduled should be stored
in a computableT asks queue. Each task’s representation contains one priority
queue for storing the pair information about finish time and neighbour at which
the finish time would be achieved. The queue is ordered by the time.
computeP riority method is used to make the timestamps for the tasks. It is
computed when a DAG application comes to the computation node (pk ) and it
66 Ondřej Votava
Ondřej Votava,et
Peter
al. Macejko, Jan Janeček

Algorithm 1 The core
1: neighbours[], readyTasks {priority queue of tasks ready to compute}
2: computableTasks {queue of scheduled tasks for computing}
3: if not initialized then
4: neighbours = findNeighbours()
5: initialized = true
6: end if
7: if received DAG then
8: computePriority(DAG) {Priority of tasks by traversing DAG}
9: push(findReadyTasks(DAG), readyTasks)
10: end if
11: if received DATA then
12: correctDependecies(DATA)
13: push(findReadyTasks(DATA.DAG), readyTasks)
14: end if
15: if received TASK then
16: push(TASK, computableTasks)
17: end if
18: if received REQUEST then
19: for all task ∈ REQUEST do
20: task.time = howLong(task) {time for task + time for tasks in computableTasks}
21: end for
22: send(REQUEST-REPLY, owner)
23: end if
24: if received REQUEST-REPLY then
25: for all task ∈ REQUEST-REPLY do
26: push((task.time, sender), task.orderingQueue)
27: end for
28: end if
29: loop {The main loop of algorithm}
30: schedule = createSchedule(tasksReady) {Creates schedule and removes tasks from queue}
31: for all (task, proc) ∈ schedule do
32: send(task, proc)
33: end for
34: for all n ∈ neighbours do
35: send(REQUEST, n) {tasks from tasksReady}
36: end for
37: TASK = pop(computableTasks)
38: compute(TASK)
39: send(DATA, TASK.owner) {Nothing is send if local task}
40: end loop

is generated according to this equation

priority(vj ) = timej,k + max priority(i), (5)
∀i∈parvj

where parvj is the set of parents of node vj and priority(v0 ) = time0,k .
The final scheduling is based on the priority queue task.orderingQueue.
The scheduling step described in Algorithm 2 is close to HEFT [19]. One big
difference is that our algorithm uses the reduced list of tasks2 and is forced to
use all neighbours3 even if it would be slower than computing at local site.
The algorithm is called local. It is because it uses only information about
the node’s local neighborhood. Each node creates a set of neighbours in the
initialization stage of the algorithm. Therefore there are no matrices R and S
or there are these matrices but they are different for each computational node.
2
Only ready tasks are scheduled
3
If there are not enough ready tasks then not all neighbours are used.
Dynamic Local
Dynamic Local Scheduling
Scheduling of
of Multiple
Multiple DAGs
DAGs in
in Distr.
Distr. Heterogen.
Heterogen. Systems
Systems 77

Algorithm 2 Scheduling phase
schedule {empty set for pairs}
num = min(|tasksReady|, |neighbours|)
for i = 0; i < num; i + + do
task = pop(tasksReady)
proc = pop(task.orderingQueue)
push((task, proc),schedule)
removeFrom(proc, tasksReady) {Once neighbour used it cannot be scheduled again}
end for
return schedule

The size of matrix R for the computational node pi is Rpi = (si × si ) where
s = |neighboursi | is the amount of neighbours of the node pi .

3.1 Time complexity
The time complexity of the algorithm can be divided into two parts. The compu-
tation part is connected to the sorting and scheduling phase of the algorithm and
the communication part is connected to the necessity of exchanging messages for
the scheduling phase. The DAG consists of v tasks and the computation node
has s neighbours. One question message is sent about every task to all of the
neighbours. Question contains information from one to s tasks and the precise
number is dependent on the structure of the DAG. The node which receives the
question message always sends a reply to it. As the node finishes the schedul-
ing phase of the algorithm another message with a schedule is sent to every
neighbour who is involved in the schedule. The last message (schedule informa-
tion) can be put together with the question’s one and there is from 3 v/s to 3 v
messages sent in total.
Computation part is based on the sorting operations of the algorithm. There
are two types of priority queues being used all of which are based on the
heap. The first one is the tasksReady. Every task from a DAG is put once
in this queue and the time complexity is O(v log v). The second priority queue
(task.orderingQueue) stores one piece of information for every neighbour. The
queue is used for every task and for every neighbour and the time complexity
obtained by this queue is O(v s log s) and therefore the time complexity of the
computational part of the algorithm is O(v log v + v s log s).

4 Performance and comparison
The algorithm was implemented in a simulation environment [24] and it was
executed several times for different scenarios. Makespan, unfairness and average
utilization of computing nodes were measured.
Makespan is the total computation time of the application, it is defined as

makespan(DAG) = f inishT ime(vl ) − startT ime(vs ), (6)

where f inishT ime(vl ) is the time when the last task of DAG was computed and
startT ime(vs ) is the time when the first task of DAG began the computation.
88 Ondřej Votava
Ondřej Votava,et
Peter
al. Macejko, Jan Janeček

Since several DAG applications compete for the shared resources the execu-
tion time for each DAG is longer compared to the execution time when there was
the only one application in the system. The slowdown of the DAG represents
ratio of the execution time when only one DAG was in system and when there
were more in the system. It is described as

Slowdown(DAG) = Tshared (DAG)/Tsingle (DAG), (7)

where Tshared is the execution time when more than one DAG was scheduled and
Tsingle is the execution time when there was only this DAG scheduled. The sched-
ule is fair when all of the DAGs achieve almost the same slowdown[20] and the
schedule is unfair when there are big differences in the slowdown of DAGs. The
unfairness for the schedule S for a set of n DAGs A = {DAG1 , DAG2 , ..., DAGn }
is defined
X
U nf airness(S) = |Slowdown(d) − AvgSlowdown|, (8)
∀d∈A

where average slowdown is defined as
1 X
AvgSlowdown = Slowdown(d) (9)
n
∀d∈A

The utilization of a computation unit pj for the schedule S is computed by
this equation:
X
U tilS (pj ) = makespan(t)/totalT imej , (10)
∀t∈tasksS

where tasksS is a set of tasks which were computed on a pj in the schedule S
and totalT imej is the total time of the simulation, which is the time when the
last task of all DAGs has finished.
Average utilization for the whole set of processors P and for the schedule S
is then defined as
p
1X
AvgU tilS (P) = U tilS (pi ). (11)
p i=1

4.1 Testing environment
Three computation platforms containing 5, 10 and 20 computers were created.
A full mesh with different communication speed for several lines was chosen as
a connection network – this created a network without bottle necks and allowed
the algorithm obtain minimal makespan time [23].
Nodes were divided into groups of 5 and the group used a gigabit connection
with a delay of 2 ms. In the network with ten nodes the groups were connected
by 100 MBit lines and in the network with 20 nodes the groups were connected
as follows:
Dynamic Local
Dynamic Local Scheduling
Scheduling of
of Multiple
Multiple DAGs
DAGs in
in Distr.
Distr. Heterogen.
Heterogen. Systems
Systems 99
80000
DLMDAG
HEFT par 3
DLMDAG
75000 CPOP par HEFT par
CPOP par
70000 2.5

65000
Average makespan

Average unfairness
60000

55000 1.5

50000
1
45000

40000 0.5

35000
2 4 6 8 10 0
Number of DAGs 2 4 6 8 10
Number of DAGs

Fig. 3. Makespan for different number
Fig. 4. Unfairness for different concur-
of DAGs running concurrently (all plat-
rently running DAGs (all platforms)
forms)

– 4 groups of 5 nodes intraconnected by gigabit,
– 2 groups connected by 100 MBit,
– 3rd and 4th group connected by 10 MBit with others.

Sets of 2, 4, 6, 8 and 10 applications were generated using the method de-
scribed in [19]. The application contained 25 tasks with different computation
and data dependency costs. The schedule for each of the set was generated by
simulation4 of DLMDAG algorithm and by static algorithms HEFT and CPOP.
We used two methods of connecting several DAGs into one for the static
algorithms, the first one is sequence execution of DAGs in a row, the second
one is to generate virtual start and end nodes and connect DAGs to these nodes
with a zero weighted edges. DAGs were ordered in the sequential execution test
by the rule the shorter the makespan of DAG is the sooner it is executed. In
total there were 100 sets of 2 DAGs, 100 sets of 4 DAGs etc. and the results
we obtained we averaged. For the DLMDAG all DAGs arrived to the system at
time 0 and on the one node.

4.2 Results

Results of sequential execution of DAGs for HEFT and CPOP achieved much
longer makespans and therefore were not included into graphs. HEFT par and
CPOP par mean that connection of DAGs was created using virtual start and
end tasks.
The makespan achieved by DLMDAG is very close to the HEFT and CPOP
(fig. 3). The differences after averaging were just units of percents. The special
case was the architecture of five computers (fig. 8), in this case DLMDAG out-
performs the others. When there were 10 or 20 computers in the system (fig. 6),
DLMDAG achieved slightly worse results. Since HEFT and CPOP use the whole
4
Simulation tool OMNeT++[24] was used
10
10 Ondřej Votava
Ondřej Votava,et
Peter
al. Macejko, Jan Janeček
1
DLMDAG 56000
HEFT par DLMDAG
CPOP par HEFT par
54000 CPOP par
0.8
52000
Average effectivness

50000
0.6

Average makespan
48000

46000
0.4
44000

42000
0.2
40000

38000
0
2 4 6 8 10 36000
Number of DAGs 2 4 6 8 10
Number of DAGs

Fig. 5. Utilization for different number
Fig. 6. Makespan for different numbers
of DAGs running concurrently (all plat-
of applications (20 PC platform)
forms)
3
DLMDAG 130000
DLMDAG
HEFT par
HEFT par
CPOP par 120000 CPOP par
2.5
110000

100000
2
Average unfairness

Average makespan

90000

1.5 80000

70000
1
60000

50000
0.5
40000

30000
0
2 4 6 8 10
2 4 6 8 10
Number of DAGs
Number of DAGs

Fig. 7. Unfairness for different numbers Fig. 8. Makespan for diferent numbers
of applications (20 PC platform) of applications (5 PC platform)

structure of applications DLMDAG works only with ready tasks and therefore it
may be unable to use the platform with more devices so efficiently. These results
are then dependent on the structure of the applications that were scheduled.
The more parallel application is, the better results DLMDAG obtains.
The unfairness (figures 4, 7) for DLMDAG is at the very low level and the
growth of it is slow. The unfairness level obtained by HEFT and CPOP in
comparison with DLMDAG is worse.
The utilization of nodes (figure 5) corresponds to the makespan achieved by
the algorithms. Growing the amount of DAGs in the system the utilization of
nodes increases for all algorithms. As mentioned earlier, DLMDAG achieves high
level of parallelization and therefore the average utilization of all nodes is also
increasing.

5 Conclusion
The algorithm presented in this article is dynamic, it does not use any central
point for scheduling neither it requires the information about the whole network.
Dynamic Local
Dynamic Local Scheduling
Scheduling of
of Multiple
Multiple DAGs
DAGs in
in Distr.
Distr. Heterogen.
Heterogen. Systems
Systems 11
11

DLMDAG is based on the local knowledge of the network – only neighbours cre-
ate the schedule – and the schedule is created using several messages by which the
computation times are gathered on the scheduling node. The simulations of the
algorithm were executed and results obtained were compared to the traditional
offline scheduling algorithms.
DLMDAG is able to use the computation resources in a better way than
compared algorithms when there are more tasks in the system than computa-
tion units. As the number of computation nodes increases the result DLMDAG
achieves become worse than competitor’s.

Future work There are several possibilities to improve the proposed algorithm.
Initially the computation systems do change. The algorithm should be able to
modify the schedules to reflect the network changes. Subsequently the current
algorithm is fixed to the scheduling node and it’s neighbours and this may cause
performance problems, the algorithm could be able to move the application to
some other node with different neighbours.

References
1. D. Feitelson, L. Rudolph, U. Schwiegelshohn, K. Sevcik, and P. Wong, “Theory and
practice in parallel job scheduling,” in Job Scheduling Strategies for Parallel Pro-
cessing (D. Feitelson and L. Rudolph, eds.), vol. 1291 of Lecture Notes in Computer
Science, pp. 1–34, Springer Berlin / Heidelberg, 1997. 10.1007/3-540-63574-2 14.
2. T. Hagras and J. Janecek, “A high performance, low complexity algorithm for
compile-time task scheduling in heterogeneous systems,” Parallel Computing,
vol. 31, no. 7, pp. 653 – 670, 2005. Heterogeneous Computing.
3. H. Kikuchi, R. Kalia, A. Nakano, P. Vashishta, H. Iyetomi, S. Ogata, T. Kouno,
F. Shimojo, K. Tsuruta, and S. Saini, “Collaborative simulation grid: Multiscale
quantum-mechanical/classical atomistic simulations on distributed pc clusters in
the us and japan,” in Supercomputing, ACM/IEEE 2002 Conference, p. 63
4. D. Kehagias, M. Grivas, G. Pantziou, and M. Apostoli, “A wildly dynamic grid-
like cluster utilizing idle time of common pc,” in Telecommunications in Modern
Satellite, Cable and Broadcasting Services, 2007. TELSIKS 2007. 8th International
Conference on, pp. 36 –39, sept. 2007.
5. A. Wakatani, “Parallel vq compression using pnn algorithm for pc grid system,”
Telecommunication Systems, vol. 37, pp. 127–135, 2008.
6. M. Maheswaran, T. D. Braun, and H. J. Siegel, “Heterogeneous distributed com-
puting,” in In Encyclopedia of Electrical and Electronics Engineering, pp. 679–690,
John Wiley, 1999.
7. Y. kwong Kwok and I. Ahmad, “Benchmarking the task graph scheduling algo-
rithms,” in In Proc. IPPS/SPDP, pp. 531–537, 1998.
8. J. Liou and M. Palis, “A comparison of general approaches to multiprocessor
scheduling,” Parallel Processing Symposium, International, vol. 0, p. 152, 1997.
9. J. Ullman, “Np-complete scheduling problems,” Journal of Computer and System
Sciences, vol. 10, no. 3, pp. 384 – 393, 1975.
10. M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the
Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co., 1979.
12
12 Ondřej Votava
Ondřej Votava,et
Peter
al. Macejko, Jan Janeček

11. T. D. Braun, H. J. Siegel, N. Beck, L. L. Bölöni, M. Maheswaran, A. I. Reuther,
J. P. Robertson, M. D. Theys, B. Yao, D. Hensgen, and R. F. Freund, “A com-
parison of eleven static heuristics for mapping a class of independent tasks onto
heterogeneous distributed computing systems,” Journal of Parallel and Distributed
Computing, vol. 61, no. 6, pp. 810 – 837, 2001.
12. H. Topcuoglu, S. Hariri, and M. Wu, “Performance-effective and low-complexity
task scheduling for heterogeneous computing,” IEEE Transactions on Parallel and
Distributed Systems, vol. 13, pp. 260–274, 2002.
13. M. Maheswaran, S. Ali, H. J. Siegel, D. Hensgen, and R. F. Freund, “Dynamic
matching and scheduling of a class of independent tasks onto heterogeneous com-
puting systems,” Heterogeneous Computing Workshop, vol. 0, p. 30, 1999.
14. M. Iverson and F. Ozguner, “Dynamic, competitive scheduling of multiple dags in
a distributed heterogeneous environment,” Heterogeneous Computing Workshop,
vol. 0, p. 70, 1998.
15. M. A. Iverson and F. Özgüner, “Hierarchical, competitive scheduling of multiple
dags in a dynamic heterogeneous environment,” Distributed Systems Engineering,
vol. 6, no. 3, p. 112, 1999.
16. X. Qin and H. Jiang, “Dynamic, reliability-driven scheduling of parallel real-time
jobs in heterogeneous systems,” Parallel Processing, International Conference on,
vol. 0, p. 0113, 2001.
17. O. Votava, P. Macejko, J. Kubr, and J. Janeček, “Dynamic Local Scheduling of
Multiple DAGs in a Distributed Heterogeneous Systems,” in Proceedings of the
2011 International Conference on Telecommunication Systems Management, (Dal-
las, TX), pp. 171–178, American Telecommunications Systems Management Asso-
ciation Inc., 2011.
18. J. Janeček, P. Macejko, and T. M. G. Hagras, “Task scheduling for clustered hetero-
geneous systems,” in IASTED International Conference - Parallel and Distributed
Computing and Networks (PDCN 2009) (M. Hamza, ed.), pp. 115–120, February
2009. ISBN: 978-0-88986-783-3, ISBN (CD): 978-0-88986-784-0.
19. H. Topcuoglu, S. Hariri, and M.-Y. Wu, “Task scheduling algorithms for hetero-
geneous processors,” in Heterogeneous Computing Workshop, 1999. (HCW ’99)
Proceedings. Eighth, pp. 3 –14, 1999.
20. H. Zhao and R. Sakellariou, “Scheduling multiple dags onto heterogeneous sys-
tems,” Parallel and Distributed Processing Symposium, International, 2006.
21. J. Barbosa and B. Moreira, “Dynamic job scheduling on heterogeneous clusters,”
in Parallel and Distributed Computing, 2009. ISPDC ’09. Eighth International
Symposium on, pp. 3 –10, 302009-july4 2009.
22. R. de Mello, J. Andrade Filho, L. Senger, and L. Yang, “Grid job scheduling
using route with genetic algorithm support,” Telecommunication Systems, vol. 38,
pp. 147–160, 2008. 10.1007/s11235-008-9101-5.
23. Y. Kitatsuji, K. Yamazaki, H. Koide, M. Tsuru, and Y. Oie, “Influence of network
characteristics on application performance in a grid environment,” Telecommuni-
cation Systems, vol. 30, pp. 99–121, 2005. 10.1007/s11235-005-4320-5.
24. A. Varga et al., “The omnet++ discrete event simulation system,” in Proceedings
of the European simulation multiconference (ESM’2001), vol. 9, p. 65, sn, 2001.