<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Clairvoyance versus cooperation in scheduling of independent tasks?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomas Plachetka</string-name>
          <email>plachetka@fmph.uniba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Comenius University</institution>
          ,
          <addr-line>Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <fpage>39</fpage>
      <lpage>45</lpage>
      <abstract>
        <p>Finding a schedule with minimal makespan for master's interest also is the fastest completion of a ¯nite number of independent tasks on a homogeneous net- all W tasks. The master and the workers are isolated work of processors is an NP-hard problem if durations of from one another but can communicate via a reliable all tasks are known. With only partial a-priori knowledge asynchronous postal service. The delivery of a packet of tasks' time durations, it makes sense to look for on- (message) takes some time called communication laline algorithms which guarantee short makespans in terms tency. More precisely, latency is the time from the oafndsmcaolmlpcaormedpeitnitivtheisraptiaopse.r.ThTrheee aclhguonrkitihnmgsaalgroeriatnhamlysaesd- moment when a worker becomes idle until the mosumes a-priori knowledge of the longest task's duration. ment when it receives some work to do (or realises The factoring algorithm assumes a-priori knowledge of the that there is no more work to do). Latency is a conratio between the longest and shortest task's durations. The stant which does not depend on the size of the packet work-stealing algorithm requires no a-priori knowledge but, (e.g. on number of tasks transferred in the packet). unlike the previous two algorithms, requires a mechanism The master and the workers know this constant.1 for redistribution of tasks which have already been assigned If the master knows the durations of all tasks, it to processors. It turns out that work-stealing outperforms can compute the shortest schedule o²ine (solving an both the chunking and factoring algorithms when the num- NP-hard problem) and send a single packet to each ber of tasks is su±ciently large. The analysis is not only worker. The packet contains all the tasks assigned in asymptotic|it also provides an accurate (worst-case) pre- the shortest schedule to that worker. diction of makespans for all aforementioned algorithms for an arbitrary number of processors and tasks. If the master knows nothing about the durations of all tasks, it can for example send one packet containing W=N arbitrarily chosen tasks to each worker. 1 Introduction By doing so, the latency is added only once to the makespan (as in the previous o²ine algorithm). This Finding an optimal schedule for a given number of in- algorithm is good in case of equal tasks' durations. But dependent tasks with known time durations on a given it can happen that all the tasks are very short except number of (equally fast) parallel processors is of W=N tasks which are very long. In a lucky case, NP-hard [9]. An online version of this problem [14] each worker will receive a packet which contains one assumes only a partial knowledge of tasks' durations. very long task and many short tasks. In an unlucky This online version appears in the same abstract form case, N ¡ 1 workers will receive packets which contain e.g. in parallel game-tree search, parallel ray tracing, only short tasks, while one worker will receive a packet scheduling of independent loops for multiprocessor in which all the tasks are very long. Online analysis is machines etc. The challenge is to properly control the interested in this worst case, where an \adversary" trade-o® between the costs of work imbalance and plays against the master and the workers. The intencommunication. tion of the adversary is to make the schedule as long The problem can be stated as follows. There is as possible. a master holding W pieces of indivisible work (tasks). Each task can be processed by any of N workers (the master can actually be one of the workers, playing both roles simultaneously). The durations of tasks can be di®erent (e.g. processing of one task can take a second, processing of another task can take a minute). The workers are reliable, equally fast and are willing to complete the entire work as soon as possible (to minimise the makespan, i.e. the parallel time). The</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1 Note that in asynchronous message passing model the
latency can vary not only for di®erent runs of the same
program with the same input, but even for di®erent
messages in the same run. It is bounded for a given run,
but the bound is not a-priori known. In order to be fair
by comparison of di®erent algorithms, we assume that
the latency is the same constant for all messages and
all runs. This is a common assumption in publications
relating to the problem in question. (In dedicated
practical networks, the latency for a ¯xed message size is
bounded|and a-priori known, as it can be measured in
run-time before the actual computation begins.)
? This work was partially supported by the grant VEGA</p>
      <p>1/0726/09.</p>
      <p>If the workers are not allowed to return tasks which
they are assigned and if no information on tasks'
durations is available, then no master's algorithm is
intuitively \better" than the algorithm from the previous
paragraph. For example, consider an algorithm where
the master sends only one task as a reply to a request
from an idle worker. Then the unlucky case is that all
the packets are very short. The cost of communication
can then be very high in comparison with the work
itself and may exceed the total time of the previous
algorithm.</p>
      <p>In a deterministic model [12, 13] it is assumed that
the information on maximal and minimal tasks'
dura</p>
      <p>All algorithms used in the previous examples use tions is available a-priori, i.e. before the computation
the same scheme. In the beginning, all the workers are begins. The goal in the cited papers was to ¯nd
paramidle. Each worker process runs a loop in which it sends eter settings which minimise the maximal makespan
a job request to the master process, waits for a job of chunking and factoring algorithms. The goal in this
(a set of tasks sent in a message) and then processes paper is to ¯nd parameter settings which minimise the
all the tasks in the job, one after another. The algo- competitive ratio of the algorithms.
rithms only di®er in how the master decides for the
number of tasks (job size) which it sends when
replying a job request. Without a knowledge on a speci¯c
task's duration only the job sizes K are important,
not the choice of tasks in a job. A generic master's
algorithm is shown in Fig. 1.</p>
      <p>Chunking and factoring algorithms make use of
partial information on the tasks' durations. They were
¯rst investigated in a probabilistic model, where tasks'
durations are assumed to be realisations of a random
variable with known mean and variance [11, 7, 6, 1, 10].</p>
      <p>The goal in the probabilistic model is to ¯nd
parameter settings which minimise the expected makespan of
the algorithms. Optimal settings have not been found,
only rough approximations are known.</p>
      <p>The main contribution of this paper is
competitive analysis and quantitative comparison of three
algorithms in the deterministic model. Optimal
parameters for chunking and factoring algorithms are
derived. (As it turns out, the optimal parameter settings
are almost the same in both scenarios|which is
perhaps not surprising, as both scenarios focus on the
worst-case input.) The third algorithm is
deterministic work stealing. This algorithm requires no a-priori
information and has no parameters; however, it
requires a mechanism for redistribution of already
assigned tasks. This means that the workers are allowed
to communicate with one another and are allowed to
return assigned, but yet unprocessed tasks to the
master. We prove that the deterministic work stealing
algorithm performs better than the previous two
algorithms under certain assumptions which usually hold
in practical systems. This makes deterministic work
stealing very attractive for applications where no
a-priori knowledge of tasks' durations is available.
master generic(int W , int N )
int K;
int work = W ;
while (work &gt; 0)
f
wait for a job request from an idle worker;
compute the job size K;
assign a job consisting of K yet unprocessed tasks
to the idle worker;
work = work ¡ K;
g
reply job requests with NO MORE WORK;
f
g</p>
      <p>The deterministic work stealing algorithm appears
in the context of di®usive load balancing, e.g.</p>
      <p>Fig. 1. Generic assignment algorithm (without work redis- in [3{5] (optimal load balancing scheme). In the
contribution). text of di®usive load balancing the data locality is the
main concern; the goal is to exploit the structure of the
network in order to minimise the cost of a single work
redistribution step. In this paper the network
structure is ignored and the number of work redistribution
steps is minimised.</p>
      <p>Note that when a job request arrives, the master
process must decide immediately how many yet
unprocessed tasks it assigns in that job (as it might take
long until another job request arrives). This decision The paper is organised as follows. Section 2
introis based on partial a-priori knowledge of tasks' du- duces the notation and de¯nitions. Sections 3, 4 and 5
rations. Only deterministic online algorithms will be present online analysis of chunking, factoring and work
considered, i.e. algorithms which do not internally use stealing algorithms in the deterministic model (related
any source of randomness. Two of the three algorithms results for chunking and factoring algorithms in
studied in this paper, chunking and factoring, follow a probabilistic model are brie°y summarised in
subsecthe scheme from Fig. 1. The third algorithm, deter- tions). Performance of these algorithms is compared in
ministic work stealing, uses a more complex scheme. Section 6. Section 7 concludes the paper.
The following notation is used throughout this paper:
makespan (total parallel time)
number of worker processes; N ¸ 1
total number of tasks (total work); W ¸ 1
latency (duration of assignment of one job)
task's durations (i = 1:::W ); ti &gt; 0:0
minimal task's duration, Tmin = mini=1:::W ti
maximal task's duration, Tmax = maxi=1:::W ti
ratio Tmax=Tmin; T ¸ 1:0</p>
    </sec>
    <sec id="sec-2">
      <title>Notation and de¯nitions 3</title>
    </sec>
    <sec id="sec-3">
      <title>Chunking</title>
      <p>In order to quantify the performance of algorithms, in the worst case comparable with the sequential time
tio between the makespan MS (A) of algorithm S which
uses a-priori information A and the best o²ine
makespan</p>
      <p>Mbest o²ine over all sequences t1 : : : tW
input tasks' durations which conform to the a-priori</p>
      <p>! 1, we prefer to keep the comparison
parameterised with respect to W , N , L instead of
using the limit values of CRS(A) and CRR(A) in
De¯nition 2. This allows for a ¯ner comparison of algorithms. solving for K. This yields K =
q W L(L+Tmax) .</p>
      <p>NTmax
The chunking algorithm [11], [8] always assigns jobs
of size K to idling worker processes, where K remains
constant (the last assignment may be an exception,
where a smaller job is assigned), see Fig. 1. Once a job
has been assigned to a worker, this decision cannot be
changed|the worker must then compute all the tasks
assigned in that job.</p>
      <p>We will prove a general theorem which states that
a-priori knowledge of Tmax does not help much. The
parallel time of any algorithm (including chunking) is
for a su±ciently large number of tasks.
­(N ) for W = ­(N 3)).</p>
      <p>Theorem 1. For all W; N; L; Tmax such that W
N 3 + N 2(N ¡ 1)Tmax=L competitive ratio of an
arbitrary assignment algorithm with no work redistribution</p>
      <p>&gt;
and with a-priori knowledge of Tmax is at least N (i.e.</p>
      <p>Proof. Let W &gt; N 3 + N 2(N ¡ 1)Tmax=L. Let K
denote the maximal job size assigned by an algorithm S.</p>
      <sec id="sec-3-1">
        <title>There are two cases:</title>
        <p>case 1, K ¸ W L=(N 2L + N (N ¡ 1)Tmax);
case 2, K &lt; W L=(N 2L + N (N ¡ 1)Tmax).</p>
        <p>In case 1, there is Kint such that N
and Kint=N is an integer. Kint tasks of the maximal
job will have duration Tmax, while all the other tasks
will have duration " ! 0. The best o²ine algorithm
computes the Kint long tasks in parallel, whereas the
algorithm S computes them sequentially. This implies
(as S makes at least as many assignments as the best
o²ine algorithm)
· Kint
· K
CRS(Tmax)(W; N; L) ¸ KintTmax=N
= N
KintTmax
least W L=(N K) + Tmax and</p>
        <p>In case 2, consider the latency overhead of the
algorithm S, which is at least W L=(N K). Assume that
one task has duration Tmax and is assigned in the last
job; all the remaining tasks are of an equal duration
" ! 0). Hence the makespan of the algorithm S is at
CRS(Tmax)(W; N; L) ¸</p>
      </sec>
      <sec id="sec-3-2">
        <title>This completes the proof.</title>
        <p>W L=(N K) + Tmax</p>
        <sec id="sec-3-2-1">
          <title>L + Tmax</title>
          <p>¸ N
tu</p>
          <p>Consider the case where all K tasks of some job
are of duration " ! 0 and all the other tasks are of
duration Tmax. The competitive ratio of the chunking
algorithm using the job size K is then</p>
          <p>CRchunking(Tmax)(W; N; L) =
KTmax + W L=(N K)</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>L + Tmax</title>
          <p>The competitive ratio is minimised by setting the ¯rst
derivative of CRchunking with respect to K to zero and
Recall that the probabilistic model assumes that tasks'
durations are realisations of a random variable with
(known) mean ¹ and (known) standard deviation ¾.</p>
          <p>The ¯xed-size chunking strategy in the probabilistic
model was analysed by Kruskal and Weiss [11]. They
derived the following estimation of the expected
makespan E[M] for the chunk size K:
int K;
int counter;
int round = 0;
int work = W ;
while (work &gt; 0)
f</p>
          <p>W
E[M] ¼ N ¹ +</p>
          <p>
            W L
N K
+ ¾p2K ln N
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
f
          </p>
          <p>This formula has a nice intuitive interpretation.</p>
          <p>The ¯rst term is the time of executing W tasks on
N processors on a system with no overhead. The
second term describes the latency overhead. The third
term describes the load imbalance due to the
variation in tasks' durations. Unfortunately, the estimation
in Equation 1 only holds if W and K are large and
K À log N . If these assumptions hold then also the
optimal chunk size Kopt can be estimated:</p>
          <p>K^ opt =
Ã p2W L !
¾N pln N
f
g
round = round + 1;
K = max(work=F , 1);
counter = 0;
while ((counter &lt; N ) and (work &gt; 0))
counter = counter + 1;
wait for a job request from an idle worker;
assign a job consisting of K yet unprocessed tasks
to the idle worker;
work = work ¡ K;
g
g reply job requests with NO MORE WORK;
3.1
Chunking in a probabilistic model</p>
          <p>MASTER FACTORING(int W , int N , °oat F )
constant over all rounds. During the last round,
singletask jobs are assigned. Once a job has been assigned to</p>
          <p>If the assumptions above do not hold, [11] gives the a worker, the worker must compute all tasks assigned
following estimates for the expected makespan E[M]: in that job.</p>
          <p>We will derive the optimal factor F , assuming that
E[M] ¼ WN ¹ + NWKL + ¾s2K ln p¾KN¹
teahd-pegreiroaartviiaoiklTanbowl=el.eTd(mgAaexs=oimTfmibliaonrthiasnTtahmlyeasxiosnawlynhdiac-hTpmraioisnsruimckanensowablnefound in [8] and [12].) Denote Ki the job size which is
for K ¿ W=N and small pK=N ; and assigned during the round i of the factoring algorithm
and let wi denote the number of still unassigned tasks
at the beginning of round i. In order to be competitive,
E[M] ¼ WN ¹ + NWKL + N¹¾2 fpaucttaotriionng ogfuaarjaonbteoefs stihzeatKtihewliollnngoetsttasekqeuleonntgiaelr
ctohmanthe shortest parallel computation of the still
unasfor K ¿ W=N and large pK=N . However, a tracta- signed wi ¡ Ki tasks on the remaining N ¡ 1 workers:
ble analytical expression for the optimal chunk size K max seq time(Ki) · min par time(wi ¡Ki; N ¡ 1). In
could not be derived. order to minimise the assignment overhead, Ki must
be as large as possible. The largest Ki which
satis4 Factoring ¯es the inequality above (and thus guarantees
the maximal imbalance of at most 1 task) is
Ki = wi=(1 + T (N ¡ 1)).</p>
          <p>Factoring [7, 6, 1, 8] works in rounds, see Fig. 2 it could
also be expressed in the form of the generic algorithm
from Fig. 1 by rewriting the procedure compute the
job size K, but doing so would make the algorithm
more di±cult for reading). In each round, it assigns N
jobs of equal size. The job size is decreased after each
round, whereby the job size in a round is a factor of the
work remaining (the number of yet unassigned tasks)
at the beginning of the round. The factor F remains</p>
          <p>Note that it is only the assignment overhead which
determines the competitive ratio of factoring. For
example, the trick with setting durations of all the tasks
of K1 to Tmax and computing them in parallel by
the best o²ine algorithm does not work. The
reason is that this does not increase the makespan of
factoring at all: K1Tmax = W Tmax=(T (N ¡ 1)) =
W Tmin=(N ¡ 1).
Theorem 2. For all W; N; L; T competitive ratio of
Note that this iteration scheme only requires the
the factoring algorithm with a-priori knowledge of T
knowledge of the coe±cient of variation cov of the
using factor F = 1 + T (N ¡ 1) is O((ln W )=W ) and tasks' probability distribution (cov = ¾=¹). There are
two extreme cases: 1. If cov = 0 (no variance) then
this strategy assigns all jobs in a single round; 2. If
cov !</p>
          <p>1 (unbounded variance or negligible tasks'
durations) then this scheme assigns jobs of size 1. (This
scheme is not strictly factoring in the sense of
Section 4 because the factor is not the same constant in
subsequent rounds.)
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Work stealing</title>
      <p>So far we assumed that the master process cannot take
back its decisions|i.e. once a job has been assigned
to a worker, then the job must be processed by that
worker. In the work stealing algorithm, the master
process can reclaim already assigned but yet unprocessed
tasks from the workers. The work stealing algorithm
requires no a-priori information (not even the
knowledge of latency). It initially assigns all the tasks in
jobs of size W=N to idling worker processes. When
a worker becomes idle again, the master reclaims all
yet unprocessed tasks from all the worker processes
and redistributes them equally back again to all worker
processes. The periods between the redistributions are
called rounds. Each round adds a penalty L0 to the
makespan.</p>
      <p>
        An implementation of the work stealing algorithm
can use two threads of control in each worker process:
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) a \listening thread" which reacts to work
redistribuapproaches 1 if W
      </p>
      <p>! 1.</p>
      <p>Proof. Let r denote the last round at the beginning
of which the number of still unassigned tasks wr is at
most N (as the size of the jobs assigned in the round
r is Kr = 1 and the number of the jobs assigned in
the round r is at most N ). It can be observed that the
number of yet unassigned tasks wi at the beginning of
round i is equal to wi = W (1 ¡ N=(1 + T (N ¡ 1)))i¡1.</p>
      <p>Solving wr · N for maximal r yields the number of
rounds r performed by the factoring algorithm:
rfactoring =
wi+1 = wi ¡ N K^iopt; xi+1 = 2 +</p>
      <p>N 2 µ ¾ ¶2
wi
¹
tion messages by sending all yet unprocessed tasks
to the master process; and a \working thread" which
computes the tasks and noti¯es other processes when
it runs out of work. These two threads share a queue
of tasks. The queue is protected by a semaphore in
order to exclude its simultaneous access by both the
threads. The working thread repeats a loop in which it
locks the queue, pops one task, unlocks the queue and
starts processing the task. After ¯nishing the task, this
procedure repeats until the working thread ¯nds the
queue empty. Then it noti¯es the other processes and
waits until the listening thread inserts tasks of the new
round into the queue and resumes the computation (or
terminates the whole process). Yet unprocessed tasks
in a process are the tasks in the queue. A clever
implementation of the algorithm amortises the latency by
allowing a worker which reacts to a work
redistribution message to continue in processing of tasks in its
queue during the work redistribution.</p>
      <p>As the task distribution in work stealing uses
a more complex communication pattern (broadcasting
and gathering) than the previous algorithms
(pointto-point round-trip), we will denote this latency L0,
whereby L0 ¸ L. However, L0 di®ers from L only
by a constant factor if N is a constant. This
facwork stealing (Eq. 4) and factoring (Eq. 2), then work
tor depends on the physical mechanism which is used
stealing does not perform worse than factoring when
maximal r yields
the beginning of the last round r. Solving wr · N for is assigned
for communication among the processes. Note that
L0 ¼ L e.g. in a bus network or a network with a
complete interconnection graph. Similarly as by factoring,
there is no work imbalance at the end of the algorithm,
therefore the competitive ratio of work stealing only
depends on the number of rounds.</p>
      <p>Theorem 3. For all W; N; L; L0 competitive ratio of
the work stealing algorithm with no a-priori knowledge
is O((ln W )=W ) and approaches 1 if W</p>
      <p>! 1.</p>
      <p>Proof. The number of rounds of work stealing in the
worst case can be determined as follows. Assume that
one of the worker processes ¯nishes its ¯rst job of
size W=N , while no other worker process has ¯nished
its ¯rst task. After the redistribution a second round
begins and the same situation happens: one worker
process ¯nishes its job, while none of the other worker
processes has ¯nished its ¯rst task. Etc. The total
number of yet unprocessed tasks (in the whole
system) is at most wi = W ((N ¡ 1)=N )i at the
beginning of round i. At most N tasks are distributed at
r =
large in comparison with N (W ¼ N 3 or larger).</p>
      <p>We proved a common upper bound for competitive
ratios of the work stealing and factoring algorithms for
W
!</p>
      <p>
        1. Both these algorithms guarantee a perfect
balance, therefore we can focus on their number of
rounds which determine the cost of assignment. In
order to keep things simple, we will assume L = L0 in the
sequel. If we directly compare the number of rounds in
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(5)
tu
      </p>
      <p>T · N + 1, because then
rworkstealing =
rfactoring</p>
      <p>N
ln N¡1
ln (T1+¡T1()N(N¡¡11)) · 1</p>
      <p>However, the comparison above is not fair, because
the work stealing algorithm with a-priori knowledge of
T = Tmax=Tmin is actually more e±cient than in the
proof of Theorem 3 (although it does not makes use
of the knowledge of T ). For a given T , let us
reconsider the scenario in which always one worker process
¯nishes all its tasks from the ¯rst round, while all the
other worker processes do as little work as possible.</p>
      <p>While the worker computes its ¯rst job of size W=N
tasks, all the other workers must have computed at
least W=(N T ) tasks each. So every other worker has
processed tasks in the
whole</p>
      <p>system
at most W (T ¡ 1)=(N T ) yet unprocessed tasks; in
sum, there are at most W (T ¡ 1)(N ¡ 1)=(N T )
unless
than</p>
      <p>W (N ¡ 1)=N
tasks in</p>
      <p>the
Theorem 3). In the second
round, each</p>
      <p>worker
(which
proof
is
of</p>
      <p>W (T ¡ 1)(N ¡ 1)= (N 2T ) tasks. When
a worker ¯nishes its job from the second round, then
all the other workers have at most W ((T ¡ 1)=(N T ))2
yet unprocessed tasks each; in sum, there are at most
W ((T ¡ 1)(N ¡ 1)=(N T ))2 yet unprocessed tasks in
the whole system. Etc. Generally, there are wi =
W ((T ¡ 1)(N ¡ 1)=(N T ))i yet unprocessed tasks in
the whole system at the beginning of round i. At most
N tasks are distributed at the beginning of the last
round r. Solving wr &lt; N for maximal r yields
rworkstealing =
ln (W=N )</p>
      <p>NT
ln (T ¡1)(N¡1)</p>
      <p>(6)</p>
      <p>Fair comparison of the number of work stealing
rounds (Eq. 6) with the number of factoring rounds
(Eq. 2) yields
rworkstealing =
rfactoring
ln (W=N)</p>
      <p>NT
ln (T ¡1)(N¡1) =</p>
      <p>ln (W=N)
ln (T1+¡T1()N(N¡¡11))
ln (T1+¡T1()N(N¡¡11)) &lt; 1</p>
      <p>NT
ln (T ¡1)(N¡1)</p>
      <p>This means the work stealing algorithm performs
better than the factoring algorithm for all W; N; L,
L0; T if L'=L. More precisely, the work stealing
algorithm performs better for L, L' such that
ln (T1+¡T1()N(N¡¡11)) &lt;</p>
      <p>NT
ln (T ¡1)(N¡1)</p>
      <p>L</p>
      <p>
        L0
because then L0rworkstealing · Lrfactoring. We stress
that an a-priori knowledge of T is rarely available in
practice and must therefore be estimated. With an in- 5. R. Elsasser, B. Monien, A. Frommer, and R. Preis:
Opaccurate estimation of T , the factoring algorithm per- timal di®usion schemes and load balancing on product
forms worse than in our analysis. graphs. Parallel Processing Letters, 14 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 2004, 61{
      </p>
      <p>
        The work stealing algorithm is a clear winner. It 73.
has no parameters and requires no tuning. Moreover, it 6. L. E. Flynn and S. Flynn-Hummel: Scheduling
can be used (after some modi¯cations) in applications variable-length parallel subtasks. Technical Report RC
15492, IBM Research, 1990.
where processes may fail or where the number of tasks 7. S. Flynn-Hummel, E. Schonberg, and L. E. Flynn:
Facmay grow in run-time. toring: A practical and robust method for scheduling
parallel loops. In: Proc. of Supercomputing '91, IEEE
Computer Society / ACM, 1991, 610{619.
7 Conclusions 8. S. Fujita: A semi-dynamic multiprocessor scheduling
algorithm with an asymptotically optimal competitive
We analysed online performance of chunking, factor- ratio. In: Proc. of the 8th International Euro-Par
Coning and work stealing assignment algorithms in a de- ference on Parallel Processing, Springer-Verlag, 2002,
terministic model. The chunking algorithm requires an 240{247.
9. M. R. Garey and D. S. Johnson: Computers and
ina-priori knowledge of the maximal task's duration and tractability. W. H. Freeman and Company, 1979.
achieves competitive ratio N (which does not depend 10. T. Hagerup: Allocating independent tasks to parallel
on W ) for W = ­(N 3), where N denotes the num- processors: An experimental study. Journal of Parallel
ber of processes and W denotes the number of tasks. and Distributed Computing, 47 (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), 1997, 185{197.
The performance of chunking algorithm is thus very 11. C. P. Kruskal and A Weiss: Allocating independent
poor, at least from the point of view of competitive subtasks on parallel processors. IEEE Transactions on
analysis. The factoring algorithm requires an a-priori Software Engineering, 11 (10):1001{1016, 1985.
knowledge of the factor T = Tmax=Tmin. Its competi- 12. T. Plachetka: Perfect load balancing for
demandtive ratio is bounded from above by O(ln (W )=W ) and driven parallel ray tracing. In: B. Monien and R.
Feldapproaches 1 when W ! 1, which is very desirable. man, (eds), Proc. of Euro-Par 2002, volume 2400 of
The same holds for the deterministic work stealing al- Lecture Notes in Computer Science, Springer-Verlag,
2002, 410{419.
gorithm, which performs better than the factoring al- 13. T. Plachetka: Tuning of algorithms for
indepengorithm and requires no a-priori information. dent task placement in the context of demand-driven
      </p>
      <p>The last result is valid under two assumptions: parallel ray tracing. In: D. Bartz, B. Ra±n, and
1. the underlying communication mechanism provides H.W. Shen, (eds), Proc. of the Eurographics/ACM
an e±cient implementation of broadcasting and gath- SIGGRAPH Symposium on Parallel Graphics and
Viering, which we assume to be as fast as round-trip sualization (EGPGV), Eurographics Proceedings
Sepoint-to-point communication; 2. the communication ries, Eurographics Association, 2004, 101{109.
latency is constant which does not depend on the mes- 14. K. Pruhs, J. Sgall, and E. Torng: Online scheduling. In:
sage size. The ¯rst assumption holds e.g. for bus and J.Y.-T. Leung, (ed.), Handbook of Scheduling,
chapfully-switched networks; the second assumption holds ter 15, CRC Press, 2004, 15.1{15.41.
for practically all contemporary networks, if the
message size does not exceed a certain threshold.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>I.</given-names>
            <surname>Banicescu</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Flynn-Hummel</surname>
          </string-name>
          :
          <article-title>Balancing processor loads and exploiting data locality in irregular computations</article-title>
          .
          <source>Technical Report RC 19934, IBM Research</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Borodin</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>El-Yaniv</surname>
          </string-name>
          :
          <article-title>Online computation and competitive analysis</article-title>
          . Cambridge University Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Diekmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Frommer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Monien</surname>
          </string-name>
          : E±
          <article-title>cient schemes for nearest neighbor load balancing</article-title>
          .
          <source>Parallel Computing</source>
          ,
          <volume>25</volume>
          (
          <issue>7</issue>
          ),
          <year>1999</year>
          ,
          <volume>789</volume>
          {
          <fpage>812</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Elsasser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Frommer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Monien</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Preis</surname>
          </string-name>
          <article-title>: Optimal and alternating-direction loadbalancing schemes</article-title>
          .
          <source>In: Proc. of Euro-Par</source>
          , volume
          <volume>1685</volume>
          of Lecture Notes in Computer Science, Springer-Verlag,
          <year>1999</year>
          ,
          <volume>280</volume>
          {
          <fpage>290</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>