-

Clairvoyance versus cooperation in scheduling of independent tasks?

Tomas Plachetka

plachetka@fmph.uniba.sk 0 0 Comenius University , Bratislava , Slovakia

39 45

Finding a schedule with minimal makespan for master's interest also is the fastest completion of a ¯nite number of independent tasks on a homogeneous net- all W tasks. The master and the workers are isolated work of processors is an NP-hard problem if durations of from one another but can communicate via a reliable all tasks are known. With only partial a-priori knowledge asynchronous postal service. The delivery of a packet of tasks' time durations, it makes sense to look for on- (message) takes some time called communication laline algorithms which guarantee short makespans in terms tency. More precisely, latency is the time from the oafndsmcaolmlpcaormedpeitnitivtheisraptiaopse.r.ThTrheee aclhguonrkitihnmgsaalgroeriatnhamlysaesd- moment when a worker becomes idle until the mosumes a-priori knowledge of the longest task's duration. ment when it receives some work to do (or realises The factoring algorithm assumes a-priori knowledge of the that there is no more work to do). Latency is a conratio between the longest and shortest task's durations. The stant which does not depend on the size of the packet work-stealing algorithm requires no a-priori knowledge but, (e.g. on number of tasks transferred in the packet). unlike the previous two algorithms, requires a mechanism The master and the workers know this constant.1 for redistribution of tasks which have already been assigned If the master knows the durations of all tasks, it to processors. It turns out that work-stealing outperforms can compute the shortest schedule o²ine (solving an both the chunking and factoring algorithms when the num- NP-hard problem) and send a single packet to each ber of tasks is su±ciently large. The analysis is not only worker. The packet contains all the tasks assigned in asymptotic|it also provides an accurate (worst-case) pre- the shortest schedule to that worker. diction of makespans for all aforementioned algorithms for an arbitrary number of processors and tasks. If the master knows nothing about the durations of all tasks, it can for example send one packet containing W=N arbitrarily chosen tasks to each worker. 1 Introduction By doing so, the latency is added only once to the makespan (as in the previous o²ine algorithm). This Finding an optimal schedule for a given number of in- algorithm is good in case of equal tasks' durations. But dependent tasks with known time durations on a given it can happen that all the tasks are very short except number of (equally fast) parallel processors is of W=N tasks which are very long. In a lucky case, NP-hard [9]. An online version of this problem [14] each worker will receive a packet which contains one assumes only a partial knowledge of tasks' durations. very long task and many short tasks. In an unlucky This online version appears in the same abstract form case, N ¡ 1 workers will receive packets which contain e.g. in parallel game-tree search, parallel ray tracing, only short tasks, while one worker will receive a packet scheduling of independent loops for multiprocessor in which all the tasks are very long. Online analysis is machines etc. The challenge is to properly control the interested in this worst case, where an \adversary" trade-o® between the costs of work imbalance and plays against the master and the workers. The intencommunication. tion of the adversary is to make the schedule as long The problem can be stated as follows. There is as possible. a master holding W pieces of indivisible work (tasks). Each task can be processed by any of N workers (the master can actually be one of the workers, playing both roles simultaneously). The durations of tasks can be di®erent (e.g. processing of one task can take a second, processing of another task can take a minute). The workers are reliable, equally fast and are willing to complete the entire work as soon as possible (to minimise the makespan, i.e. the parallel time). The

1 Note that in asynchronous message passing model the latency can vary not only for di®erent runs of the same program with the same input, but even for di®erent messages in the same run. It is bounded for a given run, but the bound is not a-priori known. In order to be fair by comparison of di®erent algorithms, we assume that the latency is the same constant for all messages and all runs. This is a common assumption in publications relating to the problem in question. (In dedicated practical networks, the latency for a ¯xed message size is bounded|and a-priori known, as it can be measured in run-time before the actual computation begins.) ? This work was partially supported by the grant VEGA

1/0726/09.

If the workers are not allowed to return tasks which they are assigned and if no information on tasks' durations is available, then no master's algorithm is intuitively \better" than the algorithm from the previous paragraph. For example, consider an algorithm where the master sends only one task as a reply to a request from an idle worker. Then the unlucky case is that all the packets are very short. The cost of communication can then be very high in comparison with the work itself and may exceed the total time of the previous algorithm.

In a deterministic model [12, 13] it is assumed that the information on maximal and minimal tasks' dura

All algorithms used in the previous examples use tions is available a-priori, i.e. before the computation the same scheme. In the beginning, all the workers are begins. The goal in the cited papers was to ¯nd paramidle. Each worker process runs a loop in which it sends eter settings which minimise the maximal makespan a job request to the master process, waits for a job of chunking and factoring algorithms. The goal in this (a set of tasks sent in a message) and then processes paper is to ¯nd parameter settings which minimise the all the tasks in the job, one after another. The algo- competitive ratio of the algorithms. rithms only di®er in how the master decides for the number of tasks (job size) which it sends when replying a job request. Without a knowledge on a speci¯c task's duration only the job sizes K are important, not the choice of tasks in a job. A generic master's algorithm is shown in Fig. 1.

Chunking and factoring algorithms make use of partial information on the tasks' durations. They were ¯rst investigated in a probabilistic model, where tasks' durations are assumed to be realisations of a random variable with known mean and variance [11, 7, 6, 1, 10].

The goal in the probabilistic model is to ¯nd parameter settings which minimise the expected makespan of the algorithms. Optimal settings have not been found, only rough approximations are known.

The main contribution of this paper is competitive analysis and quantitative comparison of three algorithms in the deterministic model. Optimal parameters for chunking and factoring algorithms are derived. (As it turns out, the optimal parameter settings are almost the same in both scenarios|which is perhaps not surprising, as both scenarios focus on the worst-case input.) The third algorithm is deterministic work stealing. This algorithm requires no a-priori information and has no parameters; however, it requires a mechanism for redistribution of already assigned tasks. This means that the workers are allowed to communicate with one another and are allowed to return assigned, but yet unprocessed tasks to the master. We prove that the deterministic work stealing algorithm performs better than the previous two algorithms under certain assumptions which usually hold in practical systems. This makes deterministic work stealing very attractive for applications where no a-priori knowledge of tasks' durations is available. master generic(int W , int N ) int K; int work = W ; while (work > 0) f wait for a job request from an idle worker; compute the job size K; assign a job consisting of K yet unprocessed tasks to the idle worker; work = work ¡ K; g reply job requests with NO MORE WORK; f g

The deterministic work stealing algorithm appears in the context of di®usive load balancing, e.g.

Fig. 1. Generic assignment algorithm (without work redis- in [3{5] (optimal load balancing scheme). In the contribution). text of di®usive load balancing the data locality is the main concern; the goal is to exploit the structure of the network in order to minimise the cost of a single work redistribution step. In this paper the network structure is ignored and the number of work redistribution steps is minimised.

Note that when a job request arrives, the master process must decide immediately how many yet unprocessed tasks it assigns in that job (as it might take long until another job request arrives). This decision The paper is organised as follows. Section 2 introis based on partial a-priori knowledge of tasks' du- duces the notation and de¯nitions. Sections 3, 4 and 5 rations. Only deterministic online algorithms will be present online analysis of chunking, factoring and work considered, i.e. algorithms which do not internally use stealing algorithms in the deterministic model (related any source of randomness. Two of the three algorithms results for chunking and factoring algorithms in studied in this paper, chunking and factoring, follow a probabilistic model are brie°y summarised in subsecthe scheme from Fig. 1. The third algorithm, deter- tions). Performance of these algorithms is compared in ministic work stealing, uses a more complex scheme. Section 6. Section 7 concludes the paper. The following notation is used throughout this paper: makespan (total parallel time) number of worker processes; N ¸ 1 total number of tasks (total work); W ¸ 1 latency (duration of assignment of one job) task's durations (i = 1:::W ); ti > 0:0 minimal task's duration, Tmin = mini=1:::W ti maximal task's duration, Tmax = maxi=1:::W ti ratio Tmax=Tmin; T ¸ 1:0

Notation and de¯nitions 3 Chunking

In order to quantify the performance of algorithms, in the worst case comparable with the sequential time tio between the makespan MS (A) of algorithm S which uses a-priori information A and the best o²ine makespan

Mbest o²ine over all sequences t1 : : : tW input tasks' durations which conform to the a-priori

! 1, we prefer to keep the comparison parameterised with respect to W , N , L instead of using the limit values of CRS(A) and CRR(A) in De¯nition 2. This allows for a ¯ner comparison of algorithms. solving for K. This yields K = q W L(L+Tmax) .

NTmax The chunking algorithm [11], [8] always assigns jobs of size K to idling worker processes, where K remains constant (the last assignment may be an exception, where a smaller job is assigned), see Fig. 1. Once a job has been assigned to a worker, this decision cannot be changed|the worker must then compute all the tasks assigned in that job.

We will prove a general theorem which states that a-priori knowledge of Tmax does not help much. The parallel time of any algorithm (including chunking) is for a su±ciently large number of tasks. (N ) for W = (N 3)).

Theorem 1. For all W; N; L; Tmax such that W N 3 + N 2(N ¡ 1)Tmax=L competitive ratio of an arbitrary assignment algorithm with no work redistribution

> and with a-priori knowledge of Tmax is at least N (i.e.

Proof. Let W > N 3 + N 2(N ¡ 1)Tmax=L. Let K denote the maximal job size assigned by an algorithm S.

There are two cases:

case 1, K ¸ W L=(N 2L + N (N ¡ 1)Tmax); case 2, K < W L=(N 2L + N (N ¡ 1)Tmax).

In case 1, there is Kint such that N and Kint=N is an integer. Kint tasks of the maximal job will have duration Tmax, while all the other tasks will have duration " ! 0. The best o²ine algorithm computes the Kint long tasks in parallel, whereas the algorithm S computes them sequentially. This implies (as S makes at least as many assignments as the best o²ine algorithm) · Kint · K CRS(Tmax)(W; N; L) ¸ KintTmax=N = N KintTmax least W L=(N K) + Tmax and

In case 2, consider the latency overhead of the algorithm S, which is at least W L=(N K). Assume that one task has duration Tmax and is assigned in the last job; all the remaining tasks are of an equal duration " ! 0). Hence the makespan of the algorithm S is at CRS(Tmax)(W; N; L) ¸

This completes the proof.

W L=(N K) + Tmax

L + Tmax

¸ N tu

Consider the case where all K tasks of some job are of duration " ! 0 and all the other tasks are of duration Tmax. The competitive ratio of the chunking algorithm using the job size K is then

CRchunking(Tmax)(W; N; L) = KTmax + W L=(N K)

L + Tmax

The competitive ratio is minimised by setting the ¯rst derivative of CRchunking with respect to K to zero and Recall that the probabilistic model assumes that tasks' durations are realisations of a random variable with (known) mean ¹ and (known) standard deviation ¾.

The ¯xed-size chunking strategy in the probabilistic model was analysed by Kruskal and Weiss [11]. They derived the following estimation of the expected makespan E[M] for the chunk size K: int K; int counter; int round = 0; int work = W ; while (work > 0) f

W E[M] ¼ N ¹ +

W L N K + ¾p2K ln N ( 1 ) f

This formula has a nice intuitive interpretation.

The ¯rst term is the time of executing W tasks on N processors on a system with no overhead. The second term describes the latency overhead. The third term describes the load imbalance due to the variation in tasks' durations. Unfortunately, the estimation in Equation 1 only holds if W and K are large and K À log N . If these assumptions hold then also the optimal chunk size Kopt can be estimated:

K^ opt = Ã p2W L ! ¾N pln N f g round = round + 1; K = max(work=F , 1); counter = 0; while ((counter < N ) and (work > 0)) counter = counter + 1; wait for a job request from an idle worker; assign a job consisting of K yet unprocessed tasks to the idle worker; work = work ¡ K; g g reply job requests with NO MORE WORK; 3.1 Chunking in a probabilistic model

MASTER FACTORING(int W , int N , °oat F ) constant over all rounds. During the last round, singletask jobs are assigned. Once a job has been assigned to

If the assumptions above do not hold, [11] gives the a worker, the worker must compute all tasks assigned following estimates for the expected makespan E[M]: in that job.

We will derive the optimal factor F , assuming that E[M] ¼ WN ¹ + NWKL + ¾s2K ln p¾KN¹ teahd-pegreiroaartviiaoiklTanbowl=el.eTd(mgAaexs=oimTfmibliaonrthiasnTtahmlyeasxiosnawlynhdiac-hTpmraioisnsruimckanensowablnefound in [8] and [12].) Denote Ki the job size which is for K ¿ W=N and small pK=N ; and assigned during the round i of the factoring algorithm and let wi denote the number of still unassigned tasks at the beginning of round i. In order to be competitive, E[M] ¼ WN ¹ + NWKL + N¹¾2 fpaucttaotriionng ogfuaarjaonbteoefs stihzeatKtihewliollnngoetsttasekqeuleonntgiaelr ctohmanthe shortest parallel computation of the still unasfor K ¿ W=N and large pK=N . However, a tracta- signed wi ¡ Ki tasks on the remaining N ¡ 1 workers: ble analytical expression for the optimal chunk size K max seq time(Ki) · min par time(wi ¡Ki; N ¡ 1). In could not be derived. order to minimise the assignment overhead, Ki must be as large as possible. The largest Ki which satis4 Factoring ¯es the inequality above (and thus guarantees the maximal imbalance of at most 1 task) is Ki = wi=(1 + T (N ¡ 1)).

Factoring [7, 6, 1, 8] works in rounds, see Fig. 2 it could also be expressed in the form of the generic algorithm from Fig. 1 by rewriting the procedure compute the job size K, but doing so would make the algorithm more di±cult for reading). In each round, it assigns N jobs of equal size. The job size is decreased after each round, whereby the job size in a round is a factor of the work remaining (the number of yet unassigned tasks) at the beginning of the round. The factor F remains

Note that it is only the assignment overhead which determines the competitive ratio of factoring. For example, the trick with setting durations of all the tasks of K1 to Tmax and computing them in parallel by the best o²ine algorithm does not work. The reason is that this does not increase the makespan of factoring at all: K1Tmax = W Tmax=(T (N ¡ 1)) = W Tmin=(N ¡ 1). Theorem 2. For all W; N; L; T competitive ratio of Note that this iteration scheme only requires the the factoring algorithm with a-priori knowledge of T knowledge of the coe±cient of variation cov of the using factor F = 1 + T (N ¡ 1) is O((ln W )=W ) and tasks' probability distribution (cov = ¾=¹). There are two extreme cases: 1. If cov = 0 (no variance) then this strategy assigns all jobs in a single round; 2. If cov !

1 (unbounded variance or negligible tasks' durations) then this scheme assigns jobs of size 1. (This scheme is not strictly factoring in the sense of Section 4 because the factor is not the same constant in subsequent rounds.) 5

Work stealing

So far we assumed that the master process cannot take back its decisions|i.e. once a job has been assigned to a worker, then the job must be processed by that worker. In the work stealing algorithm, the master process can reclaim already assigned but yet unprocessed tasks from the workers. The work stealing algorithm requires no a-priori information (not even the knowledge of latency). It initially assigns all the tasks in jobs of size W=N to idling worker processes. When a worker becomes idle again, the master reclaims all yet unprocessed tasks from all the worker processes and redistributes them equally back again to all worker processes. The periods between the redistributions are called rounds. Each round adds a penalty L0 to the makespan.

An implementation of the work stealing algorithm can use two threads of control in each worker process: ( 3 ) a \listening thread" which reacts to work redistribuapproaches 1 if W

! 1.

Proof. Let r denote the last round at the beginning of which the number of still unassigned tasks wr is at most N (as the size of the jobs assigned in the round r is Kr = 1 and the number of the jobs assigned in the round r is at most N ). It can be observed that the number of yet unassigned tasks wi at the beginning of round i is equal to wi = W (1 ¡ N=(1 + T (N ¡ 1)))i¡1.

Solving wr · N for maximal r yields the number of rounds r performed by the factoring algorithm: rfactoring = wi+1 = wi ¡ N K^iopt; xi+1 = 2 +

N 2 µ ¾ ¶2 wi ¹ tion messages by sending all yet unprocessed tasks to the master process; and a \working thread" which computes the tasks and noti¯es other processes when it runs out of work. These two threads share a queue of tasks. The queue is protected by a semaphore in order to exclude its simultaneous access by both the threads. The working thread repeats a loop in which it locks the queue, pops one task, unlocks the queue and starts processing the task. After ¯nishing the task, this procedure repeats until the working thread ¯nds the queue empty. Then it noti¯es the other processes and waits until the listening thread inserts tasks of the new round into the queue and resumes the computation (or terminates the whole process). Yet unprocessed tasks in a process are the tasks in the queue. A clever implementation of the algorithm amortises the latency by allowing a worker which reacts to a work redistribution message to continue in processing of tasks in its queue during the work redistribution.

As the task distribution in work stealing uses a more complex communication pattern (broadcasting and gathering) than the previous algorithms (pointto-point round-trip), we will denote this latency L0, whereby L0 ¸ L. However, L0 di®ers from L only by a constant factor if N is a constant. This facwork stealing (Eq. 4) and factoring (Eq. 2), then work tor depends on the physical mechanism which is used stealing does not perform worse than factoring when maximal r yields the beginning of the last round r. Solving wr · N for is assigned for communication among the processes. Note that L0 ¼ L e.g. in a bus network or a network with a complete interconnection graph. Similarly as by factoring, there is no work imbalance at the end of the algorithm, therefore the competitive ratio of work stealing only depends on the number of rounds.

Theorem 3. For all W; N; L; L0 competitive ratio of the work stealing algorithm with no a-priori knowledge is O((ln W )=W ) and approaches 1 if W

! 1.

Proof. The number of rounds of work stealing in the worst case can be determined as follows. Assume that one of the worker processes ¯nishes its ¯rst job of size W=N , while no other worker process has ¯nished its ¯rst task. After the redistribution a second round begins and the same situation happens: one worker process ¯nishes its job, while none of the other worker processes has ¯nished its ¯rst task. Etc. The total number of yet unprocessed tasks (in the whole system) is at most wi = W ((N ¡ 1)=N )i at the beginning of round i. At most N tasks are distributed at r = large in comparison with N (W ¼ N 3 or larger).

We proved a common upper bound for competitive ratios of the work stealing and factoring algorithms for W !

1. Both these algorithms guarantee a perfect balance, therefore we can focus on their number of rounds which determine the cost of assignment. In order to keep things simple, we will assume L = L0 in the sequel. If we directly compare the number of rounds in ( 4 ) (5) tu

T · N + 1, because then rworkstealing = rfactoring

N ln N¡1 ln (T1+¡T1()N(N¡¡11)) · 1

However, the comparison above is not fair, because the work stealing algorithm with a-priori knowledge of T = Tmax=Tmin is actually more e±cient than in the proof of Theorem 3 (although it does not makes use of the knowledge of T ). For a given T , let us reconsider the scenario in which always one worker process ¯nishes all its tasks from the ¯rst round, while all the other worker processes do as little work as possible.

While the worker computes its ¯rst job of size W=N tasks, all the other workers must have computed at least W=(N T ) tasks each. So every other worker has processed tasks in the whole

system at most W (T ¡ 1)=(N T ) yet unprocessed tasks; in sum, there are at most W (T ¡ 1)(N ¡ 1)=(N T ) unless than

W (N ¡ 1)=N tasks in

the Theorem 3). In the second round, each

worker (which proof is of

W (T ¡ 1)(N ¡ 1)= (N 2T ) tasks. When a worker ¯nishes its job from the second round, then all the other workers have at most W ((T ¡ 1)=(N T ))2 yet unprocessed tasks each; in sum, there are at most W ((T ¡ 1)(N ¡ 1)=(N T ))2 yet unprocessed tasks in the whole system. Etc. Generally, there are wi = W ((T ¡ 1)(N ¡ 1)=(N T ))i yet unprocessed tasks in the whole system at the beginning of round i. At most N tasks are distributed at the beginning of the last round r. Solving wr < N for maximal r yields rworkstealing = ln (W=N )

NT ln (T ¡1)(N¡1)

(6)

Fair comparison of the number of work stealing rounds (Eq. 6) with the number of factoring rounds (Eq. 2) yields rworkstealing = rfactoring ln (W=N)

NT ln (T ¡1)(N¡1) =

ln (W=N) ln (T1+¡T1()N(N¡¡11)) ln (T1+¡T1()N(N¡¡11)) < 1

NT ln (T ¡1)(N¡1)

This means the work stealing algorithm performs better than the factoring algorithm for all W; N; L, L0; T if L'=L. More precisely, the work stealing algorithm performs better for L, L' such that ln (T1+¡T1()N(N¡¡11)) <

NT ln (T ¡1)(N¡1)

L0 because then L0rworkstealing · Lrfactoring. We stress that an a-priori knowledge of T is rarely available in practice and must therefore be estimated. With an in- 5. R. Elsasser, B. Monien, A. Frommer, and R. Preis: Opaccurate estimation of T , the factoring algorithm per- timal di®usion schemes and load balancing on product forms worse than in our analysis. graphs. Parallel Processing Letters, 14 ( 1 ), 2004, 61{

The work stealing algorithm is a clear winner. It 73. has no parameters and requires no tuning. Moreover, it 6. L. E. Flynn and S. Flynn-Hummel: Scheduling can be used (after some modi¯cations) in applications variable-length parallel subtasks. Technical Report RC 15492, IBM Research, 1990. where processes may fail or where the number of tasks 7. S. Flynn-Hummel, E. Schonberg, and L. E. Flynn: Facmay grow in run-time. toring: A practical and robust method for scheduling parallel loops. In: Proc. of Supercomputing '91, IEEE Computer Society / ACM, 1991, 610{619. 7 Conclusions 8. S. Fujita: A semi-dynamic multiprocessor scheduling algorithm with an asymptotically optimal competitive We analysed online performance of chunking, factor- ratio. In: Proc. of the 8th International Euro-Par Coning and work stealing assignment algorithms in a de- ference on Parallel Processing, Springer-Verlag, 2002, terministic model. The chunking algorithm requires an 240{247. 9. M. R. Garey and D. S. Johnson: Computers and ina-priori knowledge of the maximal task's duration and tractability. W. H. Freeman and Company, 1979. achieves competitive ratio N (which does not depend 10. T. Hagerup: Allocating independent tasks to parallel on W ) for W = (N 3), where N denotes the num- processors: An experimental study. Journal of Parallel ber of processes and W denotes the number of tasks. and Distributed Computing, 47 ( 2 ), 1997, 185{197. The performance of chunking algorithm is thus very 11. C. P. Kruskal and A Weiss: Allocating independent poor, at least from the point of view of competitive subtasks on parallel processors. IEEE Transactions on analysis. The factoring algorithm requires an a-priori Software Engineering, 11 (10):1001{1016, 1985. knowledge of the factor T = Tmax=Tmin. Its competi- 12. T. Plachetka: Perfect load balancing for demandtive ratio is bounded from above by O(ln (W )=W ) and driven parallel ray tracing. In: B. Monien and R. Feldapproaches 1 when W ! 1, which is very desirable. man, (eds), Proc. of Euro-Par 2002, volume 2400 of The same holds for the deterministic work stealing al- Lecture Notes in Computer Science, Springer-Verlag, 2002, 410{419. gorithm, which performs better than the factoring al- 13. T. Plachetka: Tuning of algorithms for indepengorithm and requires no a-priori information. dent task placement in the context of demand-driven

The last result is valid under two assumptions: parallel ray tracing. In: D. Bartz, B. Ra±n, and 1. the underlying communication mechanism provides H.W. Shen, (eds), Proc. of the Eurographics/ACM an e±cient implementation of broadcasting and gath- SIGGRAPH Symposium on Parallel Graphics and Viering, which we assume to be as fast as round-trip sualization (EGPGV), Eurographics Proceedings Sepoint-to-point communication; 2. the communication ries, Eurographics Association, 2004, 101{109. latency is constant which does not depend on the mes- 14. K. Pruhs, J. Sgall, and E. Torng: Online scheduling. In: sage size. The ¯rst assumption holds e.g. for bus and J.Y.-T. Leung, (ed.), Handbook of Scheduling, chapfully-switched networks; the second assumption holds ter 15, CRC Press, 2004, 15.1{15.41. for practically all contemporary networks, if the message size does not exceed a certain threshold.

Banicescu and

Flynn-Hummel : Balancing processor loads and exploiting data locality in irregular computations . Technical Report RC 19934, IBM Research , 1995 .

Borodin and

El-Yaniv : Online computation and competitive analysis . Cambridge University Press, 1998 .

Diekmann ,

Frommer , and

Monien : E± cient schemes for nearest neighbor load balancing . Parallel Computing , 25 ( 7 ), 1999 , 789 { 812 .

Elsasser ,

Frommer ,

Monien , and

Preis : Optimal and alternating-direction loadbalancing schemes . In: Proc. of Euro-Par , volume 1685 of Lecture Notes in Computer Science, Springer-Verlag, 1999 , 280 { 290 .