Introduction

Muhammad Nouman Durrani and Jawwad A. Shamsi. Volunteer computing: requirements, chal- lenges, and solutions. Journal of Network and Computer Applications

Split-Merge Model of Workunit Replication in Distributed Computing

0 Alexander Rumyantsev Institute of Appiled Mathematical Research, Karelian Research Centre of RAS 11 Pushkinskaya Str., Petrozavodsk, 185910, Russia Petrozavodsk State University 33 Lenina Pr. , Petrozavodsk, 185910 , Russia 1 Srinivas R. Chakravarthy Department of Industrial and Manufacturing Engineering Kettering University , Flint, MI 48504 , USA

2014

39 369 380

A special class of tasks that demand huge computational resources is known as embarrassingly parallel. Such tasks may be split into a large amount of small independent subtasks, known as workunits. The results of the processed workunits are later assembled. Note that the independence of the workunits makes it inappropriate to run embarrassingly parallel applications on a supercomputer. This is due to the fact that the key feature of a supercomputer (the high speed interconnect) is not utilized during the computation. Thus, one needs to use general purpose instruments like Map-Reduce framework, Beowulf-type clusters for parallel computing (having modern implementations, such as Simple Network of Workstations SNOW package for R language), and Desktop Grids (DG), to get results for such embarassingly parallel tasks. It should be pointed out that DG is a specifically designed, inexpensive, and very powerful option. A common mechanism used in the DG environment (as well as in other modern distributed computing approaches, such as Cloud-based computing), is replication. Replication requires each workunit to be processed by multiple hosts, and when the required number of results from these hosts is obtained, the workunit is said to be completed. The replication is known to reduce latency in a high-throughput systems [JSW15a], and to increase redundancy in RAID storage systems [Tho14]. In DG, the replication mechanism is used together with obtaining the quorum of results, which allows to reduce the probability of malicious activity, increase redundancy and reduce the response time [BYSS+12, HAH09]. The quorum is the number of results (from one to the number of replicas) that are required to conclude that the workunit is complete. Moreover, when the result is obtained only approximately, or there is no easy algorithm to validate the result (e.g. in prime number searching problem), the quorum mechanism is the ultimate solution. Below we use the term workunit to indicate a single subtask of a DG project before replication (and after obtaining a quorum), and the term result to denote the replica of the workunit (it will allow to adopt the more common term customer or task to the DG context). We stress that the replication and quorum settings dramatically affect the performance of the distributed computing systems. Thus, a model is required to study the effect of the aforementioned settings on the key performance metrics of the DG system. The class of queueing models that describes the replication and quorum mechanisms is known as Fork-Join queues. In the Fork-Join n-server queueing system, a single input of workunits is replicated to r results and Copyright c by the paper's authors. Copying permitted for private and academic purposes. In: E. Ivashko, A. Rumyantsev (eds.): Proceedings of the Third International Conference BOINC:FAST 2017, Petrozavodsk, Russia, August 28 - September 01, 2017, published at http://ceur-ws.org

Introduction

dispatched to r servers, each having its own queue, according to the dispatching discipline (e.g. to the first r servers with least residual workload). The r servers act independently and serve the results according to some queueing discipline (e.g. First-Come-First-Served). After being processed, each result is routed to a single unlimited join queue. When the quorum q of results of the same workunit is obtained (and waits in the join queue), the workunit is marked as completed, and the unused results, which are either waiting in the queues, or being served by corresponding servers, are abandoned. Such models, known as (n, r, q) Fork-Join systems, have been studied extensively in [Jos16]. Note however, that, despite the simplicity of the model, only few analytic results are available and that too under stringent conditions (e.g. for the (n, n, n) M/M-type Fork-Join system [Jos16]), and in most cases only the asymptotic bounds or approximations are obtained [BDVD98, BM97]. In particular, the upper bound on the performance of the (n, r, r) Fork-Join system in [JSW15b] is done with the help of another closely related type of model, known as Split-Merge, or Split and Match queues. The key property of the Split-Merge system is that the service of the results of a particular workunit is started only after the previous workunit is completed (note, that in particular if q = 1 and r divides n, then Split-Merge and Fork-Join models coincide). The Split-Merge queues have been extensively studied since three decades ago. Two-server Split-Merge system was considered in [RP85], the departure process from Split-Merge was studied in [Rao90]. An emerging interest to the Split-Merge systems is related to new applications in Cloud and distributed systems [FL15].

In this work we suggest a Split-Merge model of the replication and quorum mechanism of a DG, which we refer below as (n, r, q) Split-Merge model. Our work extends the (n, n, n) Split-Merge model studied in [FL15], and elaborates the Fork-Join type models studied in [Jos16]. We give exact solutions for particular cases of the model that are analytically tractable, and study the effect of parameters, r and q, on the performance of the model by means of simulation.

The work is organized as follows. In section 2 we give a brief description of the DG technology, providing necessary details for the BOINC-based DG (the standard software used for distributed computing), and discuss the limitations of our approach. In section 3 we present the model, providing the necessary notation. In section 4 we validate the model and discuss the results of numerical experiments. The conclusion and possible extensions of the model are given in section 5. 2

BOINC-Based Desktop Grid

DG technology is a distributed computing technology, that allows to utilize the idle resources (central processor unit time, memory and disk space) of a host in order to complete the computational project that consists of loosely coupled workunits. The hosts are either provided by volunteers (the Volunteer Computing, VC [NDS14]), or are a part of some controlled environment, e.g., the computational resources of a company (the Enterprise Desktop Grid, EDG [Iva15]). The project management, workunit generation, replication and result validation via quorum mechanism is performed by the management server of the project. During the project evaluation, the server produces workunits, then generates the necessary number of replicas of a workunit, and dispatches them to the hosts, either on their requests (VC), or impassively (EDG). The server receives and validates the results of computation, and assimilates them. However, it also has to deal with several key difficulties of VC: host availability, trust, malicious activity etc. [NDS14]. We also note, that the VC project requires a lot of social work with the volunteers, however, discussion of these issues is beyond the scope of this paper.

The EDG is a far less studied concept, which, however, has several simplifications compared to the VC. Namely, in most cases the hosts of an EDG may be treated as trusted, reliable, controllable and available. Moreover, since the hosts are directly under control of, say, the enterprise owner, the workunits may be PUSHed to the hosts (i.e. immediately dispatched), rather than PULLed by them (i.e. requested by clients in asynchronous regime, the basic technology used in the VC), and moreover may be canceled on demand [YJKA11]. In the present paper we mostly consider the model of EDG performance controlled by replication and quorum. However, some results of the study may be related to the VC model as well.

We briefly describe the key performance metrics, which are basically studied for the DG [ETA09]: throughput — the number of workunits completed per unit time; latency — the time from workunit generation (arrival) to workunit completion (departure); delay — the time from workunit generation to the beginning of results computation; starvation — the fraction of the idle time of the hosts, which is not used for computation; completion time — the average time it takes to complete a fixed number of workunits. Moreover, the centralized structure of the DG network in some cases makes the project management server become the bottleneck of the system. Indeed, all the tasks related to dispatching, scheduling, result validation and assimilation, are done on a single server. In the current paper we mainly focus on the throughput, latency and delay of the DG model. 3

Stochastic Model of Desktop Grid Performance

We consider an EDG project, that has a fixed number, say n, of identical hosts processing the workunits, generated by management server at instants of a point process with intensity λ. The workunit is immediately replicated and dispatched upon arrival to the hosts that have the least work left to be finished. Let r 6 n be the number of results corresponding to a single workunit requested by the management server from the hosts, and let q 6 r results be required to constitute a quorum. We assume, that the hosts start evaluating the results of a single workunit simultaneously, and the service time of results on each host is iid with distribution function FS . Immediately when q results are obtained, computation of r − q incomplete results (if any) is canceled by the management server. We assume that once received, the result is correct, i.e., the earliest q results received are valid. We also assume an unbounded time for result computation, that is, q results are eventually received. We are interested in the performance metrics of such a computation.

Let S1, . . . , Sr be iid random variables corresponding to the (potential) times of computation of the results of a single workunit. Since the results are started simultaneously, then the time to complete evaluation of the workunit is distributed as the qth order statistics of r r.v., which we denote by Sq:r. Recall, that where F S (x) := 1 − FS (x). In particular, the minimum of S1, . . . , Sr is distributed as

P (Sq:r 6 x) = Xr r

i i=q

FSi (x)F rS−i(x),

P (S1:n 6 x) = 1 − F rS (x).

Thus, the system under study is equivalent to an M/G/ nr queueing system with general service time Sq:r.

Note however, that it is not necessary, that FSq:r (x) := P (Sq:r 6 x) belongs to the same class of distributions, as FS . Obtaining the distribution of qth order statistics analytically is difficult in general. Note that if the service times are exponentially distributed with FS (x) = 1 − e−μx, μ > 0, then Sq:r is exponentially distributed only for q = 1, which provides FSq:r (x) = 1 − e−rμx, r > 1.

The class of distributions, that is closed under the operation of obtaining order statistics, is phase-type (PH) distribution. A continuous-time PH distribution of order m with representation (β, B) is the time to absorption of the continuous-time Markov chain with a fixed number m of transient states (phases), initial distribution of states β = (β1, . . . , βm) and transition subintensity matrix B. The vector b0 = −B1 > 0 gives the intensity of transitions to absorbing state m + 1 (for more details on this type of distributions see e.g. [He14, Neu81]). Recall also, that if S has a (β, B) PH distribution, then ES = −βB−11 and ES2 = 2βB−21 [BKF14]. Let FS be a (β, B) PH distribution.

Now we give a procedure to obtain PH representation of the distribution FSq:r (x) (a detailed discussion on the procedure in case of non-identical PH distributions may be found in [BN17]).

First, we need to extend the phase space. Denote Mk := {1, . . . , m}k the set of mk lexicographically ordered k-tuples corresponding to the states of k processes that have not yet reached absorbing state. Now we construct the process {X (t) :=

X1(t), . . . , Xrq−q(q−1)/2(t) , t > 0} ∈ M := Mr ∪ · · · ∪ Mr−q+1, that corresponds to evolution of the r processes until q absorptions, where the number of components equals r + (r − 1) + · · · + (r − q + 1). Below we explain the evolution of the process in detail. Initially the first r components of the process are independent copies of r.v. with PH(β, B) distribution. Thus, the evolution of the process (X1(t), . . . , Xr(t)) ∈ Mr without absorptions is governed by the mr × mr subintensity matrix B⊕r, where

B⊕r = B ⊕ · · · ⊕ B .

| {rz } Intuitively the notation above means, that only one of the r.v. X1, . . . , Xr makes its own independent transition, that does not end into absorption.

When an absorption occurs, we exclude the absorbing component, switching to the process (Xr+1, . . . , X2r−1) ∈ Mr−1. Thus, the transitions of the process related to absorption of one of the components are governed by mr × mr−1 matrix B0(r). We stress, that the rate [B0(r)]i,i0 of transition from a state i = (i1, . . . , ir) ∈ Mr to the state i0 = (i01, . . . , i0r−1) ∈ Mr−1 depends on the number of possibilities to remove exactly one component of i and arrive at i0 as follows [B0(r)]i,i0 =

X t∈N(i,i0)

b0,it , i ∈ Mr, i0 ∈ Mr−1, where N (i, i0) = {t 6 r − k + 1 : ij = i0j , 1 6 j 6 t, ij+1 = i0j , j > t} is the set of indices of such components of i, that, being removed from i, convert it to i0. We note, however, that, by exploiting the lexicographical order of Mr and Mr−1, a more straightforward expression for the transition rate matrix appears

B(r) = b0⊕r := b0 ⊗ Imr−1 + Im ⊗ b0 ⊗ Imr−2 + · · · + Imr−1 ⊗ b0,

0 (with obvious conventions if r < 2) where Ij is the identity matrix of size j > 1.

Then the transitions of the process X (t) are governed by the following bidiagonal matrix (where zeroes are the zero matrices of the corresponding dimension)

 Bq:r =     B⊕r It remains to note, that the initial distribution of the process is given by and the transition intensity to the super-absorbing state corresponding to exactly q absorptions of r independent processes is given by −Bq:r1. It may be seen, that row sums of the matrix Bq:r are zero for all rows, except the last mr−q+1, which correspond to the phases of process with q − 1 absorptions.

It may be now seen, that the r.v. Sq:r has a PH(βq:r, Bq:r) distribution.

The PH representation allows to obtain immediately the stability condition and performance measures of the system as follows.

Stability condition of the system where, recall, Thus, the maximal throughput of the system equals The explicit forms for the mean stationary delay (ED), as well as the mean latency (EW ), are available for r = n (for more details see Appendix E in [Jos16], case q = r = n is considered in [FL15]), since the system in this case is equivalent to an M/P H/1 system: λESq:r < n r

, ESq:r = −βq:rBq−:r11.

n r θ =

[ESq:r]−1.

ED =

λβq:nBq−:n2 1 1 + λβq:nBq−:n1 1

EW = ESq:r + ED.

For r < n it is possible either to use approximation [Jos16], or rely on simulation. Alternatively, the solution of continuous-time QBD process corresponding to the M/P H/ nr system may be obtained in terms of the intensity (1) (2) (3) (4) (5) (6) and in particular, if r = 1 (no replication at all), the sojourn time equals where

EW = 1 rμ

C(bn/rc, λ/(rμ)) bn/rcrμ − λ

, EW = +

C(n, λ/μ) nμ − λ

, " C(n, λ/μ) = 1 + (1 − ρ)

n! (nρ)n nX−1 (nρ)k #−1 k=0 k! matrix R, which (except some special cases) is derived only numerically. However, discussion of this approach is beyond the scope of this paper and will be presented elsewhere.

We note, that it is easy to sample exactly from qth order statistics of r independent PH(β, B)-type variables by sampling from a single PH(βq:r, Bq:r) r.v. The drawback of this approach is the growing size of the number of phases that requires to process square matrices of order Piq=−01 mr−i = (mr+1 − mr−q+1)/(m − 1). Note however, that it is mostly a technical limitation.

Now we focus on the latency in some particular cases, that are most analytically tractable.

An important particular case is the M/M/ nr -type system with exponential interarrival times. This corresponds to an (n, r, 1)-type replication with exponentially distributed service times (with intensity μ). In this case we have (n, r, 1) split-merge model, i.e., we send r results and wait for the fastest, we obtain the Exp(rμ) service distribution of the minimum of r exponentials, and obtain the M/M/b nr c-type model. That is, the sojourn time equals We set n = r = 3 and q = 2, and derive βq:r from (3), and Bq:r from (2). We consider N = 5 · 105 tasks arriving at epochs of a Poisson process (with intensity λ), with service times generated according to PH(βq:r, Bq:r) distribution. We fix λ = 0.9μ and μ = [ESq:r]−1. Note, that the average stationary latency in the system obtained from (6) is 7.42. The results of the experiment shown on Fig. 1 show the proximity between the theoretical result and practical estimaion.

Next, we study the dependence of the theoretical value of sojourn time (6) in a system with (n, n, q) replication. We fix the PH(β, B) distribution used in the first experiment. We vary r = 1, . . . , 5 and q = 1, . . . , r. We depict the theoretical value obtained from (6) for each pair (q, r) by the size of a circle on the graph. It may be easily seen form Fig. 2, that the sojourn time is increasing both for increasing q and decreasing r.

Finally, we illustrate the system with exponential service times. We consider (n, r, q) replication with n = 100, q = 1, 2, 3 and r = 1, . . . , 10. We perform a parameter sweep experiment for all (q, r) pairs s.t. q 6 r. We also consider the theoretical value (7) for the case q = 1. We depict the dependency of the sample mean latency obtained by simulation of the M/G/b nr c system with Sq:r obtained as an q-th order statistics of r exponentially distributed r.v. and exponentially distributed interarrival times with λ = 0.9b nr c(ESq:r)−1. For a fixed pair (q, r) we generate the input of 5 ·105 tasks and evaluate the delay of each task at arrival by means of multiserver system simulation using the hpcwld package for R language. The results of the simulation are presented on Fig. 3, that coincide with the previous experiment.

+ (7) (8) (9) is the Erlang C-formula. On the other hand, if r = n (full replication), we obtain the M/M/1 system with service intensity nμ, and the mean stationary sojourn time is EW = n1μ + nμ−λ 1 . 4

Numerical Experiments

In this section we discuss the results of several numerical experiments performed for illustrative purposes and quantitative analysis.

First, we study the proximity between the exact value (6) of average stationary sojourn time of the DG model with (n, n, q) replication, and a simple mean estimator obtained by stochastic modeling of the M/P H/1-type system with general service time Sq:r. First we define a 3-phase PH(β, B) distribution by B =  3e+05 4e+05

5e+05

Number of customers

Final Remarks

In this paper we have presented a model of the EDG system with (n, r, q)-type replication and quorum mechanism. The results of numerical experiments illustrate the practical applicability of the model and show good correspondence to the obtained, as well as earlier known, theoretical results.

Note however, that the numerical experiments mostly cover the so-called log-convex distributions. It is known, that the monotonicity of dependence of the mean stationary latency on the values q, r may the opposite, compared to the log-concave distribution (for a detailed discussion on the impact of log-concavity/convexity on latency for the Fork-Join type models see [Jos16]). Moreover, the dependence may not be monotonic for the distributions with the so-called heavy tails (e.g. Pareto distribution). Thus, a more intensive numerical study is required, and we leave this for future research.

We also note, that an extension to heterogeneous servers may be done taking into accounts results of [AM01], and order statistics of heterogeneous PH distributions are discussed in [BN17].

Acknowledgements References

[AM01] The work of AR is partially supported by RFBR, projects 15-07-02341, 15-07-02354, 15-29-07974, 16-07-00622, and by President RF’s grant No.MK-1641.2017.1.

Søren Asmussen and Jakob R. Møller. Calculation of the steady state waiting time distribution in GI/PH/c and MAP/PH/c queues. Queueing systems, 37(1-3):9–29, 2001. [BDVD98] S. Balsamo, L. Donatiello, and N.M. Van Dijk. Bound performance models of heterogeneous parallel processing systems. IEEE Transactions on Parallel and Distributed Systems, 9(10):1041–1056, October 1998.

Peter Buchholz, Jan Kriege, and Iryna Felko. Input Modeling with Phase-Type Distributions and Markov Models. SpringerBriefs in Mathematics. Springer International Publishing, Cham, 2014. Quorum

Simonetta Balsamo and Ivan Mura. On queue length moments in fork and join queuing networks with general service times, pages 218–231. Springer Berlin Heidelberg, Berlin, Heidelberg, 1997. Mogens Bladt and Bo Friis Nielsen. Matrix-Exponential Distributions in Applied Probability, volume 81 of Probability Theory and Stochastic Modelling. Springer US, Boston, MA, 2017. DOI: 10.1007/978-1-4939-7049-0. [JSW15b] [NDS14]

G. Joshi, E. Soljanin, and G. Wornell. Efficient replication of queued tasks for latency reduction in cloud systems. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 107–114, Sept 2015.

Gauri Joshi, Emina Soljanin, and Gregory Wornell. Queues with redundancy: Latency-cost analysis. ACM SIGMETRICS Performance Evaluation Review, 43(2):54–56, 2015.

B.M. Rao. On the departure process of the split and match queue. Computers & Operations Research, 17(4):349 – 357, 1990.

B. M. Rao and M. J. M. Posner. Algorithmic and approximation analyses of the split and match queue. Communications in Statistics. Stochastic Models, 1(3):433–456, 1985.

[BYSS+12]

O.A.

Ben-Yehuda ,

Schuster ,

Sharov ,

Silberstein , and

Iosup. ExPERT: Pareto-Efficient Task Replication on Grids and a Cloud . Parallel & Distributed Processing Symposium (IPDPS) , 2012 IEEE 26th International , pages 167 - 178 , May 2012 .

Trilce

Estrada , Michela Taufer, and

David P.

Anderson . Performance Prediction and Analysis of BOINC Projects: An Empirical Study with EmBOINC . Journal of Grid Computing , 7 ( 4 ): 537 - 554 , December 2009 .

Pierre M. Fiorini and Lester

Lipsky . Exact analysis of some split-merge queues . ACM SIGMETRICS Performance Evaluation Review , 43 ( 2 ): 51 - 53 , 2015 .

Eric

Martin Heien ,

David P.

Anderson , and

Kenichi

Hagihara . Computing Low Latency Batches with Unreliable Workers in Volunteer Computing Environments . Journal of Grid Computing , 7 ( 4 ): 501 - 518 , December 2009 .

Qi-Ming He . Fundamentals of Matrix-Analytic Methods . Springer New York, 2014 .

Evgeny

Ivashko . Enterprise desktop grids . In Proceedings of the Second International Conference BOINC-based High Performance Computing: Fundamental Research and Development (BOINC:FAST 2015 ), pages 16 - 21 . CEUR Workshop Proceedings, Vol- 1502 , 2015 .

Gauri

Joshi . Efficient redundancy techniques to reduce delay in Cloud systems . PhD thesis , Massachusetts Institute of Technology, 2016 .

[JSW15a]

Alexander

Thomasian . Analysis of fork/join and related queueing systems . ACM Comput. Surv. , 47 ( 2 ): 17 : 1 - 17 : 71 , August 2014 .

[YJKA11]

Sangho

Yi , Emmanuel Jeannot, Derrick Kondo, and

David P.

Anderson . Towards real-time, volunteer distributed computing . In Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing , pages 154 - 163 . IEEE Computer Society, 2011 .