Optimal Mean Cost Replication in Desktop Grids Ilya A. Chernov IAMR KRC RAS, PetrSU Petrozavodsk, Russia IAChernov@yandex.ru Abstract In this paper we consider optimization problems for task scheduling in Desktop Grids; the used approaches has been weakly reported in literature so far despite their potential efficiency here demonstrated in simple estimations. We propose the mean-cost approach and show how it can be used for different purposes, including counter-sabotage and deadline safety, derive necessary inequalities for the costs, and show that task grouping can be efficient for optimizing extra losses but hardly can be used as a sort of replication. 1 Introduction Mathematics of Desktop Grid task scheduling is quickly developing. It is enough to mention surveys [KMH17, ET12, DS14, XA10, CKBH06]. Much attention is paid to efficiency in terms of throughput, makespan, or the total time to complete a batch of tasks, and reliability defined as low risk of getting a wrong or invalid answer or not getting any answer at all, including sabotage-tolerate techniques. There are other points of interest, including availability of nodes, load balance or minimization of the peak load, and others, that are out of the scope of this work. Obviously different criteria contradict or at least conflict with each other. Some effort has been paid to simultaneous improving two or more criteria. However, nobody (up to our knowledge) considered optimization of the average cost with all conflicting factors converted to the same unit (time, currency, etc). In [Che16, CN15] we studied such problems for the optimal replication. The only assumption about the computing system was that tasks are solved independently with a unit cost and the same risk of producing a wrong answer from the fixed set (due to malfunction, errors, malicious activity, etc). Each task is replicated until ν identical answers are obtained. If a wrong answer is accepted (this will sooner or later show up), come penalty F is added to the overall cost. This approach allows to choose the optimal replication level and also can be helpful for, e.g., choosing parameters of the computing algorithm. For example, choosing higher precision usually reduces the risk of an error, but needs more time; replication with a cheaper algorithm sometimes is more efficient, of course provided that calculations are independent. Interesting results obtained here were quick growth of the critical penalties with respect to risk levels (F ∼ p−1 ), so that there is no need to know penalties precisely; low cost of overestimation of the optimal replication for rare (and thus valuable) results; and more careful check of less valuable answers. Here we apply the approach to other cases, obtaining rather simple estimations that can be useful (we hope) for developing scheduling algorithms. Also we consider estimations for task grouping which can serve a number Copyright c by the paper’s authors. Copying permitted for private and academic purposes. In: E. Ivashko, A. Rumyantsev (eds.): Proceedings of the Third International Conference BOINC:FAST 2017, Petrozavodsk, Russia, August 28 - September 01, 2017, published at http://ceur-ws.org 114 of purposes: improving efficiency, checking, kind of ”fractional” replication (only some tasks of a parcel can be double-checked). However, task grouping seems insufficiently studied in literature so far. 2 The mean-cost approach 2.1 Counter-sabotage scheduling Assume that a wrong answer can be accepted due to malefactors and all that is known is the probability p to send a task to a malicious computing node. We use replication (ν copies) to reveal the malefactors: if at least two replicas produced different results, the task is re-solved and the saboteur is discriminated. In case a wrong answer has been accepted, some penalty F expressed in terms of the unit cost of a single processing of a task is paid. Then the average total cost is E = ν + pν F. This function has minimum either at ν = 1 (no replication) or at some ν > 1, because for very high ν it grows almost linearly to infinity. The necessary condition of the minimum is 1 1 pν ln = p F which means that the optimal ν depends on logarithm of the penalty and the risk level. If p is small so that p2  p, we are able to derive simple tests. The replication ν + 1 is better than ν if pν F > 1. So, here again the critical penalties (that force the change of the replication level) are reciprocals of the risk level. 2.2 Deadlines Consider a reliable computing system with equal nodes and pseudo-answers yes and no obtained if at least one replica/no replicas was solved before the deadline D expressed in terms of the unit solving cost for an answer. The probability of violating the deadline (i.e., the ”no” answer) is p and if the ”no” is ”accepted”, i.e., all replicas violated the deadline, some penalty is added to the overall cost. Denote the number of replicas by ν. Probability of i answers got in time and ν − i not is P = Cνi q i pν−i . The cost in this case is i × 1 + (ν − i)D, so the random cost is the sum of two binomial random variables. The average equals E = ν(1 − p + Dp). Violated deadline means penalty F so we add the expected penalty: E = ν(1 − p + Dp) + pν F . To minimize it consider the derivative: E 0 = (1 − p + Dp) + pν ln pF , so ν ∗ satisfies 1 − p + Dp pν = . − ln pF A necessary condition for ν > 1 is − ln pF > 1 + (D − 1)p. Note that again critical penalties are reciprocal of the risk and that the relation between F and p is also logarithmic. Now assume that the response time is distributed with some CDF f and also that a node does not respond at all with probability p̄. Then probability p of missing the deadline D is p = p̄ + (1 − f (D))(1 − p̄). This allows to choose the optimal deadline together with the optimal replication provided that the distribution function f is known. Let us consider the case p̄ = 0. If (1 − f (D))D → 0 as D → ∞ (the case for most popular distributions) then the optimal deadline is infinite; this means that the deadline is useless, it is reasonable to just wait for an answer paying, in average, the mean unit cost. However, for the Cauchy distribution (1 − f )D → 1 so that E → 2; in the exotic case of a CDF f (D) such that D(1 − f (D)) → ∞ some finite deadline is still necessary. However, the usual case for a Desktop Grid is p̄ > 0: nodes may leave the project, temporarily of permanently. Now assume that we have M increasing deadlines Dj , j = 1, . . . , M , and probability to violate the deadline j is pj . Add artificial deadlines D0 = 0 and DM +1 = DM . In fact, the last deadline is infinite, but if the DM is violated, the task is cancelled. The sequence pj decrease with respect to j, p0 = 1 (no task can be returned instantly), pM +1 = 0 (infinite deadline can not be violated). Violation of the deadline Dj costs the penalty Fj ; penalties increase with respect to j and let F1 = 0 (no penalty for quick work). Then the average cost is M X +1 M X +1 E=ν pj−1 (1 − pj )Dj + (1 − pνi )pνi−1 Fi j=1 i=1 115 This function has a minimum for a ν ≥ 1 because for large ν it grows almost linearly: the second term tends to zero quite quickly as ν → ∞. 2.3 The reliable computer and penalty/cost estimations Let us estimate the penalty using a reliable computer with p = 0 but high cost C of usage. The set of answers is binary (yes/no) with no information on their probabilities. It is clear that if this computer is better (from the point of view of the mean total cost) than the Desktop Grid, the latter is just useless. So assume that p = 0.5 which means that tossing a coin is as good as solving tasks. Then the expected penalty is 0.5F which must exceed C: otherwise tossing a coin is better than solving all tasks on reliable nodes. So F > 2C is a necessary condition for choosing the penalty value. If there are S possible answers, the estimation becomes (S − 1)F > SC. In case of known probabilities αi of answers tossing a coin is replaced by guessing the most likely answer (with probability Q) yielding the estimate (1 − Q)F > C. Indeed, choosing a distribution xi to guess an answer in order to maximize the probability to guess right we have the linear optimization problem X X αi xi → max, xi = 1, xi ≥ 0. Assume that αi decrease and exclude the x1 to get the linear problem X X 1+ (αi − α1 )xi → max, xi ≤ 1, xi ≥ 0. With negative constant gradient, the function increases up to the boundary so that one of the variables vanishes; continue to eliminate all variables except x1 = 1. Note than the uniform distribution is the entropy-maximizing solution for distribution on the finite set of points with no restrictions. In case of countable set such a solution does not exist; however, fixing the expectation W > 0 of the distribution provides the solution which is the geometrical distribution P (1 − P )i on i = 0, 1, . . . , where 1 P = . W +1 The discussed estimation is (1 − P )F > C. Now let us obtain the other estimation. In case of relatively reliable computing nodes the optimal cost must be less than C: otherwise there is no need to use the Desktop grid. If p is small so that p2 is negligible with respect to p, then the wrong result is accepted with probability X ν−1 ν+i−1   2ν − 1    pν 2ν 4ν ν ν p =p = ∼√ . i=0 ν−1 ν−1 2 ν πν The average cost without the penalty and discarding terms with p is ν: most times we get ν correct answers in a row. So the reliable cost must be high enough to make using the Desktop Grid reasonable: (4p)ν ν+ √ F 0 provided by its own quorum νi . The penalty for accepting a wrong answer F is the same for all groups. If we try to determine relative amounts ψi of tasks sent to each group in order to minimize the cost, we get a simple linear optimization problem X X T = ψi Ei → min, ψi = 1, ψi ≥ 0. i i 116 Without loss of generality assume that Ei increase with respect to i. This problem has a solution ψ1 = 1, other ψi = 0, so all tasks are computed on the cheapest (with replication taken into account) node. Indeed, express ψ1 from the constraint and substitute to expression for T : X X T = E1 + ψi (Ei − E1 ) → min, ψi ≤ 1, ψi ≥ 0. i i>1 The gradient of T is positive, so the descent eliminates all ψi , i > 1, making ψ1 = 1. Note that the similar argument remains true in a rather general nonlinear case. However, high load of a node would generally increase the cost per a task just because more tasks need to be solved. Then the optimal ψi are such that all costs with ψi > 0 are equal and costs with ψ = 0 are higher. This means that hopeless nodes are excluded while nodes with acceptable cost/reliability are loaded in such a way that the average total cost is the same. This allows to estimate the basic cost of computing for a task. If nodes are choosing their ψi and are paid the same T , this is the congestion game; such games were applied to scheduling in [NIT17]. Game approach allows to estimate the cost in a decentralized way. 3 Groups of tasks Formation of task parcels containing many tasks and considered as a single task can be useful for Desktop Grid computing. We faced the problem of loss of performance due to much additional work when the Desktop Grid solved many vary quick tasks, so that the server lost much time to processing reports and task retrieval requests. Grouping tasks into parcel with 1000 simple tasks in each improved the performance significantly. 3.1 Reducing additional costs Let us estimate the reasonable batch size depending on reliability of computing nodes and additional costs. Denote the parcel size by n tasks including N tasks from other parcels. Cost for evaluating a task is unit: it includes the cost (or time) of uploading a task and all necessary information. Let c be the cost of additional work for the whole parcel. Assume that probability of solving a task is q (so p = 1 − q is for the probability of a fault) and that tasks are solved independently. Then probability of processing the whole parcel with no errors is q n . In case of a fault the whole parcel needs to be solved again. Solving a parcel in k attempts costs (n + c)k units and has the probability (1 − q n )k−1 q n . The expected cost of solving a parcel is n+c En = . qn Solving all tasks costs approximately N/n times more which gives us the total average cost per a task: N En n+c E= = . (1) n N nq n Find the minimal value of this cost: nq n − (n + c)(q n + nq n ln q) n − (n + c)(1 + n ln q) E0 = 2 2n = = 0. n q n2 q n This equation is reduced to An2 + Acn − c = 0 where A = − ln q. This square equation has the single positive root √ A2 c2 + 4Ac − Ac n∗ = . 2A For example, for c = 1, p = 10−2 we have n∗ = 9. Note that n∗ does not depend on the number N of tasks added from other parcels. Consider inequality n∗ > 1 to see when using parcels is reasonable. It is only if q > e−1 and A 1 c> , A = ln . 1−A q In other words, grouping can help if nodes are reliable enough and additional costs are high enough. For reliability reported at the BOINC web site (q ≈ 0.95) grouping seems reasonable for similar additional cost with respect to the average cost of a single task: c > 5.4%. 117 3.2 Embedding test tasks Let us assume that tasks are sent to computing nodes in parcels of N tasks each; in order to reduce the risk of accepting a wrong answer, k answers of each parcel are replicated by being included into other parcels. If at least one replicated task produce different answers, all parcel is solved again. As the first approximation, we assume that a parcel solved again is solved correctly. Such embedding test tasks into parcels can be called ”fractional replication”: a part of each parcel is replicated. A wrong answer can be obtained with some probability p, probability of the right answer is q = 1 − p. If a wrong tasks is believed, some penalty F is added to the overall computation cost. The cost function is additional time spent for checks and penalties. Then there is no alarm if both replicas of replicated tasks produce the same answer, either right or wrong; in other words, probability of the ”no alarm” case is Xk   k (pi q k−i )2 . i=0 i All terms except the i = 0 contain errors and, therefore, penalties. The following cases are possible: 1. No errors; additional cost is k, probability is q N +k . 2. An error, all control tasks solved correctly; cost is k + F , probability is q 2k (1 − q N −k ) = q 2k − q N +k . P k k  i k−i 2 3. An error, no alarm; cost is k + F , probability is i (p q ) . i=1 P k k  i k−i 2 4. An error detected; cost is k + N , probability is 1 − i (p q ) . i=0 So the average cost is k   ! k   ! X k X k k 2k N +k 2i 2(k−i) 2i 2(k−i) EN =k+F q −q + p q +N 1− p q . i=1 i i=0 i This quantity increases with respect to N for a fixed k; as k ≤ N , then the optimal N equals k. This means that no grouping is necessary for optimizing the cost; however, the parcel size N can be fixed due to other reasons, as we have noted above. Then the solution to the optimization problem is N = k, which means the complete replication (duplication) of a whole parcel. These preliminary estimations seem promising for the improved task-parcel management to be developed. Conclusion The mean-cost approach has been shown here to be effective not only for improving performance by replication, but also to increase sabotage-tolerance, decrease deadline-violation losses, and effectively use nodes of various reliability/cost ratio. Penalties and costs can be a priori estimated if not known precisely. Task grouping (formation of task parcels) can serve efficiency due to less additional costs on client-server interaction and lower server load if computing nodes are reliable while the losses are relatively high. Estimations for these quantities are presented. Adding test tasks to parcels, though possibly useful, can hardly be used as a ”partial replication” method if the only optimization criteria is the mean cost. Acknowledgements This work is supported by the Russian Foundation for Basic Research (grant number 16-07-00622). References [Che16] I. Chernov. Theoretical study of replication in desktop grid computing: Minimizing the mean cost. In Proceedings of the 2nd Applications in Information Technology (ICAIT-2016), International Con- ference on, pages 125–129, Aizu-Wakamatsu, Japan, 2016. 118 [CKBH06] S.J. Choi, H.S. Kim, E.J. Byun, and C.S. Hwan. A taxonomy of desktop grid systems focusing on scheduling. Technical report KU-CSE-2006-1120-02, Department of Computer Science and Engeering, Korea University, 2006. [CN15] I.A. Chernov and N.N. Nikitina. Virtual screening in a desktop grid: Replication and the optimal quorum. In V. Malyshkin, editor, Parallel Computing Technologies, International Conference on, volume 9251, pages 258–267. Springer, 2015. [DS14] N.M. Durrani and J.A. Shamsi. Volunteer computing: requirements, challenges, and solutions. Jour- nal of Network and Computer Applications, 39:369–380, mar 2014. [ET12] T. Estrada and M. Taufer. Challenges in designing scheduling policies in volunteer computing. In C. Cérin and G. Fedak, editors, Desktop Grid Computing, pages 167–190. CRC Press, 2012. [KMH17] M.Kh. Khan, T. Mahmood, and S.I. Hyder. Scheduling in desktop grid systems: Theoretical evalua- tion of policies and frameworks. International Journal of Advanced Computer Science and Applica- tions, 8(1):119–127, 2017. [NIT17] N. Nikitina, E. Ivashko, and A. Tchernykh. Congestion game scheduling for virtual drug screening optimization. Submitted to The Journal of Chemical Information and Modeling, 2017. [XA10] F. Xhafa and A. Abraham. Computational models and heuristic methods for grid scheduling prob- lems. Future Generation Computer Systems, 26(4):608–621, apr 2010. 119