Optimal Mean Cost Replication in Desktop Grids

                                                  Ilya A. Chernov
                                             IAMR KRC RAS, PetrSU
                                                Petrozavodsk, Russia
                                               IAChernov@yandex.ru


                                                        Abstract
                       In this paper we consider optimization problems for task scheduling
                       in Desktop Grids; the used approaches has been weakly reported in
                       literature so far despite their potential efficiency here demonstrated
                       in simple estimations. We propose the mean-cost approach and show
                       how it can be used for different purposes, including counter-sabotage
                       and deadline safety, derive necessary inequalities for the costs, and
                       show that task grouping can be efficient for optimizing extra losses but
                       hardly can be used as a sort of replication.


1    Introduction
Mathematics of Desktop Grid task scheduling is quickly developing. It is enough to mention surveys [KMH17,
ET12, DS14, XA10, CKBH06]. Much attention is paid to efficiency in terms of throughput, makespan, or the
total time to complete a batch of tasks, and reliability defined as low risk of getting a wrong or invalid answer
or not getting any answer at all, including sabotage-tolerate techniques. There are other points of interest,
including availability of nodes, load balance or minimization of the peak load, and others, that are out of the
scope of this work.
   Obviously different criteria contradict or at least conflict with each other. Some effort has been paid to
simultaneous improving two or more criteria. However, nobody (up to our knowledge) considered optimization
of the average cost with all conflicting factors converted to the same unit (time, currency, etc). In [Che16, CN15]
we studied such problems for the optimal replication. The only assumption about the computing system was
that tasks are solved independently with a unit cost and the same risk of producing a wrong answer from the
fixed set (due to malfunction, errors, malicious activity, etc). Each task is replicated until ν identical answers
are obtained. If a wrong answer is accepted (this will sooner or later show up), come penalty F is added to
the overall cost. This approach allows to choose the optimal replication level and also can be helpful for, e.g.,
choosing parameters of the computing algorithm. For example, choosing higher precision usually reduces the
risk of an error, but needs more time; replication with a cheaper algorithm sometimes is more efficient, of course
provided that calculations are independent.
   Interesting results obtained here were quick growth of the critical penalties with respect to risk levels (F ∼
p−1 ), so that there is no need to know penalties precisely; low cost of overestimation of the optimal replication
for rare (and thus valuable) results; and more careful check of less valuable answers.
   Here we apply the approach to other cases, obtaining rather simple estimations that can be useful (we hope)
for developing scheduling algorithms. Also we consider estimations for task grouping which can serve a number

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: E. Ivashko, A. Rumyantsev (eds.): Proceedings of the Third International Conference BOINC:FAST 2017, Petrozavodsk, Russia,
August 28 - September 01, 2017, published at http://ceur-ws.org


                                                        114
of purposes: improving efficiency, checking, kind of ”fractional” replication (only some tasks of a parcel can be
double-checked). However, task grouping seems insufficiently studied in literature so far.

2     The mean-cost approach
2.1   Counter-sabotage scheduling
Assume that a wrong answer can be accepted due to malefactors and all that is known is the probability p to
send a task to a malicious computing node. We use replication (ν copies) to reveal the malefactors: if at least
two replicas produced different results, the task is re-solved and the saboteur is discriminated. In case a wrong
answer has been accepted, some penalty F expressed in terms of the unit cost of a single processing of a task is
paid. Then the average total cost is
                                                   E = ν + pν F.
This function has minimum either at ν = 1 (no replication) or at some ν > 1, because for very high ν it grows
almost linearly to infinity. The necessary condition of the minimum is
                                                             1   1
                                                     pν ln     =
                                                             p   F
which means that the optimal ν depends on logarithm of the penalty and the risk level. If p is small so that
p2  p, we are able to derive simple tests. The replication ν + 1 is better than ν if pν F > 1. So, here again the
critical penalties (that force the change of the replication level) are reciprocals of the risk level.

2.2   Deadlines
Consider a reliable computing system with equal nodes and pseudo-answers yes and no obtained if at least one
replica/no replicas was solved before the deadline D expressed in terms of the unit solving cost for an answer.
The probability of violating the deadline (i.e., the ”no” answer) is p and if the ”no” is ”accepted”, i.e., all replicas
violated the deadline, some penalty is added to the overall cost.
   Denote the number of replicas by ν. Probability of i answers got in time and ν − i not is P = Cνi q i pν−i .
The cost in this case is i × 1 + (ν − i)D, so the random cost is the sum of two binomial random variables.
The average equals E = ν(1 − p + Dp). Violated deadline means penalty F so we add the expected penalty:
E = ν(1 − p + Dp) + pν F .
   To minimize it consider the derivative: E 0 = (1 − p + Dp) + pν ln pF , so ν ∗ satisfies
                                                         1 − p + Dp
                                                  pν =              .
                                                           − ln pF
A necessary condition for ν > 1 is − ln pF > 1 + (D − 1)p. Note that again critical penalties are reciprocal of
the risk and that the relation between F and p is also logarithmic.
   Now assume that the response time is distributed with some CDF f and also that a node does not respond
at all with probability p̄. Then probability p of missing the deadline D is p = p̄ + (1 − f (D))(1 − p̄). This allows
to choose the optimal deadline together with the optimal replication provided that the distribution function f
is known.
   Let us consider the case p̄ = 0. If (1 − f (D))D → 0 as D → ∞ (the case for most popular distributions) then
the optimal deadline is infinite; this means that the deadline is useless, it is reasonable to just wait for an answer
paying, in average, the mean unit cost. However, for the Cauchy distribution (1 − f )D → 1 so that E → 2; in
the exotic case of a CDF f (D) such that D(1 − f (D)) → ∞ some finite deadline is still necessary. However, the
usual case for a Desktop Grid is p̄ > 0: nodes may leave the project, temporarily of permanently.
   Now assume that we have M increasing deadlines Dj , j = 1, . . . , M , and probability to violate the deadline
j is pj . Add artificial deadlines D0 = 0 and DM +1 = DM . In fact, the last deadline is infinite, but if the DM
is violated, the task is cancelled. The sequence pj decrease with respect to j, p0 = 1 (no task can be returned
instantly), pM +1 = 0 (infinite deadline can not be violated). Violation of the deadline Dj costs the penalty Fj ;
penalties increase with respect to j and let F1 = 0 (no penalty for quick work).
   Then the average cost is
                                         M
                                         X +1                    M
                                                                 X +1
                                  E=ν         pj−1 (1 − pj )Dj +      (1 − pνi )pνi−1 Fi
                                         j=1                         i=1


                                                     115
This function has a minimum for a ν ≥ 1 because for large ν it grows almost linearly: the second term tends to
zero quite quickly as ν → ∞.

2.3   The reliable computer and penalty/cost estimations
Let us estimate the penalty using a reliable computer with p = 0 but high cost C of usage. The set of answers
is binary (yes/no) with no information on their probabilities. It is clear that if this computer is better (from the
point of view of the mean total cost) than the Desktop Grid, the latter is just useless.
   So assume that p = 0.5 which means that tossing a coin is as good as solving tasks. Then the expected
penalty is 0.5F which must exceed C: otherwise tossing a coin is better than solving all tasks on reliable nodes.
So F > 2C is a necessary condition for choosing the penalty value.
   If there are S possible answers, the estimation becomes (S − 1)F > SC. In case of known probabilities αi of
answers tossing a coin is replaced by guessing the most likely answer (with probability Q) yielding the estimate
(1 − Q)F > C. Indeed, choosing a distribution xi to guess an answer in order to maximize the probability to
guess right we have the linear optimization problem
                                     X                 X
                                         αi xi → max,       xi = 1, xi ≥ 0.

Assume that αi decrease and exclude the x1 to get the linear problem
                                 X                         X
                             1+     (αi − α1 )xi → max,        xi ≤ 1,             xi ≥ 0.

With negative constant gradient, the function increases up to the boundary so that one of the variables vanishes;
continue to eliminate all variables except x1 = 1.
   Note than the uniform distribution is the entropy-maximizing solution for distribution on the finite set of points
with no restrictions. In case of countable set such a solution does not exist; however, fixing the expectation W > 0
of the distribution provides the solution which is the geometrical distribution P (1 − P )i on i = 0, 1, . . . , where

                                                                 1
                                                      P =           .
                                                               W +1
The discussed estimation is (1 − P )F > C.
   Now let us obtain the other estimation. In case of relatively reliable computing nodes the optimal cost must
be less than C: otherwise there is no need to use the Desktop grid. If p is small so that p2 is negligible with
respect to p, then the wrong result is accepted with probability

                                   X
                                   ν−1
                                         ν+i−1
                                                          
                                                               2ν − 1
                                                                                
                                                                              pν 2ν   4ν
                               ν                       ν
                              p                      =p                     =       ∼√ .
                                   i=0
                                          ν−1                   ν−1           2 ν      πν

The average cost without the penalty and discarding terms with p is ν: most times we get ν correct answers in
a row. So the reliable cost must be high enough to make using the Desktop Grid reasonable:

                                                        (4p)ν
                                                     ν+ √ F <C
                                                          πν

for the optimal ν (equivalently, for at least some ν). As p is small, large ν mean very quickly decreasing second
term. So if the penalty F is so high that the inequality does not hold for any ν < C, the Desktop Grid is useless.
   These two estimates allows to choose the penalty value (if it is unknown) to solve the optimal replication
problem.
   Another approach is able to estimate the cost C. Assume for the sake of generality that there are many groups
of nodes with their own costs Ci and risks pi ; also assume that tasks are replicated within each group. Then
each group possesses its own average total cost Ei = E(Ci , νi , pi , F ) > 0 provided by its own quorum νi . The
penalty for accepting a wrong answer F is the same for all groups. If we try to determine relative amounts ψi
of tasks sent to each group in order to minimize the cost, we get a simple linear optimization problem
                                        X                  X
                                   T =      ψi Ei → min,       ψi = 1, ψi ≥ 0.
                                          i                        i


                                                      116
Without loss of generality assume that Ei increase with respect to i. This problem has a solution ψ1 = 1, other
ψi = 0, so all tasks are computed on the cheapest (with replication taken into account) node. Indeed, express
ψ1 from the constraint and substitute to expression for T :
                                     X                         X
                            T = E1 +     ψi (Ei − E1 ) → min,      ψi ≤ 1, ψi ≥ 0.
                                        i                          i>1

The gradient of T is positive, so the descent eliminates all ψi , i > 1, making ψ1 = 1. Note that the similar
argument remains true in a rather general nonlinear case.
   However, high load of a node would generally increase the cost per a task just because more tasks need to
be solved. Then the optimal ψi are such that all costs with ψi > 0 are equal and costs with ψ = 0 are higher.
This means that hopeless nodes are excluded while nodes with acceptable cost/reliability are loaded in such a
way that the average total cost is the same. This allows to estimate the basic cost of computing for a task. If
nodes are choosing their ψi and are paid the same T , this is the congestion game; such games were applied to
scheduling in [NIT17]. Game approach allows to estimate the cost in a decentralized way.

3     Groups of tasks
Formation of task parcels containing many tasks and considered as a single task can be useful for Desktop Grid
computing. We faced the problem of loss of performance due to much additional work when the Desktop Grid
solved many vary quick tasks, so that the server lost much time to processing reports and task retrieval requests.
Grouping tasks into parcel with 1000 simple tasks in each improved the performance significantly.

3.1   Reducing additional costs
Let us estimate the reasonable batch size depending on reliability of computing nodes and additional costs.
Denote the parcel size by n tasks including N tasks from other parcels. Cost for evaluating a task is unit: it
includes the cost (or time) of uploading a task and all necessary information. Let c be the cost of additional
work for the whole parcel. Assume that probability of solving a task is q (so p = 1 − q is for the probability of
a fault) and that tasks are solved independently. Then probability of processing the whole parcel with no errors
is q n . In case of a fault the whole parcel needs to be solved again. Solving a parcel in k attempts costs (n + c)k
units and has the probability (1 − q n )k−1 q n . The expected cost of solving a parcel is
                                                           n+c
                                                   En =        .
                                                            qn
Solving all tasks costs approximately N/n times more which gives us the total average cost per a task:
                                                    N En   n+c
                                               E=        =      .                                               (1)
                                                    n N    nq n
Find the minimal value of this cost:
                             nq n − (n + c)(q n + nq n ln q)   n − (n + c)(1 + n ln q)
                      E0 =                2 2n
                                                             =                         = 0.
                                        n q                            n2 q n
This equation is reduced to
                                                An2 + Acn − c = 0
where A = − ln q. This square equation has the single positive root
                                               √
                                                  A2 c2 + 4Ac − Ac
                                         n∗ =                       .
                                                        2A
For example, for c = 1, p = 10−2 we have n∗ = 9.
   Note that n∗ does not depend on the number N of tasks added from other parcels. Consider inequality n∗ > 1
to see when using parcels is reasonable. It is only if q > e−1 and
                                                    A             1
                                              c>       ,    A = ln .
                                                   1−A            q
In other words, grouping can help if nodes are reliable enough and additional costs are high enough.
   For reliability reported at the BOINC web site (q ≈ 0.95) grouping seems reasonable for similar additional
cost with respect to the average cost of a single task: c > 5.4%.


                                                    117
3.2   Embedding test tasks
Let us assume that tasks are sent to computing nodes in parcels of N tasks each; in order to reduce the risk of
accepting a wrong answer, k answers of each parcel are replicated by being included into other parcels. If at least
one replicated task produce different answers, all parcel is solved again. As the first approximation, we assume
that a parcel solved again is solved correctly. Such embedding test tasks into parcels can be called ”fractional
replication”: a part of each parcel is replicated.
   A wrong answer can be obtained with some probability p, probability of the right answer is q = 1 − p.
   If a wrong tasks is believed, some penalty F is added to the overall computation cost. The cost function is
additional time spent for checks and penalties.
   Then there is no alarm if both replicas of replicated tasks produce the same answer, either right or wrong; in
other words, probability of the ”no alarm” case is

                                                            Xk  
                                                                k
                                                                  (pi q k−i )2 .
                                                            i=0
                                                                i

All terms except the i = 0 contain errors and, therefore, penalties. The following cases are possible:

 1. No errors; additional cost is k, probability is q N +k .

 2. An error, all control tasks solved correctly; cost is k + F , probability is q 2k (1 − q N −k ) = q 2k − q N +k .
                                                                    P
                                                                    k
                                                                           k
                                                                              i k−i 2
 3. An error, no alarm; cost is k + F , probability is                     i (p q   ) .
                                                                    i=1

                                                                          P
                                                                          k
                                                                                k
                                                                                   i k−i 2
 4. An error detected; cost is k + N , probability is 1 −                       i (p q   ) .
                                                                          i=0

So the average cost is
                                                          k  
                                                                                  !             k  
                                                                                                                      !
                                                          X  k                                  X  k
                 k                   2k        N +k                   2i 2(k−i)                           2i 2(k−i)
                EN =k+F          q        −q          +             p q               +N   1−             p q             .
                                                          i=1
                                                                i                               i=0
                                                                                                      i

This quantity increases with respect to N for a fixed k; as k ≤ N , then the optimal N equals k. This means that
no grouping is necessary for optimizing the cost; however, the parcel size N can be fixed due to other reasons,
as we have noted above. Then the solution to the optimization problem is N = k, which means the complete
replication (duplication) of a whole parcel.
   These preliminary estimations seem promising for the improved task-parcel management to be developed.

Conclusion
The mean-cost approach has been shown here to be effective not only for improving performance by replication,
but also to increase sabotage-tolerance, decrease deadline-violation losses, and effectively use nodes of various
reliability/cost ratio. Penalties and costs can be a priori estimated if not known precisely. Task grouping
(formation of task parcels) can serve efficiency due to less additional costs on client-server interaction and lower
server load if computing nodes are reliable while the losses are relatively high. Estimations for these quantities
are presented. Adding test tasks to parcels, though possibly useful, can hardly be used as a ”partial replication”
method if the only optimization criteria is the mean cost.

Acknowledgements
This work is supported by the Russian Foundation for Basic Research (grant number 16-07-00622).

References
[Che16]    I. Chernov. Theoretical study of replication in desktop grid computing: Minimizing the mean cost.
           In Proceedings of the 2nd Applications in Information Technology (ICAIT-2016), International Con-
           ference on, pages 125–129, Aizu-Wakamatsu, Japan, 2016.


                                                                118
[CKBH06] S.J. Choi, H.S. Kim, E.J. Byun, and C.S. Hwan. A taxonomy of desktop grid systems focusing on
         scheduling. Technical report KU-CSE-2006-1120-02, Department of Computer Science and Engeering,
         Korea University, 2006.

[CN15]    I.A. Chernov and N.N. Nikitina. Virtual screening in a desktop grid: Replication and the optimal
          quorum. In V. Malyshkin, editor, Parallel Computing Technologies, International Conference on,
          volume 9251, pages 258–267. Springer, 2015.
[DS14]    N.M. Durrani and J.A. Shamsi. Volunteer computing: requirements, challenges, and solutions. Jour-
          nal of Network and Computer Applications, 39:369–380, mar 2014.
[ET12]    T. Estrada and M. Taufer. Challenges in designing scheduling policies in volunteer computing. In
          C. Cérin and G. Fedak, editors, Desktop Grid Computing, pages 167–190. CRC Press, 2012.
[KMH17]   M.Kh. Khan, T. Mahmood, and S.I. Hyder. Scheduling in desktop grid systems: Theoretical evalua-
          tion of policies and frameworks. International Journal of Advanced Computer Science and Applica-
          tions, 8(1):119–127, 2017.
[NIT17]   N. Nikitina, E. Ivashko, and A. Tchernykh. Congestion game scheduling for virtual drug screening
          optimization. Submitted to The Journal of Chemical Information and Modeling, 2017.
[XA10]    F. Xhafa and A. Abraham. Computational models and heuristic methods for grid scheduling prob-
          lems. Future Generation Computer Systems, 26(4):608–621, apr 2010.


                                               119