Mathematical model of a ”tail” computation in a Desktop Grid Evgeny Ivashko Petrozavodsk State University Institute of Applied Mathematical Research Karelian Research Centre of RAS, Petrozavodsk, Russia ivashko@krc.karelia.ru Abstract. Task scheduling is an important problem of Desktop Grids. Among other special problems there is a ”tail” computation problem. It is related to a final stage of computation in a Desktop Grid, when the number of tasks is less than the number of computing nodes. The excess of computing power could be used to implement redundant computing. Special mathematical models are needed to reduce duration of this final stage. In the paper a mathematical model of the ”tail” computation is considered. It allows to estimate probability of reducing the time needed to finish a certain task by a certain computing node. So, a task with the maximal probability of time reducing effect could be replicated to a free node. 1 Introduction Desktop Grids are of special interest as a cheap, easy to install and support, and potentially powerful high-throughput computing tool. Desktop Grid is a computational paradigm based on a distributed computing system which uses idle time of non-dedicated geographically distributed general- purpose computing nodes (usually, personal computers) connected over Internet. Desktop Grids popularity is motivated by quick growth of the number of personal computers, as well as huge increase in their performance. The first large volunteer computing project SETI@home was launched in 1999, providing the basis for development of the BOINC platform. There are a number of middleware systems for Desktop Grid computing. However, the open source BOINC platform [EFT+ 06] is nowadays considered as a de facto standard among them. Today there are more than 60 active BOINC projects utilizing more than 15 million computers worldwide [boi22]. So, Desktop Grids hold their place among other high-performance systems, such as Computing Grid systems, computing clusters, and supercomputers. A Desktop Grid consists of a (large) number of computing nodes and a server which distributes tasks among the nodes. The workflow is as follows. A node asks Mathematical model of a ”tail” computation in a Desktop Grid 55 the server for work; the server replies sending one or more tasks to the node. The node performs calculations and finishing, sends the result (which is a solution of a task or an error report) back to the server. Task scheduling in Desktop Grids is an important problem. Among other special problems there is the ”tail” computation problem. It is related to a final stage of computation in a Desktop Grid, when the number of tasks is less than the number of computing nodes. Special mathematical models are needed to reduce duration of this final stage. In the paper a mathematical model of the ”tail” computation is considered. The structure of the paper is following. Section 2 describes motivation and related works. Section 3 is devoted to a mathematical model of the ”tail” com- putation. Finally, Section 4 contains final remarks and conclusion. 2 Motivation and Related Works As BOINC is the most popular Desktop Grid middleware, the paper is aimed at scheduling problem based on BOINC workflow. BOINC is based on the client-server architecture. The client part is software, which is able to employ idle resources of a computer for computations within one or multiple BOINC projects. It is available for computers with various hardware and software characteristics. The server part of BOINC consists of several subsystems responsible for tasks generation, distribution, results verification, assimilation, etc. It is based on Open Source software: Linux, Apache, mariadb, php, etc. For each computational task BOINC holds multiple mostly independent in- stances or replicas. Replicas are computed separately with different nodes, and then their results are compared with the aims of verification. The quorum con- cept is used to define the necessary number of successful results to obtain for one computational task. One of the received results should be accepted as the correct via results ”voting”. This mechanism is useful to overcome processing er- rors and sabotage. Moreover, the BOINC settings allow to create and distribute more task replicas dynamically as needed. BOINC employs PULL model to interact with computing nodes. It sometimes (in case of large number of computing nodes and low computational time of a single task) lead to high server load. To avoid it can be used PUSH model (a case of Enterprise Desktop Grid [Iva15]); another way is to reduce an overall server load forming effective-size parcels of tasks instead of sending to a client a single task [MNI15]. A deadline is set for each task instance to limit its completion time. If the server does not get a result before the deadline, the task instance is considered lost. This happened when a task processing fall into infinite loop or computing node itself become unavailable (abandon the Desktop Grid) and the server can not determine that. The replication mechanism as a form of redundant computing serves a num- ber of purposes. The main purpose is to increase reliability, by increasing the 56 Evgeny Ivashko chance to obtain the correct answer in time even if some nodes become un- available without having finished the task. This helps, in its turn, to improve efficiency (in particular, throughput of successful results). Replication and voting are incredibly important in volunteer computing as a counter-sabotage defence measure. The reputational quorum is a method of voting with higher-reputation (for example, in terms of reliability) nodes’ votes valued more: this approach employs both replication and reputation techniques the same time. In more de- tails the architecture of BOINC is described in [And04]. The problem of search for optimal replication parameters is studied in practice in [Kur16]; there are also different mathematical models developed with the same objective, for ex- ample, [MNI15]. A valuable problem of task scheduling in Desktop Grids is optimization of the “tail” computation. A distributed computational experiment involving a batch of tasks inherently consists of two stages (see Fig. 1). At the first one the number of tasks is greater than the number of computing nodes (in the very beginning it can be much more). At this stage the computational power is limiting the performance, so it is reasonable to supply each node with a unique task (without replication; from the point of view of the makespan replication is useless as it was shown in [GL04]). With time, the number of unprocessed tasks decreases until it is equal to the number of computing nodes: then the second stage called the ”tail” starts. At this stage there is excess of computational power which could be used to implement redundant computing to reduce overall computing time. The ”tail” computation problem is studied in [KCC07,BYSS+ 12]. Fig. 1. Two stages of batch completion in a Desktop Grid. A number of research problems related to specifics of Desktop Grids are connected to the two-staged batch completion. One of them is the fastest ex- periment or batch completion. In practice, the “tail” computation takes a long time (usually, about two or three times greater than deadline) because of missing deadlines. A computational network does not have information on the current status of tasks computation. So, the “tail” accumulates a lot of nodes that have abandoned the computing network. As a certain task assigned to such a node vi- Mathematical model of a ”tail” computation in a Desktop Grid 57 olates the deadline, it is assigned again to a different node, possibly also going to leave the network soon. So, this prolongs the “tail”. The solution to the problem is in the redundant computing: currently processing tasks are assigned to vacant computing nodes. This strategy significantly increases the chances that at least one copy is solved in time. Employing this strategy, one should take into ac- count characteristics of computational nodes (availability, reliability, computing power), processing the same task; accumulated task processing time; expected task completion time; and so on. The fastest batch completion problem is more complex if a new tasks batch should be started after the current batch completion. In this case redundant computing reduces accessible computing power. One more complicated case is connected to inter-dependency between the tasks of the new batch and comple- tion of certain tasks of the current batch. 3 Mathematical Model To study the ”tail” computation problem we present the following mathematical model. There is a Desktop Grid consisting of n computing nodes. Define ak as computing performance of k-th node, k = 1, .., n. Let computing complexity T is the same for all the tasks and known in advance. So, T Tk = , k = 1, .., n. ak is working time needed for k-th computing node to finish a task. But a computing node of a Desktop Grid does not work continuously, so suppose that there is a probability distribution function Fk (x) describing the probability of k-th node to finish a task in certain time. The form of the probability distribution function and its parameters are depend on computing node itself. In the simplest case it is availability rate. Define d as deadline time of a task. Rc Having all thing considered, dFk (x) is probability to finish a task in time Tk Rd c; dFk (x) is a probability to successfully finish a task, i.e. not to miss the Tk Rd R∞ deadline; and 1 − dFk (x) = dFk (x) is probability to miss the deadline. Tk d Redundant computing can be used to reduce time needed to finish a batch. At the beginning of the second stage there is a node appears which can not be equipped with a unique task. So, it can be used to reduce ”tail” computation. The main question is following: What task should be replicated to reduce ”tail” computing? Let n − 1 nodes are equipped with unique tasks; each node is already have been computing its task for time tk , k = 1, .., n − 1. 58 Evgeny Ivashko Probability of the node n to finish a task before the node k finishes it is   d Zd Zd Zd Z P (n, k) = dFn (x) dFk (y) + 1 − dFk (y) dFn (x), k = 1, .., n − 1. Tn x+tk Tk Tk (1) Formula 1 can be used to make a decision which task should be replicated. A free node should take task of a node k ∗ maximizing probability to reduce task computation time: k ∗ = arg max P (n, k). k 4 Conclusion The concept of Desktop Grid is a valuable part of high-performance computing industry. It allows to gather needed resources quick and easy to solve certain types of computational problems. Task scheduling plays crucial role in providing computing performance of a Desktop Grid. This problem is complicated by intrinsic characteristics of Desktop Grids. So, big attention is payed to task scheduling. One of the important problems in the domain of task scheduling is ”tail” computation. It arises when the number of tasks become less than the number of computing nodes. Because of unreliable nature of nodes of Desktop Grids the ”tail” stage can take much time despite of excess of computing power. The excess of computing power could be used to implement redundant computing. The special mathematical models should be developed to reduce time needed to complete this stage. In the paper a mathematical model of ”tail” computation is presented. It allows to estimate probability of reducing the time needed to a certain computing node to finish a certain task. So, a task with the maximal probability of time reducing effect could be replicated to a free node. The presented mathematical model is based on probability to reduce com- puting time of a certain task. But there are different criteria could be used to op- timize ”tail” stage of computation. The examples of these criteria are mean com- puting time reduction, game theoretical multiobjective payoffs, or look-ahead criteria, etc. Acknowledgements This work is supported by the Russian Foundation for Basic Research (grant numbers 16-07-00622, 15-07-02354, and 15-29-07974). References [And04] David P. Anderson. Boinc: A system for public-resource computing and storage. In Proceedings of the 5th IEEE/ACM International Workshop on Mathematical model of a ”tail” computation in a Desktop Grid 59 Grid Computing, GRID ’04, pages 4–10, Washington, DC, USA, 2004. IEEE Computer Society. [boi22] BOINCstats. In https://boincstats.com, 2017-06-22. [BYSS+ 12] Orna Agmon Ben-Yehuda, Assaf Schuster, Artyom Sharov, Mark Sil- berstein, and Alexandru Iosup. Expert: Pareto-efficient task replication on grids and a cloud. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 167–178. IEEE, 2012. [EFT+ 06] T. Estrada, D.A. Flores, M. Taufer, P.J. Teller, A. Kerstens, and D.P. An- derson. The effectiveness of threshold-based scheduling policies in BOINC projects. In e-Science and Grid Computing, 2006. e-Science’06. Second IEEE International Conference on, pages 88–88. IEEE, 2006. [GL04] Gaurav D Ghare and Scott T Leutenegger. Improving speedup and response times by replicating parallel programs on a SNOW. In Workshop on Job Scheduling Strategies for Parallel Processing, pages 264–287. Springer, 2004. [Iva15] E. Ivashko. Enterprise desktop grids. In CEUR Workshop Proceedings, volume 1502, pages 16–21, 2015. [KCC07] D. Kondo, A.A. Chien, and H. Casanova. Scheduling task parallel applica- tions for rapid turnaround on enterprise desktop grids. 5(4):379–405, oct 2007. [Kur16] I. Kurochkin. Determination of replication parameters in the project of the voluntary distributed computing NetMax@home. In International scientific conference "High technologies. Business. Society." 14-17.03.2016, Borovets, Bulgaria, pages 10–12, 2016. [MNI15] V.V. Mazalov, N.N. Nikitina, and E.E Ivashko. Task scheduling in a desk- top grid to minimize the server load. In V. Malyshkin, editor, Parallel Computing Technologies, International Conference on, volume 9251, pages 273–278. Springer, 2015.