Proceedings of the 9th International Conference “Distributed Computing and Grid Technologies in Science and Education” (GRID’2021), Dubna, Russia, July 5-9, 2021 HPC WORKLOAD BALANCING ALGORITHM FOR CO- SCHEDULING ENVIRONMENTS R.I. Kuchumov, V.V. Korkhova Saint Petersburg State University, 7/9 Universitetskaya nab., St. Petersburg, 199034, Russia E-mail: a v.korkhov@spbu.ru The goal of this research work is to reduce wait time of HPC (high performance computing) applications in schedulers queue by applying a co-scheduling strategy. This strategy allows the execution of more than one task with different non-overlapping requirements for computational resources simultaneously. Co-scheduling strategy reduces task queue wait time and improves utilization of cluster resources when compared to the scheduling strategies that do not allow for parallel task execution on the same machine. We have proposed a method for measuring application processing speed in its run-time, which can be used as a feedback for scheduling strategies. In this work, we have formalized the co-scheduling problem and proposed strategies for solving it. For some strategies we have shown analytically the upper bounds values of their competitive ratios. Besides that for the proposed scheduling strategies we ran numerical experiments using imitation models to show how they compare to the optimal strategy. Keywords: Co-scheduling, HPC, Scheduling theory, Stochastic Optimization. Ruslan I. Kuchumov, Vladimir V. Korkhov Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 133 Proceedings of the 9th International Conference “Distributed Computing and Grid Technologies in Science and Education” (GRID’2021), Dubna, Russia, July 5-9, 2021 1. Introduction There are several case studies (e.g [1-3]) showing that by running more than one HPC application on the same computational node it is possible to increase cluster throughput, decrease schedule makespan and to decrease power consumption. The scheduling strategy that allows to run multiple tasks on the same machine in the literature is usually referred to as co-scheduling or co- location strategies. Commonly used batch schedulers do not allow implement such strategy in a controllable way, as they work either by assigning the whole computational node or by assigning a subset of CPU cores (so called slots) to a single task without considering task performance degradation. In this work we propose a method of measuring task processing speed in its runtime. This metric can be used as a feedback for dynamic scheduling algorithms to quantitatively evaluate co- scheduling decisions. We have formalized co-scheduling problem in terms of scheduling theory and have proposed several strategies that vary by the amount of initial information about each task they require and by the optimality of the schedules they produce. We have analytically shown how greedy strategy compares to an optimal strategy and based on that proposed several stochastic strategies that approximate a greedy strategy and use task processing speed metric as a feedback. 2. Co-Scheduling problem and its strategies To provide scheduling strategies that implement a co-scheduling approach, we formalised the problem in terms of scheduling theory. We have considered a static problem definition where the number of tasks and their parameters (processing speed, amount of work) does not change in time. In this work we consider makespan (completion time of the last task) as an objective function. Denote a set of n tasks as T={T1,...,Tn}. Each Ti requires a bi amount of work units to be completed. Values of bi form an n-dimensional vector b. Any combination (subset) for T can be simultaneously on the same machine. There are a total of m = 2n-1 possible combinations of n tasks. We will denote Sj as a combination of tasks T (Sj is in the power set of T and non-empty). Denote ai,j as a speed of Ti in combination Sj. Without loss of generosity we will consider the processing speed of any task Ti running alone (i.e. in combination Sj = {Ti}) to be equal to 1. Values ai,j form a matrix A with dimensions n by m. As task processing speed can not increase, when a new tasks are added to the combination, we can introduce the following constraint on ai,j values: ai,p ≤ ai, q, ∀ (p, q) | Sp ⊆ Sq, i = 1,..., n (1) At first, we will consider a deterministic offline problem to provide an optimal solution. In this problem definition, exac values of all task parameters (A, b) are known in advance. Denote xj as a total amount of time the combination Sj were run in a schedule. Then the problem of makespan optimization can be solved by reducing it to a linear programming problem of minimizing a sum of xj, subject to Ax=b and x ≥ 0. This solution can not be used in practice as it requires all task data to be available. To mitigate this, we have considered an online problem definition, where values of the vector b are unknown in advance. To solve this problem, we have proposed to use a scheduling strategy that selects the combination of remaining tasks with the maximum sum of tasks processing speed. We refer to this strategy as FCS (Fastest Combination Speed-first). We have analytically shown [4] that the makespan of FCS strategy can not be larger than the optimal makespan by more than 2 times. Online definition of this problem still can not be applied in schedulers implementations, as it requires values of matrix A to be known in advance. As FCS strategy makes local optimal decisions (a greedy strategy), provides a small competitive ratio and values of task processing speed can be measured in practice, we have covered several strategies that approximate FCS strategy. We have considered a non-deterministic definition of co-scheduling problem, where task processing speed ai,j are normal random variables and their parameters can be estimated in tasks run- 134 Proceedings of the 9th International Conference “Distributed Computing and Grid Technologies in Science and Education” (GRID’2021), Dubna, Russia, July 5-9, 2021 time. To do that, we have relaxed constraint (1) by defining variances of ai,j so that constraint (1) holds with a given probability. We have proposed several stochastic search strategies that approximate FCS by searching for the combination with maximum sum of task processing speed. To interpolate values of ai,j and ensure that task variance increases with an increase of uncertainty, we used linear spline interpolation as described in [4]. In this work, we have considered 3 search based strategies, varying by their acquisition functions: PI (probability of improvements), EI (expected improvements) and UCB (upper confidence bounds). We have compared them with FCS and the optimal strategy using the simulation software that we developed. Task processing speed values in different combinations were measured in the experiments and then they were used to generate similar input for numerical simulations. Simulation results are shown in figure 1. Figure 1. Comparison of the lowest competitive ratio values produced by different scheduling strategies. Speedup rate range is a range of task processing speed decrements when the number of tasks in combinations increases. 3. Processing speed metrics Tasks running on the same computational node may share bandwidth of such resources as memory bus, network card, last level cache, disk bandwidth, etc. When bandwidth for any of these resources is fully utilized, task processing speed decreases as CPU instructions performing access to the resource take more cycles to complete. That is, when multiple tasks interfere with each other due to co-scheduling IPC (instructions per cycle) values will decrease. Besided IPC, CPU time may also be affected by co-scheduling as operating system scheduler may assign more than one task thread to a single CPU core. In this case, CPU time would be shared between all of the threads and the more threads there are, the smaller fraction of CPU time would be available to each task. In this work we proposed to measure task processing speed in runtime as IPC multiplied by CPU time fraction values averaged across all tasks threads. To show that this metric indeed measures task processing speed, we found a set of benchmark applications that have a constant IPC value throughout their runtime and compared how this metric changes in different conditions compared to the change of total task processing time. For that we picked 10 benchmark applications from NPB and 135 Proceedings of the 9th International Conference “Distributed Computing and Grid Technologies in Science and Education” (GRID’2021), Dubna, Russia, July 5-9, 2021 Parsec and a few of our own benchmarks and run them in combinations of a different size. A complete list of benchmarks is presented in [4]. Results for some benchmarks are shown in figure (2) along with a complete list of benchmarks. The figure shows a scatter plot of processing time ratio version processing speed ratio (as measured by the proposed metric). Each point corresponds to measurements of a single task in combinations with other tasks. Figure 2. Scatter plots of task processing time ratio version processing speed ratio. Each point corresponds to tasks run in different combinations with other tasks. Red line is a line where ratios are equal. 4. Results and Conclusion We have formalized the co-scheduling problem in terms of scheduling theory and have proposed multiple strategies that provide a solution for the problem. An optimal strategy was proposed for offline deterministics problem definitions. For online deterministic problem we have proposed a greedy strategy (FCS) and have analytically shown that its competitive ratio can not be larger than two. For non-deterministics counterpart problem we have proposed three search-based strategies (PI, EI, UCB) that use different acquisition functions. Using numerical simulations we have shown (figure 1) how competitive ratio compares between different scheduling strategies. Simulation results show that PI produces a lower value of competitive ratio than EI and UCB strategies for almost all problem input ranges. We have proposed a method for measuring task processing speed in its runtime and have validated it using test applications that mimic HPC workloads. Results of the evaluation (figure 2) showed that this metric measures task processing speed with a very high accuracy in different conditions. These results allow to apply proposed scheduling strategies in schedulers implementations. 5. Acknowledgement Research has been supported by the RFBR grant 19-37-90138. 136 Proceedings of the 9th International Conference “Distributed Computing and Grid Technologies in Science and Education” (GRID’2021), Dubna, Russia, July 5-9, 2021 References [1] Trinitis, Carsten, and Josef Weidendorfer, eds. Co-scheduling of HPC applications. Vol. 28. IOS Press, 2017. [2] de Blanche, Andreas, and Thomas Lundqvist. "Node Sharing for Increased Throughput and Shorter Runtimes–an Industrial Co-Scheduling Case Study." Proceedings of the 3rd Workshop on Co- Scheduling of HPC Applications (COSH 2018). 2018. [3] Breslow, Alex D., et al. "The case for colocation of hpc workloads." Concurrency and Computation: Practice and Experience Preprint (2012). [4] Kuchumov, R.; Korkhov, V. Analytical and Numerical Evaluation of Co-Scheduling Strategies and Their Application. Preprints 2021, 2021090053 (doi: 10.20944/preprints202109.0053.v1). 137