Introduction

Aspects of the Assessment of the Quality of Loading Hybrid High- Performance Computing Cluster

0 Computing Center of Far Eastern Branch Russian Academy of Sciences , Khabarovsk , Russia 1 Federal research center 'Computer Science and Control' of the Russian Academy of Sciences , Moscow , Russia

2019

16 19

The article proposes a method for estimating workload, based on the calculation of peak performance, which is required to perform computational tasks. The system of dynamic priorities of computing tasks is considered, based on the resource efficiency indicators of the highperformance cluster. Keywords: high-performance computing cluster; hybrid architecture; graphics accelerator; performance efficiency; profiling; dynamic priority.

Introduction

The most important issue in the operation of a high-performance computing cluster is to provide the complete utilization of its resources. This is necessary for solving scientific problems and ensuring the return of investments (ROI).

We can distinguish two main areas in this problem [1, 2]:

- ensure execution of the maximum possible number of applications for a certain period of time; - the most efficient use of cluster resources by user applications.

An important issue of operations is to determine the grade of loading of the cluster, because it allows to plan the provision of resources, to assess the necessity for modernization, to determine the quality of the services.

As a rule, the workload is defined as the ratio of the metric (parameter) of the workload to the maximum possible value of this parameter. The metric is determined by measurement or calculation.

The article proposes a new method for calculating the value of the workload using the peak performance of the cluster.

A high workload of the HPC cluster does not mean efficient use of its resources. It is possible that the resources requested by the application are not used and are idle. In this case, the workload factor of the cluster can be high, but the quality of the tasks is low.

To provide an advantage to applications that efficiently use the resources of the cluster, the article discusses a system of dynamic priorities. The system is based on determining the coefficient of profiling and using it to change the priorities of applications. ______________________________________________________________________________________________

Note that there is a difference between the theoretically possible performance of cluster (peak performance) and practically achievable results. Results are determined by different tests and vary greatly depending on the type of tasks and configuration of the cluster [4].

To estimate the workload of a hybrid high-performance computing cluster, we use peak performance. It is defined as the sum of the peak productivities of its components — nodes (1).

∑ =1 ℎ , where Ppeak is peak performance of the computing cluster,

Phost i – peak performance (Phost) of the i-th node of the computing cluster

Note that the summation does not take into account the performance losses that occur when the nodes interact over the computer network connecting them (interconnect) [4].

The peak performance of the Phost node is defined as the sum of the performance of the central processors of the node (Pcpu) and its graphic accelerators - Pgpu. It is assumed that they are fully loaded with floating point operations, do not perform any other operations, and there are no data transfer losses between the central processors and graphics accelerators (2).

ℎ = + , where Ncpu is number of CPUs in the compute node,

Ngpu – number of graphics accelerators in the compute node, Pcpu – peak CPU performance, Pgpu – graphics accelerator peak performance.

To calculate the peak performance of the CPU (3), we assume that the operations are performed by the cores in parallel, each core can process a group of threads, and the flow allows several operations to be performed in parallel if there are several operational blocks for this. Such a core-streaming architecture is characteristic of modern classical , (1) (2) (3) (4) (5) (6) (7) (8) processors of various manufacturers. where ncore is number of CPU cores, cycle,

Fcpu – CPU frequency.

= nstream – the number of threads processed by the CPU core, nunit – the number of operating units per flow corresponds to the number of operations performed in one flow per

Pgpu = Pcuda + Ptensor,

To assess the performance of graphics accelerators, we use the the modern accelerator architecture of the NVidia company. Consider the family of accelerators Tesla Volta, as the most popular. Accelerators contain cuda- and tensor-cores, which allow performing parallel operations on floating-point numbers and matrices. The performance of a graphics accelerator is defined as the sum of the productivity of all cores without taking into account performance losses on scheduling and interaction (4) [5]. where Pcuda is total performance of the graphics accelerator cuda-cores,

Ptensor – total performance of tensor-cores of the graphics accelerator.

Let us determine the performance value of the cuda-cores of the graphics accelerator using formula (5), assuming that the floating-point operation is performed in one clock cycle.

Tensor-cores perform a multiplication of square matrices in one clock cycle. When calculating the number of operations performed in this case, we will take into account that the calculation of each element of the resulting matrix requires the execution of multiplication operations equal to the order of the matrix, as well as addition operations one less. Thus, the total performance of tensor-kernels is calculated as (6).

Note that the accuracy of performing floating point operations for different cores may differ. So, in the graphics accelerator NVidia Tesla V 100 cuda-cores work with double precision numbers, and tensor-cores with single precision numbers. In this method of performance evaluation, this feature is not taken into account.

The total performance of cuda- and tensor-cores are determined by formulas (5) and (6).

where ncuda is number of graphics accelerator cuda-cores, ntensor – number of tensor-cores of the graphics accelerator, r – square matrix order,

Fgpu – graphics accelerator frequency. Thus, the peak performance of the graphics accelerator is calculated by the formula (7).

Pcuda = ncudaFgpu

Ptensor = ntensorr2(2r − 1)Fgpu = ( +

2(2 − 1)) ℎ =

+ ( +

2(2 − 1))

The peak performance of the computing node of a hybrid high-performance computing cluster is calculated by the formula (8).

The total peak performance of a hybrid high-performance computing cluster is calculated as (1). As shown above, the performance of the HPC cluster is calculated as the sum of the performances of its components and is expressed by the number of floating-point operations performed per second. 8

The peak estimate differs from the actual, which is determined on the basis of various tests. However, as noted above, in this method we will use the peak values.

To estimate the requirements of applications to the resources of a hybrid high-performance computing cluster, we calculate the number of operations required for the execution of the application (10).

For each application, a number of CPU cores, graphics accelerators, and runtime are reserved. We take into account that the resources of graphic accelerators are reserved entirely, and the resources of central processors - by cores. Therefore, the total number of hybrid high-performance computing cluster operations performed by the task Opapp(t) - for a given time t is determined by the number of cores of the central processors (Rcore) and graphics accelerators (Rgpu) reserved by the application. ) , where Rcore is the number of cores reserved by the application,

Rgpu – the number of graphics accelerators reserved by the application,

n – total number of cores in CPU.

After calculations for all applications i=1…N, the execution of which accounted for the period T, we obtain the total number of operations required for the execution of applications on the period T (11):

) time of the application falls on this period. In this case, when estimating the resources used, only the time interval ti belonging to T is taken into account. ______________________________________________________________________________________________ floating point operations available to users during this interval.

The resource of the hybrid high-performance computing cluster in the time interval will be the peak number of The total number of operations of the hybrid high-performance computing cluster Op(T) on the time interval T is defined as: where T is time interval. App

Appn+1 Appn

App4 App3 App2 App1

T0 Интервал T

T1 t

To obtain an indicator of the use of computing resources, it is proposed to introduce the task profiling coefficient as an integral indicator of the quality of resource use (13). where Kprof is profiling ratio.

Using such an assessment allows you to avoid a situation when the application does not use or irrationally uses the resources requested from the computing cluster. The coefficient of profiling is obtained by running a custom application under the control of a special debugging tool - a profiler that allows you to determine the degree of resource utilization, the execution time of individual code sections, bottlenecks and problems of memory usage. As part of development packages, there are profilers for both code executed on central processors and graphic accelerators.

Information on program profiling should be available both to the developer of a scientific application and to the division operating the computing cluster. This is necessary to take measures to improve the efficiency of the program code and increase the efficiency of the functioning of the hybrid HPC cluster as a whole.

Obviously, applications with a high profiling coefficient improve the quality indicator of a high-performance cluster workload. Therefore, a competitive advantage should be given to such applications. This encourages users to improve the calculation algorithms and taking into account the capabilities of the computing cluster. A classic way of encouraging tasks with a high profiling rate is to introduce a system of dynamic priorities based on the profiling coefficient.

The introduction of dynamic priorities allows within certain limits to change the priority of an application depending on its quality. This service policy is especially useful in conditions of heavy workload of the computing cluster. It allows to improve the quality of resource use and reduce the workload, as well as provide an advantage in the implementation of the applications that make the most use of the cluster’s resources.

The decision to change the priority should be made on the basis of a comparison of the measured profiling coefficient with the recommended one, which is determined by expert. It is possible to set several threshold values of the profiling coefficient, for each of which there is a different priority rule. For example, for two quality thresholds (profiling coefficients К1, К2) that divide a multitude of applications into three subsets of quality “low”, “medium”, “high”, the dynamic priority can be calculated based on a piecewise linear function (14).

(С0 − С0 1 + 1), < 0 = { (С1 − С1 1 + 1), 1 > ≥ 0 (14) ( С2 − С2 2 + С1 2− С1 1 + 1), ≥ 1 where Prdyn is dynamic application priority;

Prbase – basic application priority; Kprof – coefficient derived from application execution profiling; K1, К2 – expert profiling coefficients; C0, C1, C2 – expert change factors.

Figure 2 shows an example of the dependence of dynamic priority on the values of К and С with Prbase =1.

Thus, when obtaining the values of the profiling coefficient below K1, a linear decrease in priority relative to the base value is made; when K1 is exceeded, a linear increase in priority is obtained. If K2 is exceeded, the priority growth increases. The recommended profiling coefficients K and coefficients C are determined by an expert method, based on the characteristics of the functioning and loading of the computing cluster.

Prdyn 1.0

C0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1.0

Prbase

In the early stages of cluster operation, when technologies and algorithms are being debugged, the priority should be changed to the minimum extent. The requirements for grading factors should not be too high. Therefore, the values of C, which determine the slope of the straight lines, should be chosen closer to zero, this provides a slight change in priorities.

With increasing workload on the computing cluster and the need for more accurate task management, the values of C can be increased. This leads to a more significant change in priority compared to the baseline with a significant deviation of Kprof from K1. 4

Conclusion

The proposed methodology for estimating the workload on the hybrid HPC cluster allows to determine how fully and efficiently the resources of the hybrid cluster are used. On the basis of the results obtained, it is possible to determine indicators of ROI, plan the work of the cluster, and determine the need for modernization.

The system of dynamic priorities will allow to control the quality of resource utilization of hybrid highperformance computing clusters when they perform different types of applications from various fields of science and technology.

Acknowledgements References

The research is partially supported by the Russian Foundation for Basic Research (project 18-29-03100).