<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Aspects of the Assessment of the Quality of Loading Hybrid High- Performance Computing Cluster</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Computing Center of Far Eastern Branch Russian Academy of Sciences</institution>
          ,
          <addr-line>Khabarovsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Federal research center 'Computer Science and Control' of the Russian Academy of Sciences</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>The article proposes a method for estimating workload, based on the calculation of peak performance, which is required to perform computational tasks. The system of dynamic priorities of computing tasks is considered, based on the resource efficiency indicators of the highperformance cluster. Keywords: high-performance computing cluster; hybrid architecture; graphics accelerator; performance efficiency; profiling; dynamic priority.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The most important issue in the operation of a high-performance computing cluster is to provide the complete
utilization of its resources. This is necessary for solving scientific problems and ensuring the return of investments
(ROI).</p>
      <sec id="sec-1-1">
        <title>We can distinguish two main areas in this problem [1, 2]:</title>
        <p>- ensure execution of the maximum possible number of applications for a certain period of time;
- the most efficient use of cluster resources by user applications.</p>
        <p>An important issue of operations is to determine the grade of loading of the cluster, because it allows to plan the
provision of resources, to assess the necessity for modernization, to determine the quality of the services.</p>
        <p>As a rule, the workload is defined as the ratio of the metric (parameter) of the workload to the maximum possible
value of this parameter. The metric is determined by measurement or calculation.</p>
        <p>The article proposes a new method for calculating the value of the workload using the peak performance of the
cluster.</p>
        <p>A high workload of the HPC cluster does not mean efficient use of its resources. It is possible that the resources
requested by the application are not used and are idle. In this case, the workload factor of the cluster can be high, but
the quality of the tasks is low.</p>
        <p>To provide an advantage to applications that efficiently use the resources of the cluster, the article discusses a
system of dynamic priorities. The system is based on determining the coefficient of profiling and using it to change
the priorities of applications.
______________________________________________________________________________________________</p>
        <p>Note that there is a difference between the theoretically possible performance of cluster (peak performance) and
practically achievable results. Results are determined by different tests and vary greatly depending on the type of
tasks and configuration of the cluster [4].</p>
        <p>To estimate the workload of a hybrid high-performance computing cluster, we use peak performance. It is defined
as the sum of the peak productivities of its components — nodes (1).</p>
        <p>=</p>
        <p>∑
 =1  ℎ 
,
where Ppeak is peak performance of the computing cluster,</p>
      </sec>
      <sec id="sec-1-2">
        <title>Phost i – peak performance (Phost) of the i-th node of the computing cluster</title>
        <p>Note that the summation does not take into account the performance losses that occur when the nodes interact over
the computer network connecting them (interconnect) [4].</p>
        <p>The peak performance of the Phost node is defined as the sum of the performance of the central processors of the
node (Pcpu) and its graphic accelerators - Pgpu. It is assumed that they are fully loaded with floating point operations,
do not perform any other operations, and there are no data transfer losses between the central processors and graphics
accelerators (2).</p>
        <p>ℎ
=  


+  
,
where Ncpu is number of CPUs in the compute node,</p>
      </sec>
      <sec id="sec-1-3">
        <title>Ngpu – number of graphics accelerators in the compute node,</title>
      </sec>
      <sec id="sec-1-4">
        <title>Pcpu – peak CPU performance,</title>
      </sec>
      <sec id="sec-1-5">
        <title>Pgpu – graphics accelerator peak performance.</title>
        <p>To calculate the peak performance of the CPU (3), we assume that the operations are performed by the cores in
parallel, each core can process a group of threads, and the flow allows several operations to be performed in parallel if
there are several operational blocks for this. Such a core-streaming architecture is characteristic of modern classical


,
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
processors of various manufacturers.
where ncore is number of CPU cores,
cycle,</p>
        <p>Fcpu – CPU frequency.</p>
        <p>=  
 
 


nstream – the number of threads processed by the CPU core,
nunit – the number of operating units per flow corresponds to the number of operations performed in one flow per</p>
        <p>Pgpu = Pcuda + Ptensor,</p>
        <p>To assess the performance of graphics accelerators, we use the the modern accelerator architecture of the NVidia
company. Consider the family of accelerators Tesla Volta, as the most popular. Accelerators contain cuda- and
tensor-cores, which allow performing parallel operations on floating-point numbers and matrices. The performance of
a graphics accelerator is defined as the sum of the productivity of all cores without taking into account performance
losses on scheduling and interaction (4) [5].
where Pcuda is total performance of the graphics accelerator cuda-cores,</p>
      </sec>
      <sec id="sec-1-6">
        <title>Ptensor – total performance of tensor-cores of the graphics accelerator.</title>
        <p>Let us determine the performance value of the cuda-cores of the graphics accelerator using formula (5), assuming
that the floating-point operation is performed in one clock cycle.</p>
        <p>Tensor-cores perform a multiplication of square matrices in one clock cycle. When calculating the number of
operations performed in this case, we will take into account that the calculation of each element of the resulting
matrix requires the execution of multiplication operations equal to the order of the matrix, as well as addition
operations one less. Thus, the total performance of tensor-kernels is calculated as (6).</p>
        <p>Note that the accuracy of performing floating point operations for different cores may differ. So, in the graphics
accelerator NVidia Tesla V 100 cuda-cores work with double precision numbers, and tensor-cores with single
precision numbers. In this method of performance evaluation, this feature is not taken into account.</p>
      </sec>
      <sec id="sec-1-7">
        <title>The total performance of cuda- and tensor-cores are determined by formulas (5) and (6).</title>
        <p>where ncuda is number of graphics accelerator cuda-cores,
ntensor – number of tensor-cores of the graphics accelerator,
r – square matrix order,</p>
      </sec>
      <sec id="sec-1-8">
        <title>Fgpu – graphics accelerator frequency.</title>
      </sec>
      <sec id="sec-1-9">
        <title>Thus, the peak performance of the graphics accelerator is calculated by the formula (7).</title>
        <p>Pcuda = ncudaFgpu</p>
        <p>Ptensor = ntensorr2(2r − 1)Fgpu


= ( 
+</p>
        <p>2(2 − 1)) 
 ℎ
=</p>
        <p>+
( 
+</p>
        <p>2(2 − 1))</p>
        <p>The peak performance of the computing node of a hybrid high-performance computing cluster is calculated by the
formula (8).</p>
      </sec>
      <sec id="sec-1-10">
        <title>The total peak performance of a hybrid high-performance computing cluster is calculated as (1). As shown above, the performance of the HPC cluster is calculated as the sum of the performances of its components and is expressed by the number of floating-point operations performed per second. 8</title>
        <p>The peak estimate differs from the actual, which is determined on the basis of various tests. However, as noted
above, in this method we will use the peak values.</p>
        <p>To estimate the requirements of applications to the resources of a hybrid high-performance computing cluster, we
calculate the number of operations required for the execution of the application (10).</p>
        <p>For each application, a number of CPU cores, graphics accelerators, and runtime are reserved. We take into
account that the resources of graphic accelerators are reserved entirely, and the resources of central processors - by
cores. Therefore, the total number of hybrid high-performance computing cluster operations performed by the task
Opapp(t) - for a given time t is determined by the number of cores of the central processors (Rcore) and graphics
accelerators (Rgpu) reserved by the application.

) ,
where Rcore is the number of cores reserved by the application,</p>
      </sec>
      <sec id="sec-1-11">
        <title>Rgpu – the number of graphics accelerators reserved by the application,</title>
        <p>n – total number of cores in CPU.</p>
        <p>After calculations for all applications i=1…N, the execution of which accounted for the period T, we obtain the
total number of operations required for the execution of applications on the period T (11):</p>
        <p>)

time of the application falls on this period. In this case, when estimating the resources used, only the time interval ti
belonging to T is taken into account.
______________________________________________________________________________________________
floating point operations available to users during this interval.</p>
        <p>The resource of the hybrid high-performance computing cluster in the time interval will be the peak number of
The total number of operations of the hybrid high-performance computing cluster Op(T) on the time interval T is
defined as:
where T is time interval.
App</p>
        <sec id="sec-1-11-1">
          <title>Appn+1</title>
        </sec>
        <sec id="sec-1-11-2">
          <title>Appn</title>
          <p>App4
App3
App2
App1</p>
          <p>T0
Интервал T</p>
          <p>T1
t</p>
          <p>To obtain an indicator of the use of computing resources, it is proposed to introduce the task profiling coefficient
as an integral indicator of the quality of resource use (13).
where Kprof is profiling ratio.</p>
          <p>Using such an assessment allows you to avoid a situation when the application does not use or irrationally uses the
resources requested from the computing cluster. The coefficient of profiling is obtained by running a custom
application under the control of a special debugging tool - a profiler that allows you to determine the degree of
resource utilization, the execution time of individual code sections, bottlenecks and problems of memory usage. As
part of development packages, there are profilers for both code executed on central processors and graphic
accelerators.</p>
          <p>Information on program profiling should be available both to the developer of a scientific application and to the
division operating the computing cluster. This is necessary to take measures to improve the efficiency of the program
code and increase the efficiency of the functioning of the hybrid HPC cluster as a whole.</p>
          <p>Obviously, applications with a high profiling coefficient improve the quality indicator of a high-performance
cluster workload. Therefore, a competitive advantage should be given to such applications. This encourages users to
improve the calculation algorithms and taking into account the capabilities of the computing cluster. A classic way of
encouraging tasks with a high profiling rate is to introduce a system of dynamic priorities based on the profiling
coefficient.</p>
          <p>The introduction of dynamic priorities allows within certain limits to change the priority of an application
depending on its quality. This service policy is especially useful in conditions of heavy workload of the computing
cluster. It allows to improve the quality of resource use and reduce the workload, as well as provide an advantage in
the implementation of the applications that make the most use of the cluster’s resources.</p>
          <p>The decision to change the priority should be made on the basis of a comparison of the measured profiling
coefficient with the recommended one, which is determined by expert. It is possible to set several threshold values of
the profiling coefficient, for each of which there is a different priority rule. For example, for two quality thresholds
(profiling coefficients К1, К2) that divide a multitude of applications into three subsets of quality “low”, “medium”,
“high”, the dynamic priority can be calculated based on a piecewise linear function (14).</p>
          <p>(С0  − С0 1 + 1),   &lt;  0
 
= { 
(С1 
− С1 1 + 1),
 1 &gt;  
≥  0
(14)
  ( С2  − С2 2 + С1 2− С1 1 + 1),   ≥  1
where Prdyn is dynamic application priority;</p>
        </sec>
      </sec>
      <sec id="sec-1-12">
        <title>Prbase – basic application priority;</title>
      </sec>
      <sec id="sec-1-13">
        <title>Kprof – coefficient derived from application execution profiling;</title>
      </sec>
      <sec id="sec-1-14">
        <title>K1, К2 – expert profiling coefficients;</title>
      </sec>
      <sec id="sec-1-15">
        <title>C0, C1, C2 – expert change factors.</title>
        <p>Figure 2 shows an example of the dependence of dynamic priority on the values of К and С with Prbase =1.</p>
        <p>Thus, when obtaining the values of the profiling coefficient below K1, a linear decrease in priority relative to the
base value is made; when K1 is exceeded, a linear increase in priority is obtained. If K2 is exceeded, the priority
growth increases. The recommended profiling coefficients K and coefficients C are determined by an expert method,
based on the characteristics of the functioning and loading of the computing cluster.</p>
        <p>Prdyn
1.0</p>
        <p>C0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9</p>
        <p>1.0</p>
        <p>Prbase</p>
        <p>In the early stages of cluster operation, when technologies and algorithms are being debugged, the priority should
be changed to the minimum extent. The requirements for grading factors should not be too high. Therefore, the values
of C, which determine the slope of the straight lines, should be chosen closer to zero, this provides a slight change in
priorities.</p>
        <p>With increasing workload on the computing cluster and the need for more accurate task management, the values of
C can be increased. This leads to a more significant change in priority compared to the baseline with a significant
deviation of Kprof from K1.
4</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Conclusion</title>
      <p>The proposed methodology for estimating the workload on the hybrid HPC cluster allows to determine how fully
and efficiently the resources of the hybrid cluster are used. On the basis of the results obtained, it is possible to
determine indicators of ROI, plan the work of the cluster, and determine the need for modernization.</p>
      <p>The system of dynamic priorities will allow to control the quality of resource utilization of hybrid
highperformance computing clusters when they perform different types of applications from various fields of science and
technology.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements References</title>
      <p>The research is partially supported by the Russian Foundation for Basic Research (project 18-29-03100).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>