<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>D.A. Nikitenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vl.V. Voevodin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A.M. Teplov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S.A. Zhumatiy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vad.V. Voevodin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>K.S. Stefanov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>P.A. Shvets</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Computing Center M.V. Lomonosov Moscow State University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>Efficient use and high output of any supercomputer depends on a great number of factors. The problem of controlling granted resource utilization is one of those, and becomes especially noticeable in conditions of concurrent work of many user projects. It is important to provide users with detailed information on peculiarities of their executed jobs. At the same time it is important to provide project managers with detailed information on resource utilization by project members by giving access to the detailed job analysis. Unfortunately, such information is rarely available. This gap should be eliminated with our proposed approach to supercomputer application integral characteristics analysis for the whole queued job collection of large-scale HPC systems based on system monitoring data management and study, building integral job characteristics, revealing job categories and single job run peculiarities.</p>
      </abstract>
      <kwd-group>
        <kwd>supercomputer</kwd>
        <kwd>efficiency</kwd>
        <kwd>system monitoring</kwd>
        <kwd>job categories</kwd>
        <kwd>integral job characteristics</kwd>
        <kwd>queued job collection</kwd>
        <kwd>job queue</kwd>
        <kwd>resource utilization control</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Securing efficient resource utilization of HPC systems is one of the most important and
challenging tasks at present trends of rapid growth of scales and capabilities of modern
supercomputers [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. There is a variety of approaches that are aimed at analysis of efficient
utilization of certain HPC system components or systems as a whole. Some of them are based on
system monitoring data analysis [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ]. This type of approaches sets especially strict requirements on
monitoring system implementation and configuration [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], as well as for the means of data storage and
access. At the same time these approaches possess a number of fundamental advantages.
      </p>
      <p>First, the analyzed data reflects physical, real levels of HPC system components and appropriate
resource utilization.</p>
      <p>Second, filtering system monitoring data obtained from known set of components and period of
time allows binding this data to certain jobs. Thus, allowing analyzing resource utilization history and
trends by certain applications, users, projects, partitions, and so on.</p>
      <p>Third, typically it is possible to configure monitoring systems obtaining data from the whole
system in such a way that it induces acceptable overhead. This allows collecting data with a rougher
granulation, when possible, but still sufficient for basic analysis of resource utilization by any and
every job. To have more detailed information on certain job, of course, used monitoring system should
likely support data acquisition rate reconfiguration on-the-fly for specified sensor sets and sources. If
not (of course, it is much less efficient way), most monitoring systems can be started in a higher
granularity mode to record certain job activity and restarted in a normal mode afterwards. There are
other options available, for example, data aggregation implementation that is precise for first, say, 30
seconds of job execution (to study short jobs) that is later switches to rough mode (for longer jobs).
Anyhow, there are a number of techniques available to study certain application behavior.</p>
      <p>
        The existing methods and techniques that base on system monitoring data analysis, allow both
analysis of dynamic characteristics of certain application runs and peculiarities of resource utilization
within system partitions and systems as a whole [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. With a project-oriented workflow, when a
number of users run jobs as a part of one applied research, it is very useful to let administrator and
system manager have a clear view of resource utilization distribution in the workgroup to have a
possibility to influence permissions or workflow inside the workgroup to meet the granted resources
limitation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Nevertheless there is still need for specialized tools and techniques to analyze available system
monitoring data. In a point of fact, the one is needed as a valuable additional tool to the set of
implemented approaches in every-day practice of MSU Supercomputer Center [
        <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8-11</xref>
        ] — a tool for job
queue analysis based on system monitoring that would allow revealing job categories, job grouping by
some criteria, starting from belonging to user or project domain and other resource manager specific
characteristics, to categories by levels and peculiarities of HPC system resource utilization or its
combinations.
      </p>
      <p>As a basic technique for such grouping implementing tagging system seems to be an adequate
option — assigning special tags to a job description as soon as each tag description criteria is met by
job characteristics. Tagging principles are widely successfully used for categorizing and search
purposes managing huge collections of data in Internet: news, videos, photos, notes, and so forth, that
is quite close to the challenge that is being tackled.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Job categories and tagging</title>
      <p>The combined analysis of system monitoring data and resource manager log data, as was already
mentioned, allows binding raw system monitoring data to certain jobs. This provides means to analyze
job dynamics as far as data granularity allows. To analyze the average rate of application resource
utilization every dynamic characteristic can serve basis for the calculation of minimum, maximum,
average and median values. These types of values are often named integral job characteristics.</p>
      <p>When one takes a look at the whole scope of executed jobs for analysis of job queue structure,
application run sequences, jobs comparative analysis and even searching for outstanding single job
behavior it becomes obvious that it would be very useful to have tools that provide means for
revealing job categories based on various criteria.</p>
      <p>This functionality can be implemented by introducing special tags. Every tag is based on its own
criteria, based on a single integral job characteristic or its combination, resource manager job-related
information and any other available info from used data sources. For example tags can correspond to
certain average rates of various resource utilization, job ownership, job duration, resource utilization
specifics, special execution modes, detailed system monitoring data availability, and so forth.</p>
      <p>The approach features means to make efficient grouping and filtration of whole job queue history
collection by any improvised combination of specified tags.</p>
      <p>Driven by experience of application efficiency and scalability study based on system monitoring
data analysis, the authors propose introducing the following job categories on the first stage of
implementation. Tag naming is designed to give self-explanatory tag description, nevertheless, every
tag must have a detailed full-format description available.</p>
      <sec id="sec-2-1">
        <title>2.1 System monitoring data based categories</title>
        <sec id="sec-2-1-1">
          <title>CPU utilization</title>
          <p>

</p>
          <p>Tag name: avg_CPU_user LOW
Category: Low CPU user utilization.</p>
          <p>Criteria: Average value of CPU_user doesn’t exceed 20%.
Tag name: avg_CPU_user HIGH
Category: High CPU user utilization.</p>
          <p>Criteria: Average value of CPU_user exceeds 40%.
Tag name: avg_CPU_idle TOO HIGH
Category: CPU is idle for a considerable time.</p>
          <p>Criteria: Average CPU_idle value exceeds 25%.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Competition of processes for CPU cores</title>
          <p></p>
          <p>Tag name: avg_LA LOW
Category: User job is almost out of action, almost no utilization of CPU.</p>
          <p>Criteria: Average Load Average is below 1.
Tag name: avg_LA SINGLE CORE
Category: Only one process per node is active as an average.</p>
          <p>Criteria: Average Load Average is approximately 1.</p>
          <p>Tag name: avg_LA NORMAL
Category: Optimal competition of processes.</p>
          <p>Criteria: Average Load Average is approximately equal to the number of cores per node.
Tag name: avg_LA HYPERTHREADED
Category: Normal process competition with hyperthreading is on.</p>
          <p>Criteria: Average Load Average value is approximately equal to the double number of CPU cores
per node.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Floating point operations</title>
          <p></p>
          <p>Tag name: avg_Flops HIGH
Category: Intensive CPU floating point operations.</p>
          <p>Criteria: Average value of floating point operations number exceeds 10% of theoretical CPU peak.</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>Interconnect activity</title>
          <p>Tag name: avg_IB_packages_num LOW
Category: Low number of inter-node data transmissions.</p>
          <p>Criteria: Average package send rate does not exceed 103 packages per second.
Tag name: avg_IB_packages_size TOO LOW
Category: Small size of packages.</p>
          <p>Criteria: Average package send rate exceeds 103 packages per second while average data
transmission rate is below 2 kilobytes per second.</p>
          <p>Tag name: avg_IB_speed HIGH
Category: High data transmission intensity.</p>
          <p>Criteria: Average data transmission rate is over 0,2 Gigabytes per second and up to 1 Gigabytes
per second.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Tag name: avg_IB_speed TOO HIGH</title>
        <p>Category: Very high data transmission intensity.</p>
        <p>Criteria: Average data transmission rate is over 1 Gigabytes per second.</p>
        <p>Memory utilization











</p>
        <p>Tag name: avg_cache_L1/L3 TOO LOW
Category: Very low efficiency of cache stack utilization.</p>
        <p>Criteria: Ratio of the number of L1 misses to the number of L3 misses is below 5.
Tag name: avg_cache_L1/L3 LOW
Category: Reduced efficiency of cache stack utilization.</p>
        <p>Criteria: Ratio of the number of L1 misses to the number of L3 misses is below 10.
Tag name: avg_cache_L1/L3 HIGH
Category: Good efficiency of cache stack utilization.</p>
        <p>Criteria: Ratio of the number of L1 misses to the number of L3 misses exceeds 10.
Tag name: avg_mem/cache_L1 LOW
Category: Reduced efficiency of cache L1 utilization.</p>
        <p>Criteria: Ratio of the number of total memory operations to the number of L1 misses does not
exceed 15.</p>
        <p>Tag name: avg_memload HIGH
Category: Intensive memory operations.</p>
        <p>Criteria: Average number of memory operations exceeds 109 operations per second.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.2 Resource manager based categories</title>
        <p>Job execution status

















</p>
        <p>Tag name: job_status COMPLETED
Category: Job is successfully finished.</p>
        <p>Tag name: job_status FAILED
Category: Job is finished with an error in program.
Tag name: job_status CANCELED
Category: Job was cancelled by user.</p>
        <p>Tag name: job_status TIMEOUT
Category: Job was cancelled by exceeding time limit.
Tag name: job_status NODE_FAIL</p>
        <p>Category: Job is finished with system error.</p>
        <sec id="sec-2-3-1">
          <title>Job submission details</title>
          <p>Tag name: job_time_limit CUSTOM
Category: Requested time limit is custom.</p>
          <p>Tag name: job_start_script CUSTOM
Category: Job batch file is custom.</p>
          <p>Tag name: job_cores_requested FEW
Category: Not all available CPU cores per node requested.
Tag name: job_cores_requested SINGLE
Category: Just a single CPU core per node requested.
Tag name: job_MPI INTEL
Category: MPI type used: Intel MPI.</p>
          <p>Tag name: job_MPI OpenMPI
Category: MPI type used: OpenMPI.</p>
          <p>Tag name: job_nnodes SINGLE
Category: Job used a single node.</p>
          <p>Tag name: job_nnodes FEW
Category: Job used from 2 up to 8 nodes.</p>
          <p>Tag name: job_nnodes MANY</p>
          <p>Category: Job used 8 nodes and above.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>System-dependent peculiarities and partition usage</title>
          <p>Illustrated by the example of “Lomonosov” supercomputer partitions.</p>
          <p>Tag name: job_partition REGULAR4
Category: Job allocated to REGULAR4 partition.</p>
          <p>Tag name: job_partition REGULAR6
Category: Job allocated to REGULAR6 partition.</p>
          <p>Tag name: job_partition HDD4
Category: Job allocated to HDD4 partition.</p>
          <p>Tag name: job_partition HDD6
Category: Job allocated to HDD6 partition.









</p>
          <p>Tag name: job_partition SMP
Category: Job allocated to SMP partition.</p>
          <p>Tag name: job_partition GPU
Category: Job allocated to GPU partition.</p>
          <p>Tag name: job_partition TEST
Category: Job allocated to TEST partition.</p>
          <p>Tag name: job_partition GPUTEST
Category: Job allocated to GPUTEST partition.</p>
          <p>Tag name: job_partition EXCEPT TEST
Category: Job allocated to regular or high priority partition.</p>
          <p>Tag name: job_priority HIGH
Category: Job allocated to partitions with a higher priority (в очередях reg4prio, gpu_p,
dedicated6)</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>Matching partition specifics</title>
          <p>Tag name: job_accell GPU
Category: User application uses accelerators. Accelerator type: GPU.
Tag name: job_accel GPU UNUSED
Category: Job is run on GPU partition, but never uses GPUs.
Tag name: job_disks UNUSED
Category: Job is run on HDD-equipped partition, but never uses local I/O.
Tag name: job_disks TOO LOW</p>
          <p>Category: Job is run on HDD-equipped partition, but I/O rate is very low.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.3 Other categories</title>
        <p>Beyond the tags that can be assigned automatically, it is possible to introduce manually-set tags.
This is useful when the criteria is cannot be automatically determined. There are now different manual
setting options available. First, most typical, selecting the one from known tags lost or introducing
new one with human-read and formal description. Second, is pushing some tags like “higher system
monitoring rate for the job” via the command line when submitting a job.</p>
        <p>This applies for instance to general job description characterizing type of data processing as it is
usually known a priori or determined in the course of job behavior study by a specialist:
job_behavior DATA MINING,
job_behavior MASTER-SLAVE,
job_behavior COMMUNICATION,
job_behavior ITERATIVE, etc.</p>
        <p>In the same manner typical anomalies encountered during analysis course can be specified:
job_bug DEADLOCK,
job_bug DATA RACE, etc.</p>
        <p>
          It is very useful to specify if a widely used algorithm implementation or software package is used.
Just in case, this can provide a great contribution to scalability and algorithms-studying projects, like
AlgoWiki [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]:
job_sw VASP,
job_sw FIREFLY,
job_sw GROMACS, etc.
        </p>
        <p>If detailed reports on job efficiency analysis or issues is available, or specific standard report like
JobDigest is available, it is useful to mark such a feature with another tag, for example:.
job_analized,
job_analized JobDigest.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Implementation details</title>
      <p>
        We keep to the basis of building a tool that might be deployed at any supercomputer center with
minimal efforts. We currently support Slurm[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Cleo[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] resource managers and Ganglia [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
Collectd[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], Clustrx[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], DiMMon [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (most promising) monitoring systems.
      </p>
      <p>
        As for integral job characteristics derivation and tagging, PostgreSQL is used as data storage for
coarsened system monitoring data and saved job information from resource managers. The saved job
info is processed by JavaScript, jQuery with jQuery UI [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and TagIt [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>The tag can be assigned to a job only if it is already declared in tag description table. Such a table
includes tag id, name, human-readable description, criteria (a specification of SQL request for
automatic processing), comments, and a flag of availability that can be set only by administrator. Any
user can suggest introducing a new tag, but it will be available only after administrator approval.
Information on new tag author is saved in the comments attribute, added the user tag description
suggestion and motivation.</p>
      <p>All tags can be assigned in two ways: automatically and manually. Any tag set by mistake or error
can be manually removed from a job.</p>
      <sec id="sec-3-1">
        <title>Automatic mode</title>
        <p>In this mode, the tags are automatically assigned:
 to all finished jobs according to SQL-based criteria regarding saved integral job
characteristics data, information from resource manager and other available saved data;
 as a result of running a special script that processes whole saved job collection info.</p>
        <p>In this mode a special attribute would indicate that the tag was set in package (automatic) mode of
tag assignment.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Manual mode</title>
        <p>Manual tag assignment is usually done by user, project manager or administrator in the following
cases:
 as a result of certain job analysis (specifying algorithm implemented, etc.);
 as a result of specifying the tag via command line when submitting a job;
 any tag in a user-specific tag space (marking out important job runs as a part of the
project, etc.).</p>
        <p>In this mode an attribute addressing tag author is set, that also allows finding jobs, marked as a
part of a certain project or by a certain user.</p>
        <p>User-specific tag space consists of regular tags and custom user tags. Any manually assigned tags
by a user are seen only in the scope of the project and system administrators. The members of other
project see their own tag spaces and the general tag space is available for all of the users.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Use cases</title>
      <p>Of course, real life use cases are very diverse. It this section we would like to share our
experience of every-day usage of the proposed technique as a part of the developed tool approbation at
Supercomputer Center of Moscow State University on a few examples just to give a general idea of it.</p>
      <sec id="sec-4-1">
        <title>4.1 Revealing jobs, users and projects that practice inappropriate resource utilization</title>
        <p>One of the problems of every-day practice of large-scale supercomputer center with a number of
heterogeneous resources and considerable number of users concurring for the resources is a problem
of inacceptable efficiency or inappropriate resource utilization. This is of a higher priority for specific
limited resources, like compute nodes equipped with specialized accelerators, local disks, extra
memory or other hardware and software that is critical for some applications and at the same time
these nodes can still be used by applications that do not need that specific type of resources that the
nodes possess. Such nodes usually have a high potential for resource-demanding specific applications
and for the large systems like “Lomonosov” are usually managed as a separate partition with a special
queuing options to allow submitting jobs to the appropriate partition. This is vital for projects that
perform computations only due to the advantages of such partitions, so by queuing to the desired
partition user get a guarantee that their application would have all necessary resources at disposal.</p>
        <p>Nevertheless, when analyzing the whole job collection for such partitions it appears that there are
numerous job runs that do not use any partition facilities benefits. Of course, sometimes algorithm
peculiarities can use resources with totally different intensity, but further analysis usually shows that
the majority of suspicious jobs never use any benefit of such partitions. The reasons can be different,
but usually it is a shorter wait time in a queue.</p>
        <p>This can be seen on GPU partitions with user job runs that never use GPUs. A slightly different
situation is seen on HDD-equipped nodes with absent or extreme low disk usage rate and finally,
single-process application that don’t benefit from multiple CPU cores can be seen almost on any
partition regardless of hardware and software.</p>
        <p>It is important to find the root cause of such applications behavior and as soon as the reason is
found and changes by user or administrator are applied, the ratio of such jobs can be lowered that
would immediately raise HPC system efficiency and overall throughput.</p>
        <p>The most popular reasons are:
 Problems inside the application, program or algorithm. The user is sure that he needs
resources, but in practice application doesn’t utilize any or utilizes at extremely low rates.
 Problems of HPC system. The declared resources are not available on the nodes.
 Inappropriate job allocation. This can be both a mistake, and cheating for lower job
waiting time.</p>
        <p>Regardless of real reason, these job runs lead to a higher wait time for the jobs that really need
specific resources.</p>
        <p>The search for such jobs can be automated using integral job characteristics and some of
introduced tags.</p>
        <p>For example, to filter the jobs allocated to GPU partition with no usage of accelerator one can use
tags job_partition GPU and job_accel GPU UNUSED at the same time. Next, one can cut off jobs
allocated to the test partition as of no interest. The rest jobs that are assigned job_status
COMPLETED tag probably do not need GPUs at all, as finished successfully with no registered GPU
usage. At this point two options are available whether it is a mistake (user or system) or it was done by
user intentionally, trying to reduce job wait time as wait time in specific partitions is sometimes less
than in regular.</p>
        <p>A very similar situation is seen for HDD-equipped nodes. Jobs that are tagged with job_partition
HDD4, job_partition HDD6, job_disks UNUSED or job_disks TOO LOW can potentially be
successfully executed at regular partitions. Note that there appears an option of very low resource
utilization. This means that disk operations might be easily replaced with network file system
operations with minimal additional overhead or even without it.</p>
        <p>For those jobs that are tagged with job_nodes SINGLE and avg_LA SINGLE CORE or avg_LA
LOW, it is quite reasonable to inquire what for it was submitted to the supercomputer. Such jobs use a
single node and a single core (or just few processes per node) and can potentially run well on a
desktop. Unfortunately such jobs are met very often.</p>
        <p>Users who submit types of jobs mentioned above must be contacted to figure out the reasons of
the revealed facts of inappropriate and inefficient resource utilization. The problems found should be
resolved. If cheating is met or it is proved that the executed jobs do not really need HPC resources,
quotas for corresponding user accounts and projects can be reduced to the extent of blocking.</p>
        <p>Let us take a look at one of real-life examples. Figure 1 illustrates the filtered job list allocated to
regular partitions with automatically avg_LA_SINGLE_CORE tag assigned. It is clearly seen that the
jobs have a low LoadAverage close to 1 as filtered by the tag, at the same time having very low
CPU_user. Note, that it is not a test partition and all jobs are run on a single node, grabbing 8 cores on
regular4 and hdd4 partitions!</p>
        <p>Fig 1. Filtered single-process jobs found in real job queue in various regular partitions
A close look at the longest job owner that was cancelled by timeout illustrates that the user always
runs such single-node, even single-process jobs regardless of partitions (Figure 2).</p>
        <p>Fig 2. Jobs of a certain user appear to be all single-process</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Finding jobs with high Flops intensity and high efficiency</title>
        <p>Apart from finding problem cases, there is a task of finding well-optimized jobs that utilize HPC
resources with high efficiency. These jobs owners are usually experienced users with high
qualification in parallel programming and fine tuning of software that secures highly efficient
supercomputer load. The contact with such users is very important first and foremost to learn the
techniques used and share them with novice users, contributing to FAQ/wiki sections of helpdesk and
so on. The experienced users are very likely to be invited to give public lectures, make reports on
seminars and join other educational activities.</p>
        <p>As a rule, most of jobs with high floating point intensity are tagged with avg_Flops HIGH,
avg_CPU_user HIGH tags and most efficient apps in terms of memory utilization are tagged
avg_cache_L1/L3 HIGH. Filtered by these criteria jobs are usually having good data locality, they are
well-balanced and show high performance.</p>
        <p>The deeper analysis of such jobs allows revealing optimal command line and compiler options for
the variety of categories of standard applications and algorithm implementations. Once such a job is
approved to be a well-optimized typical example of a SW package usage or algorithm implemented, a
proper tag, corresponding to such a category can be set (like job_sw VASP). This provides means for
the comparative analysis of similar jobs. This can also serve as a good basis for the more detailed
analysis of the whole job collection and revealing inefficient applications and users that use resources
inefficiently.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Finding applications with special need for large amounts of memory</title>
        <p>Many users of supercomputer complex run applications that are resource-demanding regarding
amount of available memory per process. Such applications are usually effective enough, but are often
scaled down in different ways to fit available memory, for example, reducing number of MPI
processes per node and so on.</p>
        <p>Such applications are usually run on a considerable number of nodes and LoadAvg values are
below the number of CPU cores per node. These jobs are usually tagged with avg_memload HIGH,
and related to node and core usage tags avg_LA SINGLE CORE, job_nnodes FEW or job_nnodes
MANY.</p>
        <p>If such a job is found in the 6-cored CPU “Regular 6” partition (tagged with job_partition
REGULAR6), even changing allocation to the “Regular 4” partition can be an optimization choice
leading to reducing the number of idle CPU cores per node by 4 cores (2*6–2*4).</p>
        <p>Some of such applications can also benefit from moving to hybrid MPI+OpemMP or MPI+Cilk
models. If such a model cannot be applied, some of the applications can be reallocated to SMP
partition with much larger amounts of memory available.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4 Revealing categories of issues and inefficient behavior</title>
        <p>The accumulation of statistics and knowledge on the problems of parallel applications is one of
the most important components of the HPC center job collection analysis. The ability to add tags to the
analyzed jobs related to the implementation issues found is a useful feature for this purpose as well as
tags corresponding to non-efficient use of computing resources, hardware problems or other features
found in course of job execution characteristics analysis.</p>
        <p>When analyzing inefficient, abnormal application behavior of a single run or of a sequence of
jobs, based on the certain software package, it is often needed to contact the user, the application
owner who can provide additional information on the program details: algorithm implementation used,
program architecture and structure, computing model and so on up to dependencies on input data and
command line options. All this information should be recorded to aid further analysis of similar
applications and categories.</p>
        <p>If any application run is being analyzed it is useful to mark and tag the used system software
details. This is true first of all regarding the math libraries used, compiler and compiler options, MPI
type, etc. This provides the basis for the comparative analysis of similar jobs or sequences of jobs. If
differences in behavior are found, one can continue deep study on the reason origin: user application
reaction, system software configuration, etc.</p>
        <p>All widely-met issues like data race, deadlocks and so forth can be marked by special tags
(job_bug DATA RACE, job_bug DEADLOCK, etc.). This can help in further analysis of other jobs.
One can compare strange program behavior to the analyzed profiles marked as having specific issues.
Once a similar behavior is found, it can be a key to resolving the problems of the originally analyzed
job.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and future work</title>
      <p>
        Close-future plans include implementation of the Octoshell [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] module for full project-oriented
workflow support and authentication, thus securing accessibility of the proposed service for any user.
We expect it ready by the middle of 2016. By that time we also plan to extend supported tag set and
adjust criteria for existing tags if needed.
      </p>
      <p>To sum up, a user-friendly, useful and effective technique for filtration, grouping and further
analysis of the whole queued job collection of large-scale HPC systems based on system monitoring
and resource manager data is proposed and implemented. The developed tool is evaluated in the
everyday practice of the Supercomputer Center of Lomonosov Moscow State University, providing means
for effective analysis for any and every user application run. The priceless collection of information on
all finished jobs is already being enriched in a 24/7 mode for several month.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The work was funded in part by the Russian Found for Basic Research, grants №16–07–00972А,
№13–07–00786A, and by the Ministry of Education and Science of the Russian Federation,
Agreement No. 14.607.21.0006 (unique identifier RFMEFI60714X0006).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>1. Top50 Supercomputers of Russia and</article-title>
          CIS: [http://top50.supercomputers.
          <source>ru/]. 15.02</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Top500 Supercomputer sites: [http://top500.org/].15.02.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Antonov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhumatiy</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikitenko</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanov</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teplov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shvets</surname>
            <given-names>P.</given-names>
          </string-name>
          <article-title>Analysis of dynamic characteristics of job stream on supercomputer system Numerical Methods and Programming, 2013</article-title>
          . Vol.
          <volume>14</volume>
          , No.
          <volume>2</volume>
          ., P.
          <fpage>104</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Safonov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kostenetskiy</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borodulin</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melekhin</surname>
            <given-names>F.</given-names>
          </string-name>
          <article-title>A monitoring system for supercomputers of</article-title>
          SUSU // Russian Supercomputing Days International Conference, Moscow, Russian Federation,
          <fpage>28</fpage>
          -
          <lpage>29</lpage>
          September,
          <year>2015</year>
          ,
          <source>Proceedings. CEUR Workshop Proceedings</source>
          ,
          <year>2015</year>
          . Vol.
          <volume>1482</volume>
          , P.
          <fpage>662</fpage>
          -
          <lpage>666</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Stefanov</surname>
            <given-names>K.</given-names>
          </string-name>
          et al.
          <article-title>Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon</article-title>
          ) // Procedia Computer Science / Elsevier B.V.,
          <year>2015</year>
          . Vol.
          <volume>66</volume>
          , P.
          <fpage>625</fpage>
          -
          <lpage>634</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nikitenko</surname>
            <given-names>D.</given-names>
          </string-name>
          <article-title>Complex approach to performance analysis of supercomputer systems based on system monitoring data</article-title>
          .
          <source>Numerical Methods and Programming</source>
          ,
          <year>2014</year>
          , Vol.
          <volume>15</volume>
          , P.
          <fpage>85</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Voevodin</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhumatiy</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikitenko</surname>
            <given-names>D</given-names>
          </string-name>
          . Octoshell: Large Supercomputer Complex Administration System // Russian Supercomputing Days International Conference, Moscow, Russian Federation,
          <fpage>28</fpage>
          -
          <lpage>29</lpage>
          September,
          <year>2015</year>
          ,
          <source>Proceedings. CEUR Workshop Proceedings</source>
          ,
          <year>2015</year>
          . Vol.
          <volume>1482</volume>
          , P.
          <fpage>69</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Voevodin</given-names>
            <surname>Vl</surname>
          </string-name>
          .,
          <string-name>
            <surname>Antonov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bryzgalov</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikitenko</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhumatiy</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sobolev</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanov</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <source>Voevodin Vad. Practice of "Lomonosov"</source>
          Supercomputer // Open systems,
          <year>2012</year>
          . No. 7,
          <string-name>
            <surname>P.</surname>
          </string-name>
          36-
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Zhumatiy</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikitenko</surname>
            <given-names>D</given-names>
          </string-name>
          . Approach to flexible supercomputers management // International supercomputing conference Scientific Services &amp;
          <article-title>Internet: all parallelism edges</article-title>
          , Novorossiysk, Russian Federation,
          <fpage>23</fpage>
          -
          <lpage>28</lpage>
          September,
          <year>2013</year>
          , Proceedings. MSU,
          <year>2013</year>
          . P.
          <volume>296</volume>
          -
          <fpage>300</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Voevodin</given-names>
            <surname>Vl</surname>
          </string-name>
          . Supercomputer situational screen // Open systems,
          <year>2014</year>
          . No. 3,
          <string-name>
            <surname>P.</surname>
          </string-name>
          36-
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Shvets</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikitenko</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sobolev</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanov</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voevodin</surname>
            <given-names>Vad.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voevodin</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhumatiy</surname>
            <given-names>S.</given-names>
          </string-name>
          <article-title>An Approach for Ensuring Reliable Functioning of a Supercomputer Based on a Formal Model</article-title>
          .
          <source>11th Int. Conference on Parallel Processing and Applied Mathematics</source>
          , Krakow, Poland,
          <fpage>6</fpage>
          -
          <lpage>9</lpage>
          September,
          <year>2015</year>
          . Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Voevodin</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antonov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dongarra</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <source>AlgoWiki: an Open Encyclopedia of Parallel Algorithmic Features // Supercomputing Frontiers and Innovations</source>
          ,
          <year>2015</year>
          . Vol.
          <volume>2</volume>
          . c1. P. 4-
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. SLURM workload manager: [http://slurm.schedmd.
          <source>com/]. 15.02</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <article-title>Cleo cluster batch system:</article-title>
          [http://sourceforge.net/projects/cleo-bs/]. 15.02.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Ganglia Monitoring System: [http://ganglia.sourceforge.
          <source>net/]. 15.02</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. Collectd - The system statistics collection daemon: [https://collectd.org/]. 15.02.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. Clustrx:
          <volume>15</volume>
          .
          <fpage>02</fpage>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. jQuery &amp; jQuery UI: [http://jqueryui.com/]. 15.02.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. TagIt: [http://aehlke.github.io/tag-it/].15.02.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>