<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>One Approach to Computational Load Balancing within the Node of Hybrid Computing System</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Keldysh Institute of Applied Mathematics</institution>
          ,
          <addr-line>Miusskaya sq., 4, 125047, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2003</year>
      </pub-date>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The issues of the computations distributing within single node of a hybrid computing system for applied programs with computation-intense operations are considered in this paper. There are two methods proposed: a method for static distribution of computations and a method for automatic balancing of the computational load during program execution, which is based on periodic analyzing the CPU load by the executed program and making decision whether redistribution of computational load is needed. The proposed methods are implemented in an applied program that solves a gas dynamic problem using the computing resources of the multicore central processor and graphics accelerators. The results of program execution with various data distributions were obtained and analyzed, both with and without the mechanism for automatic balancing of the computational load.</p>
      </abstract>
      <kwd-group>
        <kwd>parallel programming</kwd>
        <kwd>programming automation</kwd>
        <kwd>computational load balancing</kwd>
        <kwd>hybrid computing architectures</kwd>
        <kwd>NORMA language</kwd>
        <kwd>automatic program generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In the modern world there are a huge number of different problems to solve and it
requires significant computing power. We have computation-intense problems in
science and industry, in business and for individual purposes. Typical examples of
such resource-intense problems are numerical methods in solving mathematical
physics equations (e.g., modeling of processes occurring in a nuclear reactor), modeling of
physical, chemical and biological processes. New challenges of this kind are
constantly emerging. Modern computing systems for such problems to solve provide the
possibility of parallel computing. Therefore if the program is effective and up to date it
must be parallel.</p>
      <p>
        There are various methods to automate the development of parallel programs.
Monographs [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] are devoted to the subject. There were strictly formulated
mathematical basis of joint study of parallel numerical methods and parallel computing
systems and investigated the task of mapping the program on the architecture of a
parallel computer. The idea of automatic mapping of given sequential program on
parallel computing system is generally stated there as a NP-complete problem. This
fact explains the up to date lack of a practical satisfactory universal method for the
development of parallel programs.
      </p>
      <p>The theoretical difficulties arising in the development of parallel programs are
aggravated by the constant development and complication of computing systems
architectures. These possibilities, on the one hand, provide a new potential for accelerating
computations, and, on the other hand, arise the problem of utilizing this potential,
developing methods and programming tools in the context of these new possibilities.</p>
      <p>In addition to general purpose central processor unit (CPU) modern computing
systems typically contain additional computing units designed to quick and
energyefficient parallel mass computing operations same for a large amount of data being
processed. Examples of such computer units include graphic processor units (GPUs)
and Xeon Phi accelerators. To be effective the parallel program must provide all the
computer units at its disposal with continuous data loading for calculations. It should
also ensure that computing is synchronized where it is necessary when accessing
shared data to minimize computer units outages during synchronization and access to
other resources both software and hardware. If some computer units process the
amount of data allocated to them faster than others and then stand out of action
waiting for synchronization there is a need to redistribute the processed data between the
computer units while the program is running.</p>
      <p>The solution of the problem of data distribution between the computer units is
called computational load balancing. In case periodic solution of this problem is
needed during program execution it will be called dynamic balancing of the
computational load. The effectiveness of the program as a whole depends to a large extent on
the successful implementation of this problem.</p>
      <p>
        Research in the field of creating both programming methods for new architectures
and the implementation of these methods in language tools for parallel programming
is very active, and supported by manufacturers of computer systems. A fairly
complete classification of architectures, methods and means of parallel programming is
presented on the site [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is devoted, in particular, to parallel computing
technologies. One approach could be noted of those already implemented. It is based on a
perfectly reasonable symbiosis of the parallel compiler and hints from the
programmer, made in the form of special software directives, for example [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
      </p>
      <p>
        In this case another constructive approach to problems of developing parallel
programs is worked out [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It determines limits of automatic parallelizing in the
particular program and gives the facilities for automatic generation of the effective parallel
program. This approach uses the non-procedural NORMA language [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as a
programming language.
      </p>
      <p>
        This article will propose and consider the method of organizing automatic dynamic
computational load balancing within a single node of a hybrid computing system,
which has one or more CPUs and one or more additional computer units. The issue of
the distribution of computational load between the nodes of the computing system is
not considered in this work. The presented method was designed to be used in the
NORMA programming system [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in compiling programs for computer systems with
graphics accelerators (GPU). But this method itself is universal and does not depend
on the type of specific hybrid computing system and software used. It seems to the
authors it can be successfully applied in the development of parallel programs with
computation-intense operations both in manual programming and in the case of
automatic approach.
      </p>
      <p>The proposed method was successfully tested on a gas dynamics program with
computation-intense operations. It was tested on parallel systems with hybrid
architecture (with NVIDIA GPUs).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Static computational load distribution method</title>
      <p>The issue of the distribution of computational load within the single node of the
hybrid computing system is considered. Each of such nodes has one or more central
processor units, CPUs. Since all modern CPUs are multi-core and have access to all
the RAM of the node, it doesn't matter how many of them are in the node – their
entire totality is always considered by the application program and operating system as a
single multi-core processor. There is also one or more special computer units in the
hybrid computing system (accelerator, GPU or Xeon Phi, or perhaps some other).
Their number is already important as each such computer unit has access and can
process data only from its own memory.</p>
      <p>An effective parallel program should use all available computing power.
Accordingly, in the node of a hybrid computing system, the entire amount of computational
output must be somehow distributed between the CPU and the accelerators.
Calculations done on CPU should be performed using multi-thread programming
technologies, such as OpenMP. Calculations performed on the accelerator should be done
using technology available for this type of accelerator, such as NVIDIA CUDA for
NVIDIA GPU.</p>
      <p>
        The process of automatic static distribution of computation between CPU and GPU
when compiling programs written in the NORMA language is detailed in the
papers [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. The methods and the ideas outlined in these articles can be applied to any
parallel program and any other type of accelerators. In short, these methods are as
follows. In the NORMA language, an operator describing some calculations can be
done on the domain. The domain is an analogous to the mathematical concept of the
grid. Thus the operator describes a set of identical calculations produced at points
(grid nodes) of domain of any type. As a rule calculations at each point of the domain
do not depend on the values at other points calculated in the same operator at the
same iteration step. Then the calculations done by one operator are independent at
each domain point and can be processed in parallel.
      </p>
      <p>To distribute such calculations between CPU and accelerators it is proposed that
each such operator is performed on both the CPU and each accelerator available in the
system. But the entire domain of such an operator (or rather, the array of points of this
domain) is distributed among the computer units and each computer unit performs the
operator only for the points distributed to it. To allow the computer unit to perform
the operator for its points it is also necessary that the arrays of the variables defined
on these domain points both required for calculations and those that are calculated as
a result of the performed operator are also physically distributed between the memory
of the CPU and the memory of each accelerator. That is, in fact, the process of
computational distribution comes down to the process of data distribution and then to the
synchronized performing calculations by each computer unit over the intended part of
the common data.</p>
      <p>The following approach is proposed for the distribution of such variables. At first,
the entire amount of data processed is divided into two unequal parts: the area
processed by the CPU and the area processed by the accelerator (accelerators). The size
of the areas is chosen according to the expected ratio of the CPU performance and the
total performance of the accelerators. Then, if there are several accelerators their areas
are divided into the corresponding number of subareas and distributed equally among
them. There is an example of such distribution on Fig. 1.</p>
      <p>subarea GPU1
subarea GPU2
subarea GPU3</p>
      <sec id="sec-2-1">
        <title>CPU area</title>
      </sec>
      <sec id="sec-2-2">
        <title>GPU area</title>
        <p>Each computer unit performs calculations with the data that has been occurred in its
area (subarea). Additionally, the problem of data transmission between areas and
subareas could be solved if necessary.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Dynamic computational load distribution method</title>
      <p>The boundary between the CPU area and the accelerator area can be fixed or may be
being modified while the program runs. By changing the area boundary periodically
the program performs a dynamic balancing of the computational load which will be
discussed further. While changing the size of the GPU area the size of each
accelerator's subareas is recalculated accordingly.</p>
      <p>In order to assess the need to adjust the position of the area boundary the process of
estimating the overall efficiency of calculations in the current distribution of the
computational load is started periodically (for example, at each n step of the iteration). As
a result of this process it is decided whether to maintain the current position of the
area boundary or to shift it to some value in one direction or another. In case of a shift
it is necessary to redistribute the data occurred in the other area or subarea and this
data starts to be processed by another computer unit.</p>
    </sec>
    <sec id="sec-4">
      <title>The method to determine whether you need to adjust the distribution of the load</title>
      <p>To determine the overall effectiveness of the calculations it is proposed to use the
method based on the evaluation of the program's use of the CPU resource. Developing
this method we assume that the program that solves the computational problem
should be constantly engaged in computing and completely consume the resource of
the computer units. There may be sure synchronization points when individual
processes and threads can wait for other processes and threads but the waiting time
should be quite short comparing to the computing time. If the program periodically
waits for some external events and spends considerable time in stand by, then this
method couldn’t be applied to such a program.</p>
      <p>So, ideally, a computing program should create a 100% CPU load. If the program
is hybrid and uses calculations on the accelerator along with the calculations on the
CPU, then, if the accelerator processes the data allocated to it more slowly than the
CPU its ones, the program will stand by in synchronization points waiting for the
accelerator to execute. And, as a result, the CPU load will be less than 100%. It's easy
and quick for the program to get the information about its consumption of CPU’s
resource. In UNIX family OS, for example, it is done by a system call
clock_gettime(...). A call with a CLOCK_REALTIME parameter gives a total
system time and with the CLOCK_PROCESS_CPUTIME_ID parameter gives the
processor time consumed by the running program. If these two parameters are
detected over a period, then the rate of CPU load by this program during this period can be
calculated according to the following formula:</p>
      <p>CPUload = tCLOCK_PROCESS_CPUTIME_ID / Nthread / tCLOCK_REALTIME * 100%,
where t is an appropriate time, and Nthread is the number of CPU threads running.</p>
      <p>The next issue is how to interpret the resulting CPU load. We resume that ideally a
computing program should have a CPU load 100%. But it will also be 100% if the
accelerator processes its data faster than the CPU and the program doesn’t stand by
waiting for the accelerator to terminate. To be able to diagnose such a situation and to
allow the program to spend some time in synchronization points with other processes
it is suggested that the eligible CPU load is considered an empirical value of 95%. In
other words it is allowed that the CPU stands by waiting for the accelerator but for a
short period of time no more than 5% of the total CPU’s load. If the value of the CPU
load is more than that is considered eligible, then it is necessary to reduce its area
(and, accordingly, to increase the accelerator area). If, on the contrary, less than
eligible – then to increase the CPU area.</p>
      <p>But shifting the area boundaries entails starting data redistribution process and it
can take considerable time to complete. Therefore, it is highly desirable to avoid the
situation of constant changes from decreasing to increasing areas and vice versa. Thus
an eligible CPU load is proposed to consider not a specific value but a small range,
for example, from 85% to 95%.</p>
      <p>If the CPU load has been estimated periodically (for example, at each n step of the
iteration) for the time interval expired since the previous estimation, one can decide
whether to leave the current distribution of the data (if the CPU load is in the eligible
range) or increase the CPU area (if the CPU load is below the range) or reduce the
CPU area (if the CPU load is above the range). It is also important to determine how
much it is necessary to shift the boundary of areas. It is obvious that the more the
CPU load rate differs from the needed the more the area boundary should be changed.
On the other hand, moving the areas boundary should be careful when the size of one
area highly exceeds the size of the other. As even a small boundary shift can
significantly alter the amount of data being processed which in turn can fundamentally
change the balance of computational load between the CPU and the accelerator. A
sudden change in the ratio of areas size may cause the reverse changes in return at the
next step. And if the response is also sharp it will cause constant changes which
should be avoided as it was explained earlier. Therefore, near the extreme values of
the relative size of the areas the algorithm for moving the areas boundary should be
carefully implemented to shift the boundary to small extent.</p>
      <p>As a result, the algorithm that determines the magnitude of the area boundary can
be described by a function of two variables – a deviation from the eligible CPU load
and the current relative size of the areas.</p>
      <p>
        The proposed method can be algorithmized and is suitable for a wide range of
computational tasks and does not depend on the characteristics of a particular
computing system. Therefore, it can be used for automatic balancing of the computational
load. It is planned to implement it in the NORMA compiler [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The main goal is
that the compiler would automatically generate all the necessary code to determine
the CPU load, to decide whether to shift the boundary of the areas, and to redistribute
the data being processed. In the meantime, the method is "manually" implemented in
some gas dynamics solving applied program, and the next chapter gives the results of
its application.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>The results of the proposed method applying</title>
      <p>The method of automatic balancing of computational load was implemented in a
hybrid applied program that solves gas dynamics problem. The program use MPI
technology to engage several nodes of distributed computer system, OpenMP technology
for multicore CPU computing within a single node and NVIDIA CUDA technology
for GPU computing.</p>
      <p>
        The results below were obtained from K-100 computational cluster [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] using Intel
15.0.0, nvcc compiler version 6.5, and Intel MPI Library 5.0 Update 1. The program
starts 4 MPI processes, each runs on its own computational cluster node with 12 CPU
cores and 3 GPUs. Tests were conducted with different number of GPUs, from 1 up to
3, and the method worked properly regardless of the number of GPUs. But because
this program is well-suited to GPU calculations, the share of CPU calculations was
very small. Therefore, further data for starts using only 1 GPU is given so that the
CPU's contribution to the overall computation gets more noticeable and the running
processes become more visible. The program also ran on K-60 computational
cluster [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] with more powerful GPUs. The method of automatic balancing of the
computational load also worked well there, but the share of calculations on the CPU was
even less - 4% with only 1 GPU.
      </p>
      <p>The diagram of the program’s execution time at different values of the GPU area
size is presented on Fig. 2. The first 10 columns correspond to program starts with
fixed-boundaries areas and without the automatic balancing of the computational
load. In this case the GPU area size is set from 100% (when the CPU is not used at
all) to 83%. The last 4 columns are starts using automatic computational load
balancing, with different initial GPU area size: 100%, 75%, 50% and 0%.</p>
      <p>The diagram shows that the least execution time is achieved when GPU area size is
set to 87% (and 13% CPU area size respectively). Then with small enlargement of
CPU area size the program's execution time begins to grow rapidly – in fact, as much
as the CPU area size grows, because it is time of the CPU work that begins to
determine the operation time of the entire program.</p>
      <p>Of particular interest there are columns that correspond to the starts with the use of
automatic computational load balancing. They show that if the initial distribution of
the computational load has been chosen roughly correct (d100 – the initial GPU area
size is 100% and d75 – the initial GPU area size is 75%), then total program's
execution time is close to the ideal. But if the initial distribution has not been chosen
correctly (d50 and, in particular, d0), the program spends considerable time to get to its
proper distribution.</p>
      <p>There are graphs of areas size changes corresponding to the iteration step for
program starts with automatic computational load balancing on Fig. 3. The values of
computational load have been analyzed and adjusted at each 100th step of the
iteration.</p>
      <p>Step of the iteration (*100)</p>
      <p>The diagram shows that d100 and d75 starts have already got to the ideal distribution
at the 4th adjustment. The ideal distribution for the given program on the set hardware
could be seen on Fig. 2. and considered 87% GPU area size. D50 start has already
spent more steps on adjustment and as a result the total time of its execution is
noticeably longer. D0 start has been moving cautiously away from zero size of the GPU
area for a very long time, the average values of areas distribution have been being
already passed much faster and finally it has also got to the same ideal distribution,
87% for the GPU area size. But it has taken 150 adjustments and considerable time
has been wasted.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>The proposed method of automatic computational load balancing, despite of its
simplicity, can be successfully used when solve computation-intense problems on hybrid
computing systems. Tests have shown that the described method implementation
gives the program its ideal distribution of the computational load and in the case of a
small change in the load the method gives the opportunity to cope with such changes
quickly. This method is independent neither from the hardware of the hybrid
computing system nor from the software chosen for solving the applied problem.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Voevodin</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          :
          <article-title>Matematicheskie modeli i metody v parallelnykh protsessakh</article-title>
          .
          <source>Nauka</source>
          , Moscow (
          <year>1986</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Voevodin</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voevodin</surname>
          </string-name>
          , Vl.V.:
          <article-title>Parallelnye vychisleniia</article-title>
          . BKhV-Peterburg,
          <string-name>
            <given-names>S.</given-names>
            <surname>Peterburg</surname>
          </string-name>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Informational Analytical Center, http://parallel.ru/index_eng.html,
          <source>last accessed</source>
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. OpenACC, http://openacc.org,
          <source>last accessed</source>
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>5. DVM-system</article-title>
          , http://www.keldysh.ru/dvm, last accessed
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sistema</surname>
            <given-names>NORMA</given-names>
          </string-name>
          , http://www.keldysh.ru/pages/norma, last accessed
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Andrianov</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baranova</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bugerya</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gladkova</surname>
            ,
            <given-names>E.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efimkin</surname>
            ,
            <given-names>K.N.: Iazyk</given-names>
          </string-name>
          <string-name>
            <surname>NORMA. Preprinty</surname>
            <given-names>IPM</given-names>
          </string-name>
          im. M.V.
          <string-name>
            <surname>Keldysha</surname>
          </string-name>
          (
          <year>2019</year>
          ),
          <string-name>
            <surname>ISSN</surname>
          </string-name>
          <year>2071</year>
          -
          <volume>2898</volume>
          (Print),
          <source>ISSN</source>
          <year>2071</year>
          -
          <volume>2901</volume>
          (Online),
          <source>No</source>
          <volume>132</volume>
          , 48 p.,
          <source>doi:10</source>
          .20948/prepr-2019-
          <volume>132</volume>
          ., http://library.keldysh.ru/preprint.asp?id=
          <fpage>2019</fpage>
          -
          <lpage>132</lpage>
          , last accessed
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Andrianov</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baranova</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bugerya</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efimkin</surname>
            ,
            <given-names>K.N.</given-names>
          </string-name>
          :
          <article-title>Raspredelenie vychislenii v gibridnykh vychislitelnykh sistemakh pri transliatsii programm na iazyke NORMA</article-title>
          .
          <article-title>Vychislitelnye metody i programmirovanie (</article-title>
          <year>2019</year>
          ), ISSN 1726-3522, M.:
          <string-name>
            <surname>NIVTs MGU im.</surname>
            <given-names>M.V.</given-names>
          </string-name>
          <string-name>
            <surname>Lomonosova</surname>
          </string-name>
          , Vol.
          <volume>20</volume>
          , № 3,
          <string-name>
            <surname>P.</surname>
          </string-name>
          224-
          <fpage>236</fpage>
          , DOI: 10.26089/NumMet.v20r321, http://num-meth.srcc.msu.ru/zhurnal/tom_2019/pdf/v20r321.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Andrianov</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baranova</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bugerya</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efimkin</surname>
            ,
            <given-names>K.N.</given-names>
          </string-name>
          :
          <article-title>Metody raspredeleniia vychislenii pri avtomaticheskom rasparallelivanii neprotsedurnykh spetsifikatsii. Superkompiuternye dni v Rossii: Trudy mezhdunarodnoi konferentsii</article-title>
          .
          <volume>23</volume>
          -
          <fpage>24</fpage>
          sentiabria
          <year>2019</year>
          g.,
          <string-name>
            <given-names>g. Moskva. Pod. red. Vl.V.</given-names>
            <surname>Voevodina</surname>
          </string-name>
          . M.: MAKS Press (
          <year>2019</year>
          ),
          <source>ISBN 978-5-317-06007- 7</source>
          , e-ISBN 978-5-
          <fpage>317</fpage>
          -06244-6, P.
          <fpage>59</fpage>
          -
          <lpage>70</lpage>
          , DOI: 10.29003/m680.RussianSCDays, URL: http://russianscdays.org/files/2019/pdf/59.pdf,
          <source>last accessed</source>
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Andrianov</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bugerya</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Efimkin</surname>
            ,
            <given-names>K.N</given-names>
          </string-name>
          , Koludarov,
          <string-name>
            <surname>P.I.</surname>
          </string-name>
          :
          <article-title>Modulnaia arkhitektura kompiliatora iazyka Norma+</article-title>
          . M.:
          <string-name>
            <surname>Preprint</surname>
            <given-names>IPM im. M.V. Keldysha RAN</given-names>
          </string-name>
          (
          <year>2011</year>
          ), No
          <volume>64</volume>
          , 16 p., http://keldysh.ru/papers/2011/prep64/prep2011_
          <fpage>64</fpage>
          .pdf,
          <source>last accessed</source>
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>Tsentr kollektivnogo polzovaniia IPM im</article-title>
          . M.V.
          <string-name>
            <surname>Keldysha</surname>
            <given-names>RAN</given-names>
          </string-name>
          , http://ckp.kiam.ru/?hard,
          <source>last accessed</source>
          <year>2020</year>
          /11/25.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>