<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HYBRID ALGORITHM NEWTON METHOD FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS WITH BLOCK JACOBI MATRIX</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>V.M. Glushkov Institute of Cybernetics of NAS of Ukraine</institution>
          ,
          <addr-line>40, Acad. Glushkov Av., 03187, Kyiv tel. (066)-367-85-40</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>V.M. Glushkov Institute of Cybernetics of National Academy of Sciences of Ukraine</institution>
          ,
          <addr-line>Kyiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Systems of nonlinear equations often arise when modelling processes of different nature. These can be both independent problems describing physical processes and also problems arising at the intermediate stage of solving more complex mathematical problems. Usually, these are high-order tasks with a big count of un-knows, that better take into account the local features of processes or things that are modelled. Also, more accurate discrete models allow for more accurate solutions. Usually, the matrices of such problems have a sparse structure. Often the structure of sparse matrices is one of next: band, profile, blockdiagonal with bordering, etc. In many cases, the matrices of the discrete problems are symmetric and positively defined or half-defined. The solution of systems of nonlinear equations is performed mainly by iterative methods based on the Newton method, which has a high convergence rate (quadratic) near the solution, provided that the initial approximation lies in the area of gravity of the solution. In this case, the method requires, at each iteration, to calculates the Jacobi matrix and to further solving systems of linear algebraic equations. As a consequence, the complexity of one iteration is O(n3 ) . Using parallel computations in the step of the solving of systems of linear algebraic equations greatly accelerates the process of finding the solution of systems of nonlinear equations. In the paper, a new method for solving systems of nonlinear high-order equations with the Jacobi block matrix is proposed. The basis of the new method is to combine the classical algorithm of the Newton method with an efficient small-tile algorithm for solving systems of linear equations with sparse matrices. The times of solving the systems of nonlinear equations of different orders on the nodes of the SKIT supercomputer are given.</p>
      </abstract>
      <kwd-group>
        <kwd>systems of nonlinear equations</kwd>
        <kwd>hybrid algorithm</kwd>
        <kwd>sparse matrices</kwd>
        <kwd>systems of linear algebraic equations</kwd>
        <kwd>high-performance computing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Systems of nonlinear equations (SNE) often arise when modelling processes of different nature. These can be both
stand-alone problems that describe physical processes and problems that arise in the intermediate stage of solving more
complex mathematical problems. As a rule, these are problems of ultrahigh orders that allow better consideration to take
into account the local characteristics of the process or phenomenon (occurrence) being modelled. Also, more accurate
discrete models allow for more accurate approximations.</p>
      <p>On the other hand, the matrices of such problems have a sparse structure. Most often it is band, profile,
blockdiagonal with bordering, etc. In many cases, the matrices of discrete problems are symmetric and positively defined or
half -definite.</p>
      <p>The solution SNE is carried out (Systems of nonlinear equations are solved) mainly by iterative methods based on
Newton's method, which has a high convergence rate (quadratic) near the solution, provided that the initial
approximation lies in the region of convergence of the solution. In this case, the method requires at each iteration the
computation of the Jacobi matrix and further solution of systems of linear algebraic equations (SLAE). As a
consequence, the complexity of one iteration is O(n3 ) .</p>
      <p>Parallelizing the calculation of the SLAE solution greatly accelerates the process of finding the SNE solution.</p>
    </sec>
    <sec id="sec-2">
      <title>Statement of the problem with approximate initial data</title>
      <sec id="sec-2-1">
        <title>Let a system of n nonlinear equations be given</title>
        <p>
          Problem (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) is some approximation to the exact system of nonlinear equations (x)=0, and for these vector
functions the inequality holds:
        </p>
        <p>f u    u   
 fi 
If A   </p>
        <p> x j i, j1
for any n-dimensional vector u.</p>
        <p>
          To solve problem (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) initial approximation x0 and required accuracy to get an approximation to the solution of
the system  are given and the area in which the solution is sought is determined D  ai  xi  bi , i  1, 2,, n.
Herewith the initial approximation belongs to a given area x 0  D . The lower index in the formulas denote the
component numbers of the vectors, and the upper index will denote the iteration numbers.
        </p>
        <p>n</p>
        <p>
          is the Jacobi matrix of system (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) (or some approximation to it), then the iterative process of
Newton's method of finding the solution at a given initial approximation is written in the form
where wk   xk 1  xk  correction, k = 0, 1, ... – is the iteration number. The next approximation to the solution is
calculated by the formula:
        </p>
        <p>
          As can be seen from formula (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) it is necessary to solve a system of linear algebraic equations, calculating the
values of the vector function and the Jacobi matrix, at each iteration.
        </p>
        <p>
          To obtain a solution of the system of nonlinear equations (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) with a given accuracy xk   x   , the iterative process
must be completed if the condition
        </p>
        <p>Ak wk    f x k  
xk 1  xk   wk 
f x (k 1)  </p>
        <p>
          
A(k 1) 1
,
where A1(k ) is a matrix inverse to the Jacobi matrix calculated on the k-th iteration, and the relationship
will be
iterations f x k     , and after its execution, we check conditions (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ).
denoted as  [1]. Since to check condition (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) at each iteration it is necessary to rotate the Jacobi matrix, which requires a
significant number of arithmetic operations, then on the first iterations check the more economical condition of the end of
        </p>
        <p>
          After completing the iterative process, the error of the obtained approximation to the solution of the problem
with approximate data relative to the exact solution of the system with accurate data is calculated:

A1k 
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
xk  x    Ak1  ,
where x is the exact solution of the exact SNE.
        </p>
        <p>It follows that when solving systems of nonlinear equations, most of the arithmetic operations are performed
when calculating the values of the vector function, solving the corresponding SLAE and calculating the inverse matrix
to find the decoupling estimate. In the general case, the complexity of Newton's iterative method is O(n 3 ) . To reduce
the execution time of one iteration for solving SLAE, we will use a small-tile hybrid factorization algorithm, where  f x (k ) 
is the right part of SLAE.</p>
        <p>Small-tile hybrid factorization algorithm of the sparse block-diagonal matrix with a border</p>
      </sec>
      <sec id="sec-2-2">
        <title>Consider the problem of solving a system of linear algebraic equations</title>
        <p>~
A~~x  b
with the symmetric positive-definite sparse matrix of order n.</p>
        <p>The ideological prerequisite for the use of hybrid calculations in the processing of sparse matrices of arbitrary
structure is the preliminary rearrangement of the structure of the original matrix by the method of parallel sections,
which leads the matrix to a block-diagonal view with a border [2, 3]</p>
        <p>D22
D33




0
0
D p1 p1
C pp1
C 2 p  ,
C3 p 
 </p>
        <p>
C p1 p </p>
        <p></p>
        <p>D pp 
x  PT ~x, b  PT b~ ,
where P – permutation matrix, and blocks Dii, Cpi, Cip, i  1, p  1 , Dpp retain a sparse structure, р – the number of
diagonal blocks in the matrix, р ≥ 3.</p>
        <p>It is natural for the method of parallel sections that the size of the last diagonal block is much smaller than the
orders of the other diagonal blocks. Also, the structure of the block-diagonal matrix with a border allows to carry out
parallel and independent processing of blocks. Dii, Cpi, Cip, i  1, p  1 .</p>
        <p>Given the structure of the resulting matrix for data storage, it is advisable to use a block distribution. Moreover,
the number of blocks should be chosen taking into account the number of processors and graphics accelerators that are
involved in the calculations.</p>
        <p>In Fig. 1 shows the profile of matrix fill by nonzero elements. As a result of the application of the method of
parallel sections, the structure of the matrix takes the form shown in Fig. 2.</p>
        <p>Consider a small-tile hybrid factorization algorithm of a sparse block-diagonal matrix with a border A. This
algorithm better takes into account the profile or sparse structure of diagonal blocks, blocks with a border.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Divide matrix A into blocks of dimension с×с.</title>
        <p>Further, to factorization a block-diagonal matrix, we apply the algorithm proposed in [4] for dense matrices.</p>
      </sec>
      <sec id="sec-2-4">
        <title>To factorization the matrix during the k-th step, we use the following relations</title>
        <p>
          (
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
where the dimensions of the blocks
A11 – с×с,
        </p>
        <sec id="sec-2-4-1">
          <title>A21 – (n-kс)с, A22 – (n-kс)(n-kс), blocks and take into account the</title>
          <p>structure of the blocks Dii, Cpi, Dpp.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>From here we get the algorithm by which the decomposition is carried out during the k step</title>
        <p>
          Note that the implementation (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) - (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) at each step modifies only the blocks Dii, Cpi, i  1, p  1 , Dpp.
        </p>
        <p>To implement the algorithm we will use the following data distribution: in GPU(i) i  1, p 1 are containing
blocks Dii, Сpi and block A(pip) the same size as Dрр; in GPU(p) the Dpp block is stored.</p>
        <p>In fig. 3 show the block distribution of data at the k-th step of factorization of the block-diagonal matrix with a
border, taking into account the above-proposed decomposition.</p>
        <p>The parallelization of triangular factorization calculations is that the factorization of blocks A11 and the
modification of A21 and A22 can be done independently in each CPU(i) and GPU(і) i  1, p  1 .</p>
        <p>A11  L11 * LT11 ;</p>
        <p> T 1
L21  A21 * L11</p>
        <p>;
~
A22  A22  L21 * LT21.
 in GPU(i), i  1, p  1 we modify the column of blocks L21 :
 in GPU(i), i  1, p  1 we modify the blocks of the matrix A22 with A(pip) by the formula:
 In the CPU(p), using multi-assembly, we modify D pp :</p>
        <p>A11  L11 * LT11 ;
L21  A21 LT11 1 ;
~
A22  A22  L21LT21 .</p>
        <p>p1
D pp  D pp   A(pip) .</p>
        <p>*
i1</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref9">9</xref>
          )
(
          <xref ref-type="bibr" rid="ref10">10</xref>
          )
(11)
We factorize the block D*pp , thereby completing the process of factorization of the matrix.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>We will assume that the orders of all diagonal blocks are approximately equal to where s is the order of the last diagonal block. Then the following theorems are valid. Theorem 1. The number of operations, which performed on one GPU to find the factorization of a sparse blockdiagonal matrix with a border is estimated by the value</title>
        <p>qi  q 
n  s
p  1</p>
        <p>,
N p 
q
Proof. We introduce the following notation l  the number of rows of tiles in the diagonal block.</p>
        <p>c</p>
        <p>
          Since (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) - (11) are performed in parallel and independently in all p-1 pairs of CPU and GPU and the maximum
number of operations falls on step (11), the estimate of the number of operations performed on one GPU is determined
by the complexity of step (11).
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>The number of operations required to perform (11) can be estimated by the value</title>
      </sec>
      <sec id="sec-2-8">
        <title>Let's write the first term from (12)</title>
      </sec>
      <sec id="sec-2-9">
        <title>Here</title>
      </sec>
      <sec id="sec-2-10">
        <title>We substitute the value of l in the formula and obtain</title>
      </sec>
      <sec id="sec-2-11">
        <title>From here</title>
      </sec>
      <sec id="sec-2-12">
        <title>Let's write the second term from (12).</title>
        <p> l q  k 2
N p  2c 
 k1 2
k1
l
  q  kcs  l
s2 
2 
.</p>
        <p>l q  k 2
2c
k1 2</p>
        <p> l   l l l 
= c  q  k 2  = c  q 2   2qkc   k 2c 2  .</p>
        <p>
 k1   k1 k1 k1 
l l l l l
 q 2  lq 2 ,  2qkc  2qc k  qcl 2 ,  k 2c 2  c 2  k 2 
k1
k1
k1
k1
k1
c 2l 3
3
.
k1
l q 2  q3
c</p>
        <p>l
,  2qkc 
k1
q3 , l k 2c 2  q3
c 3c
k1</p>
        <p>.</p>
        <p>l q  k 2
2c
k1 2
 q3c  q3
3c 3
.</p>
        <p>l  l l  
2c sq  kc  2cs  q   kc   2cs ql 
k 1  k 1 k 1  
l 2c   q 2</p>
        <p>  2cs c 
2 
q 2 c   q 2 
  2cs 2c </p>
        <p>  sq 2 .
2c 2 
(15)</p>
      </sec>
      <sec id="sec-2-13">
        <title>The last term in (12) can be neglected because the order of the last diagonal block is small.</title>
      </sec>
      <sec id="sec-2-14">
        <title>Substituting (14) and (15) in (13) we obtain</title>
        <p>N p 
q3  sq 2  q 2 q  3s .
3 3</p>
      </sec>
      <sec id="sec-2-15">
        <title>Theorem proved.</title>
        <p>Theorem 2. Acceleration (speed up) of small-tile hybrid algorithm LLT - decomposition of a sparse
blockdiagonal matrix with a border A, is</p>
        <p>
S p   p  11 


3( p  1)s 2   2 p
2q 2 q  3s  opp    p  1
 4sq2c   p4c1s  opg  ,
1
where c is the size of the tile
(13)
(12)
(14</p>
        <p>Proof. To prove the theorem, we will use the methodology for estimating the efficiency and acceleration of
hybrid algorithms, given in [5].</p>
        <p>We will find the amount of data that are transferred between processors during the execution of the algorithm.
Because the ring communication topology is used and the processors exchange s 2 machine words, the total amount of
 р 1s2</p>
        <p>2</p>
        <p>Before execution of a multi-collection operation in all CPU and GPU pairs except the last one must exchange
data of volume s 2 . A similar exchange also executed in the last pair of CPU and GPU.</p>
        <p>Also in the process of factorization of diagonal blocks Dii all pairs of CPU and GPU except the last one exchange
data of volume 2qc. The last pair of CPU and GPU exchanges data of volume 2sc.</p>
        <p>The total amount of data exchanged by all CPU and GPU pairs is equal to 2qc p 1  ps 2  2sc .</p>
        <sec id="sec-2-15-1">
          <title>We will use the value of N p from Theorem 1 when calculating T1 and T р .</title>
          <p>T1 
 p  1N p
no
t g ,</p>
          <p>Tp 
tg 
topp  2qc p 1  ps2  2sctopg .</p>
          <p>Substituting all the values found in the formula for calculating the acceleration factor Sp, we obtain
.</p>
          <p>topp  2qc p 1  ps 2  2sctopg</p>
          <p>S p 
 p 1s 2</p>
          <p>2
S p 
1
1
1   p 1s 2
N p  2</p>
          <p>
 opp  2qc p 1  ps 2  2sc opg </p>
          <p>
 p 1
3( p 1)s 2   2 p
2q 2 q  3s  opp    p 1
 4sq2c   p4c1s  opg </p>
        </sec>
      </sec>
      <sec id="sec-2-16">
        <title>Dividing the numerator and denominator by</title>
      </sec>
      <sec id="sec-2-17">
        <title>We take out for brackets</title>
      </sec>
      <sec id="sec-2-18">
        <title>Theorem proved. According to the above algorithms, programs were developed and the following numerical experiments were performed.</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Numerical experiments</title>
      <p>Computational experiments to solving the SNE were performed on the nodes of the supercomputer SKIT [6]
with the following characteristics:</p>
      <sec id="sec-3-1">
        <title>1. 2 CPUs Intel Xeon E5-2600 with a frequency of 2.6 GHz;</title>
      </sec>
      <sec id="sec-3-2">
        <title>2. integrated with the general data storage of a cluster complex of 200 TB;</title>
      </sec>
      <sec id="sec-3-3">
        <title>3. data network between Infiniband FDR 56 Gbps;</title>
        <p>4. 128 GB of RAM;</p>
      </sec>
      <sec id="sec-3-4">
        <title>5. 3 accelerators NVidia Tesla M2075.</title>
        <p>With a given initial approximation x0 =0,5 in the region D   1000  xi  1000, i  0,1,2,, n  1 the
following system of nonlinear equations with the Jacobi block matrix was solved by Newton's method:
- the first block contains the following m equations:</p>
        <p>4x12  x22  xm21  xm2  3  0 ,
 2x j 1x j  5x 2j  x 2j1  xm j 1  xm2 j  xm j 1  3  0 ,
- following N-2 (K=2,…, N-1) blocks each contain m equations:
 2xKm2m j xKmm j  xKm2m2  5xK2mm j  xK2mm j 1  xK2m j  xKm j 1  3  0 , j = 1,
xKm2m j1  2xKm2m j xKmm j  xKm2m j1 
 2xKm2m j xKmm j  6xK2mm j  xK2mm j 1  xKm j 1  xK2m j  xKm j 1  4  0 ,
j = 2, , m-1
xKmm1  2xKmm xKm  2xKm1xKm  6xK2m  xKmm1  xK2mm  3  0
- last m equations:
 2x(N 2)m1x(N 1)m1  x(N 2)m2  5x(2N 1)m1  x(2N 1)m2  3  0 ,</p>
        <p>x(N 2)m j 1  2x(N 2)m j x(N 1)m j  x(N 2)m j1 
 2x(N 1)m j x(N 1)m j  6x(2N 1)m j  x(2N 1)m j1  3  0 ,j = 2, , m-1</p>
        <p>x(N 1)m1  2x(N 1)m xNm  2xNm1xNm  6xN2m  3  0
where m is the number of rows in the block, N is the number of blocks. Then N  (n 1) / m 1 or n=Nm.</p>
        <p>When software implementation of algorithms on computers of hybrid architecture it is advisable to use the functions of
optimized software libraries. In particular, such libraries include Intel MKL [7], CUBLAS [8].</p>
        <p>The following library's functions are used to implement the main computational stages of algorithms:
 dpotrf – finding LLT decomposition of a dense matrix. The function is performed on the CPU;
 cublasDsyrk – performs s-rank modification of a dense matrix. The function is performed on the GPU;
 cublasDgemm – finds the multiplication of two dense matrices. The function is performed on the GPU;
 cublasDtrsm – solving a triangular system with many right parts.. The function is performed on the GPU.</p>
        <p>The algorithm uses functions from the MPI [9] and CUDA libraries [10] to perform communications:
 Mpi_reduce – global reduction function with saving the result in the specified processor.
 cudaMemcpyAsync – a function that performs asynchronous data copying between the CPU and the GPU.</p>
        <p>Running the function in multiple cudaStream streams reduces the execution time of the program because
copying is performed against the background of calculations and does not cause a stop in the calculations.
Calculations on the GPU are also performed simultaneously in different cudaStream streams.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Here are some of the results.</title>
        <p>Table 1 shows the time (sec.) Execution of one iteration of Newton's method using a parallel variant (8 CPU) of
the fine-tile factorization algorithm of the block-diagonal matrix with a frame. Tile size is 128.</p>
        <p>15000
20000
25000
0.0393
0.0354</p>
        <p>From the table, we can conclude that the increase in the order of the system which is solving and the
transformation of the matrix for different numbers of processors contribute to a uniform load of processors by
calculations. Another factor that contributes to such a reduction in calculation time is the adjustment of the tile size.
With the right choice of tile size, the effect of caching calculations can be achieved, which can manifest itself as
superacceleration on certain parameter configurations.</p>
        <p>In fig. 4. the dependence of the time of execution of calculations of the hybrid variant (8 CPU + 1 GPU) of the
fine-tile algorithm on the number of flows and the size of the tile is shown. The order of the matrix is 10050.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>A new method for solving systems of large-orders nonlinear equations with Jacobi block matrices is considered.
Its ideological basis is a combination of the classical algorithm of Newton's method with an efficient fine-tile algorithm
for solving systems of linear equations with sparse matrices. The times of solving SNE of different orders on the nodes
of the SKIT supercomputer are given. A significant reduction in the solution time of systems of nonlinear equations is
obtained. This algorithm allows you to adjust the dimension of the block with which the calculations are performed at
each step of the algorithm. This can have a computational caching effect when the blocks are completely stored in the
GPU's fast memory. Also, this block structure allows you to work with inseparable data sets on the GPU, which reduces
the number of index operations and checks, which are quite expensive on a graphics accelerator.</p>
      <p>References:
About the authors:</p>
      <p>Place of work of the authors:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nesterenko</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Khimich</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Yakovlev</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          (
          <year>2006</year>
          )
          <article-title>To the problem of solving of non-linear systems on multi-processor distributed memory computing system</article-title>
          .
          <source>Gerald of computer and information technologies. 10</source>
          . pp.
          <fpage>54</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dzhordzh</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>Chyslennoe reshenye bol shykh razrezhennykh system uravnenyy / A</article-title>
          . Dzhordzh, Dzh. Lyu. - M.:
          <string-name>
            <surname>Myr</surname>
          </string-name>
          ,
          <year>1984</year>
          . -
          <fpage>334</fpage>
          s.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Pysanetsky</surname>
            <given-names>S</given-names>
          </string-name>
          . Tekhnolohyya razrezhennykh matryts / Pysanetsky S. - M.:
          <string-name>
            <surname>Myr</surname>
          </string-name>
          ,
          <year>1988</year>
          . -
          <fpage>410</fpage>
          s.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Alfredo</given-names>
            <surname>Buttari</surname>
          </string-name>
          , Julien Langou, Jakub Kurzak, and
          <article-title>Jack Dongarra: A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures</article-title>
          .
          <source>Parallel Computing</source>
          , Volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>1</given-names>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          38-
          <fpage>53</fpage>
          ,
          <year>2009</year>
          , ISSN:
          <fpage>0167</fpage>
          -
          <lpage>8191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>O.V.</given-names>
            <surname>Popov</surname>
          </string-name>
          .
          <article-title>Doslidzhennya efektyvnosti paralel nykh alhorytmiv dlya komp yuteriv hibrydnoyi arkhitektury. Materialy Mizhnarodnoyi naukovoyi shkoly-seminaru «Pytannya optymizatsiyi obchyslen (POO XLII)», (Zakarpat·s ka obl</article-title>
          .,
          <source>Mukachivs kyy r</source>
          .
          <article-title>-n, smt</article-title>
          .
          <source>Chynadiyevo</source>
          ,
          <volume>21</volume>
          -
          <fpage>25</fpage>
          veresnya
          <year>2015</year>
          ), Kyyiv:
          <year>2015</year>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>16</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>6. Rezhym dostupu: http://icybcluster.org.ua</mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>7. Intel® Math Kernel Library (Intel® MKL) - Rezhym dostupu: https://software.intel.com/en-us/intel-mkl</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>CUDA</given-names>
            <surname>CUBLAS_Library - Santa Clara</surname>
          </string-name>
          : Nvidia,
          <year>2010</year>
          . -
          <fpage>254</fpage>
          c.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>W. Gropp</surname>
          </string-name>
          <article-title>Using MPI-2: Advanced Features of the Message</article-title>
          -Passing Interface / W. Gropp, E. Lusk, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Thakur</surname>
          </string-name>
          . - Cambridge: MIT Press,
          <year>1999</year>
          . - 382 p.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Boreskov</surname>
            <given-names>A.V.</given-names>
          </string-name>
          <article-title>Osnovy raboty s tekhnolohyey CUDA / Boreskov A</article-title>
          .V.,
          <string-name>
            <surname>Kharlamov</surname>
            <given-names>A</given-names>
          </string-name>
          .A. - M.: DMK Press,
          <year>2010</year>
          .
          <article-title>- 232 s.</article-title>
          <string-name>
            <given-names>A.N.</given-names>
            <surname>Khimich</surname>
          </string-name>
          ,
          <article-title>Head's assistant, number of scientific publications in Ukrainian publications -</article-title>
          110, https://orcid.org/0000-0001-
          <fpage>9284</fpage>
          -139X.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>