=Paper=
{{Paper
|id=Vol-2866/ceur_208-217himich20
|storemode=property
|title=Hybrid Algorithm Newton Method for Solving Systems of Nonlinear Equations with Block Jacobi Matrix
|pdfUrl=https://ceur-ws.org/Vol-2866/ceur_208-217himich20.pdf
|volume=Vol-2866
|authors=Alexandr Khimich,Volodymyr Sydoruk,Alla Nesterenko
|dblpUrl=https://dblp.org/rec/conf/ukrprog/KhimichSN20
}}
==Hybrid Algorithm Newton Method for Solving Systems of Nonlinear Equations with Block Jacobi Matrix==
<pdf width="1500px">https://ceur-ws.org/Vol-2866/ceur_208-217himich20.pdf</pdf>
<pre>
        UDC 519.6


HYBRID ALGORITHM NEWTON METHOD FOR SOLVING SYSTEMS OF
NONLINEAR EQUATIONS WITH BLOCK JACOBI MATRIX

          A.N. Khimich[0000-0001-9284-139X], V. А. Sydoruk[0000-0003-0210-6020], A.N. Nesterenko/0000-0001-6174-1812]
        V.M. Glushkov Institute of Cybernetics of National Academy of Sciences of Ukraine, Kyiv, Ukraine
.


        Abstract. Systems of nonlinear equations often arise when modelling processes of different nature. These can be both independent problems
        describing physical processes and also problems arising at the intermediate stage of solving more complex mathematical problems. Usually,
        these are high-order tasks with a big count of un-knows, that better take into account the local features of processes or things that are
        modelled. Also, more accurate discrete models allow for more accurate solutions.
        Usually, the matrices of such problems have a sparse structure. Often the structure of sparse matrices is one of next: band, profile, block-
        diagonal with bordering, etc. In many cases, the matrices of the discrete problems are symmetric and positively defined or half-defined.
        The solution of systems of nonlinear equations is performed mainly by iterative methods based on the Newton method, which has a high
        convergence rate (quadratic) near the solution, provided that the initial approximation lies in the area of gravity of the solution. In this case,
        the method requires, at each iteration, to calculates the Jacobi matrix and to further solving systems of linear algebraic equations. As a
                                                              3
        consequence, the complexity of one iteration is O ( n ) .

        Using parallel computations in the step of the solving of systems of linear algebraic equations greatly accelerates the process of finding the
        solution of systems of nonlinear equations.
        In the paper, a new method for solving systems of nonlinear high-order equations with the Jacobi block matrix is proposed. The basis of the
        new method is to combine the classical algorithm of the Newton method with an efficient small-tile algorithm for solving systems of linear
        equations with sparse matrices. The times of solving the systems of nonlinear equations of different orders on the nodes of the SKIT
        supercomputer are given.
        Key words: systems of nonlinear equations, hybrid algorithm, sparse matrices, systems of linear algebraic equations, high-performance
        computing.


Introduction
      Systems of nonlinear equations (SNE) often arise when modelling processes of different nature. These can be both
stand-alone problems that describe physical processes and problems that arise in the intermediate stage of solving more
complex mathematical problems. As a rule, these are problems of ultrahigh orders that allow better consideration to take
into account the local characteristics of the process or phenomenon (occurrence) being modelled. Also, more accurate
discrete models allow for more accurate approximations.
      On the other hand, the matrices of such problems have a sparse structure. Most often it is band, profile, block-
diagonal with bordering, etc. In many cases, the matrices of discrete problems are symmetric and positively defined or
half -definite.
      The solution SNE is carried out (Systems of nonlinear equations are solved) mainly by iterative methods based on
Newton's method, which has a high convergence rate (quadratic) near the solution, provided that the initial
approximation lies in the region of convergence of the solution. In this case, the method requires at each iteration the
computation of the Jacobi matrix and further solution of systems of linear algebraic equations (SLAE). As a
                                                                     3
consequence, the complexity of one iteration is O (n ) .
    Parallelizing the calculation of the SLAE solution greatly accelerates the process of finding the SNE solution.
Statement of the problem with approximate initial data
        Let a system of n nonlinear equations be given
                                                                  f x   0 ,                                                                   (1)

where x  x1, x2 ,  , xn  , f x    f1 x , f 2 x ,  , f n x   is the n-dimensional vector of the desired solution and
                            T                                             T

the n-dimensional vector-function, respectively.


Copyright © 2020 for this paper by its authors. Use permitted under Creative                                                                        209
Commons License Attribution 4.0 International (CC BY 4.0).
       Problem (1) is some approximation to the exact system of nonlinear equations (x)=0, and for these vector
functions the inequality holds:
                                                       f u    u   

for any n-dimensional vector u.
       To solve problem (1) initial approximation x 0  and required accuracy to get an approximation to the solution of
the system  are given and the area in which the solution is sought is determined D  ai  xi  bi , i  1, 2,  , n .
Herewith the initial approximation belongs to a given area x 0   D . The lower index in the formulas denote the
component numbers of the vectors, and the upper index will denote the iteration numbers.
                         n
             f 
     If A   i                 is the Jacobi matrix of system (1) (or some approximation to it), then the iterative process of
              x j  i , j 1
Newton's method of finding the solution at a given initial approximation is written in the form

                                                                          
                                                   A  k  w k    f x  k                                                    (2)

where wk   x k 1  x k  correction, k = 0, 1, ... – is the iteration number. The next approximation to the solution is
calculated by the formula:

                                                      xk 1  x k   wk                                                     (3)
       As can be seen from formula (2) it is necessary to solve a system of linear algebraic equations, calculating the
values of the vector function and the Jacobi matrix, at each iteration.
       To obtain a solution of the system of nonlinear equations (1) with a given accuracy x k   x   , the iterative process
must be completed if the condition

                                                            
                                                 f x ( k 1) 
                                                                          
                                                                                     ,                                            (4)
                                                                   A         
                                                                        ( k 1) 1


           (k )                                                                                                         
where A 1        is a matrix inverse to the Jacobi matrix calculated on the k-th iteration, and the relationship                 will be
                                                                                                                        1 k 
                                                                                                                    A

denoted as  [1]. Since to check condition (4) at each iteration it is necessary to rotate the Jacobi matrix, which requires a
significant number of arithmetic operations, then on the first iterations check the more economical condition of the end of
               
iterations f x k    , and after its execution, we check conditions (4).
       After completing the iterative process, the error of the obtained approximation to the solution of the problem
with approximate data relative to the exact solution of the system with accurate data is calculated:

                                                   x k  x    Ak1  ,

where x is the exact solution of the exact SNE.
       It follows that when solving systems of nonlinear equations, most of the arithmetic operations are performed
when calculating the values of the vector function, solving the corresponding SLAE and calculating the inverse matrix
to find the decoupling estimate. In the general case, the complexity of Newton's iterative method is O ( n 3 ) . To reduce
the execution time of one iteration for solving SLAE, we will use a small-tile hybrid factorization algorithm, where  f x (k )    
is the right part of SLAE.
Small-tile hybrid factorization algorithm of the sparse block-diagonal matrix with a border
       Consider the problem of solving a system of linear algebraic equations
                                                                                           ~    ~
                                                                                           A~x b

with the symmetric positive-definite sparse matrix of order n.
       The ideological prerequisite for the use of hybrid calculations in the processing of sparse matrices of arbitrary
structure is the preliminary rearrangement of the structure of the original matrix by the method of parallel sections,
which leads the matrix to a block-diagonal view with a border [2, 3]


                                                                                                                                    210
                                                  D11            0       0           0          C1 p 
                                                                                                         
                                                  0            D 22      0           0          C2 p 
                                                                                                            ,
                                                                                                 C3 p 
                                            ~P   0
                                     A  PT A
                                                                 0       D33           0
                                                                                                          
                                                                                                 
                                                  0              0       0        D p 1 p 1   C p 1 p 
                                                                                                         
                                                  C p1         C p2    C p3        C pp 1      D pp 
                                                 

                                                                               ~
                                                                    x , b  PT b ,
                                                             x  PT ~
where P – permutation matrix, and blocks Dii, Cpi, Cip, i  1, p  1 , Dpp retain a sparse structure, р – the number of
diagonal blocks in the matrix, р ≥ 3.
       It is natural for the method of parallel sections that the size of the last diagonal block is much smaller than the
orders of the other diagonal blocks. Also, the structure of the block-diagonal matrix with a border allows to carry out
parallel and independent processing of blocks. Dii, Cpi, Cip, i  1, p  1 .
       Given the structure of the resulting matrix for data storage, it is advisable to use a block distribution. Moreover,
the number of blocks should be chosen taking into account the number of processors and graphics accelerators that are
involved in the calculations.
       In Fig. 1 shows the profile of matrix fill by nonzero elements. As a result of the application of the method of
parallel sections, the structure of the matrix takes the form shown in Fig. 2.


                                                 Fig. 1. The profile of the original matrix


                                                 Fig. 2. Profile of block-diagonal matrix
       As is well known, the most effective direct method of solving such a problem is the Kholetsky method. To
solving of the system consists of solving the subtasks: triangular decomposition of the matrix of the system, to solving
two SLAE with triangular matrices (7) and (8):

                                                              A  LLT                                               (6)

                                                              Ly  b                                                (7)

                                                              LT x  y                                              (8)

       Consider a small-tile hybrid factorization algorithm of a sparse block-diagonal matrix with a border A. This
algorithm better takes into account the profile or sparse structure of diagonal blocks, blocks with a border.
       Divide matrix A into blocks of dimension с×с.
       Further, to factorization a block-diagonal matrix, we apply the algorithm proposed in [4] for dense matrices.
       To factorization the matrix during the k-th step, we use the following relations


Copyright © 2020 for this paper by its authors. Use permitted under Creative                                          211
Commons License Attribution 4.0 International (CC BY 4.0).
                                                  A     A12   L11       0  LT11        LT21 
                                            A к   11         
                                                  A
                                                   21   A22   L21      L22  0          LT22 

where the dimensions of the blocks A11 – с×с, A21 – (n-kс)с, A22 – (n-kс)(n-kс), blocks and take into account the
structure of the blocks Dii, Cpi, Dpp.
       From here we get the algorithm by which the decomposition is carried out during the k step

                                                 A11  L11 * LT11 ;                                                   (9)


                                                              ;
                                               L21  A21 * LT11
                                                                  1
                                                                                                                     (10)

                                              ~
                                              A22  A22  L21 * LT21.                                                (11)
       Note that the implementation (9) - (10) at each step modifies only the blocks Dii, Cpi, i  1, p  1 , Dpp.
       To implement the algorithm we will use the following data distribution: in GPU(i) i  1, p  1 are containing
blocks Dii, Сpi and block A(ipp) the same size as Dрр; in GPU(p) the Dpp block is stored.
       In fig. 3 show the block distribution of data at the k-th step of factorization of the block-diagonal matrix with a
border, taking into account the above-proposed decomposition.
       The parallelization of triangular factorization calculations is that the factorization of blocks A11 and the
modification of A21 and A22 can be done independently in each CPU(i) and GPU(і) i  1, p  1 .


                                              Fig. 3. Data decomposition on GPU(і)

       At each step in all pairs of CPU(i) and GPU(i) i  1, p  1 execut:
              in СPU(i), i  1, p  1 we factorize A11 with Diі:

                                                                        A11  L11 * LT11 ;

              in GPU(i), i  1, p  1 we modify the column of blocks L21 :

                                                                                    ;
                                                                       L 21  A21 LT11
                                                                                           1


              in GPU(i), i  1, p  1 we modify the blocks of the matrix A22 with A (ipp) by the formula:

                                                                      ~  A  L LT .
                                                                      A22  22  21 21

              In the CPU(p), using multi-assembly, we modify D pp :

                                                                                    p 1
                                                                  D *pp  D pp     A .
                                                                                    i 1
                                                                                             (i )
                                                                                             pp


                                                                                                                        212
        We factorize the block D *pp , thereby completing the process of factorization of the matrix.
        We will assume that the orders of all diagonal blocks are approximately equal to
                                                                                                   ns
                                                                                   qi  q              ,
                                                                                                   p 1

where s is the order of the last diagonal block. Then the following theorems are valid.
      Theorem 1. The number of operations, which performed on one GPU to find the factorization of a sparse block-
diagonal matrix with a border is estimated by the value

                                                                                  q3          q2
                                                                           Np        sq 2     q  3s  .
                                                                                  3           3
                                                        q
        Proof. We introduce the following notation l      the number of rows of tiles in the diagonal block.
                                                        c
       Since (9) - (11) are performed in parallel and independently in all p-1 pairs of CPU and GPU and the maximum
number of operations falls on step (11), the estimate of the number of operations performed on one GPU is determined
by the complexity of step (11).
       The number of operations required to perform (11) can be estimated by the value

                                                                                           l q  k 2     l
                                                                                                                             s2 
                                                                                  N p  2c
                                                                                          
                                                                                           k 1  2    
                                                                                                          k 1
                                                                                                                               
                                                                                                                q  kc s  l  .
                                                                                                                             2 
                                                                                                                                                                                    (12)


Let's write the first term from (12)
                                                           l
                                                                    q  k 2 = c l q  k 2  = c l q 2  l 2qkc  l k 2 c 2  .
                                                   2c                 2                                                                                   
                                                                                                                                                                             (13)
                                                        k 1                       k 1                             k 1         k 1              k 1        
Here
                                     l                          l                       l                        l                       l
                                                                                                                                                      c 2l 3
                                   k 1
                                           q 2  lq 2 ,       k 1
                                                                      2qkc  2qc      k 1
                                                                                              k  qcl 2 ,    
                                                                                                             k 1
                                                                                                                         k 2c 2  c 2   
                                                                                                                                        k 1
                                                                                                                                               k2 
                                                                                                                                                        3
                                                                                                                                                             .


We substitute the value of l in the formula and obtain
                                                           l                       l                                 l
                                                                           q3            q3              q3
                                                         k 1
                                                                    q2       ,
                                                                           c k 1 
                                                                                  2qkc     ,
                                                                                         c k 1
                                                                                                k 2c 2 
                                                                                                         3c
                                                                                                            .
From here
                                                                                                             l
                                                                                                                         q  k 2  q 3c  q 3 .
                                                                                                      2c    
                                                                                                            k 1
                                                                                                                            2           3c       3
                                                                                                                                                                                       (14)


Let's write the second term from (12).
                               l                    l         l                 l 2 c        q 2 q 2c      q2 
                         2c   
                              k 1
                                   sq  kc   2cs
                                                   
                                                    k 1
                                                            
                                                          q
                                                             k 1
                                                                  kc   2cs ql 
                                                                     
                                                                     
                                                                            
                                                                                    2    
                                                                                          
                                                                                             2cs
                                                                                                 
                                                                                                       2   2cs
                                                                                                  c 2c 
                                                                                                            
                                                                                                                        sq 2 .
                                                                                                                  2c 
                                                                                                                     
                                                                                                                                                                      (15)


The last term in (12) can be neglected because the order of the last diagonal block is small.
       Substituting (14) and (15) in (13) we obtain

                                                                                  q3          q2
                                                                           Np        sq 2     q  3s  .
                                                                                  3           3
      Theorem proved.
      Theorem 2. Acceleration (speed up) of small-tile hybrid algorithm LLT - decomposition of a sparse block-
diagonal matrix with a border A, is
                                                                                                                                                             1
                                                    3( p  1) s 2            2p          4qc       4c          
                                   S p   p  11  2                                                
                                                                                                                   ,
                                                 
                                                    2 q q   3 s  
                                                                     
                                                                         opp
                                                                                  p  1   s 2    p  1s 
                                                                                                                opg 
                                                                                                                    
where c is the size of the tile
Copyright © 2020 for this paper by its authors. Use permitted under Creative                                                                                          213
Commons License Attribution 4.0 International (CC BY 4.0).
       Proof. To prove the theorem, we will use the methodology for estimating the efficiency and acceleration of
hybrid algorithms, given in [5].
       We will find the amount of data that are transferred between processors during the execution of the algorithm.
Because the ring communication topology is used and the processors exchange s 2 machine words, the total amount of

data that the processors exchange is
                                          р  1s 2 .
                                          2
       Before execution of a multi-collection operation in all CPU and GPU pairs except the last one must exchange
data of volume s 2 . A similar exchange also executed in the last pair of CPU and GPU.
       Also in the process of factorization of diagonal blocks Dii all pairs of CPU and GPU except the last one exchange
data of volume 2qc. The last pair of CPU and GPU exchanges data of volume 2sc.
       The total amount of data exchanged by all CPU and GPU pairs is equal to 2qc p  1  ps 2  2 sc .
       We will use the value of N p from Theorem 1 when calculating T1 and T р .

                                  p  1N p                                     p  1s 2 t
                          T1 
                                      no
                                                 tg ,       Tp 
                                                                    Np
                                                                    no
                                                                         tg 
                                                                                       2
                                                                                                                    
                                                                                                    opp  2qc p  1  ps
                                                                                                                              2
                                                                                                                                        
                                                                                                                                   2sc topg .

       Substituting all the values found in the formula for calculating the acceleration factor Sp, we obtain
                                                                                           Np
                                                                             p  1                tg
                                                                                           no
                                      Sp                                                                                         .
                                               Np
                                                      t 
                                                           p  1s 2 t                                     2
                                                                                                                          
                                                        g                  opp  2qc p  1  ps                     2sc t opg
                                                 no                 2

                                                                   Np
       Dividing the numerator and denominator by                         t g we obtain
                                                                    no

                                     Sp 
                                                                         p  1                          .
                                                 1   p  1s
                                                     
                                                                                                       
                                                                                                                          
                                                               2
                                             1                   opp  2qc p  1  ps  2 sc  opg 
                                                                                         2
                                                N p     2                                             

       We take out for brackets
                                    p  1s 2
                                        2

                                     Sp 
                                                                          p  1                           .
                                              3( p  1) s     2
                                                                           2p     4qc     4c            
                                            1 2               opp             2              
                                              2q q  3s              p  1 s      p  1s  opg 
       Theorem proved.
       According to the above algorithms, programs were developed and the following numerical experiments were
performed.
Numerical experiments
       Computational experiments to solving the SNE were performed on the nodes of the supercomputer SKIT [6]
with the following characteristics:
            1. 2 CPUs Intel Xeon E5-2600 with a frequency of 2.6 GHz;
            2. integrated with the general data storage of a cluster complex of 200 TB;
            3. data network between Infiniband FDR 56 Gbps;
            4. 128 GB of RAM;
            5. 3 accelerators NVidia Tesla M2075.
       With a given initial approximation x 0 =0,5 in the region D   1000  xi  1000, i  0,1,2,  , n  1 the
following system of nonlinear equations with the Jacobi block matrix was solved by Newton's method:
       - the first block contains the following m equations:

                                                        4 x12  x22  xm2 1  xm  2  3  0 ,

                         2 x j 1 x j  5 x 2j  x 2j 1  xm  j 1  xm2  j  xm  j 1  3  0 ,               j = 2, , m-1


                                                                                                                                                 214
                                                       2 xm 1 xm  5 xm2  x2 m 1  x22m  3  0 .

        - following N-2 (K=2,…, N-1) blocks each contain m equations:
                                                                    2              2                 2
                  2 xKm  2 m  j xKm  m  j  xKm  2 m  2  5 xKm  m  j  x Km  m  j 1  x Km  j  x Km  j 1  3  0 , j = 1,


                                                    x Km  2 m  j 1  2 x Km  2 m  j x Km  m  j  x Km  2 m  j 1 

                                             2              2                               2
           2 xKm  2 m  j xKm  m  j  6 xKm  m  j  x Km  m  j 1  x Km  j 1  x Km  j  x Km  j 1  4  0 ,            j = 2, , m-1

                                                                               2                  2
                                xKm  m 1  2 xKm  m xKm  2 xKm 1 xKm  6 xKm  xKm  m 1  xKm m  3  0


        - last m equations:

                                    2 x( N  2) m 1 x( N 1) m 1  x( N  2) m  2  5 x(2N 1) m 1  x(2N 1) m  2  3  0 ,


                                                   x( N  2) m  j 1  2 x( N  2) m  j x( N 1) m  j  x( N  2) m  j 1 


                                2 x( N 1) m  j x( N 1) m  j  6 x(2N 1) m  j  x(2N 1) m  j 1  3  0 ,j = 2, , m-1

                                                                                                      2
                                             x( N 1) m 1  2 x( N 1) m x Nm  2 x Nm 1 x Nm  6 x Nm 3 0

where m is the number of rows in the block, N is the number of blocks. Then N  (n  1) / m  1 or n=Nm.
        When software implementation of algorithms on computers of hybrid architecture it is advisable to use the functions of
optimized software libraries. In particular, such libraries include Intel MKL [7], CUBLAS [8].
        The following library's functions are used to implement the main computational stages of algorithms:
       dpotrf – finding LLT decomposition of a dense matrix. The function is performed on the CPU;
       cublasDsyrk – performs s-rank modification of a dense matrix. The function is performed on the GPU;
       cublasDgemm – finds the multiplication of two dense matrices. The function is performed on the GPU;
       cublasDtrsm – solving a triangular system with many right parts.. The function is performed on the GPU.
        The algorithm uses functions from the MPI [9] and CUDA libraries [10] to perform communications:
       Mpi_reduce – global reduction function with saving the result in the specified processor.
       cudaMemcpyAsync – a function that performs asynchronous data copying between the CPU and the GPU.
          Running the function in multiple cudaStream streams reduces the execution time of the program because
          copying is performed against the background of calculations and does not cause a stop in the calculations.
        Calculations on the GPU are also performed simultaneously in different cudaStream streams.
        Here are some of the results.
        Table 1 shows the time (sec.) Execution of one iteration of Newton's method using a parallel variant (8 CPU) of
the fine-tile factorization algorithm of the block-diagonal matrix with a frame. Tile size is 128.
        Table 1.
                      Matrix order /
                                                              15000                              20000                            25000
                    number of threads

                               1
                                                                    0.184481                               0.2459                         0.3074
                               2
                                                                      0.16771                              0.2236                         0.2795
                               3
                                                                        0.0819                             0.1108                         0.1336
                               4
                                                                        0.0631                             0.0842                         0.0988
                               5
                                                                        0.0519                             0.0630                         0.0772
                               6
                                                                        0.0445                             0.0539                         0.0637

Copyright © 2020 for this paper by its authors. Use permitted under Creative                                                                          215
Commons License Attribution 4.0 International (CC BY 4.0).
                           7
                                                           0.0393                    0.0474                     0.0540
                           8
                                                           0.0354                    0.0427                     0.0468

        From the table, we can conclude that the increase in the order of the system which is solving and the
transformation of the matrix for different numbers of processors contribute to a uniform load of processors by
calculations. Another factor that contributes to such a reduction in calculation time is the adjustment of the tile size.
With the right choice of tile size, the effect of caching calculations can be achieved, which can manifest itself as super-
acceleration on certain parameter configurations.
        In fig. 4. the dependence of the time of execution of calculations of the hybrid variant (8 CPU + 1 GPU) of the
fine-tile algorithm on the number of flows and the size of the tile is shown. The order of the matrix is 10050.


                       Fig. 4. GPU computation time, depending on tile size and number of threads.

Conclusions
        A new method for solving systems of large-orders nonlinear equations with Jacobi block matrices is considered.
Its ideological basis is a combination of the classical algorithm of Newton's method with an efficient fine-tile algorithm
for solving systems of linear equations with sparse matrices. The times of solving SNE of different orders on the nodes
of the SKIT supercomputer are given. A significant reduction in the solution time of systems of nonlinear equations is
obtained. This algorithm allows you to adjust the dimension of the block with which the calculations are performed at
each step of the algorithm. This can have a computational caching effect when the blocks are completely stored in the
GPU's fast memory. Also, this block structure allows you to work with inseparable data sets on the GPU, which reduces
the number of index operations and checks, which are quite expensive on a graphics accelerator.

References:
    1.  Nesterenko, A.N. & Khimich, A.N. & Yakovlev, M.F. (2006) To the problem of solving of non-linear systems on multi-processor
        distributed memory computing system. Gerald of computer and information technologies. 10. pp. 54-56.
    2. Dzhordzh A. Chyslennoe reshenye bol shykh razrezhennykh system uravnenyy / A. Dzhordzh, Dzh. Lyu. – M.: Myr, 1984. – 334 s.
    3. Pysanetsky S. Tekhnolohyya razrezhennykh matryts / Pysanetsky S. – M.: Myr, 1988. — 410 s.
    4. Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra: A Class of Parallel Tiled Linear Algebra Algorithms for Multicore
        Architectures.Parallel Computing, Volume 35, Issue 1, P. 38-53, 2009, ISSN:0167-8191.
    5. O.V. Popov. Doslidzhennya efektyvnosti paralel nykh alhorytmiv dlya komp yuteriv hibrydnoyi arkhitektury. Materialy Mizhnarodnoyi
        naukovoyi shkoly-seminaru «Pytannya optymizatsiyi obchyslen (POO XLII)», (Zakarpat·s ka obl., Mukachivs kyy r.-n, smt.
        Chynadiyevo, 21–25 veresnya 2015), Kyyiv: 2015, S. 16.
    6. Rezhym dostupu: http://icybcluster.org.ua
    7. Intel® Math Kernel Library (Intel® MKL) – Rezhym dostupu: https://software.intel.com/en-us/intel-mkl
    8. CUDA CUBLAS_Library – Santa Clara: Nvidia, 2010. – 254 c.
    9. W. Gropp Using MPI-2: Advanced Features of the Message-Passing Interface / W. Gropp, E. Lusk, and R. Thakur. – Cambridge: MIT
        Press, 1999. – 382 p.
    10. Boreskov A.V. Osnovy raboty s tekhnolohyey CUDA / Boreskov A.V., Kharlamov A.A. – M.: DMK Press, 2010. – 232 s.


                                                                                                                                   216
About the authors:

A.N. Khimich,
Head's assistant,
number of scientific publications in Ukrainian publications - 110,
https://orcid.org/0000-0001-9284-139X.


V. А. Sydoruk,
Senior Research Assistant,
number of scientific publications in Ukrainian publications - 30,
http://orcid.org/0000-0003-0210-6020.


A.N. Nesterenko,
Research Scientist,
number of scientific publications in Ukrainian publications - 32,
https://orcid.org/0000-0001-6174-1812.

Place of work of the authors:
V.M. Glushkov Institute of Cybernetics of NAS of Ukraine,
40, Acad. Glushkov Av., 03187, Kyiv
tel. (066)-367-85-40
e-mail: alla.nest1958@gmail.com


Copyright © 2020 for this paper by its authors. Use permitted under Creative   217
Commons License Attribution 4.0 International (CC BY 4.0).

</pre>