Introduction

HYBRID ALGORITHM NEWTON METHOD FOR SOLVING SYSTEMS OF NONLINEAR EQUATIONS WITH BLOCK JACOBI MATRIX

0 V.M. Glushkov Institute of Cybernetics of NAS of Ukraine , 40, Acad. Glushkov Av., 03187, Kyiv tel. (066)-367-85-40 1 V.M. Glushkov Institute of Cybernetics of National Academy of Sciences of Ukraine , Kyiv , Ukraine

2020

0000 0001

Systems of nonlinear equations often arise when modelling processes of different nature. These can be both independent problems describing physical processes and also problems arising at the intermediate stage of solving more complex mathematical problems. Usually, these are high-order tasks with a big count of un-knows, that better take into account the local features of processes or things that are modelled. Also, more accurate discrete models allow for more accurate solutions. Usually, the matrices of such problems have a sparse structure. Often the structure of sparse matrices is one of next: band, profile, blockdiagonal with bordering, etc. In many cases, the matrices of the discrete problems are symmetric and positively defined or half-defined. The solution of systems of nonlinear equations is performed mainly by iterative methods based on the Newton method, which has a high convergence rate (quadratic) near the solution, provided that the initial approximation lies in the area of gravity of the solution. In this case, the method requires, at each iteration, to calculates the Jacobi matrix and to further solving systems of linear algebraic equations. As a consequence, the complexity of one iteration is O(n3 ) . Using parallel computations in the step of the solving of systems of linear algebraic equations greatly accelerates the process of finding the solution of systems of nonlinear equations. In the paper, a new method for solving systems of nonlinear high-order equations with the Jacobi block matrix is proposed. The basis of the new method is to combine the classical algorithm of the Newton method with an efficient small-tile algorithm for solving systems of linear equations with sparse matrices. The times of solving the systems of nonlinear equations of different orders on the nodes of the SKIT supercomputer are given.

systems of nonlinear equations hybrid algorithm sparse matrices systems of linear algebraic equations high-performance computing

Introduction

Systems of nonlinear equations (SNE) often arise when modelling processes of different nature. These can be both stand-alone problems that describe physical processes and problems that arise in the intermediate stage of solving more complex mathematical problems. As a rule, these are problems of ultrahigh orders that allow better consideration to take into account the local characteristics of the process or phenomenon (occurrence) being modelled. Also, more accurate discrete models allow for more accurate approximations.

On the other hand, the matrices of such problems have a sparse structure. Most often it is band, profile, blockdiagonal with bordering, etc. In many cases, the matrices of discrete problems are symmetric and positively defined or half -definite.

The solution SNE is carried out (Systems of nonlinear equations are solved) mainly by iterative methods based on Newton's method, which has a high convergence rate (quadratic) near the solution, provided that the initial approximation lies in the region of convergence of the solution. In this case, the method requires at each iteration the computation of the Jacobi matrix and further solution of systems of linear algebraic equations (SLAE). As a consequence, the complexity of one iteration is O(n3 ) .

Parallelizing the calculation of the SLAE solution greatly accelerates the process of finding the SNE solution.

Statement of the problem with approximate initial data Let a system of n nonlinear equations be given

Problem ( 1 ) is some approximation to the exact system of nonlinear equations (x)=0, and for these vector functions the inequality holds:

f u    u     fi  If A   

 x j i, j1 for any n-dimensional vector u.

To solve problem ( 1 ) initial approximation x0 and required accuracy to get an approximation to the solution of the system  are given and the area in which the solution is sought is determined D  ai  xi  bi , i  1, 2,, n. Herewith the initial approximation belongs to a given area x 0  D . The lower index in the formulas denote the component numbers of the vectors, and the upper index will denote the iteration numbers.

is the Jacobi matrix of system ( 1 ) (or some approximation to it), then the iterative process of Newton's method of finding the solution at a given initial approximation is written in the form where wk   xk 1  xk  correction, k = 0, 1, ... – is the iteration number. The next approximation to the solution is calculated by the formula:

As can be seen from formula ( 2 ) it is necessary to solve a system of linear algebraic equations, calculating the values of the vector function and the Jacobi matrix, at each iteration.

To obtain a solution of the system of nonlinear equations ( 1 ) with a given accuracy xk   x   , the iterative process must be completed if the condition

Ak wk    f x k   xk 1  xk   wk  f x (k 1)  

 A(k 1) 1 , where A1(k ) is a matrix inverse to the Jacobi matrix calculated on the k-th iteration, and the relationship will be iterations f x k     , and after its execution, we check conditions ( 4 ). denoted as  [1]. Since to check condition ( 4 ) at each iteration it is necessary to rotate the Jacobi matrix, which requires a significant number of arithmetic operations, then on the first iterations check the more economical condition of the end of

After completing the iterative process, the error of the obtained approximation to the solution of the problem with approximate data relative to the exact solution of the system with accurate data is calculated:  A1k  ( 2 ) ( 3 ) ( 4 ) xk  x    Ak1  , where x is the exact solution of the exact SNE.

It follows that when solving systems of nonlinear equations, most of the arithmetic operations are performed when calculating the values of the vector function, solving the corresponding SLAE and calculating the inverse matrix to find the decoupling estimate. In the general case, the complexity of Newton's iterative method is O(n 3 ) . To reduce the execution time of one iteration for solving SLAE, we will use a small-tile hybrid factorization algorithm, where  f x (k )  is the right part of SLAE.

Small-tile hybrid factorization algorithm of the sparse block-diagonal matrix with a border

Consider the problem of solving a system of linear algebraic equations

~ A~~x  b with the symmetric positive-definite sparse matrix of order n.

The ideological prerequisite for the use of hybrid calculations in the processing of sparse matrices of arbitrary structure is the preliminary rearrangement of the structure of the original matrix by the method of parallel sections, which leads the matrix to a block-diagonal view with a border [2, 3]

D22 D33     0 0 D p1 p1 C pp1 C 2 p  , C3 p   

 C p1 p 



D pp  x  PT ~x, b  PT b~ , where P – permutation matrix, and blocks Dii, Cpi, Cip, i  1, p  1 , Dpp retain a sparse structure, р – the number of diagonal blocks in the matrix, р ≥ 3.

It is natural for the method of parallel sections that the size of the last diagonal block is much smaller than the orders of the other diagonal blocks. Also, the structure of the block-diagonal matrix with a border allows to carry out parallel and independent processing of blocks. Dii, Cpi, Cip, i  1, p  1 .

Given the structure of the resulting matrix for data storage, it is advisable to use a block distribution. Moreover, the number of blocks should be chosen taking into account the number of processors and graphics accelerators that are involved in the calculations.

In Fig. 1 shows the profile of matrix fill by nonzero elements. As a result of the application of the method of parallel sections, the structure of the matrix takes the form shown in Fig. 2.

Consider a small-tile hybrid factorization algorithm of a sparse block-diagonal matrix with a border A. This algorithm better takes into account the profile or sparse structure of diagonal blocks, blocks with a border.

Divide matrix A into blocks of dimension с×с.

Further, to factorization a block-diagonal matrix, we apply the algorithm proposed in [4] for dense matrices.

To factorization the matrix during the k-th step, we use the following relations

( 6 ) ( 7 ) ( 8 ) where the dimensions of the blocks A11 – с×с,

A21 – (n-kс)с, A22 – (n-kс)(n-kс), blocks and take into account the

structure of the blocks Dii, Cpi, Dpp.

From here we get the algorithm by which the decomposition is carried out during the k step

Note that the implementation ( 9 ) - ( 10 ) at each step modifies only the blocks Dii, Cpi, i  1, p  1 , Dpp.

To implement the algorithm we will use the following data distribution: in GPU(i) i  1, p 1 are containing blocks Dii, Сpi and block A(pip) the same size as Dрр; in GPU(p) the Dpp block is stored.

In fig. 3 show the block distribution of data at the k-th step of factorization of the block-diagonal matrix with a border, taking into account the above-proposed decomposition.

The parallelization of triangular factorization calculations is that the factorization of blocks A11 and the modification of A21 and A22 can be done independently in each CPU(i) and GPU(і) i  1, p  1 .

A11  L11 * LT11 ;

 T 1 L21  A21 * L11

; ~ A22  A22  L21 * LT21.  in GPU(i), i  1, p  1 we modify the column of blocks L21 :  in GPU(i), i  1, p  1 we modify the blocks of the matrix A22 with A(pip) by the formula:  In the CPU(p), using multi-assembly, we modify D pp :

A11  L11 * LT11 ; L21  A21 LT11 1 ; ~ A22  A22  L21LT21 .

p1 D pp  D pp   A(pip) .

* i1

( 9 ) ( 10 ) (11) We factorize the block D*pp , thereby completing the process of factorization of the matrix.

We will assume that the orders of all diagonal blocks are approximately equal to where s is the order of the last diagonal block. Then the following theorems are valid. Theorem 1. The number of operations, which performed on one GPU to find the factorization of a sparse blockdiagonal matrix with a border is estimated by the value

qi  q  n  s p  1

, N p  q Proof. We introduce the following notation l  the number of rows of tiles in the diagonal block.

Since ( 9 ) - (11) are performed in parallel and independently in all p-1 pairs of CPU and GPU and the maximum number of operations falls on step (11), the estimate of the number of operations performed on one GPU is determined by the complexity of step (11).

The number of operations required to perform (11) can be estimated by the value Let's write the first term from (12) Here We substitute the value of l in the formula and obtain From here Let's write the second term from (12).

 l q  k 2 N p  2c   k1 2 k1 l   q  kcs  l s2  2  .

l q  k 2 2c k1 2

 l   l l l  = c  q  k 2  = c  q 2   2qkc   k 2c 2  .

  k1   k1 k1 k1  l l l l l  q 2  lq 2 ,  2qkc  2qc k  qcl 2 ,  k 2c 2  c 2  k 2  k1 k1 k1 k1 k1 c 2l 3 3 . k1 l q 2  q3 c

l ,  2qkc  k1 q3 , l k 2c 2  q3 c 3c k1

l q  k 2 2c k1 2  q3c  q3 3c 3 .

l  l l   2c sq  kc  2cs  q   kc   2cs ql  k 1  k 1 k 1   l 2c   q 2

  2cs c  2  q 2 c   q 2    2cs 2c 

  sq 2 . 2c 2  (15)

The last term in (12) can be neglected because the order of the last diagonal block is small. Substituting (14) and (15) in (13) we obtain

N p  q3  sq 2  q 2 q  3s . 3 3

Theorem proved.

Theorem 2. Acceleration (speed up) of small-tile hybrid algorithm LLT - decomposition of a sparse blockdiagonal matrix with a border A, is

 S p   p  11    3( p  1)s 2   2 p 2q 2 q  3s  opp    p  1  4sq2c   p4c1s  opg  , 1 where c is the size of the tile (13) (12) (14

Proof. To prove the theorem, we will use the methodology for estimating the efficiency and acceleration of hybrid algorithms, given in [5].

We will find the amount of data that are transferred between processors during the execution of the algorithm. Because the ring communication topology is used and the processors exchange s 2 machine words, the total amount of  р 1s2

Before execution of a multi-collection operation in all CPU and GPU pairs except the last one must exchange data of volume s 2 . A similar exchange also executed in the last pair of CPU and GPU.

Also in the process of factorization of diagonal blocks Dii all pairs of CPU and GPU except the last one exchange data of volume 2qc. The last pair of CPU and GPU exchanges data of volume 2sc.

The total amount of data exchanged by all CPU and GPU pairs is equal to 2qc p 1  ps 2  2sc .

We will use the value of N p from Theorem 1 when calculating T1 and T р .

T1   p  1N p no t g ,

Tp  tg  topp  2qc p 1  ps2  2sctopg .

Substituting all the values found in the formula for calculating the acceleration factor Sp, we obtain .

topp  2qc p 1  ps 2  2sctopg

S p   p 1s 2

2 S p  1 1 1   p 1s 2 N p  2

  opp  2qc p 1  ps 2  2sc opg 

  p 1 3( p 1)s 2   2 p 2q 2 q  3s  opp    p 1  4sq2c   p4c1s  opg 

Dividing the numerator and denominator by We take out for brackets Theorem proved. According to the above algorithms, programs were developed and the following numerical experiments were performed. Numerical experiments

Computational experiments to solving the SNE were performed on the nodes of the supercomputer SKIT [6] with the following characteristics:

1. 2 CPUs Intel Xeon E5-2600 with a frequency of 2.6 GHz; 2. integrated with the general data storage of a cluster complex of 200 TB; 3. data network between Infiniband FDR 56 Gbps;

4. 128 GB of RAM;

5. 3 accelerators NVidia Tesla M2075.

With a given initial approximation x0 =0,5 in the region D   1000  xi  1000, i  0,1,2,, n  1 the following system of nonlinear equations with the Jacobi block matrix was solved by Newton's method: - the first block contains the following m equations:

4x12  x22  xm21  xm2  3  0 ,  2x j 1x j  5x 2j  x 2j1  xm j 1  xm2 j  xm j 1  3  0 , - following N-2 (K=2,…, N-1) blocks each contain m equations:  2xKm2m j xKmm j  xKm2m2  5xK2mm j  xK2mm j 1  xK2m j  xKm j 1  3  0 , j = 1, xKm2m j1  2xKm2m j xKmm j  xKm2m j1   2xKm2m j xKmm j  6xK2mm j  xK2mm j 1  xKm j 1  xK2m j  xKm j 1  4  0 , j = 2, , m-1 xKmm1  2xKmm xKm  2xKm1xKm  6xK2m  xKmm1  xK2mm  3  0 - last m equations:  2x(N 2)m1x(N 1)m1  x(N 2)m2  5x(2N 1)m1  x(2N 1)m2  3  0 ,

x(N 2)m j 1  2x(N 2)m j x(N 1)m j  x(N 2)m j1   2x(N 1)m j x(N 1)m j  6x(2N 1)m j  x(2N 1)m j1  3  0 ,j = 2, , m-1

x(N 1)m1  2x(N 1)m xNm  2xNm1xNm  6xN2m  3  0 where m is the number of rows in the block, N is the number of blocks. Then N  (n 1) / m 1 or n=Nm.

When software implementation of algorithms on computers of hybrid architecture it is advisable to use the functions of optimized software libraries. In particular, such libraries include Intel MKL [7], CUBLAS [8].

The following library's functions are used to implement the main computational stages of algorithms:  dpotrf – finding LLT decomposition of a dense matrix. The function is performed on the CPU;  cublasDsyrk – performs s-rank modification of a dense matrix. The function is performed on the GPU;  cublasDgemm – finds the multiplication of two dense matrices. The function is performed on the GPU;  cublasDtrsm – solving a triangular system with many right parts.. The function is performed on the GPU.

The algorithm uses functions from the MPI [9] and CUDA libraries [10] to perform communications:  Mpi_reduce – global reduction function with saving the result in the specified processor.  cudaMemcpyAsync – a function that performs asynchronous data copying between the CPU and the GPU.

Running the function in multiple cudaStream streams reduces the execution time of the program because copying is performed against the background of calculations and does not cause a stop in the calculations. Calculations on the GPU are also performed simultaneously in different cudaStream streams.

Here are some of the results.

Table 1 shows the time (sec.) Execution of one iteration of Newton's method using a parallel variant (8 CPU) of the fine-tile factorization algorithm of the block-diagonal matrix with a frame. Tile size is 128.

15000 20000 25000 0.0393 0.0354

From the table, we can conclude that the increase in the order of the system which is solving and the transformation of the matrix for different numbers of processors contribute to a uniform load of processors by calculations. Another factor that contributes to such a reduction in calculation time is the adjustment of the tile size. With the right choice of tile size, the effect of caching calculations can be achieved, which can manifest itself as superacceleration on certain parameter configurations.

In fig. 4. the dependence of the time of execution of calculations of the hybrid variant (8 CPU + 1 GPU) of the fine-tile algorithm on the number of flows and the size of the tile is shown. The order of the matrix is 10050.

Conclusions

A new method for solving systems of large-orders nonlinear equations with Jacobi block matrices is considered. Its ideological basis is a combination of the classical algorithm of Newton's method with an efficient fine-tile algorithm for solving systems of linear equations with sparse matrices. The times of solving SNE of different orders on the nodes of the SKIT supercomputer are given. A significant reduction in the solution time of systems of nonlinear equations is obtained. This algorithm allows you to adjust the dimension of the block with which the calculations are performed at each step of the algorithm. This can have a computational caching effect when the blocks are completely stored in the GPU's fast memory. Also, this block structure allows you to work with inseparable data sets on the GPU, which reduces the number of index operations and checks, which are quite expensive on a graphics accelerator.

References: About the authors:

Place of work of the authors:

1. Nesterenko , A.N. & Khimich , A.N. & Yakovlev , M.F. ( 2006 ) To the problem of solving of non-linear systems on multi-processor distributed memory computing system . Gerald of computer and information technologies. 10 . pp. 54 - 56 .

2. Dzhordzh

. Chyslennoe reshenye bol shykh razrezhennykh system uravnenyy / A . Dzhordzh, Dzh. Lyu. - M.: Myr , 1984 . - 334 s.

3. Pysanetsky

. Tekhnolohyya razrezhennykh matryts / Pysanetsky S. - M.: Myr , 1988 . - 410 s.

Alfredo

Buttari , Julien Langou, Jakub Kurzak, and Jack Dongarra: A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures . Parallel Computing , Volume 35 , Issue

, P. 38- 53 , 2009 , ISSN: 0167 - 8191 .

O.V.

Popov . Doslidzhennya efektyvnosti paralel nykh alhorytmiv dlya komp yuteriv hibrydnoyi arkhitektury. Materialy Mizhnarodnoyi naukovoyi shkoly-seminaru «Pytannya optymizatsiyi obchyslen (POO XLII)», (Zakarpat·s ka obl ., Mukachivs kyy r . -n, smt . Chynadiyevo , 21 - 25 veresnya 2015 ), Kyyiv: 2015 , S. 16 .

6. Rezhym dostupu: http://icybcluster.org.ua

7. Intel® Math Kernel Library (Intel® MKL) - Rezhym dostupu: https://software.intel.com/en-us/intel-mkl

CUDA

CUBLAS_Library - Santa Clara : Nvidia, 2010 . - 254 c.

9. W. Gropp Using MPI-2: Advanced Features of the Message -Passing Interface / W. Gropp, E. Lusk, and

Thakur . - Cambridge: MIT Press, 1999 . - 382 p.

10. Boreskov

A.V.

Osnovy raboty s tekhnolohyey CUDA / Boreskov A .V., Kharlamov

.A. - M.: DMK Press, 2010 . - 232 s.

A.N.

Khimich , Head's assistant, number of scientific publications in Ukrainian publications - 110, https://orcid.org/0000-0001- 9284 -139X.