Parallel Algorithms for Solving Linear Systems with Block-Fivediagonal Matrices on Multi-Core CPU Elena Akimova? and Dmitry Belousov 1 Krasovskii Institute of Mathematics and Mechanics, Yekaterinburg, Russia 2 Ural Federal University, Yekaterinburg, Russia Abstract. For solving systems of linear algebraic equations with block- fivediagonal matrices arising in geoelectrics and diffusion problems, the parallel matrix square root method, conjugate gradient method with pre- conditioner, conjugate gradient method with regularization, and parallel matrix sweep algorithm are proposed and some of them are implemented numerically on multi-core CPU Intel. Investigation of efficiency and opti- mization of parallel algorithms for solving the problem with quasi-model data are performed. The problem with quasi-model data is solved. Keywords: parallel algorithms · block-fivediagonal SLAE · direct and iterative numerical methods · multi-core CPU 1 Introduction Systems of linear algebraic equations (SLAE) with block-fivediagonal matrices arise when solving some mathematical modelling problems, in particular, geo- electrics and diffusion problems. The geoelectrics problems are very important for investigating the crust heterogeneity. This problem is described by a dif- ferential Poissons equation. After using a finite-difference approximation, the two- and three-dimensional lateral logging problems are reduced to solving ill- conditioned systems having large-scale block-five diagonal matrix [1]. Another important problem is the multicomponent diffusion problem when it is necessary to know concentration distribution of diffusing components at every moment of time. This problem is also reduced to solving a SLAE with block-fivediagonal matrix after approximating systems of differential equations [2]. These types of problems has the matrix with a special structure, we also describe. In this paper, for solving SLAE with block-fivediagonal matrices, direct and iterative parallel algorithms are designed: conjugate gradient method with pre- conditioner (PCGM), conjugate gradient method with regularization (PCGR), square root method (PSRM), and parallel matrix sweep algorithm (PMSA). ? This work was partially supported by Program of UrB RAS (project no. 15-7-1-13) and by the Russian Foundation for Basic Research (project no. 15-01-00629 a). Parallel Algorithms for Solving Linear Systems 39 The PCGM, PCGR, and PSRM algorithms are implemented on multi-core Intel processor. The parallel matrix sweep algorithm will be implemented on multi-core Intel processor later. Investigation of efficiency and optimization of parallel algorithms are per- formed. The problem with quasi-model data is solved. 2 Parallel Algorithms for Solving SLAE We consider the system:   C0 Ȳ0 − D0 Ȳ1 + E0 Ȳ2 = F̄0 ,   −B1 Ȳ0 + C1 Ȳ1 − D1 Ȳ2 + E1 Ȳ3 = F̄1 ,   Ai Ȳi−2 − Bi Ȳi−1 + Ci Ȳi − Di Ȳi+1 + Ei Ȳi+2 = F̄i , i = 2, ..., N − 1 ; (1) A N ȲN −2 − BN ȲN −1 + CN ȲN − DN ȲN +1 = F̄N ,     AN +1 ȲN −1 − BN +1 ȲN + CN +1 ȲN +1 = F̄N +1 ,  where Ȳi are unknown n–vectors, F̄i are given n–vectors, Ai , Bi , Ci , Di , Ei are given n×n – matrices. As mentioned in introduction, after a finite-difference approximation, the geoelectrics problems and diffusion problems can be reduced to solving SLAE with block-fivediagonal matrices with the structure presented in Fig. 1. This special case arising when we are solving diffusion problems for higher order scheme and geoelectric problems. Figure 1 show the special case of SLAE (1) where central diagonal blocks have three non zero diagonals. Fig. 1. Block-fivediagonal matrix For solving SLAE with block-fivediagonal matrices, parallel algorithms based on conjugate gradient method with preconditioner (PCGM), conjugate gradient 40 Elena Akimova and Dmitry Belousov method with regularization (PCGR), parallel square root method (PSRM), and parallel matrix sweep algorithm (PMSA) are proposed. 2.1 Parallel Conjugate Gradient Method with Regularization One of the fast iterative algorithms for solving a SLAE with a symmetric positive definite matrix is the conjugate gradient method (CGM) [3]. We write the system (1) in the form Ax = b, (2) When we are solve geoelectric problems with non-uniformly scaled coefficients and when the system is ill-conditioned, we consider a regularization to prevent effect of rounding errors. We add regularization with alpha parameter of regu- larization to ensure stability of the method with the following equation: Ax e = b, A e = A + αE, (3) where α is the parameter of regularization. The CGM method has the following form: r0 = b − Ax0 , p0 = r0 , 2 krk k xk+1 = xk + αk pk , αk = (Apk ,pk ) , rk+1 = rk − αk Ap e k, (4) k+1 2 kr k pk+1 = rk+1 + βk pk , βk = krk k2 . The condition for stopping the CGM iterative process is Ark − b / kbk < ε. 2.2 Parallel Conjugate Gradient Method with Preconditioner One of the fast iterative algorithms for solving a SLAE with a symmetric positive definite matrix is the conjugate gradient method (CGM). The introduction of a preconditioner is applied to accelerate the convergence of the iterative process. Preconditioning consists in the fact that the initial system of equations Ax = b is replaced by the system C −1 Ax = C −1 b, (5) for which the iterative method converges essentially faster. The condition for the choice of the preconditioner C is the following: λ̃max λmax cond(Ã) << cond(A), cond(Ã) = , cond(A) = , (6) λ̃min λmin where cond(A) and cond(Ã) are the condition numbers of the matrices A and Ã; λmax , λ̃max and λmin , λ̃min are the largest and smallest eigenvalues of the matrices A and Ã, respectively. Parallel Algorithms for Solving Linear Systems 41 For system of equations (1), the conjugate gradient method with precondi- tioner C has the form: r0 = b − Ax0 , p0 = C −1 r0 , z 0 = p0 , k k (r ,z ) xk+1 = xk + αk pk , αk = (Ap k ,pk ) , rk+1 = rk − αk Apk , (7) k+1 k+1 z k+1 =C −1 k+1 r , pk+1 =z k+1 k + βk p , βk = (r (rk ,z ,z k) ) . The condition for stopping the CGM iterative process with preconditioner is Az k − b / kbk < ε. Parallelization of the PCGR and PCGM is based on splitting the matrix A by horizontal lines into m blocks, and the solution vector z and the vector of the right-hand part bof the SLAE are divided into m parts such that n = m × L, where n is the dimension of the system of equations, m is the number of processors, and Lis the number of lines in the block (Fig. 2). At every iteration, each processor computes its part of the solution vector. In the case of the matrix- vector multiplication (A by z), each processor multiplies its part of the lines of the matrix A by the vector z. The host-CPU (master) is responsible for transferring data and, also, calculates its part of the solution vector. Fig. 2. Distribution of data among processors 2.3 Parallel Square Root Method For solving SLAE (1), it is also possible to use the square root method [5]. This method is based on decomposition of the symmetric positive definite matrix A into the product A = S T S, where S is an upper triangular matrix with positive elements on the main diagonal, S T is the transposed matrix. The square root method consists in the sequential solution of two systems of equations with the triangular matrices S T y = b, Sz = y. (8) 42 Elena Akimova and Dmitry Belousov To solve systems (8), we use the recursion formulas    y1 = b1 /s   11 , i−1  zn = yn /sP   nn , n yi − sik zk P bi − ski yk k=1 k=i+1 (9) y = , zi = ,  i sii sii     i = 2, 3, ...., n ; i = n − 1, n − 2, ..., 1.  The main idea of parallelization of the square root method for multiprocessor computers with shared memory is based on parallel computing the elements sij ,j = i, ..., n, of each i -th line of the matrix S . The i -th line is divided into m parts so that n − i = m × Li , where i is the line number, n is the dimension of the system, m is the number of processors, and Li is the number of line items that are evaluated by each processor (Fig. 3) described in [4]. Fig. 3. Distribution of the i-th line among processors 2.4 Constructing the Parallel Matrix Sweep Algorithm Let us consider system (1). When constructing the parallel algorithm, we choose the unknown vectors ȲK , ȲK+1 , K = 0 , M, ..., N , as parameters connecting vertically the unknown values on the grid in Lsubdomains (Fig. 4) described in [5]. Fig. 4. Dividing the initial domain P into L subdomains Parallel Algorithms for Solving Linear Systems 43 Let us introduce the operator ∆ Ȳi ≡ Ai Ȳi−2 − Bi Ȳi−1 + Ci Ȳi − Di Ȳi+1 + Ei Ȳi+2 . Similarly to ∆ Ȳi , we define the operators ∆ Ūi , ∆ V̄i , ∆ P̄i , ∆ Q̄i , ∆W̄i . A reduced system of equations with respect to the parameters ȲK , ȲK+1 is constructed as follows. In L subdomains defined by the intervals (K, K + M + 1), K = 0 , M, ..., N − M, we consider the following problems:            1 0 0 0   1 1 0  1 0  1  ..  1  ..  ∆Ūi = 0, ŪK =   , ŪK+1 =   , ŪK+M =   , ŪK+M +1 =   ,          .. .. 0 0     0 0 0 0    . . .  . . . .  . . . .  . . .  (10) 0 0 0 0       n n  ..  n  ..  n  ..  n  ..   ∆Ū = 0, Ū =   , Ū K+1 =  0  , ŪK+M =  0  , ŪK+M +1 =  0  ,          i K 0    1 0 0 0             0 1 0 0   1 1 0  1 0  1  ..  1  ..   ∆V̄i = 0, V̄K =   , V̄K+1 =   , V̄K+M =   , V̄K+M +1 =   ,             .. .. 0 0 0 0 0 0    . . .  . . . .  . . . .  . . .  (11) 0 0 0 0       n n 0  n 0  n  ..  n  ..  ∆V̄i = 0, V̄K =   , V̄K+1 =   , V̄K+M =   , V̄K+M +1 =   ,          .. .. 0 0     0 1 0 0             0 0 1 0   1 1 0  1 0  1  ..  1  ..   ∆P̄i = 0, P̄K =  ..  , P̄K+1 =  ..  , P̄K+M =  0  , P̄K+M +1 =  0  ,              0 0 0 0    . . .  . . . .  . . . . . . .   (12) 0 0 0 0       n n 0  n 0  n  ..  n  ..  ∆P̄i = 0, P̄K =   , P̄K+1 =   , P̄K+M =   , P̄K+M +1 =   ,          .. .. 0 0     0 0 1 0             0 0 0 1   1 1 0  1 0  1  ..  1  ..  ∆Q̄i = 0, Q̄K =   ..  , Q̄K+1 =  ..  , Q̄K+M =  0  , Q̄K+M +1 =  0  ,              0 0 0 0    . . . . . . .  . . .  . . . .  (13) 0 0 0 0       n n 0  n 0  n  ..  n  ..  ∆Q̄i = 0, Q̄K =   , Q̄K+1 =   , Q̄K+M =   , Q̄K+M +1 =   ,          .. .. 0 0     0 0 0 1          0 0 0 0 0  0  0  0  ∆W̄i = F̄i , W̄K =  ..  , W̄K+1 =  ..  , W̄K+M =  ..  , W̄K+M +1 =  ..  ,        (14) 0 0 0 0 44 Elena Akimova and Dmitry Belousov where i = K + 2, ..., K + M − 1. The following theorem is proved in [5]. Theorem. If Ūi1 , ..., Ūin are solutions of (10), V̄i1 , ..., V̄in are solutions of (11), 1 P̄i , ..., P̄in are solutions of (12), Q̄1i , ..., Q̄ni are solutions of (13), W̄i are solutions of (14), and Ȳi are solutions of given problem (1) on (K, K + M + 1), then, by the superposition principle [5], we have Ȳi = Ūi1 Ūi2 ...Ūin ȲK + V̄i1 V̄i2 ...V̄in ȲK+1 + P̄i1 P̄i2 ...P̄in ȲK+M +    (15) + Q̄1i Q̄2i ...Q̄ni ȲK+M +1 + W̄i .  Substituting relations (15) into the given system (1) at the points K, K + 1, K = 0 , M, ..., N, we get the reduced system with respect to the vector- parameters ȲK , ȲK+1 . This reduced system has a smaller dimension. ÃȲ = F̄ . (16) Reduced system (16) is one of vector equations with block-sevendiagonal matrices of coefficients; in each line, one of the seven vector elements ȲK be- ing on the left or right from the main diagonal is zero. After finding vector- parametres ȲK , ȲK+1 , other required unknown vectors are expressed through vector-parameters and are found in each subdomain L independently by for- mula (16). The direct parallel matrix sweep algorithm for solving vector system (1) with block-fivediagonal matrices consists in solving the following equations: (10) −→ (14) −→ (16) −→ (15). (17) Problems (10)–(14) and system of equations (16) can be solved by the matrix sweep algorithms (Gaussian elimination methods) for solving systems of equa- tions with block-fivediagonal and block-sevendiagonal matrices, respectively. The formulas of the matrix sweep algorithms for solving SLAEs with block-fivedia- gonal and block-sevendiagonal matrices are deduced similarly to formulas of the corresponding scalar sweep algorithms [6]. The stable parallel matrix sweep algorithm for solving SLAE with block-five- diagonal matrices can be effectively implemented on parallel computing systems with distributed memory with L processors (the number is equal to the number of subdomains). Problems and unknown vector-parameters Ȳi ( i = K + 2, ..., K + M −1) inside of each subdomain L are computed on L processors independently. In addition, the parallel matrix sweep algorithm can be effectively implemented on multi-core processor and graphic processors. 3 Numerical Experiments The parallel algorithms were implemented on a multi-core Intel processor using the technology of OpenMP, and the Windows API development tools. Parallel Algorithms for Solving Linear Systems 45 With the help of the parallel preconditioned conjugate gradient method, and square root method, we solved the problem of finding a potential distribution in a conducting medium with known quasi-model solution. The source data and quasi-model solution of the problem were provided by the Department of Borehole Geophysics, Institute of Geology and Geophysics, Siberian Branch of RAS (Novosibirsk). After discretization the problem is reduced to solving a SLAE with an ill- conditioned symmetric positive defined block-fivediagonal matrix of dimension 134862 × 134862 with 247 square blocks. The numerical solution of the problem is compared with the quasi-model solution by means of calculating the relative error σ = Ȳ M − Ȳ N / Ȳ M , (18) where Ȳ M is the quasi-model solution of the problem, Ȳ N is the numerical solu- tion of the problem. The condition is chosen as a stopping criterion for the iterative PCGM. A priori we find the condition number of the original matrix: λmax cond(A) = ≈ 1.3 · 1011 , λmax = 1.4 · 106 , λmin = 1.1 · 10−5 > 0, (19) λmin where λmax and λmin are the largest and smallest eigenvalues of the matrix A. In the case of solving the problem by the preconditioned PCGM, to verify conditions (11), we find the condition number of matrix à λ̃max cond(Ã) = ≈ 4.1 · 109 < cond(A). (20) λ̃min This problem was solved by the parallel conjugate gradient method with pre- conditioner, parallel conjugate gradient method with regularization, and parallel square root method. The numerical solution of the problem coincides with the quasi-model solution with accuracy σP CGM ≈ 10−7 , σP M GR ≈ 2 · 10−7 , σP SRM ≈ 6 · 10−7 . For the quasi-model data the numerical solution of the problem is presented in Fig. 5. The computation times of solving the SLAE in the potential distribution problem on the hybrid computing system are presented in Table 1. This system is installed in the Department of Ill-posed Problems of Analysis and Applications of the Institute of Mathematics and Mechanics UrB RAS. The computing system consists of 4-core processor Intel Core I5-750. Note that the time for solving the problem by PCGM without preconditioner on one core Intel Core I5-750 for σP CGM = 10−3 was 30 minutes. Table 1 shows that the preconditioner decreases essentially the time of solving the problem. 46 Elena Akimova and Dmitry Belousov Fig. 5. Numerical solution of the problem Table 1. Computation times of solving SLAE Number of cores Tm , sec. Tm , sec. Method Intel Core I5 Windows API OpenMP 1 165 145 PCGM 2 142 127 4 118 101 1 1805 1804 PCGR 2 1490 1450 4 1288 1230 1 26 18 PSRM 2 18 13 4 9 7 Parallel Algorithms for Solving Linear Systems 47 At the beginning, for a multi-core processor with shared memory, the PCGM, PCGR, and PSRM parallelization was implemented by means of the operating system (OS) threads by the development tools Windows API. For parallel im- plementation of the computing block of the program, the threads were created. Each of these threads executed on the OS “logical processor” and computed the data portion. At the end of each computing block, the barrier synchronization of the threads was perfomed. In order to optimize the program and reduce the computation time, the OpenMP technology was used. Automatic parallelization of loops was carried out by the OpenMP library using special compiler directives. The interval of size L of the loop variable i was divided into m parts. Each thread of the process calculated its p-th part of the data, where p = L/m (Fig. 4). So, the PSRM is the fastest method. The computation time for solving the SLAE on a 4-core CPU Intel is reduced to several seconds (in comparison with 30 min by CGM without preconditioner). 4 Conclusion For solving SLAE with block-fivediagonal matrices arising in geoelectrics and diffusion problems, the parallel conjugate gradient method and parallel square root method with preconditioner are proposed and numerically implemented on a multi-core processor Intel. Investigation of efficiency and optimization of parallel algorithms for solving the problem with quasi-model data are performed. The calculation results show that the use of parallel PCGR, PSRM, and PCGM with preconditioner allows us to solve rather effectively problems with ill-conditioned matrices on multi-core CPU. For the future work we want to compare parallel matrix sweep algorithm with others and to implement and optimize for GPU. 48 Elena Akimova and Dmitry Belousov References 1. Dashevsky, J.A., Surodina, I.V., Epov, M.I.: Quasi-three-dimensional mathematical modelling of diagrams of axisymmetric direct current probes in anisotropic profiles. Siberian J. of Industrial Mathematics. Vol. 5, No. 3, 76–91 (2002) 2. Gorbachev, I.I., Popov V.V., Akimova, E.N.: Computer simulation of the diffusion interaction between carbonitride precipitates and austenitic ma- trix with allowance for the possibility of variation of their composition, The Physics of Metals and Metallography. Vol. 102, No. 1, 18–28 (2006). http://link.springer.com/article/10.1134/S0031918X06070039 3. Faddeev, V.K., Faddeeva, V.N.: Computational methods of linear algebra. M. Gos. Isdat. Fizmat. Lit. (1963) 4. Akimova, E.N., Belousov, D.V.: Parallel algorithms for solving lin- ear systems with block-tridiagonal matrices on multi-core CPU with GPU, Journal of Computational Science. Vol. 3, No. 6, 445–449 (2012). http://www.sciencedirect.com/science/article/pii/S1877750312000932 5. Akimova, E.N.: A parallel matrix sweep algorithm for solv- ing linear system with block-fivediagonal matrices. AIP Conf. Proc. 1648, 850028. 2015. Rhodes, Greece, 22–28 Sept. (2014). http://scitation.aip.org/content/aip/proceeding/aipcp/10.1063/1.4913083 6. Samarskii, A.A., Nikolaev, E.S.: Methods for solving the grid equations. M. Nauka. (1978)