Parallel Algorithms for Solving Linear Systems
with Block-Fivediagonal Matrices on Multi-Core
                     CPU

                        Elena Akimova? and Dmitry Belousov
     1
         Krasovskii Institute of Mathematics and Mechanics, Yekaterinburg, Russia
                     2
                       Ural Federal University, Yekaterinburg, Russia


         Abstract. For solving systems of linear algebraic equations with block-
         fivediagonal matrices arising in geoelectrics and diffusion problems, the
         parallel matrix square root method, conjugate gradient method with pre-
         conditioner, conjugate gradient method with regularization, and parallel
         matrix sweep algorithm are proposed and some of them are implemented
         numerically on multi-core CPU Intel. Investigation of efficiency and opti-
         mization of parallel algorithms for solving the problem with quasi-model
         data are performed. The problem with quasi-model data is solved.

         Keywords: parallel algorithms · block-fivediagonal SLAE · direct and
         iterative numerical methods · multi-core CPU


1     Introduction

Systems of linear algebraic equations (SLAE) with block-fivediagonal matrices
arise when solving some mathematical modelling problems, in particular, geo-
electrics and diffusion problems. The geoelectrics problems are very important
for investigating the crust heterogeneity. This problem is described by a dif-
ferential Poissons equation. After using a finite-difference approximation, the
two- and three-dimensional lateral logging problems are reduced to solving ill-
conditioned systems having large-scale block-five diagonal matrix [1]. Another
important problem is the multicomponent diffusion problem when it is necessary
to know concentration distribution of diffusing components at every moment of
time. This problem is also reduced to solving a SLAE with block-fivediagonal
matrix after approximating systems of differential equations [2]. These types of
problems has the matrix with a special structure, we also describe.
    In this paper, for solving SLAE with block-fivediagonal matrices, direct and
iterative parallel algorithms are designed: conjugate gradient method with pre-
conditioner (PCGM), conjugate gradient method with regularization (PCGR),
square root method (PSRM), and parallel matrix sweep algorithm (PMSA).
?
    This work was partially supported by Program of UrB RAS (project no. 15-7-1-13)
    and by the Russian Foundation for Basic Research (project no. 15-01-00629 a).
                                Parallel Algorithms for Solving Linear Systems         39

    The PCGM, PCGR, and PSRM algorithms are implemented on multi-core
Intel processor. The parallel matrix sweep algorithm will be implemented on
multi-core Intel processor later.
    Investigation of efficiency and optimization of parallel algorithms are per-
formed. The problem with quasi-model data is solved.

2    Parallel Algorithms for Solving SLAE
We consider the system:
  
   C0 Ȳ0 − D0 Ȳ1 + E0 Ȳ2 = F̄0 ,
  
   −B1 Ȳ0 + C1 Ȳ1 − D1 Ȳ2 + E1 Ȳ3 = F̄1 ,
  
  
    Ai Ȳi−2 − Bi Ȳi−1 + Ci Ȳi − Di Ȳi+1 + Ei Ȳi+2 = F̄i ,   i = 2, ..., N − 1 ;   (1)
    A N ȲN −2 − BN ȲN −1 + CN ȲN − DN ȲN +1 = F̄N ,
  
  
  
  
    AN +1 ȲN −1 − BN +1 ȲN + CN +1 ȲN +1 = F̄N +1 ,
  

where Ȳi are unknown n–vectors, F̄i are given n–vectors, Ai , Bi , Ci , Di , Ei are
given n×n – matrices.
    As mentioned in introduction, after a finite-difference approximation, the
geoelectrics problems and diffusion problems can be reduced to solving SLAE
with block-fivediagonal matrices with the structure presented in Fig. 1. This
special case arising when we are solving diffusion problems for higher order
scheme and geoelectric problems.
    Figure 1 show the special case of SLAE (1) where central diagonal blocks
have three non zero diagonals.


                           Fig. 1. Block-fivediagonal matrix


   For solving SLAE with block-fivediagonal matrices, parallel algorithms based
on conjugate gradient method with preconditioner (PCGM), conjugate gradient
40        Elena Akimova and Dmitry Belousov

method with regularization (PCGR), parallel square root method (PSRM), and
parallel matrix sweep algorithm (PMSA) are proposed.


2.1     Parallel Conjugate Gradient Method with Regularization

One of the fast iterative algorithms for solving a SLAE with a symmetric positive
definite matrix is the conjugate gradient method (CGM) [3].
    We write the system (1) in the form

                                      Ax = b,                                    (2)
    When we are solve geoelectric problems with non-uniformly scaled coefficients
and when the system is ill-conditioned, we consider a regularization to prevent
effect of rounding errors. We add regularization with alpha parameter of regu-
larization to ensure stability of the method with the following equation:

                              Ax
                              e = b, A
                                     e = A + αE,                                 (3)
where α is the parameter of regularization.
   The CGM method has the following form:

             r0 = b − Ax0 ,   p0 = r0 ,
                                              2
                                       krk k
             xk+1 = xk + αk pk , αk = (Apk ,pk ) , rk+1 = rk − αk Ap
                                                                  e k,           (4)
                                         k+1 2
                                       kr k
             pk+1 = rk+1 + βk pk , βk = krk k2 .

      The condition for stopping the CGM iterative process is Ark − b / kbk < ε.


2.2     Parallel Conjugate Gradient Method with Preconditioner

One of the fast iterative algorithms for solving a SLAE with a symmetric positive
definite matrix is the conjugate gradient method (CGM). The introduction of a
preconditioner is applied to accelerate the convergence of the iterative process.
Preconditioning consists in the fact that the initial system of equations Ax = b
is replaced by the system
                                  C −1 Ax = C −1 b,                           (5)
for which the iterative method converges essentially faster. The condition for the
choice of the preconditioner C is the following:

                                                  λ̃max                 λmax
           cond(Ã) << cond(A),    cond(Ã) =           ,   cond(A) =        ,   (6)
                                                  λ̃min                 λmin

where cond(A) and cond(Ã) are the condition numbers of the matrices A and
Ã; λmax , λ̃max and λmin , λ̃min are the largest and smallest eigenvalues of the
matrices A and Ã, respectively.
                                       Parallel Algorithms for Solving Linear Systems         41

    For system of equations (1), the conjugate gradient method with precondi-
tioner C has the form:
           r0 = b − Ax0 ,   p0 = C −1 r0 ,          z 0 = p0 ,
                                        k k
                                      (r ,z )
           xk+1 = xk + αk pk , αk = (Ap   k ,pk ) ,     rk+1 = rk − αk Apk ,                  (7)
                                                                             k+1 k+1
           z   k+1
                     =C   −1 k+1
                            r      ,     pk+1
                                                =z   k+1         k
                                                           + βk p ,   βk = (r (rk ,z
                                                                                  ,z
                                                                                     k)
                                                                                        )
                                                                                          .

    The condition for stopping the CGM iterative process with preconditioner is
 Az k − b / kbk < ε.
    Parallelization of the PCGR and PCGM is based on splitting the matrix
A by horizontal lines into m blocks, and the solution vector z and the vector
of the right-hand part bof the SLAE are divided into m parts such that n =
m × L, where n is the dimension of the system of equations, m is the number of
processors, and Lis the number of lines in the block (Fig. 2). At every iteration,
each processor computes its part of the solution vector. In the case of the matrix-
vector multiplication (A by z), each processor multiplies its part of the lines of the
matrix A by the vector z. The host-CPU (master) is responsible for transferring
data and, also, calculates its part of the solution vector.


                      Fig. 2. Distribution of data among processors


2.3   Parallel Square Root Method

For solving SLAE (1), it is also possible to use the square root method [5]. This
method is based on decomposition of the symmetric positive definite matrix A
into the product A = S T S, where S is an upper triangular matrix with positive
elements on the main diagonal, S T is the transposed matrix. The square root
method consists in the sequential solution of two systems of equations with the
triangular matrices

                                         S T y = b, Sz = y.                                   (8)
42        Elena Akimova and Dmitry Belousov

      To solve systems (8), we use the recursion formulas
                                           
                   y1 = b1 /s
                  
                  
                              11 ,
                             i−1
                                             zn = yn /sP
                                            
                                            
                                                        nn ,
                                                        n
                                                    yi −       sik zk
                              P
                           bi −   ski yk
                              k=1                     k=i+1                          (9)
                    y =                 ,     zi =                ,
                   i           sii                      sii
                  
                                           
                                            
                    i = 2, 3, ...., n ;       i = n − 1, n − 2, ..., 1.
                                            

     The main idea of parallelization of the square root method for multiprocessor
computers with shared memory is based on parallel computing the elements
sij ,j = i, ..., n, of each i -th line of the matrix S . The i -th line is divided into m
parts so that n − i = m × Li , where i is the line number, n is the dimension of
the system, m is the number of processors, and Li is the number of line items
that are evaluated by each processor (Fig. 3) described in [4].


                 Fig. 3. Distribution of the i-th line among processors


2.4     Constructing the Parallel Matrix Sweep Algorithm

Let us consider system (1). When constructing the parallel algorithm, we choose
the unknown vectors ȲK , ȲK+1 , K = 0 , M, ..., N , as parameters connecting
vertically the unknown values on the grid in Lsubdomains (Fig. 4) described
in [5].


               Fig. 4. Dividing the initial domain P into L subdomains
                                Parallel Algorithms for Solving Linear Systems            43

   Let us introduce the operator

               ∆ Ȳi ≡ Ai Ȳi−2 − Bi Ȳi−1 + Ci Ȳi − Di Ȳi+1 + Ei Ȳi+2 .

Similarly to ∆ Ȳi , we define the operators ∆ Ūi , ∆ V̄i , ∆ P̄i , ∆ Q̄i , ∆W̄i .
   A reduced system of equations with respect to the parameters ȲK , ȲK+1 is
constructed as follows. In L subdomains defined by the intervals (K, K + M + 1),
K = 0 , M, ..., N − M, we consider the following problems:
                                                               
  
                     1             0               0                  0
  
      1        1
                    0     1
                                  0  1           ..  1            .. 
    ∆Ūi = 0, ŪK =   , ŪK+1 =   , ŪK+M =   , ŪK+M +1 =          ,
  
                                                              
                      ..            ..              0                  0
  
  
  
  
                      0             0               0                  0
  
  
  
         .   .    .  . .     . .  .       .   . .
                                                            .    . .                  (10)
                      0             0               0                  0
  
  
  
  
  
      n        n
                     ..   n
                                   ..  n         ..  n            .. 
   ∆Ū = 0, Ū =   , Ū
                            K+1 =  0  , ŪK+M =  0  , ŪK+M +1 =  0  ,
                                                                 
  
      i        K   0 
  
  
                      1             0               0                  0
  

                                                                       
  
                        0                1                  0                  0
  
       1         1
                       0        1
                                        0        1
                                                           ..     1
                                                                              .. 
   ∆V̄i = 0, V̄K =   , V̄K+1 =   , V̄K+M =   , V̄K+M +1 =                   ,
  
                                                                      
  
  
                        ..               ..                 0                  0
                         0                0                  0                  0
  
  
  
          .    .     .  . .         . .  .         .   . .
                                                                     .    . .         (11)
                         0                0                  0                  0
  
  
  
  
  
       n         n
                       0         n
                                        0        n
                                                            ..     n
                                                                               .. 
    ∆V̄i = 0, V̄K =   , V̄K+1 =   , V̄K+M =   , V̄K+M +1 =                   ,
  
                                                                       
                         ..               ..                 0                  0
  
  
  
  
                         0                1                  0                  0
  
                                                                       
  
                        0                0                  1                  0
  
       1         1
                       0        1
                                        0        1
                                                           ..     1
                                                                              .. 
   ∆P̄i = 0, P̄K =  ..  , P̄K+1 =  ..  , P̄K+M =  0  , P̄K+M +1 =  0  ,
  
                                                                           
  
  
  
                         0                0                  0                  0
  
  
  
          .    .    .  . .          . .  .         .  . .      .    .   . 
                                                                                        (12)
                         0                0                  0                  0
  
  
  
  
  
       n         n
                       0        n
                                        0        n
                                                            ..    n
                                                                               .. 
    ∆P̄i = 0, P̄K =   , P̄K+1 =   , P̄K+M =   , P̄K+M +1 =                   ,
  
                                                                       
                         ..               ..                 0                  0
  
  
  
  
                         0                0                  1                  0
  
                                                                        
  
                        0                0                  0                  1
  
       1        1
                       0        1
                                        0        1
                                                           ..     1
                                                                               .. 
    ∆Q̄i = 0, Q̄K =    ..  , Q̄K+1 =  ..  , Q̄K+M =  0  , Q̄K+M +1 =  0  ,
  
                                                                         
  
  
  
  
                         0                0                  0                  0
  
  
  
         .     .    . . .          . .  .        .   .    .  .    .    .         (13)
                         0                0                  0                  0
  
  
  
  
  
       n         n
                       0        n
                                        0        n
                                                            ..    n
                                                                               .. 
    ∆Q̄i = 0, Q̄K =   , Q̄K+1 =   , Q̄K+M =   , Q̄K+M +1 =                   ,
  
                                                                       
                         ..               ..                 0                  0
  
  
  
  
                         0                0                  0                  1
  
                                                                         
                        0                 0                  0                   0
                      0               0                0                0 
  ∆W̄i = F̄i , W̄K =  ..  , W̄K+1 =  ..  , W̄K+M =  ..  , W̄K+M +1 =  ..  ,
                                                                                  (14)
                        0                 0                  0                   0
44        Elena Akimova and Dmitry Belousov

where i = K + 2, ..., K + M − 1.
   The following theorem is proved in [5].

     Theorem. If Ūi1 , ..., Ūin are solutions of (10), V̄i1 , ..., V̄in are solutions of (11),
  1
P̄i , ..., P̄in are solutions of (12), Q̄1i , ..., Q̄ni are solutions of (13), W̄i are solutions
of (14), and Ȳi are solutions of given problem (1) on (K, K + M + 1), then, by
the superposition principle [5], we have

     Ȳi = Ūi1 Ūi2 ...Ūin ȲK + V̄i1 V̄i2 ...V̄in ȲK+1 + P̄i1 P̄i2 ...P̄in ȲK+M +
                                                                            
                                                                                       (15)
         + Q̄1i Q̄2i ...Q̄ni ȲK+M +1 + W̄i .
                            

   Substituting relations (15) into the given system (1) at the points K, K + 1,
K = 0 , M, ..., N, we get the reduced system with respect to the vector-
parameters ȲK , ȲK+1 . This reduced system has a smaller dimension.

                                          ÃȲ = F̄ .                                      (16)
   Reduced system (16) is one of vector equations with block-sevendiagonal
matrices of coefficients; in each line, one of the seven vector elements ȲK be-
ing on the left or right from the main diagonal is zero. After finding vector-
parametres ȲK , ȲK+1 , other required unknown vectors are expressed through
vector-parameters and are found in each subdomain L independently by for-
mula (16).
   The direct parallel matrix sweep algorithm for solving vector system (1) with
block-fivediagonal matrices consists in solving the following equations:

                            (10) −→ (14) −→ (16) −→ (15).                                  (17)
    Problems (10)–(14) and system of equations (16) can be solved by the matrix
sweep algorithms (Gaussian elimination methods) for solving systems of equa-
tions with block-fivediagonal and block-sevendiagonal matrices, respectively. The
formulas of the matrix sweep algorithms for solving SLAEs with block-fivedia-
gonal and block-sevendiagonal matrices are deduced similarly to formulas of the
corresponding scalar sweep algorithms [6].
    The stable parallel matrix sweep algorithm for solving SLAE with block-five-
diagonal matrices can be effectively implemented on parallel computing systems
with distributed memory with L processors (the number is equal to the number of
subdomains). Problems and unknown vector-parameters Ȳi ( i = K + 2, ..., K +
M −1) inside of each subdomain L are computed on L processors independently.
In addition, the parallel matrix sweep algorithm can be effectively implemented
on multi-core processor and graphic processors.


3    Numerical Experiments

The parallel algorithms were implemented on a multi-core Intel processor using
the technology of OpenMP, and the Windows API development tools.
                              Parallel Algorithms for Solving Linear Systems     45

    With the help of the parallel preconditioned conjugate gradient method, and
square root method, we solved the problem of finding a potential distribution in
a conducting medium with known quasi-model solution.
    The source data and quasi-model solution of the problem were provided by
the Department of Borehole Geophysics, Institute of Geology and Geophysics,
Siberian Branch of RAS (Novosibirsk).
    After discretization the problem is reduced to solving a SLAE with an ill-
conditioned symmetric positive defined block-fivediagonal matrix of dimension
134862 × 134862 with 247 square blocks.
    The numerical solution of the problem is compared with the quasi-model
solution by means of calculating the relative error

                           σ = Ȳ M − Ȳ N / Ȳ M        ,                     (18)

where Ȳ M is the quasi-model solution of the problem, Ȳ N is the numerical solu-
tion of the problem.
    The condition is chosen as a stopping criterion for the iterative PCGM.
    A priori we find the condition number of the original matrix:

             λmax
 cond(A) =        ≈ 1.3 · 1011 ,    λmax = 1.4 · 106 ,   λmin = 1.1 · 10−5 > 0, (19)
             λmin
where λmax and λmin are the largest and smallest eigenvalues of the matrix A.
   In the case of solving the problem by the preconditioned PCGM, to verify
conditions (11), we find the condition number of matrix Ã

                                   λ̃max
                     cond(Ã) =          ≈ 4.1 · 109 < cond(A).                (20)
                                   λ̃min
   This problem was solved by the parallel conjugate gradient method with pre-
conditioner, parallel conjugate gradient method with regularization, and parallel
square root method. The numerical solution of the problem coincides with the
quasi-model solution with accuracy

             σP CGM ≈ 10−7 , σP M GR ≈ 2 · 10−7 , σP SRM ≈ 6 · 10−7 .

     For the quasi-model data the numerical solution of the problem is presented
in Fig. 5.
     The computation times of solving the SLAE in the potential distribution
problem on the hybrid computing system are presented in Table 1. This system
is installed in the Department of Ill-posed Problems of Analysis and Applications
of the Institute of Mathematics and Mechanics UrB RAS. The computing system
consists of 4-core processor Intel Core I5-750.
     Note that the time for solving the problem by PCGM without preconditioner
on one core Intel Core I5-750 for σP CGM = 10−3 was 30 minutes.
     Table 1 shows that the preconditioner decreases essentially the time of solving
the problem.
46   Elena Akimova and Dmitry Belousov


                 Fig. 5. Numerical solution of the problem


              Table 1. Computation times of solving SLAE

                    Number of cores        Tm , sec.         Tm , sec.
        Method
                     Intel Core I5      Windows API          OpenMP
                           1                 165               145
         PCGM              2                 142               127
                           4                 118               101
                           1                1805              1804
         PCGR              2                1490              1450
                           4                1288              1230
                           1                 26                18
         PSRM              2                 18                13
                           4                  9                 7
                             Parallel Algorithms for Solving Linear Systems      47

    At the beginning, for a multi-core processor with shared memory, the PCGM,
PCGR, and PSRM parallelization was implemented by means of the operating
system (OS) threads by the development tools Windows API. For parallel im-
plementation of the computing block of the program, the threads were created.
Each of these threads executed on the OS “logical processor” and computed the
data portion. At the end of each computing block, the barrier synchronization
of the threads was perfomed.
    In order to optimize the program and reduce the computation time, the
OpenMP technology was used. Automatic parallelization of loops was carried
out by the OpenMP library using special compiler directives.
    The interval of size L of the loop variable i was divided into m parts. Each
thread of the process calculated its p-th part of the data, where p = L/m (Fig. 4).
    So, the PSRM is the fastest method. The computation time for solving the
SLAE on a 4-core CPU Intel is reduced to several seconds (in comparison with
30 min by CGM without preconditioner).


4   Conclusion
For solving SLAE with block-fivediagonal matrices arising in geoelectrics and
diffusion problems, the parallel conjugate gradient method and parallel square
root method with preconditioner are proposed and numerically implemented on a
multi-core processor Intel. Investigation of efficiency and optimization of parallel
algorithms for solving the problem with quasi-model data are performed.
     The calculation results show that the use of parallel PCGR, PSRM, and
PCGM with preconditioner allows us to solve rather effectively problems with
ill-conditioned matrices on multi-core CPU.
     For the future work we want to compare parallel matrix sweep algorithm
with others and to implement and optimize for GPU.
48       Elena Akimova and Dmitry Belousov

References
1. Dashevsky, J.A., Surodina, I.V., Epov, M.I.: Quasi-three-dimensional mathematical
   modelling of diagrams of axisymmetric direct current probes in anisotropic profiles.
   Siberian J. of Industrial Mathematics. Vol. 5, No. 3, 76–91 (2002)
2. Gorbachev, I.I., Popov V.V., Akimova, E.N.: Computer simulation of the
   diffusion interaction between carbonitride precipitates and austenitic ma-
   trix with allowance for the possibility of variation of their composition,
   The Physics of Metals and Metallography. Vol. 102, No. 1, 18–28 (2006).
   http://link.springer.com/article/10.1134/S0031918X06070039
3. Faddeev, V.K., Faddeeva, V.N.: Computational methods of linear algebra. M. Gos.
   Isdat. Fizmat. Lit. (1963)
4. Akimova, E.N., Belousov, D.V.: Parallel algorithms for solving lin-
   ear systems with block-tridiagonal matrices on multi-core CPU with
   GPU, Journal of Computational Science. Vol. 3, No. 6, 445–449 (2012).
   http://www.sciencedirect.com/science/article/pii/S1877750312000932
5. Akimova,      E.N.:    A    parallel  matrix     sweep     algorithm    for   solv-
   ing      linear     system      with     block-fivediagonal      matrices.     AIP
   Conf. Proc. 1648, 850028. 2015. Rhodes, Greece, 22–28 Sept. (2014).
   http://scitation.aip.org/content/aip/proceeding/aipcp/10.1063/1.4913083
6. Samarskii, A.A., Nikolaev, E.S.: Methods for solving the grid equations. M. Nauka.
   (1978)