=Paper= {{Paper |id=Vol-2608/paper48 |storemode=property |title=On parallel algebraic multilevel preconditioner with approximate inversion by basis matrix method |pdfUrl=https://ceur-ws.org/Vol-2608/paper48.pdf |volume=Vol-2608 |authors=Vsevolod Bohaienko |dblpUrl=https://dblp.org/rec/conf/cmis/Bohaienko20 }} ==On parallel algebraic multilevel preconditioner with approximate inversion by basis matrix method== https://ceur-ws.org/Vol-2608/paper48.pdf
      On parallel algebraic multilevel preconditioner with
        approximate inversion by basis matrix method

                               V.O. Bohaienko1[0000-0002-3317-9022]
1
    V.M. Glushkov Institute of Cybernetics of NAS of Ukraine, Glushkov ave., 40, Kyiv, 03187,
                                            Ukraine

                                       sevab@ukr.net



         Abstract. The paper presents a parallel calculation scheme for distributed
         memory systems for a modification of the AMLI preconditioner that performs
         incomplete inversions of matrix blocks by a technique that produces approxi-
         mate inverse of a matrix by an orthogonalization procedure, referred to as the
         incomplete basis matrix method. Theoretical estimates of algorithm’s perform-
         ance and the results of its experimental testing are presented. The results of ex-
         perimental analysis of preconditioner parameters influence on the convergence
         of the iterative algorithm are given. The characteristics of the proposed algo-
         rithm are compared with those of known algorithms implemented in the hypre
         software package. The tests on the matrices obtained from finite element discre-
         tization of soil stress-strain state modelling problem showed that the proposed
         algorithm has better performance for significantly ill-conditioned linear systems
         when solving them on a small number of computational resources.

         Keywords. parallel algorithm, distributed memory systems, linear algebraic
         systems, preconditioner, algebraic multilevel iteration method


1          Introduction
The need to solve linear algebraic systems arises in mathematical modelling of most
physical processes, in particular when modelling soil stress-strain state. In many cas-
es, linear systems have ill-conditioned sparse matrices of large size. In these situa-
tions, iterative methods, such as conjugate gradient method (CG) or biconjugate gra-
dient stabilized method (BiCGstab), are most commonly used [1].
    The main method used to improve the convergence and the accuracy while solving
ill-conditioned linear systems is the use of preconditioners [1, 2] - matrices, whose
multiplication on linear system matrix leads to the decrease of condition number. The
most commonly used methods for constructing preconditioners are incomplete matrix
decomposition (e.g. ILU decomposition [1]), incomplete matrix inversion (e.g. poly-
nomial preconditioners [3]), Geometric and Algebraic Multigrid (AMG) [4] methods,
the Algebraic Multilevel Iteration method (AMLI) [5,6], and wavelet precondition-
ers [7,8].
    The so-called basis matrix method [9, 10] is one of the methods developed to ana-
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
lyse ill-conditioned and rectangular linear systems. This method constructs iteratively
the inverse of a given matrix using some orthogonalization procedure. Based on this
method, an incomplete inversion algorithm is created, that can be used as a precondi-
tioner. Further, it is combined [11] with the AMLI preconditioner and applied for
solving elasticity problems.
   The speed of the AMLI preconditioner, similarly to the multigrid ones, in particu-
lar the AMG, is low comparing to the incomplete decomposition and the incomplete
inversion preconditioners and its application takes most of the time spent to obtain
solutions of linear systems. This makes crucial the development of parallel schemes
for these algorithms. While the AMLI is shown to have good scalability on vector and
parallel-vector machines [12, 13], its parallel implementations on cluster systems are
poorly studied.


2     Preconditioner on the base of the AMLI and the basis
matrix method
The AMLI preconditioner [5] is based on the representation of an input matrix as
having 2x2 blocks with its subsequent LU factorization:

                        A            A12   I                       A11      A12 
                    A =  11                =
                                      A22   A21 A111           
                                                                  I            S 
                                                                                                          (1)
                         A21
                      1
where S = A22  A21 A11 A12 is the Schur complement of A .
  After approximating the Schur complement with a polynomial and the recursive
application of the expansion (1) to the matrix A22 , the following algorithm for con-
                                                                                 1
structing a preconditioner that approximates the matrix A                             is obtained:

                    ( k 1)
                                                                                 
                                                                                       1

                    S        A ( k 1) 
                                22
                                          
                                             I     Pv   M         
                                                             ( k 1) 1
                                                                        A ( k 1) 
                                                                         22
                                                                                    
                   
                                           I                 A( k ) A( k )  .                        (2)
                    M (k ) =                                 11            12
                                                                              ( k 1) 
                   
                               A   (k )
                                21 11      A      
                                               ( k ) 1
                                                           I 
                                                                         S          

   For the matrix A of the size dimA = [ K * L, K * L] that is divided into L * L
       A , i, j = 1,..., L                 dimAij = [ K , K ]
blocks ij                  of the size                             we have
                                   A11( k ) = AL  k , L  k , dimA11( k ) = [ K , K ],
                           A12( k ) = ( AL  k , L  k 1 ,..., AL  k , L ), dimA12( k ) = [ K , kK ],
                           A21( k ) = ( AL  k 1, L  k ,..., AL , L  k ), dimA21( k ) = [ kK , K ],
                                        AL  k 1, L  k 1 ... AL  k 1, L 
                               (k )                                           (k )
                             A 22     =        ...          ...    ...  , dimA22   = [kK , kK ],
                                        AL , L  k 1       ...   ALL       
                                       
       dimS ( k 1) = dimA22
                          (k )            1
                               , S (0) = ALL , dimM ( k ) = [( k  1) K , ( k  1) K ], k = 1,..., L  1
       ( L 1)
and M            is an approximate inverse of A .
    The AMLI algorithm can be optimally stabilized by choosing the degree of an ap-
proximating polynomial [12]. However, the applicability of this theory while solving
stress-strain problems described by hyperbolic differential equations wasn't proved.
As such problems are of our specific interest, we consider the simplest case of
 Pv (t ) = P1 (t ) = 1  t . Then, the Schur complement approximation becomes
S ( k 1)  M ( k 1) and the recursive scheme (2) simplifies to

                                              I                      A( k )       A12( k ) 
                            M ( k ) =  ( k ) ( k ) 1                 11                   .                (3)
                                       A21  A11               I              M ( k 1) 


                                       A( k )  can be approximated by incomplete de-
                                                                 1

    In the scheme (3), the matrices 11
composition methods such as the ILU factorization [14] or incomplete Gram-Schmidt
method [15]. To make the scheme more efficient while solving ill-conditioned linear
systems, we propose to approximate them by the incomplete basis matrix method that
is specifically developed for this case and is described in detail in [16].
    Summarizing, the sequential computational scheme for the preconditioning proce-
dure (3) is as follows. The matrix A is divided into ( N  1)  ( N  1) blocks the fol-
lowing way:
                                           A11(0)         (0)
                                                         A121         ...         A12(0)N 
                                           (0)                                             
                                             A            ...         ...            ... 
                                      A =  211                                               ,
                                           ...           ...     A11( N 1)     A12( NN1) 
                                           (0)                    ( P 1)
                                                                                            
                                           A21N         ...     A21 N           A11( N ) 
                                      A12(i ) =  A12,
                                                   (i )
                                                       i 1 ,..., A12 N  , A21 =  A21, i 1 ,..., A21 N  .
                                                                   (i )      (i )    (i )            (i )


   Considering the application of the preconditioner (3) as a solution of the linear sys-
       (N )
tem M x = b, x = ( x0 ,..., xN ), b = (b0 ,..., bN ) , it can be performed by a two-stage
procedure, V-cycle, consisting of a "direct" and a "reverse" part.
    On the ``direct`` part of the procedure, values of the following vectors are sequen-
tially computed:

                          w0 = A11(0) b0 , wi = bi  A21
                                                       (i )  (i )
                                                           A11 wi 1 , i = 1,..., N .                           (4)

  The ``reverse`` part consists in the sequential computation of the resulting vector

                 x N = A11( N ) wN , xi = wi  A11(i ) A12(i ) ( xi 1 ,..., xN ), i = 0,..., N  1.          (5)
                          (i )
   In (4),(5), matrices A11 are the approximates to the corresponding inverse matrices
obtained by the incomplete basis matrix method [16].
   As the initial matrix is sparse, we use the compressed row storage (CSR) scheme
for all the matrices.
   The approximation accuracy is controlled on the iterations of the algorithm by cal-

culating
                       
          1 = max i AA
                          ii for all the processed rows. When  1 becomes bigger than a

given threshold value, we perform complete inversion. As the resulting matrices aren't
always dense, we also use the CSR storage scheme doing complete inversion.


3            Parallel implementation of the preconditioner
To make the AMLI algorithm combined with matrix blocks incomplete inversions
using the basis matrix method run in parallel on distributed memory systems, the
following data partitioning scheme can be applied. Each of i row blocks
                                A21(0)i ,..., A21(i i 1) , A11(i ) , A12,(i )i 1 ,..., A12(i )N 
is distributed on a system of P processes by blocks of rows. Thus, each process
                                              (i ) (i ) (i )
stores blocks of rows of the submatrices A11 , A12 , A21 j and can perform multiplica-
tions that arise while applying the preconditioner (3) in parallel.
   The computational scheme that we use for multiplying a sparse matrix A of size
 M  M by a vector v has the following form:

1. The process p stores the block
                          P
    B p = [ rp 0, rp1 ],  Bi = [0,..., M ], Bi B j = , i, j of the matrix A rows and the
                          i =0

   corresponding elements of the vector v ;
2. Denote a set of columns of the matrix A in which non-zero elements are present in
   the row block B p as C p ;
3. The process p sends values of the elements of the vector v from the index set
    Ci B p to the processes i, Ci B p   and receives the values of the elements
   from the index set C p B j from the processes j , C p B j   ;
4. After the synchronization is performed on the step 3, each of the processes inde-
   pendently multiplies the block of rows of the matrix A by the vector v . The result
   of this multiplication has the same partitioning as the vector v has.
   Performing the ``direct`` part (4) of the AMLI algorithm, when vector multiplica-
                        (i )
tions by the matrices A21 are organized as successive multiplications by the subma-
         A21(i )j
trices       , the synchronizations are done once according to the fill-in pattern of the
        (i )
matrix A21    . Similarly, when the non-changeable component xi of the result vector x
                                          ( j)
is multiplied by the submatrices A12i , j < i during the ``reverse`` part (5), the syn-
chronization can be performed only once according to the combined fill-in pattern of
                  (i )
the submatrices A21i .
                (i )
  The matrices A11 can be obtained in parallel the following way:

 The process p handles the sequence k p 2 > k  k p1 of the matrices A11 . The se-
                                                                             (k )


  quences are constructed to make the processes evenly loaded as far as it is possible.
  This is ensured by approximately the same number of non-zero elements in the ma-
           (k )
  trices A11 that are processed by each of the processes.
                                                                             (i )
 The processes exchange the values of the elements of the matrices A11 to make the
  matrices A11 , k p 2 > k  k p1 completely stored in the memory of the process p .
             (k )



  The process p sends row blocks B p of the matrices A11 , ki 2 > k  ki1 to the proc-
                                                            (k )



  esses i, i  p and receives row blocks Bi of the matrices A11 , k p 2 > k  k p1 from
                                                                          (k )


  them.
 Each of the processes independently performs a complete or incomplete inversion
  of the matrices.
 The processes exchange the values of the elements of the matrices A11 to restore
                                                                                  (i )


  the initial data partitioning. The process p sends row blocks Bi of the matrices
   A11( k ) , k p 2 > k  k p1 to the processes i, i  p and receives row blocks B p of
   A11( k ) , ki 2 > k  ki1 matrices from them.
                                                                             (i ) ( j)
   Since the data partitioning of the blocks of rows within the matrices A12 , A21i may
differ for the different values of i , an additional synchronization should be performed
when vectors are multiplied by these matrices. Denote a block of rows from the sub-
             (i ) ( j)
matrices A12 , A21i that is handled by the process p as pi . Then, before multiplying
                                                            B
                                                                      (i )
a vector v with the partitioning Bi = ( B0i ,..., BPi ) by the matrix A21 , we need to syn-
chronize its elements to make each of the processes store the block pi   pN . To
                                                                          B    ... B

do so, the process p sends the block pi  rj of the elements of the vector to the
                                         B        B
                   B  , j  i
process r if pi  rj
              B
                                 . Then, the process p receives a corresponding block
                        B pj Bri  , j  i
from the process r if                          .
   Similar procedures should be performed when a vector v = (vi 1 ,..., vN ) is multi-
                      (i )
plied by the matrix A12 for each of its components. When the non-changeable com-
                                              (i )
         x
ponent j is multiplied by the matrices A12 during the ``reverse`` part (5) of the
AMLI algorithm, the synchronization can be performed once before calculating the
result of the first multiplication operation.
   When the preconditioner is used with an iterative method for solving a linear alge-
                                                             
braic system, we perform a calculation of the matrices A11 and the sets of the ele-
ments that need synchronization once on the first iteration. The execution of these
operations will be further denoted as ``initialization procedure''.


4        Estimation of algorithms performance
We estimate the performance of the considered algorithms on the base of the follow-
ing assumptions:
                    (i )                                                                 (i )   (i )
 the matrices A11 have the same fill-in factor k11 , and the matrices A12 j , A21 j - the
  same fill-in factor k2 which does not dependent on the block size;
                                     (i )
 A percentage k f of the matrices A11 for which the full inversion procedure is
  applied does not depend on the block size;
                            (i )
 The inverse matrices of A11 are completely filled;
 The number of elements that each of the processes must synchronize before per-
  forming a parallel multiplication is considered proportional (with the coefficient of
  proportionality equal to k s ) to the number of non-zero elements in the block of
  matrix rows. We assume that each of the processes performs only one pair of data
  exchange operations doing the synchronization;
 The submatrices are considered evenly distributed across the system.
   Under these assumptions, the time sequential algorithm spend to perform calcula-
tions on (4), (5) can be estimated as

                              n2
         T1 (n, N ) = k p [
                              N
                                  2(k f  (1  k f )k11 )  ( N  1)k2   4n  n( N  1)]            (6)


where k p is the performance coefficient.
  The minimal time is achieved here for the number of blocks

                           N = Nm =         2  k  (1  k )k   k  n .
                                                 f         f   11     2                                (7)

                                                     (k )
      The time spent on computation of the matrices A11 with an assumption that the
computational complexity of one step of the incomplete basis matrix method is pro-
portional to the maximal number of non-zero elements in the rows of the matrices
 A11( k ) can be estimated as follows:

                                                        n3      n2 
                               Ti1 (n, N ) = Nk p 2  k f 3  k11 2                                   (8)
                                                     N          N 

where k p 2 is the performance coefficient.
   We evaluate the computation time of the parallel algorithm using the linear model
of a pairwise data exchange operation:

                                               Ts ( n) = n/ke  k a                                               (9)

where ke is the bandwidth, n is the data size, k a is the latency.
   Taking into account (9), the total time spent by the parallel algorithm on perform-
ing calculations on (4), (5) can be estimated as

                           1               n k                  N 1        P 1 
        TP (n, N ) =         T1 (n, N )  2  s ((1  k f )k11       k2 )      kf  
                           P               ke  P                 N           P                                 (10)
         (2 N (2  k f P  2k f )  2)ka

where P is the number of processes.
  The minimal time spent on data exchanges for a fixed N is achieved when the
number of processes is equal to

                                                                            N 1
                                              2nks ((1  k f )k11               k2 )
                               Pm ( N ) =                                    N        .                          (11)
                                                           2 ke k a k f N

                                                                                 (k )
   When doing parallel computations, the time spent to calculate the matrices A11
can be estimated with the assumption that each of the processes performs incomplete
                         N
or complete inversion of P submatrices (we consider this number to be integer) as

                      1                         k n2                                        n2          
                                                          k a     (1  k f ) k11  k f  2
                                     N
     TiP ( n, N ) =     Ti1 (n, N )  ( P  1)( 112                                                  k a ).   (12)
                      P              P          N  Pk e                                   N  Pk e       


5       Experimental analysis of the algorithm’s performance
We apply the proposed parallel modification of the preconditioner to solve linear
systems arising while modelling stress-strain state of the soil under a dam using the
model of dynamic consolidation of saturated soils as described in [11].
   The two-dimensional model has the form of the hyperbolic differential equations'
system with respect to 8 unknown function - the horizontal and the vertical compo-
nents uw , vw , usk , vsk of water and soil skeleton displacement vectors and their first
derivatives. The used boundary and the initial conditions, solution domain, and pa-
rameter values are given in [11]. The problem was discretised by the Galerkin finite
element method on a rectangular mesh with respect to the space variables and by the
Crank-Nicolson scheme with respect to the time variable.
   While forming linear systems, their rows were ordered to form blocks related to a
single unknown function. To minimize a number of synchronizations we use the
Cuthill-McKee ordering of mesh nodes. Performing the diagonal scaling of rows we
obtain matrices that have the following properties: non-symmetricity; loosely coupled
block structure; high fill-in factor; high condition number due to different speeds of
the simulated processes. As in the considered model of soil stress-strain state we have
8 unknown functions, the matrices can be easily represented as 2x2 block matrices.
Such properties of matrices make them an appropriate testing case for the AMLI pre-
conditioner that is built upon the idea of block matrix subdivision and thus can be
efficient working with matrices that have loosely coupled diagonal blocks in their
structure. The usage of the incomplete basis matrix method here is beneficial because
matrices contain significantly ill-conditioned blocks and other methods like incom-
plete LU decomposition can be unstable in this case.
   The performance and the convergence of the considered algorithm were experi-
mentally studied on the SCIT-3 cluster of the Institute of Cybernetics of National
Academy of Sciences of Ukraine. The eight-core nodes based on Intel Xeon 5345
processors with 2GB RAM per core were used.
   The algorithm was tested on the matrix which arises on the first time step with a
length  = 2hours . This matrix of this linear system, further denoted as SD, has the
size equal to 107856  107856 with an average number of non-zero elements in a row
equal to 23.42 . Its condition number estimated by the LSMR algorithm [17] has an
           12
order of 10 .
   Linear system was solved using the BiCGStab [1] algorithm with the proposed
modification of the AMLI preconditioner. The BiCGStab was used because of non-
                                                  (k )
symmetric nature of the matrix. The submatrices A11 were completely inverted when
 1 < 0.05 . This resulted in the complete inversion of : 50% of the submatrices. The
acceptable accuracy of the calculations (here and further - the value below which the
                                                                    9
residual has to be reduced on BiCGStab iterations) was equal to 10 .
   The use of the BiCGStab algorithm itself, or with the procedure of a diagonal scal-
ing of rows, did not allow obtaining solutions of the required accuracy. The computa-
tion time when the BiCGStab was used without the usage of any preconditioner was
equal to 140 ms per iteration.
   To experimentally assess the accuracy of the estimates (6) and (7) of single pre-
conditioner application execution time on a single processor core without the initiali-
zation procedure, the linear system SD was solved with the variable size of blocks on
which the matrix is split. The values of the estimates’ coefficients were, here and
further, found using the least squares method on the base of the measured execution
time. The block size which minimizes the computation time was between 125 and
500, which corresponds to the calculated value of the theoretical estimate (7) equal to
179. The obtained experimental data confirm the properties of the algorithm reflected
in the estimate (6) whose accuracy was not worse than 25%.
   The performance estimate (10) in the similar case for the parallel algorithm was
verified measuring the time spent on data exchanges between processes. The estima-
tion error here remains within 30% for the block size ranging from 250 to 3000 and
the number of processes ranging from 12 to 24. In other cases, the error increases. For
a fixed block size, the optimal number of processes, that ensures the minimal time
spent on exchanges, in most cases, was slightly lower than the values calculated by
the estimate (11).
                                                       (k )
   The time spent on the calculations of the matrices A11 (initialization procedure)
depending on the block size in the case of the sequential algorithm was measured for
the block size ranging from 125 to 4000. Then, the time was assessed by the estimate
(8). The accuracy of the estimate was within 12% for the block size less than 3000,
while the block size increase leads to the increase of the error. These inaccuracies
may be caused by a change in the percentage of fully inverted submatrices. In the case
of the parallel algorithm, the corresponding time was measured and estimated by (12).
The general behaviour of the estimate (12) coincides with the measured time (Fig.1)
but the estimation errors were high.




Fig. 1. Time, ms, spent on the calculations of A11 matrices and its estimate in the case of the
                                                 (k )


                                       parallel algorithm

   Further, we studied an influence of the size of blocks on which the matrix is split
on the iterative method’s convergence. The number of iterations required to achieve
109 order of accuracy was measured along with the overall time spent by the BiCG-
stab algorithm and the proposed preconditioner running on a single processor core.
Here, an increase of the block size leads to a decrease of the fill-in factor of the matri-
      (k )
ces A11 . As the incomplete basis matrix method performs better in terms of accuracy
on highly filled sparse matrices, the decrease of the fill-in factor results in the increase
of the number of completely inverted matrices. This accelerates the convergence of
the iterative algorithm (Fig. 2) along with the significant increase of its running time.
The block size for which the required accuracy is reached in the minimal time was in
the range from 75 to 125 elements.
                                                                      (k )
   The decrease of the number of fully inverted submatrices A11 reduces the time
spent on preconditioner initialization and thus significantly decreases the running time
the parallel algorithm in the case when a small number of computational resources are
used. On the other hand, the speed-up of the initialization operation is higher for the
                   k
higher values of f , so the time spent on initialization becomes close to equal for the
                     k
different values of f when the number of processes increase. The time spent on the
parallel calculations on (4), (5) behaves likewise. The smaller number of completely
inverted submatrices, in general, accelerates this operation, but the speed-up decreases
with the increase of the number of computing resources involved. All this, given that
convergence is faster for a larger number of completely inverted submatrices, leads to
                     k
the situation when f reduction is effective only when a small number of processes
are involved.




 Fig. 2. The logarithm of residual at the iterations of the algorithm for different percentages of
                                completely inverted submatrices

   The efficiency of the proposed parallel preconditioner was compared with the effi-
ciency of known algorithms implemented in the hypre software package v.2.10.0b
[18]. The only combination of an iterative method and a preconditioner implemented
                                                     9
in the hypre which allowed obtaining solutions of 10 accuracy order was the Gmres
[1] and the AMG preconditioner [3] (we will further denote this pair as the
Gmres+AMG). The total time needed to obtain the solution of the linear system SD
        9
with 10 order of accuracy is given in Fig. 3. Here the proposed parallel precondi-
tioner was used with the block size equal to 250 and 500. It was faster than the
Gmres+AMG when the number of processes was less than 20 with the block size
equal to 500. However, the scalability of the proposed parallel algorithm was lower
comparing with the Gmres+AMG.
               Fig. 3. Computation time, ms, subject to the number of processes


6        Conclusions
The parallel computation scheme was constructed for the AMLI preconditioner with
incomplete inversion of matrix blocks using the so-called basis matrix method and the
theoretical estimates of its performance were obtained and experimentally verified.
   For the linear system obtained while modelling stress-strain state of soil which has
the size equal to 107856  107856 , the condition number of 10 order, the maximal
                                                                 12


achieved speed-up was 24% when scaling the algorithm on 8 processes for the block
size equal to 500 elements. The low scalability of the algorithm here can be explained
by the large number of data exchange operations that must be performed before each
matrix-vector multiplication.
   We compared the characteristics of the proposed preconditioner with the AMG
used as preconditioner in the Gmres algorithm implemented in the hypre software
package. The scalability of the proposed algorithm was generally worse, but it had
better performance for the considered significantly ill-conditioned linear system when
solving it on less than 20 processor cores.


References
 1. Saad, Y.: Iterative methods for sparse linear systems, 2 edition. Society for Industrial and
    Applied Mathematics (2003). doi: 10.1137/1.9780898718003
 2. Chen, K.: Matrix Preconditioning Techniques and Applications. Cambridge University
    Press (2005). doi: 10.1017/CBO9780511543258
 3. Saad, Y.: Practical use of polynomial preconditionings for the conjugate gradient method.
    SIAM J. Sci. Stat. Comp. 6, 865-881 (1985). doi:10.1137/0906059
 4. Shapira, Y.: Matrix-Based Multigrid: Theory and Applications. Springer Science & Busi-
    ness Media (2003).
 5. Mense, C., Nabben, R.: On algebraic multilevel methods for non-symmetric systems -
    convergence results. Electron. Trans. Numer. Anal. 30, 323-345 (2008).
 6. Bayramov, N.R., Kraus, J.K.: Multigrid methods for convection–diffusion problems dis-
    cretized by a monotone scheme. Comput. Method. Appl. M. 317, 723-745 (2017).
    doi:10.1016/j.cma.2017.01.004
 7. Chan, T., Tang, W., Wan, W.: Wavelet sparse approximate inverse preconditioners. BIT
    Numerical Mathematics 37, 644-660 (1997). doi:10.1007/BF02510244
 8. Bianchi, D., Buccini, A.: Generalized Structure Preserving Preconditioners for Frame-
    Based Image Deblurring. Mathematics 8(4), 468 (2020). doi:10.3390/math8040468
 9. Kudin, V.I., Liashko, S.I., Kharytonenko, N.V., Iatsenko, Iu.P.: Analysis of the properties
    of a linear system using the method of artificial basis matrices. Cybernetics and Systems
    Analysis 43(4), 563-570 (2007). doi:10.1007/s10559-007-0081-3
10. Bogaienko, V.O., Kudin: V.I., Skopetsky, V.V.: Aspects of the organization of computa-
    tions using the basis-matrix method. Cybernetics and Systems Analysis 48(4), 30-34
    (2012). doi:10.1007/s10559-012-9441-8
11. Bohaienko, V.O.: Using Basis Matrix Method with AMLI Preconditioner for Solving Elas-
    ticity Problems. In: Proceedings of the 4th International Scientific Conference of Students
    and Young Scientists ``Theoretical and Applied Aspects of Cybernetics``; Kyiv, Ukraine,
    pp.24-28 (2014).
12. Axelsson, O., Barker, V.A., Neytcheva, M., Polman, B.: Solving the Stokes Problem on a
    Massively Parallel Computer. Math Model Anal. 6, 7-27 (2001).
13. Li, Zh., Wu, Sh., Xu, J., Zhang, Ch.: Toward Cost-Effective Reservoir Simulation Solvers
    on GPUs. Adv. Appl. Math. Mech. 8(6), 971-991 (2016). doi:10.4208/aamm.2015.m1138
14. Saad, Y.: Schur Complement Preconditioners for Distributed General Sparse Linear Sys-
    tems. In: Domain Decomposition Methods in Science and Engineering XVI, Lecture Notes
    in Computational Science and Engineering, 55, pp. 127-138 (2007).
15. Liu, Q.H., Zhang, F.D.: Incomplete hyperbolic Gram-Schmidt-based preconditioners for
    the solution of large indefinite least squares problems. J. Comput. Appl. Math. 250, 210-
    216 (2013). doi:10.1016/j.cam.2013.02.016
16. Bohaienko, V.O., Kudin, V.I.: Building preconditioners using basis matrix method. Inter-
    national Journal ``Information Content and Processing`` 1(2), 182-187 (2014).
17. Fong, D.C.-L., Saunders, M.: LSMR: An Iterative Algorithm for Sparse Least-Squares
    Problems. SIAM J. Sci. Comput. 33(5), 2950-2971 (2011). doi:10.1137/10079687X
18. Falgout, R.D., Yang, U.M.: hypre: A Library of High Performance Preconditioners. In:
    Sloot PMA, Hoekstra AG, Tan CJK, Dongarra JJ (editors). Lecture Notes in Computer
    Science, vol 2331, Wien: Springer Verlag, pp. 632-641 (2002).