=Paper=
{{Paper
|id=Vol-2281/paper-06
|storemode=property
|title=Parallel Substructuring Method With Memory Cost Limits
|pdfUrl=https://ceur-ws.org/Vol-2281/paper-06.pdf
|volume=Vol-2281
|authors=Nikita Nedozhogin,Sergey Kopysov,Alexander Novikov
}}
==Parallel Substructuring Method With Memory Cost Limits==
<pdf width="1500px">https://ceur-ws.org/Vol-2281/paper-06.pdf</pdf>
<pre>
    Parallel Substructuring Method With Memory
                     Cost Limits?

         Nikita Nedozhogin1 , Sergey Kopysov1,2 , and Alexander Novikov1
1
    Institute of Mechanics Udmurt Federal Research Center UB RAS, Izhevsk, Russia
                      2
                        Udmurt State University, Izhevsk, Russia
                               nedozhogin@inbox.ru


        Abstract. In decomposition methods, the memory costs for solving the
        interface problem are increasing significantly with the number of sub-
        domains increasing. Using of the substructuring method allows you to
        reduce the number of iterations when the system is being solved. At the
        same time, the global boundary stiffness matrix takes up more memory
        in comparison with global stiffness matrix. This imposes restrictions on
        the maximum size of the problem for which you can apply this method.
        Different approaches to reduce the costs and limitations of memory on
        stage of the construction and solving of the interface system exist. A
        layer-by-layer approach to the partitioning of a triangulated multiply
        connected domain into connected subdomains without branching of inner
        boundaries was presented. This makes it possible to avoid conflicts with
        concurrent operations of the assembly type without using synchroniza-
        tion and critical sections. Parallel algorithm of the construction global
        boundary stiffness matrix with distributed storage of the matrix are con-
        sidered when implementing using OpenMP and MPI technologies. This
        approach allows not only to reduce the limits on the maximum size of
        the solved problem, but also to resolve conflicts of shared memory access
        by increasing the number of independent parallel tasks.

        Keywords: Parallel algorithms · Decomposition method · Substructur-
        ing method · OpenMP · MPI


1     Introduction
The substructuring method is the non-overlapping domain decomposition [1].
However, the global boundary stiffness matrix has more nonzero elements with
smaller dimension in comparison with global stiffness matrix [2]. Due to these
properties, even the use of sparse matrix storage formats requires considerable
memory costs. Block [3] and hierarchical [4] methods with low-rank approxi-
mation have received great development due to the current trends in reducing
the amount of memory per processor core. In this paper, parallel substructuring
method with distributed storage of the matrix is considered. The distributed
?
    The reported study was partially funded by RFBR according to the research project
    17-01-00402-a.
                  Parallel Substructuring Method With Memory Cost Limits           51

construction makes it possible to apply this approach not only in the classical
substructuring method, but also in the hierarchical or block one, or in the case
of constructing a preconditioner on the based on the substructuring method.
     The second memory limit is the access speed. To accelerate calculations,
it is necessary the number of simultaneous accesses to the same memory area
both during reading and writing. To increase the parallelism of the method, it is
required to arrange the grid and elements in such a way that independent threads
simultaneously work with their own sections of memory without overlapping. In
this case, We propose using a layer-by-layer partitioning method. In the course
of this method, any unstructured mesh can be arranged.
     The paper presents the results of parallel version of the substructuring method
realized with the help of OpenMP and MPI technology. The potential for paral-
lelization with the help of CUDA technology for computation on graphic accel-
erators is considered


2     Layer-by-layer Partitioning

We generalize the ordering of the 3D unstructured mesh T on a subdomains
without branching of internal boundaries using layer-by-layer partitioning. We
                            the cells T ∗ ⊂ T , assuming L1 = T ∗ . We construct
define in T the some set of T
the layers Lj = {σ ∈ T | Lj Lc 6= ∅ when j = c ± 1}, j ∈ [2, m − 1], according
to the Algorithm 1. Here m – number of the layers, σ - cell of the mesh T .
    Algorithm 1: Partitioning unstructured mesh T on layers.
     Input data: T — unstructured mesh of cells σ ∈ T , T ∗ — given set
                  {σ} ⊂ T .
     Result: {Lj } — set of the layers of cells.
 1 Define the cells corresponding to the vertices of T ;
 2 ∀σ ∈ T find cells having a common vertex;
   /* Form the layers of mesh cells                                           */
                               ∗
 3 Let j = 1 and layer Lj = T ;
          P                 
             j
   while     c=1 |Lc | < |T | do
 4    Construct layer Lj+1 of cells T ;
 5    Let j ← j + 1;

    The result of the Algorithm 1 is an ordered set of m layers Lj , in which the
layers adjacent to Lj have numbers j − 1 and j + 1. The resulting layers are
discrete analogs of a domain of R3 of type (g, 2) : g > 0. All the vertices of the
cells lie on the surface bounding the layer. In general, layer Lj consist of a set of
              (c)                        (c)
sublayers {Lj }, where each sublayer Lj is the set of cells in Lj having at least
one intersection with the cell in Lj . If some layer Lj consists of one sublayer
  (c)
Lj , we assume that the layer Lj is connected.
    Thus,
        S the problem of forming subdomains of an unstructured mesh
{Ti : i Ti = T } based on obtained partitioning on layers Lj is as follows:
52       N. Nedozhogin et al.

find the union of the layers Lj in the set of subdomains {Ti } satisfying certain
restrictions. One of the conditions is the connectivity of the resulting subdomains
Ti . The number of layers included in the subdomain is bounded below by two
layers, which is a sufficient condition for the existence of vertices in Ti that do
not belong to the boundary Ti .
     In order to construct connected union of layers in the subdomain of the mesh,
it is necessary to find sublayers and their connections. The algorithm 2 provides
a search for sublayers and the definition of their connections using a dual graph
Gd (V, E) of connected cells in Lj along the faces. Further, the connected compo-
nents Gd (V, E) are found. The set of the mesh cells corresponding the connected
                                                         (c)
components Gd (V, E) will be denoted as sublayers Lj , c = 1, 2, . . . , κj , where
          (c)
κj = |{Lj }| — the number of the sublayers in layer Lj , κ∗ = max κj .
                                                                       j


 Algorithm 2: The union of the layers {Li } in linked subdomains {Tl }.
     Input data: {Lj } — set of the layers of cells.
     Result: {Ti } — set of the mesh T subdomains.
   /* Find sublayers in each layer Lj                                            */
 1 for Lj , ∀j ∈ [1, m] do
 2     Construct dual graph Gd (V, E) of the links of the cells of Lj along the faces;

 3      Define connected components Gd (V, E);
                          (c)
 4      Assign sublayers Lj to the connected components Gd (V, E);
 5 Construct the graph of sublayers G(V, E);
   /* Form subgraphs Gl of th graph G(V, E):                                     */
            ∗
 6 i = 1, m = 1, mark(Lj ) = 0, ∀ j ∈ [1, m];
              ∗
 7 while m < m do
       /* Find an unmarked layer with κj = κ∗                                    */
 8     while κ∗ > 1 do
 9         Adjoin adjacent layers to the initial layer of the domain;
10         Define κ∗ — the number of the connected components Gi ;
11         j ← j + 1;
12      i ← i + 1;
                                     (c)
     if v ∈ VG ∧ deg(v) = 1 ∧ v ⇔ Lj then
         Adjoin {v} ⇔ Lj to Gj ⇔ Lj−1 , j ∈ [1, i].
     Partitioning/ordered mesh T , assuming Ti ⇔ Gi .


    The Algorithm 2 extends the partitioning on layers [5] on multi-connected
domains. Partitioning without branching on the basis of the neighborhood rela-
tion reduces to the use of the Algorithm 1 for the layer-by-layer partitioning of
the triangulation T and the union layers Lj , ∀j ∈ [1, m] by the Algorithm 2 in
th subdomains Ti .
                 Parallel Substructuring Method With Memory Cost Limits            53

3   Substructuring Method
Let the mesh domain Ω be split into nΩ non-overlapping subdomains
      i6n
       SΩ (1)          (1) T (1)
Ω =       Ωi , where Ωi     Ωj = ∅, when i 6= j. Formally unite the sub-
      i=1
domains, preserving the structure of the decomposition (interior nodes remain
interior, boundary – boundary). We introduce the decomposition of the second
level with number of the subdomains equal number of the computational mod-
                                                                 n
                        k6n                           i6(k+1) nΩ
                         Sp     (2)      (2)                 S p
                                                                       (1)
ules np such that Ω =          Ωk , and Ωk =                         Ωi .
                                                            n
                         k=1                            i=k nΩp
   For decomposition the first level, the system of equations is formed in such a
way that the unknowns corresponding to the interior and boundary nodes take
the form:
                 (1)                      (1)
                                               
                                                   1 (1)
              1                                            1 (1) 
                (AII )     0    · · · 1 (AIB )      uI           fI
                            (1)            (1)   2 (1) 
             0        2
                         (AII ) · · · (AIB )   uI   fI(1) 
                                      2                        2
                                                            
                                                           =        .
             ···         ··· ··· ···  ···   ··· 
                                              
              1   (1)       (1)           (1)
                (ABI ) 2 (ABI ) · · · ABB           uB           fB
   In this system, the unknowns stand for the boundary nodes are found from
the solution of the system S (1) uB = f˜B . Here
                              i6n
                               XΩ               (1)    (1)           (1)
                      f˜B =          (fB − i(ABI )i(AII )(−1)i fI )
                               i=1

is the right-hand-side vector and S (1) is the global boundary stiffness matrix
(also known as the Schur complement matrix [2]) and is the sum of the local
                               (1)      (1)      (1)   (1)      (1)
boundary stiffness matrices Si = ABB − i(ABI )i(AII )(−1)i(AIB ) such that
        i6nΩ
         P (i)
S (1) =      Si , where index i corresponds to subdomains of the first level.
        i=1


4   Construct Global Boundary Stiffness Matrix
                                          (2)
Global boundary stiffness matrix Sk     corresponding to these subdomains are
                                                                     np
                                                                         (2)
supplemented by zeros up to size global matrix S (1) in this S (1) =
                                                                     P
                                                                        Sk . Note
                                                                             k=1
                (2)
that, Matrices Sk do not depend on each other during the formation stage. This
excludes conflicts when reading elements data and writing results of summing
                   (1)
of local matrices Si .
    For reduce computational costs of RAM, each computational modules forms
                          (1)
one row l of the matrix Si at the same time. The order of the construction
of rows is given by subdomains. Parallel algorithm of the construction for np
computational modules using OpenMP technology on the example of two sub-
domains would be as follows (note that np 6 nΩ ):
54      N. Nedozhogin et al.


 Algorithm 3: Parallel algorithm of the global boundary stiffness ma-
                          Pnp (2)
 trix construction S (1) = k=1 Sk for OpenMP
                         (2)
   // Forming of S1
   for i = 1 . . . nΩ /np do
 1    Function of the inverse matrix
             (1)
         i
          (AII )−1 ;
        // Parallelize loop by
            OpenMP
        for l = 1 . . . nBi do
                   (1)
 2          v = iABI (l, ) ;
                       (1)
 3           v = v i(AII )−1 ;
              (1)            (1)           (1)
 4           Si (l, ) = iABB (l, )−v iAIB ;
                               (1)
 5           Record the Si (l, ) elements
              in the corresponding row of
                           (2)
              the matrix S1 ;
                         (2)
     // Forming of S2
     for i = nΩ /np . . . nΩ do
        Function of the inverse matrix
          i (1) −1
           (AII ) ;
        // Parallelize loop by
             OpenMP
        for l = 1 . . . nBi do
                   (1)
             v = iABI (l, ) ;
                       (1)
             v = v i(AII )−1 ;
              (1)          (1)          (1)
             Si (l, ·) = iABB (l, )−v iAIB ;
                            (1)
             Record the Si (l, ·) elements
              in the corresponding row of
                           (2)
              the matrix S2 ;

   Based on the computing experiment (See Table 1), run-time costs on each
subdomain have the following order (top-down):
                                     (1)
1. Inverse of the matrix i(AII ) (Step 1 of Algorithm 3);
                                            (1)
2. Record the row l of the local matrix Si (l, ·) in the corresponding row of
                2
   the matrix S1 (Step 5 of Algorithm 3);
3. Operations of taking the row from the matrix, the matrix-vectors multipli-
   cations and the difference of the vectors (steps 2-4 of Algorithm 3, denote
   by MatVec).

    Taking the row from the matrix, the matrix-vector multiplications and the
difference of the vectors (see steps 2-4 of Algorithm 3) are parallelized using
OpenMP technology, that for these operations gave an acceleration close to lin-
                  Parallel Substructuring Method With Memory Cost Limits          55

ear. Also, in the future, it is planned to parallelize these operations for use on
graphics accelerators using CUDA technology.
                                                  (1)
    Recording of the row l of the local matrix Si (l, ·) to the corresponding row
                  (2)
of the matrix Sk (see step 5 of Algorithm 3) largely depends on the matrix
                                      (2)
storage formats used. Matrices Sk is formed in a compressed CSR format,
because of its large size. Using arrays of vectors (C++ container std::vector)
allows to reduce memory costs to a minimum, but the complexity of adding
new elements to such storage formats is O(nB ), where nB is the number of
equations corresponding to the boundary nodes To reduce the costs of adding
new elements of the row, we use sorted associative containers (C++ container
std::map) and not sorted (C++ container std::unordered_map). In the sorted
container, the complexity of finding and adding a new element is O(ln nB ) (due to
its red-black tree structure), but it significantly increase memory costs to store to
store additional two pointers. The non-sorted container, for these operations has
complexity form O(1) to O(nB ) (depending on the hash function) and request
smaller memory sizes, compared to the container std::map, which is optimal for
                                          (2)
use in the constructing of matrices Sk .


          Table 1. Characteristic formation times in one subdomain, sec.

            Compute i (AII )(−1) MatVec, see Algorithm 3 Assembly S (1)
              LU        SM
                                    nΩ = 512
            95.4972 41.69505            0.340893            1.53655
                                    nΩ = 1024
            4.07287    1.947            0.130277             0.6955


   To inverse the matrix at Step 1 of Algorithm 1, LU-factorization (LU) and
Sherman-Morrison (SM) methods [6] were considered.

    For a given non-singular matrix A there exists a decomposition on upper
triangular matrix U and a lower triangular matrix L. If matrices A, U , L are
invertible, then A−1 = U −1 L−1 . In the construction L and U , products with
nonzero elements of the parent matrix were summed, since the factorization war
carried out for matrices stored in the CSR format. Only sequential implemen-
tations of the LU-factorization on CPU was considered. This approach showed
the best results for small matrices. Features of this algorithm do not allow to
effective use all the capabilities of the massively parallel GPU architecture.

  A parallel version of the Sherman-Morrison method implemented using
OpenMP technology was considered.
56       N. Nedozhogin et al.


    Algorithm 4: Parallel algorithm of Sherman-Morrison method for
      (1)
    (AII )−1
 1   x, y - vectors of dimension n ;
     for k = 1 . . . n do
         // OpenMP
                (1)
 2      xT = AII (k, ) − ek ;
        // OpenMP
 3      y = xT Ainv ;
        // Loop of OpenMP on i
        for i = 1 . . . n do
           for j = 1 . . . n do
               Ainv (j, i) = Ainv (j, i) − (yi ∗ Ainv (j, k)/(yk + 1)) ;

       (1)
 4   (AII )−1 = Ainv ;

    LU-factorization is faster than the Sherman-Morrison method in sequential
implementations. In spite of it, LU-factorization does not parallelize well. Using
parallel version of the Sherman-Morrison method provided a parallel accelera-
tion about two times. To reduce the number of operations and the computational
                 (1)
load, matrix i(AII )−1 is stored in its entirety before the completion of the con-
               (1)
structing of Si in the current subdomains, since it is not known how many
non-zero elements contain an inverse matrix. To solve the system of equations
                                   (1)             (1)
on the inner subdomains nodes iAII iuI = ifI − iAIB uB Krylov subspace methods
[7] are used.
     In the future, parallelization on the GPU is planned. To increase the paral-
lelism, the order of the formation of the local boundary stiffness matrices will
be given by layer-by-layer partitioning. Thus, a block of parallel GPU threads
                                        (1)
will process one row of the matrix Si . In total it will be possible to generate
κi blocks that are able to write in different places of the same array without
overlap and conflicts due to layer-by-layer ordering.


5     Solving the S (1) uB = f˜B System


Implementation of conjugate gradients method for solving the system
S (1) uB = f˜B on one of np computing node is shown in Algorithm 5. Here the
                                                 np
                                                     (2)
system matrix is represented in the form S (1) =
                                                 P
                                                    Sk . Index k corresponding
                                                     k=1
to the number of computational module is omitted.
                 Parallel Substructuring Method With Memory Cost Limits         57


 Algorithm 5: Algorithm of the preconditioned conjugate gradient
 method for distributed matrix
                     n
 1 u, r, p, q, z ∈ R B ;
 2 i = 0;
 3 r0 = f˜B ;
 4 u0 ← 0 ;
                                  (1) −1
 5 z0 = M r0 ; // M ∼ (S             )   -- preconditioner
 6 p0 = z0 ;
 7 ρ0 = (r0 , z0 ) ; // sync point
   while ||ri ||2 /||f ||2 > ε do
 8     Assembly pi ; // Using MPI::Allreduce
 9     qi = S (2) pi ; // k independent operations on each MPI
         process
10     αi = (ri , zi )/(qi , pi ) ; // MPI::Allreduce lead to sync point
11     ui+1 = ui + αi pi ;
12     ri+1 = ri − αi qi ;
13     zi+1 = M ri+1 ; // M ∼ (S (1) )−1 -- preconditioner
14     ρi+1 = (ri+1 , zi+1 ) ; // MPI::Allreduce lead to sync point
15     βi+1 = ρi+1 /ρi ;
16     pi+1 = ri+1 + βi+1 pi ;
17     i=i+1 ;


    In a matrix-vector product (See steps 8-9 of Algorithm 5) operation
qi = S (1) pi is performed in two steps. First, the assembly of the vector pi from
                                                               (2)
all MPI processes, then np independent operations (qk )i = Sk pi . The result of
                                                                np
products is the set of vectors qk such that qk ∈ RnB and q =
                                                                P
                                                                   qk . Summed of
                                                               k=1
vector q are not required. Vector is stored by parts on various compute nodes.


    Parallelization of the operations with vectors is realized as follows: operands
are parts of vectors consisting of components corresponding to mesh nodes of the
               (2)
subdomain Ωk for each parallel MPI process. The result vector assembly are
not required, each MPI process jobs with its local vectors, and synchronization
occurs when the scalar products (r, z), (q, p), (r, r) are computed. Scalar prod-
ucts are performed in two steps. First, local scalar products for the part vectors
in parallel MPI processes, then summation of the local sums (on this step, im-
plicit synchronization occurs). Inside the MPI process, vectors operations are
parallelized using OpenMP technology with number of the thread equal number
of the CPU cores on computational nodes.


   To parallelize the GPU, it is planned to use an approach similar to that
presented in paper [2].
58     N. Nedozhogin et al.

6    Results
Numerical experiments were carried out on the test problem of the theory of
elasticity. The computational domain had the form of a parallelepiped. Mesh
consisted of 539000 hexagonal cells. 8 threads of the OpenMP are running in-
side each MPI process. In Table 2, the result of experiment are presented for
dimension on 512 and 1024 subdomains with using 2, 4, 6 MPI processes. The
presented results are obtained on 6 computational nodes, each of which contains
two processors Intel Xeon E5-2609 and 64GB RAM. For nΩ = 512, this problem
took approximately 100GB of RAM, for nΩ = 1024, problem took approximately
80GB of RAM.

      Table 2. Time of construction and solve in the Substructuring method
                        (1)                      (1)              (1)
                                S (1) uB = f˜B iAII iuI = ifI − iAIB uB
                    P
            np        Si
              LU, sec. SM, sec. # Time, sec. #          Time, sec.
                                    nΩ = 512
            2 34954.6 14936.8 2324 1640.37 43            4.72359
            4 18086.6 7615.15 2325 1391.23 43            2.60635
            6 11713.4 5014.66 2325 1498.37 43             1.9185
                                   nΩ = 1024
            2 5128.47 2818.16 2600 1464.09 35            3.15815
            4 2622.52 1411.97 2600 1397.77 35            2.10205
            6 1802.32 965.803 2600 1469.18 35            1.74885


    The considered algorithm allows to significantly reduce the memory limits
for the substructuring method. The partitioning of the matrix S (1) and its dis-
tributed storage makes it possible to realize coarse-grained parallelism at the
formation stage of global boundary stiffness matrix, to reduce exchanges and to
exclude synchronous access to one memory cell during the local matrices addi-
tion. At the same time, an increase in the number of involved computational
modules permits achieving an acceleration close to linear at the stage of matrix
construction. At the solver stage, the distributed storage of the matrix S (1) al-
lows solving systems with a dimension proportional to the memory size of the
involved computational modules. Scalability is limited to assembling the results
of scalar products of vectors and depends only on the restrictions on the speed
to data exchange over the network.


References
1. Toselli A., Widlund O.B.: Domain Decomposition Methods — Algorithms and The-
   ory. Springer Series in Computational Mathematics, Vol. 34 (2005)
2. Kopysov S.P., Kuzmin I.M., Nedozhogin N.S., Novikov A.K., Sagdeeva Y. A.: Hybrid
   Multi-GPU solver based on Schur complement method. Lecture Notes in Computer
   Science, vol. 7979, pp. 65-79 (2013)
                  Parallel Substructuring Method With Memory Cost Limits          59

3. Amestoy P.R., Buttari A., L’Excelent J-Y., Mary T.: On the Complexity of the Block
   Low-Rank Multifrontal Factorization. SIAM Journal on Scientific Computing, vol.
   39(4), pp. A1710-A1740 (2017)
4. Chen C., Pouransari H., Rajamanickam S., Boman E.G., Darve E.: A distributed-
   memory hierarchical solver for general sparse linear systems. Parallel Computing,
   vol. 74, pp. 49-65 (2018)
5. Novikov A.K., Piminova N.K., Kopysov S.P., Sagdeeva Y.A.: Layer-by-layer ordering
   in parallel finite composition on shared-memory multiprocessors. IOP Conf. Ser.:
   Mater. Sci. Eng., vol. 158(1), p.158 (2016)
6. Sherman J., Morrison W.J.: Adjustment of an Inverse Matrix Corresponding to a
   Change in One Element of a Given Matrix. Ann. Math. Statist., vol. 21(1), pp.
   124-127 (1950)
7. Saad Y.: Iterative Methods for Sparse Linear System, Boston: PWS Publiching
   company, p. 447 (1996)

</pre>