=Paper=
{{Paper
|id=Vol-2281/paper-06
|storemode=property
|title=Parallel Substructuring Method With Memory Cost Limits
|pdfUrl=https://ceur-ws.org/Vol-2281/paper-06.pdf
|volume=Vol-2281
|authors=Nikita Nedozhogin,Sergey Kopysov,Alexander Novikov
}}
==Parallel Substructuring Method With Memory Cost Limits==
Parallel Substructuring Method With Memory Cost Limits? Nikita Nedozhogin1 , Sergey Kopysov1,2 , and Alexander Novikov1 1 Institute of Mechanics Udmurt Federal Research Center UB RAS, Izhevsk, Russia 2 Udmurt State University, Izhevsk, Russia nedozhogin@inbox.ru Abstract. In decomposition methods, the memory costs for solving the interface problem are increasing significantly with the number of sub- domains increasing. Using of the substructuring method allows you to reduce the number of iterations when the system is being solved. At the same time, the global boundary stiffness matrix takes up more memory in comparison with global stiffness matrix. This imposes restrictions on the maximum size of the problem for which you can apply this method. Different approaches to reduce the costs and limitations of memory on stage of the construction and solving of the interface system exist. A layer-by-layer approach to the partitioning of a triangulated multiply connected domain into connected subdomains without branching of inner boundaries was presented. This makes it possible to avoid conflicts with concurrent operations of the assembly type without using synchroniza- tion and critical sections. Parallel algorithm of the construction global boundary stiffness matrix with distributed storage of the matrix are con- sidered when implementing using OpenMP and MPI technologies. This approach allows not only to reduce the limits on the maximum size of the solved problem, but also to resolve conflicts of shared memory access by increasing the number of independent parallel tasks. Keywords: Parallel algorithms · Decomposition method · Substructur- ing method · OpenMP · MPI 1 Introduction The substructuring method is the non-overlapping domain decomposition [1]. However, the global boundary stiffness matrix has more nonzero elements with smaller dimension in comparison with global stiffness matrix [2]. Due to these properties, even the use of sparse matrix storage formats requires considerable memory costs. Block [3] and hierarchical [4] methods with low-rank approxi- mation have received great development due to the current trends in reducing the amount of memory per processor core. In this paper, parallel substructuring method with distributed storage of the matrix is considered. The distributed ? The reported study was partially funded by RFBR according to the research project 17-01-00402-a. Parallel Substructuring Method With Memory Cost Limits 51 construction makes it possible to apply this approach not only in the classical substructuring method, but also in the hierarchical or block one, or in the case of constructing a preconditioner on the based on the substructuring method. The second memory limit is the access speed. To accelerate calculations, it is necessary the number of simultaneous accesses to the same memory area both during reading and writing. To increase the parallelism of the method, it is required to arrange the grid and elements in such a way that independent threads simultaneously work with their own sections of memory without overlapping. In this case, We propose using a layer-by-layer partitioning method. In the course of this method, any unstructured mesh can be arranged. The paper presents the results of parallel version of the substructuring method realized with the help of OpenMP and MPI technology. The potential for paral- lelization with the help of CUDA technology for computation on graphic accel- erators is considered 2 Layer-by-layer Partitioning We generalize the ordering of the 3D unstructured mesh T on a subdomains without branching of internal boundaries using layer-by-layer partitioning. We the cells T ∗ ⊂ T , assuming L1 = T ∗ . We construct define in T the some set of T the layers Lj = {σ ∈ T | Lj Lc 6= ∅ when j = c ± 1}, j ∈ [2, m − 1], according to the Algorithm 1. Here m – number of the layers, σ - cell of the mesh T . Algorithm 1: Partitioning unstructured mesh T on layers. Input data: T — unstructured mesh of cells σ ∈ T , T ∗ — given set {σ} ⊂ T . Result: {Lj } — set of the layers of cells. 1 Define the cells corresponding to the vertices of T ; 2 ∀σ ∈ T find cells having a common vertex; /* Form the layers of mesh cells */ ∗ 3 Let j = 1 and layer Lj = T ; P j while c=1 |Lc | < |T | do 4 Construct layer Lj+1 of cells T ; 5 Let j ← j + 1; The result of the Algorithm 1 is an ordered set of m layers Lj , in which the layers adjacent to Lj have numbers j − 1 and j + 1. The resulting layers are discrete analogs of a domain of R3 of type (g, 2) : g > 0. All the vertices of the cells lie on the surface bounding the layer. In general, layer Lj consist of a set of (c) (c) sublayers {Lj }, where each sublayer Lj is the set of cells in Lj having at least one intersection with the cell in Lj . If some layer Lj consists of one sublayer (c) Lj , we assume that the layer Lj is connected. Thus, S the problem of forming subdomains of an unstructured mesh {Ti : i Ti = T } based on obtained partitioning on layers Lj is as follows: 52 N. Nedozhogin et al. find the union of the layers Lj in the set of subdomains {Ti } satisfying certain restrictions. One of the conditions is the connectivity of the resulting subdomains Ti . The number of layers included in the subdomain is bounded below by two layers, which is a sufficient condition for the existence of vertices in Ti that do not belong to the boundary Ti . In order to construct connected union of layers in the subdomain of the mesh, it is necessary to find sublayers and their connections. The algorithm 2 provides a search for sublayers and the definition of their connections using a dual graph Gd (V, E) of connected cells in Lj along the faces. Further, the connected compo- nents Gd (V, E) are found. The set of the mesh cells corresponding the connected (c) components Gd (V, E) will be denoted as sublayers Lj , c = 1, 2, . . . , κj , where (c) κj = |{Lj }| — the number of the sublayers in layer Lj , κ∗ = max κj . j Algorithm 2: The union of the layers {Li } in linked subdomains {Tl }. Input data: {Lj } — set of the layers of cells. Result: {Ti } — set of the mesh T subdomains. /* Find sublayers in each layer Lj */ 1 for Lj , ∀j ∈ [1, m] do 2 Construct dual graph Gd (V, E) of the links of the cells of Lj along the faces; 3 Define connected components Gd (V, E); (c) 4 Assign sublayers Lj to the connected components Gd (V, E); 5 Construct the graph of sublayers G(V, E); /* Form subgraphs Gl of th graph G(V, E): */ ∗ 6 i = 1, m = 1, mark(Lj ) = 0, ∀ j ∈ [1, m]; ∗ 7 while m < m do /* Find an unmarked layer with κj = κ∗ */ 8 while κ∗ > 1 do 9 Adjoin adjacent layers to the initial layer of the domain; 10 Define κ∗ — the number of the connected components Gi ; 11 j ← j + 1; 12 i ← i + 1; (c) if v ∈ VG ∧ deg(v) = 1 ∧ v ⇔ Lj then Adjoin {v} ⇔ Lj to Gj ⇔ Lj−1 , j ∈ [1, i]. Partitioning/ordered mesh T , assuming Ti ⇔ Gi . The Algorithm 2 extends the partitioning on layers [5] on multi-connected domains. Partitioning without branching on the basis of the neighborhood rela- tion reduces to the use of the Algorithm 1 for the layer-by-layer partitioning of the triangulation T and the union layers Lj , ∀j ∈ [1, m] by the Algorithm 2 in th subdomains Ti . Parallel Substructuring Method With Memory Cost Limits 53 3 Substructuring Method Let the mesh domain Ω be split into nΩ non-overlapping subdomains i6n SΩ (1) (1) T (1) Ω = Ωi , where Ωi Ωj = ∅, when i 6= j. Formally unite the sub- i=1 domains, preserving the structure of the decomposition (interior nodes remain interior, boundary – boundary). We introduce the decomposition of the second level with number of the subdomains equal number of the computational mod- n k6n i6(k+1) nΩ Sp (2) (2) S p (1) ules np such that Ω = Ωk , and Ωk = Ωi . n k=1 i=k nΩp For decomposition the first level, the system of equations is formed in such a way that the unknowns corresponding to the interior and boundary nodes take the form: (1) (1) 1 (1) 1 1 (1) (AII ) 0 · · · 1 (AIB ) uI fI (1) (1) 2 (1) 0 2 (AII ) · · · (AIB ) uI fI(1) 2 2 = . ··· ··· ··· ··· ··· ··· 1 (1) (1) (1) (ABI ) 2 (ABI ) · · · ABB uB fB In this system, the unknowns stand for the boundary nodes are found from the solution of the system S (1) uB = f˜B . Here i6n XΩ (1) (1) (1) f˜B = (fB − i(ABI )i(AII )(−1)i fI ) i=1 is the right-hand-side vector and S (1) is the global boundary stiffness matrix (also known as the Schur complement matrix [2]) and is the sum of the local (1) (1) (1) (1) (1) boundary stiffness matrices Si = ABB − i(ABI )i(AII )(−1)i(AIB ) such that i6nΩ P (i) S (1) = Si , where index i corresponds to subdomains of the first level. i=1 4 Construct Global Boundary Stiffness Matrix (2) Global boundary stiffness matrix Sk corresponding to these subdomains are np (2) supplemented by zeros up to size global matrix S (1) in this S (1) = P Sk . Note k=1 (2) that, Matrices Sk do not depend on each other during the formation stage. This excludes conflicts when reading elements data and writing results of summing (1) of local matrices Si . For reduce computational costs of RAM, each computational modules forms (1) one row l of the matrix Si at the same time. The order of the construction of rows is given by subdomains. Parallel algorithm of the construction for np computational modules using OpenMP technology on the example of two sub- domains would be as follows (note that np 6 nΩ ): 54 N. Nedozhogin et al. Algorithm 3: Parallel algorithm of the global boundary stiffness ma- Pnp (2) trix construction S (1) = k=1 Sk for OpenMP (2) // Forming of S1 for i = 1 . . . nΩ /np do 1 Function of the inverse matrix (1) i (AII )−1 ; // Parallelize loop by OpenMP for l = 1 . . . nBi do (1) 2 v = iABI (l, ) ; (1) 3 v = v i(AII )−1 ; (1) (1) (1) 4 Si (l, ) = iABB (l, )−v iAIB ; (1) 5 Record the Si (l, ) elements in the corresponding row of (2) the matrix S1 ; (2) // Forming of S2 for i = nΩ /np . . . nΩ do Function of the inverse matrix i (1) −1 (AII ) ; // Parallelize loop by OpenMP for l = 1 . . . nBi do (1) v = iABI (l, ) ; (1) v = v i(AII )−1 ; (1) (1) (1) Si (l, ·) = iABB (l, )−v iAIB ; (1) Record the Si (l, ·) elements in the corresponding row of (2) the matrix S2 ; Based on the computing experiment (See Table 1), run-time costs on each subdomain have the following order (top-down): (1) 1. Inverse of the matrix i(AII ) (Step 1 of Algorithm 3); (1) 2. Record the row l of the local matrix Si (l, ·) in the corresponding row of 2 the matrix S1 (Step 5 of Algorithm 3); 3. Operations of taking the row from the matrix, the matrix-vectors multipli- cations and the difference of the vectors (steps 2-4 of Algorithm 3, denote by MatVec). Taking the row from the matrix, the matrix-vector multiplications and the difference of the vectors (see steps 2-4 of Algorithm 3) are parallelized using OpenMP technology, that for these operations gave an acceleration close to lin- Parallel Substructuring Method With Memory Cost Limits 55 ear. Also, in the future, it is planned to parallelize these operations for use on graphics accelerators using CUDA technology. (1) Recording of the row l of the local matrix Si (l, ·) to the corresponding row (2) of the matrix Sk (see step 5 of Algorithm 3) largely depends on the matrix (2) storage formats used. Matrices Sk is formed in a compressed CSR format, because of its large size. Using arrays of vectors (C++ container std::vector) allows to reduce memory costs to a minimum, but the complexity of adding new elements to such storage formats is O(nB ), where nB is the number of equations corresponding to the boundary nodes To reduce the costs of adding new elements of the row, we use sorted associative containers (C++ container std::map) and not sorted (C++ container std::unordered_map). In the sorted container, the complexity of finding and adding a new element is O(ln nB ) (due to its red-black tree structure), but it significantly increase memory costs to store to store additional two pointers. The non-sorted container, for these operations has complexity form O(1) to O(nB ) (depending on the hash function) and request smaller memory sizes, compared to the container std::map, which is optimal for (2) use in the constructing of matrices Sk . Table 1. Characteristic formation times in one subdomain, sec. Compute i (AII )(−1) MatVec, see Algorithm 3 Assembly S (1) LU SM nΩ = 512 95.4972 41.69505 0.340893 1.53655 nΩ = 1024 4.07287 1.947 0.130277 0.6955 To inverse the matrix at Step 1 of Algorithm 1, LU-factorization (LU) and Sherman-Morrison (SM) methods [6] were considered. For a given non-singular matrix A there exists a decomposition on upper triangular matrix U and a lower triangular matrix L. If matrices A, U , L are invertible, then A−1 = U −1 L−1 . In the construction L and U , products with nonzero elements of the parent matrix were summed, since the factorization war carried out for matrices stored in the CSR format. Only sequential implemen- tations of the LU-factorization on CPU was considered. This approach showed the best results for small matrices. Features of this algorithm do not allow to effective use all the capabilities of the massively parallel GPU architecture. A parallel version of the Sherman-Morrison method implemented using OpenMP technology was considered. 56 N. Nedozhogin et al. Algorithm 4: Parallel algorithm of Sherman-Morrison method for (1) (AII )−1 1 x, y - vectors of dimension n ; for k = 1 . . . n do // OpenMP (1) 2 xT = AII (k, ) − ek ; // OpenMP 3 y = xT Ainv ; // Loop of OpenMP on i for i = 1 . . . n do for j = 1 . . . n do Ainv (j, i) = Ainv (j, i) − (yi ∗ Ainv (j, k)/(yk + 1)) ; (1) 4 (AII )−1 = Ainv ; LU-factorization is faster than the Sherman-Morrison method in sequential implementations. In spite of it, LU-factorization does not parallelize well. Using parallel version of the Sherman-Morrison method provided a parallel accelera- tion about two times. To reduce the number of operations and the computational (1) load, matrix i(AII )−1 is stored in its entirety before the completion of the con- (1) structing of Si in the current subdomains, since it is not known how many non-zero elements contain an inverse matrix. To solve the system of equations (1) (1) on the inner subdomains nodes iAII iuI = ifI − iAIB uB Krylov subspace methods [7] are used. In the future, parallelization on the GPU is planned. To increase the paral- lelism, the order of the formation of the local boundary stiffness matrices will be given by layer-by-layer partitioning. Thus, a block of parallel GPU threads (1) will process one row of the matrix Si . In total it will be possible to generate κi blocks that are able to write in different places of the same array without overlap and conflicts due to layer-by-layer ordering. 5 Solving the S (1) uB = f˜B System Implementation of conjugate gradients method for solving the system S (1) uB = f˜B on one of np computing node is shown in Algorithm 5. Here the np (2) system matrix is represented in the form S (1) = P Sk . Index k corresponding k=1 to the number of computational module is omitted. Parallel Substructuring Method With Memory Cost Limits 57 Algorithm 5: Algorithm of the preconditioned conjugate gradient method for distributed matrix n 1 u, r, p, q, z ∈ R B ; 2 i = 0; 3 r0 = f˜B ; 4 u0 ← 0 ; (1) −1 5 z0 = M r0 ; // M ∼ (S ) -- preconditioner 6 p0 = z0 ; 7 ρ0 = (r0 , z0 ) ; // sync point while ||ri ||2 /||f ||2 > ε do 8 Assembly pi ; // Using MPI::Allreduce 9 qi = S (2) pi ; // k independent operations on each MPI process 10 αi = (ri , zi )/(qi , pi ) ; // MPI::Allreduce lead to sync point 11 ui+1 = ui + αi pi ; 12 ri+1 = ri − αi qi ; 13 zi+1 = M ri+1 ; // M ∼ (S (1) )−1 -- preconditioner 14 ρi+1 = (ri+1 , zi+1 ) ; // MPI::Allreduce lead to sync point 15 βi+1 = ρi+1 /ρi ; 16 pi+1 = ri+1 + βi+1 pi ; 17 i=i+1 ; In a matrix-vector product (See steps 8-9 of Algorithm 5) operation qi = S (1) pi is performed in two steps. First, the assembly of the vector pi from (2) all MPI processes, then np independent operations (qk )i = Sk pi . The result of np products is the set of vectors qk such that qk ∈ RnB and q = P qk . Summed of k=1 vector q are not required. Vector is stored by parts on various compute nodes. Parallelization of the operations with vectors is realized as follows: operands are parts of vectors consisting of components corresponding to mesh nodes of the (2) subdomain Ωk for each parallel MPI process. The result vector assembly are not required, each MPI process jobs with its local vectors, and synchronization occurs when the scalar products (r, z), (q, p), (r, r) are computed. Scalar prod- ucts are performed in two steps. First, local scalar products for the part vectors in parallel MPI processes, then summation of the local sums (on this step, im- plicit synchronization occurs). Inside the MPI process, vectors operations are parallelized using OpenMP technology with number of the thread equal number of the CPU cores on computational nodes. To parallelize the GPU, it is planned to use an approach similar to that presented in paper [2]. 58 N. Nedozhogin et al. 6 Results Numerical experiments were carried out on the test problem of the theory of elasticity. The computational domain had the form of a parallelepiped. Mesh consisted of 539000 hexagonal cells. 8 threads of the OpenMP are running in- side each MPI process. In Table 2, the result of experiment are presented for dimension on 512 and 1024 subdomains with using 2, 4, 6 MPI processes. The presented results are obtained on 6 computational nodes, each of which contains two processors Intel Xeon E5-2609 and 64GB RAM. For nΩ = 512, this problem took approximately 100GB of RAM, for nΩ = 1024, problem took approximately 80GB of RAM. Table 2. Time of construction and solve in the Substructuring method (1) (1) (1) S (1) uB = f˜B iAII iuI = ifI − iAIB uB P np Si LU, sec. SM, sec. # Time, sec. # Time, sec. nΩ = 512 2 34954.6 14936.8 2324 1640.37 43 4.72359 4 18086.6 7615.15 2325 1391.23 43 2.60635 6 11713.4 5014.66 2325 1498.37 43 1.9185 nΩ = 1024 2 5128.47 2818.16 2600 1464.09 35 3.15815 4 2622.52 1411.97 2600 1397.77 35 2.10205 6 1802.32 965.803 2600 1469.18 35 1.74885 The considered algorithm allows to significantly reduce the memory limits for the substructuring method. The partitioning of the matrix S (1) and its dis- tributed storage makes it possible to realize coarse-grained parallelism at the formation stage of global boundary stiffness matrix, to reduce exchanges and to exclude synchronous access to one memory cell during the local matrices addi- tion. At the same time, an increase in the number of involved computational modules permits achieving an acceleration close to linear at the stage of matrix construction. At the solver stage, the distributed storage of the matrix S (1) al- lows solving systems with a dimension proportional to the memory size of the involved computational modules. Scalability is limited to assembling the results of scalar products of vectors and depends only on the restrictions on the speed to data exchange over the network. References 1. Toselli A., Widlund O.B.: Domain Decomposition Methods — Algorithms and The- ory. Springer Series in Computational Mathematics, Vol. 34 (2005) 2. Kopysov S.P., Kuzmin I.M., Nedozhogin N.S., Novikov A.K., Sagdeeva Y. A.: Hybrid Multi-GPU solver based on Schur complement method. Lecture Notes in Computer Science, vol. 7979, pp. 65-79 (2013) Parallel Substructuring Method With Memory Cost Limits 59 3. Amestoy P.R., Buttari A., L’Excelent J-Y., Mary T.: On the Complexity of the Block Low-Rank Multifrontal Factorization. SIAM Journal on Scientific Computing, vol. 39(4), pp. A1710-A1740 (2017) 4. Chen C., Pouransari H., Rajamanickam S., Boman E.G., Darve E.: A distributed- memory hierarchical solver for general sparse linear systems. Parallel Computing, vol. 74, pp. 49-65 (2018) 5. Novikov A.K., Piminova N.K., Kopysov S.P., Sagdeeva Y.A.: Layer-by-layer ordering in parallel finite composition on shared-memory multiprocessors. IOP Conf. Ser.: Mater. Sci. Eng., vol. 158(1), p.158 (2016) 6. Sherman J., Morrison W.J.: Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix. Ann. Math. Statist., vol. 21(1), pp. 124-127 (1950) 7. Saad Y.: Iterative Methods for Sparse Linear System, Boston: PWS Publiching company, p. 447 (1996)