=Paper=
{{Paper
|id=Vol-1482/539
|storemode=property
|title=Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem
|pdfUrl=https://ceur-ws.org/Vol-1482/539.pdf
|volume=Vol-1482
}}
==Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem==
Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem∗ Sergey Solovyev1, Dmitry Vishnevsky1, Hongwei Liu2 Institute of Petroleum Geology and Geophysics SB RAS1 EXPEC ARC, Geophysics Technology, Saudi Aramco2 We present a multi-frontal hierarchically semi-separable solver to perform forward modeling of the 3D Helmholtz acoustic problem. Our frequency-domain solver com- bines two efficient approaches. First, it uses an optimal 27-point finite-difference scheme to decrease numerical dispersion and reduce required discretization of the model in terms of points per wavelength from 15 to about 4. Second, it uses a su- pernodal multi-frontal method based on low-rank approximation and hierarchically semi-separable (HSS) structure to improve performance, decrease memory usage and make it practical for realistic size 3D models required by full-waveform inversion. We performed validation and performance testing our new solver using a 3D synthetic model. Performance and OMP scalability of the solver were compared with the Intel MKL PARDISO. 1. Introduction Frequency-domain modelling and inversion are at the heart of many modern geophysical tools, such as full-waveform inversion of seismic data. Here we focus on developing an efficient algorithm for solving the Helmholtz problem in 3D heterogeneous media. This is a prerequisite for acoustic full-waveform inversion of marine and land seismic data. We present an algorithm based on two features: 1) a 27-point finite-difference approximation scheme to minimize the dispersion errors and 2) low-rank approximation technique and a hierarchically semi-separable (HSS) structure in the multi-frontal direct solver to decrease the computational resources re- quired. Perfectly matched layer (PML) boundary conditions are used in this study. Employing a 27-point scheme as opposed to a simpler 7-point approach allows decreasing the required model discretization from 15 to 4 points per wavelength, while maintaining the dispersion error below 1%. The quality of low-rank compression is improved by using a cross-approximation (CA) approach [6]. We validate this algorithm with numerical tests and confirm accuracy and per- formance improvements. The high accuracy and memory efficiency of the proposed algorithm is demonstrated using the 3D overthrust benchmark model. The current implementation re- quires three times less computational resources compared to other sparse direct solvers. As a result, the proposed algorithm significantly decreases the computational resources required for solving realistic 3D acoustic problems. Furthermore, the multi-frontal solver can be efficiently parallelized on both shared (via OMP) and distributed memory systems (via MPI). 2. Optimal 27-point finite-difference scheme The scalar-wave equation for a homogeneous isotropic medium in the frequency domain can be written as (2πν)2 ∆u + u = δ(r − rs )f (1) V2 where ν is a temporal frequency, V is acoustic velocity, rs denotes source coordinates, and f denotes source function. The standard finite-difference approximation on a parallelepiped grid ∗ The research described was partially supported by CRDF grant RUE1-30034-NO-13 and RFBR grants 14- 01-31340, 14-05-31222, 14-05-00049, 15-35-20022. 539 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org uses a 7-point stencil. As result we obtain a second-order approximation that exhibits dispersion errors of the solution in the frequency domain (Figure 1). Figure 1. The dispersion curves of the 2-nd order approximation for the 7-point scheme. Numerical phase velocity Vph are normalized with respect to the true velocity V and plotted versus 1/G, where G is the number of grid points per wavelength. Dispersion analysis confirms that at least 15 points per wavelength are required to reduce the dispersion to a practically acceptable level of less than 1%. In real-world 3D geophysical problems, this results in a huge computational grid and greatly increased computational time. There are various rotation schemes for decreasing dispersion error both in 2D [4] and in 3D [8] cases. We suggest a different approach that is easier to implement, but provides the same accuracy. To approximate a Laplace operator we use a combination of the three schemes: the first one uses middle face points, the second uses middle edges and the third uses corner points (Figure 3). Figure 2. The dispersion curves of the 2-nd order approximation for the optimized 27-point scheme. To approximate the wavenumber we use four schemes presented in Figure 4. As result we have the universal scheme with seven parameters: (2πν)2 w2 X w3 X w4 X γ 1 ∆1 u + γ 2 ∆2 u + γ 3 ∆3 u + [w 1 u 0,0,0 + u k ,k .k + u k ,k .k + uk1 ,k2 .k3 ] V2 6 1 2 3 12 1 2 3 8 St2 St3 St4 (2) 540 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org Figure 3. Three stencils of approximation the second order derivation along X axis. Figure 4. Four stencils for the approximation of the wavenumber. where ∆1 , ∆2 and ∆3 correspond to the three stencils from Figure 3, St2 , St3 and St4 are the set of points corresponding to ”bold” points of the stencil from Figure 4. P3 Additionally, coefficients should satisfy the following constraints where γi = 1 and i=1 4 P wi = 1. Dispersion analysis gives an explicit formula for phase velocity Vph = Vph (γw; G, φ, θ), i=1 where γw = (γ1 , γ2 , γ3 , w1 , w2 , w3 , w4 ), G is a number of points per wavelength and (φ, θ) is direction of propagation in spherical coordinates. After minimization of the functional (3) (rep- resenting average error for all directions), the optimal parameters γw are (0.656, 0.294, 0.050; 0.576, 0.187, 0.359, 0.122). ZZZ 2 J(γw) = Vph (γw)/V − 1 dGdϕdθ (3) G=[Gmin ,Gmax ];ϕ=[0,π/2];θ=[0,π/4] Analysis of numerical dispersion for this scheme shows that four points per wavelength are enough to achieve 0.5% dispersion error in wave propagation velocity. The proposed scheme, jointly with PML boundary conditions, can lead to a complex sparse symmetric (non-Hermitian) 27-diagonal matrix. Solving such a system of linear algebraic equa- tions (SLAE) is usually done by direct approach, using a nested dissection algorithm and low- rank/HSS technique. For real problems, the SLAE typically has many right hand sides (RHS), corresponding to the number of physical sources. The most compute-intensive stage of the solver, decomposition of the matrix, is performed once and is used for solving SLAE many times for various RHS. 3. Multifrontal low-rank solver The direct solver is based on decomposition of the matrix A. The traditional way of improv- ing performance of the decomposition is based on row/column reordering. This corresponds to renumbering points of the computational grid. One of the most effective methods is the called nested dissection algorithm [3], which uses a ”separator” technique of the binary tree to separate the grid on the unconnected subdomains. As a result, the factorization process of the matrix 541 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org …………………………… …………………………………… ………………… ………… ………………… ………………… ………………… ………………… … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … … Figure 5. Nested dissection reordering: pattern of L-factor in exact arithmetic (left image) and after low-rank approximation with HSS diagonal blocks (right image). columns corresponds to the grid nodes from different domains. This process can be performed independently. Such a process is referred to as ”multi-frontal.” Moreover, after preliminary reordering, the L-factor contains less non-zero elements compared to before reordering. The pattern of such an L-factor is shown in Figure 5 (left). To improve the multi-frontal process we use low-rank approximation based on the fact that large off-diagonal blocks can be efficiently approximated by low-rank matrices [1] while diagonal blocks can be effectively represented in hierarchically semi-separable (HSS) format [9]. The pattern of the L-factor is presented in Figure 5 (right). Matrix operations in the multi-frontal method are performed using low-rank arith- metic, so computational resources can be significantly decreased [5]. To compress dense blocks into low-rank or HSS structures, we use the panel modification of the cross-approximation (CA) approach [6]. The inversion step uses the iterative refinement process to achieve the accuracy, compatible with direct solvers. 4. Numerical experiments Let us evaluate the performance and precision of our algorithm using a numerical test. Performance is measured on a single node with Intel R Xeon R E5-2690 v2 (Ivy Bridge EP) 3.00GHz processors, RAM 512 GB with 8 threads. A realistic 3D velocity model (3D overthrust benchmark) is used. The velocity varies from 2300 m/s to 6000 m/s. The target domain is with frequencies from 1 Hz to 8 Hz and a source at the corner point. The spatial discretization is 30 m in all directions and the size of the PML boundary varies from 25 grid points for 1 Hz to 10 grid points for 8 Hz. Therefore, the size of the computational grid varies from (351x351x201) to (321x321x171). First, we compare 7-point and 27-point schemes. We observe that for 10 grid points per wavelength (test for 8 Hz), a standard 7-point scheme delivers unacceptable results (Figure 6, left) with very high numerical dispersion, whereas the 27-point scheme provides good results (Figure 6, right). The wavefield from the 27-point scheme is nearly identical to a numeri- cal solution computed with the Seiscope time-domain solver (reference), thus serving as useful benchmark. Next let us evaluate computer resources required for both the 27-point and 7-point schemes using conventional arithmetic and low rank approximation (Table 1). 5. Parallel computations To parallelize computations, we are going to use both MPI and OMP parallelization ap- proaches. The idea of the MPI parallelization of the factorization step is based on elimination 542 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org Figure 6. Real part of wavefield error at 8 Hz computed with 7-point scheme (left image) and optimized 27-point one (right image). Observe unacceptable dispersion on the left due to insufficient discretization (10 points per wavelength). Table 1. Memory usage for storing L-factors (8Hz) and solution times. 7-point 27-point Exact arithmetic 848 GB 880 GB Low-rank approximation 256 GB 272 GB Factorization 26 339 s 28 741 s Solution, 5 iterations 258 s 288 s tree structure as result of 3D Nested Dissection reordering. The factorization of the different elimination tree nodes can be done in parallel on different cluster nodes [7]. Let me note that current MPI version of program is under developing now. The OMP parallelization of our solver is based on OMP functionality of BLAS and LAPACK functions from the Intel MKL. We tested the OMP scalability and compared performance of our solver with direct solver PARDISO from Intel MKL (Table 2). Test case is the result of 7-point finite difference approximation of the Helmholtz equation (12Hz) with PML layer in cube. The number of grid points is 3.4 ∗ 106 (151x151x151) and this problem can be solved by direct solver on our system with 512G RAM as well. Table 2. Factorization and solution times (inverse LDLt factors) for 10 right hand sides. PARDISO Low-rank solver Factorization (1 thread) 13 594 s 4 072 s. Factorization (8 threads) 2 387 s 1 941 s. Solution (1 thread) 66 s 28 s Solution (8 threads) 32 s 14 s This table shows more than 3 times performance gain of non-parallel factorization of our solver in compare with Intel MKL PARDISO. However the scalability of the PARDISO is better 543 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org than our solver, so on 8 threads performance of our solver just a 20% better than PARDISO. Scalability of solution step is the similar both for low-rank solver and PARDISO one: about twice from 1 to 8 threads. Therefore the solution step of our solver is more than twice faster than PARDISO. Let me note that parallelization of the factorization step are going to be improved by using two-level approach of combining parallel low-rank compressing of different blocks of L-factors with using OMP parallelization of MKL BLAS/LAPACK functions. 6. Conclusions We developed a high-performance algorithm for numerical solution of the Helmholtz prob- lem in 3D heterogeneous media. The main advantage of the algorithm is a reduction of num- ber of grid points per wavelength enabled by application of optimal 27-point finite-difference scheme resulting in a significant memory saving. The second advantage is use of an efficient multi-frontal direct solver, based on low-rank approximation technique and hierarchically semi- separable (HSS) structure. The high accuracy of the proposed scheme and memory efficiency is shown using a realistic overthrust benchmark model, which cannot be solved by direct solvers on current systems because of memory limitations. Overall the proposed algorithm allows for a significant reduction of computational resources and enables solving large scale 3D acoustic problems required for FWI. Performance benefits of the multi-frontal solver over the Intel MKL PARDISO was showed. Ways of improving OMP scalability and developing MPI parallelization were proposed. References 1. Dewilde P. Gu M. Somasunderam N. Chandrasekaran, S. On the numerical rank of the off-diagonal blocks of schur complements of discretized elliptic pdes. SIAM J. Matrix Anal. Appl., 31(5):2261–2290, 2010. 2. Collino F., Tsogka C. Application of the perfectly matched layer absorbing layer model to the linear elastodynamic problem in anisotropic heterogeneous media Geophysics. 2001. 66. 294–307. 3. George A. Nested dissection of a regular finite elementmesh // SIAM Journal on Numerical Analysis. 1973. 10, N 2. 345–63. 4. Jo, C., Shin, C., and Suh, J. An optimal 9-point finite-difference frequency-space 2-D scalar wave extrapolator Geophysics. 1996. 61. 294–307. 5. Xia J. Robust and efficient multifrontal solver for large discretized pdes // High-Performance Scientific Computing. 2012. 199–217. 6. Rjasanow S. Adaptive cross approximation of dense matrices. In IABEM 2002, International Association for Boundary Element Methods, UT Austin, TX, USA, May 28-30, 2002. 7. Solovyev S. A. Application of the Low-Rank Approximation Technique in the Gauss Elim- ination Method for Sparse Linear Systems Vychisl. Metody Programm. 15, 441–460 (2014), in Russian. 8. J. Xia S. Wang, M.V. de Hoop and X.S. Li. Massively parallel structured multifrontal solver for time-harmonic elastic waves in 3D anisotropic media. In Project Review, Geo- Mathematical Imaging Group, volume 1, pages 97–121. Purdue University, West Lafayette IN, 2012. 544 Суперкомпьютерные дни в России 2015 // Russian Supercomputing Days 2015 // RussianSCDays.org 9. Hackbusch W. A sparse matrix arithmetic based on H-matrices. part I: Introduction to H-matrices. Computing, 62(2):89 – 108, 1999. 545