Parallelization of a finite difference scheme for solving systems of 2D Sine-Gordon equations I. G. Hristov1,2,a, R. D. Hristova1,2, S. N. Dimova1 1 University of Sofia “St. Kliment Ohridski”, 5 James Bourchier Blvd, Sofia, 1164, Bulgaria 2 Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, 141980, Russia E-mail: a christov_ivan@abv.bg Systems of perturbed 2D Sine-Gordon equations coupled via a cyclic tridiagonal matrix are solved numeri- cally by a second-order centered finite difference scheme. The systems are considered on rectangular domains. First an OpenMP parallel program is realized and very good performance scalability inside one computational node is achieved. The tests on one computational node of the CPU platform in the IICT-BAS cluster, Bulgaria, show speedup about 11 and on the CPU part of one node in the HybriLIT cluster, JINR, Russia, show speedup about 30. After that the OpenMP parallelization effect is multiplied by a fully hybrid strategy: one MPI process per cluster node and OpenMP on the cores of the node. The Hybrid MPI+OpenMP parallelization is realized as follows. Domain decomposition is made and ghost points are introduced to support receives between MPI pro- cesses. Inside a subdomain the work is shared between OpenMP threads. At every mesh point a system with one and the same cyclic tridiagonal matrix is solved. Solving the system with this matrix is the smallest piece of work distributed between threads. Only master threads communicate outside the parallel OpenMP region with ordered MPI sends and receives. Computations on 12 nodes of the CPU platform in the IICT-BAS cluster show significant speedup up to 116. Keywords: Systems of 2D Sine-Gordon equations, Finite difference method, OpenMP, MPI, Hybrid MPI+OpenMP parallelization We greatly thank to the opportunity to use the computational resources of the HybriLIT cluster and the IICT-BAS cluster. This work is supported by the National Science Fund of Bulgaria under Grant DFNI-I02/8/2014 and by the National Science Fund of BMSE under grant I02/9/2014. © 2016 Ivan G. Hristov, Radoslava D. Hristova, Stefka N. Dimova 250 Introduction Recently the number of computational clusters in the world is rapidly increasing together with the increasing of the total number of cores per cluster. The standard parallel programming technologies MPI and OpenMP evolve together with the evolution of the computational resources. As a result using serious computational resources nowadays becomes more accessible and parallel programming be- comes more portable and efficient. This gives opportunity to investigate easier large scale mathemati- cal models of practical interest and also to study theoretically many classical problems of mathemati- cal physics in higher dimensions. In this work we solve numerically a system of 2D perturbed Sine-Gordon equations. On one hand to solve this system is important theoretically because it is a generalization of the classical perturbed 1D Sine-Gordon equation. As a result of considering systems of equations, new 2D effects are ex- pected and new types of solutions, without analogues in the 1D case are possible. On the other hand, solving this system is of strong practical interest because it models the phase dynamics in the so called Intrinsic Josephson Junctions (IJJs) [Sakai, Bodin, …, 1993]. IJJs are a potential source of THz radia- tion. Powerful THz radiation from IJJs was first achieved experimentally in 2007 [Ozyuzer, at al., 2007]. Then it was understood that standing waves of plasma oscillation have been built in the cavity formed by the side surfaces of the crystal, exactly as in a laser. In 2008 standing waves were first found in a computer simulation [Lin, Hu, 2008]. Although the leading role of the excitation of cavity oscillations was well understood [Tsujimoto, et al., 2010], the mechanism of the powerful THz radiation is still not fully explained. For further in- vestigation of this mechanism, very long time simulations of a quasi 3D model (systems of 2D equa- tions) are needed. In some cases computational domain size may be very large: 106  108 mesh points with 108  109 time steps. So we have a pretty large problem and we really need serious computational resources. This work is concentrated mainly on the computational aspect of solving the problem. The goals of the work are: 1. To make OpenMP realization of a difference scheme for solving systems of 2D perturbed Si- ne-Gordon equations. 2. To test the performance scalability of the OpenMP program on the CPU part of one computa- tional node in the HybriLIT cluster and on one node of the CPU platform in the IICT-BAS cluster. 3. To multiply the parallelization effect by a domain decomposition and fully hybrid strategy: one MPI process per cluster node and OpenMP on the cores of the node (if good OpenMP per- formance scalability inside one node is achieved). 4. To test the performance scalability of the Hybrid MPI+OpenMP program on multiple nodes in the CPU platform in the IICT-BAS cluster. Mathematical model We consider systems of 2D perturbed Sine-Gordon equations: S (tt  t  sin    )   , ( x, y )    R 2 . (1) Here  is the 2D Laplace operator.  is a given domain in R . S is the N eq  N eq cyclic 2 tridiagonal matrix: 251 1 s 0 . 0 s     s 1 s 0 . 0 . . . . . . S  ,  . . . . . .  0 . 0 s 1 s     s . 0 0 s 1 where 0.5  s  0 and N eq is the number of equations. The unknown is the column vector  ( x, y, t )  (1 ,...,  N )T . Neumann boundary conditions are considered: eq  |   0 . (2) n In (2) n denotes the exterior normal to the boundary  . To close the problem (1), (2) appropri- ate initial conditions are posed. The choice of the initial conditions is not a trivial task and it will be discussed in a future work. The model (1), (2) describes very well the phase dynamics in N eq periodi- cally stacked IJJs. The parameter s represents the inductive coupling between adjacent Josephson junctions,  is the dissipation parameter,  is the external current. All the units are normalized as in [Sakai, 1993]. Numerical scheme We follow the approach for construction of the numerical scheme from [Kazacha, Serdyukova, 1993]. We solve the problem numerically in rectangular domains by using second order central finite differences with respect to all derivatives in equation (1). Let  be the step size in time and h be the step size in both x and y space directions, N x - the number of nodes in x direction and N y - the number of nodes in y direction. Let the mesh nodes be: xk  kh, k  0,, N x  1, yi  ih, i  0,, N y  1, t j  j , j  0,1, ~ and the approximate solution be  . By using the notations: the finite difference equations that approximate system (1) are: 252 For every k  1,, N x  2, i  1,, N y  2 a system with the cyclic tridiagonal matrix S is solved with respect to zˆkl ,i , l  1, , N eq , using Gaussian elimination. The diagonally dominance of the matrix S ( 0.5  s  0) ensures the stability of the algorithm. Because of the specific tridiagonal structure of S we need only 9 N eq  12 floating point operations per solving one system. To approxi- mate the Neumann boundary conditions (2) third order one-sided finite differences are used. At the left and the right boundaries of the rectangular domain the difference equations are: zˆ0,i  (18 zˆ1,i  9 zˆ2,i  2 zˆ3,i ) /11 , zˆN x 1,i  (18 zˆN x 2,i  9 zˆN x 3,i  2 zˆN x 4,i ) /11 . At the top and the bottom boundaries the difference equations are: zˆk ,0  (18 zˆk ,1  9 zˆk ,2  2 zˆk ,3 ) /11 , zˆk , N 1  (18 zˆk , N  2  9 zˆk , N 3  2 zˆk , N  4 ) /11 . y y y y The resulting approximation error of the scheme is O(h 2   2 ) . The number of the floating point operations in one time-step for the constructed difference scheme is O ( N x N y N eq ) . The stability con- dition of the scheme is: ( 1  2s )h  . (3) 2 In (3) 1  2s  min , where min is the minimal eigenvalue of S , 2 in the denominator cor- responds to dimension = 2. The conservativity of the scheme is checked numerically as in [Kazacha, Serdyukova, 1993]. By using the described below parallelization of the difference scheme, we carried out numerical experiments with real physical parameters. Our results are in good agreement with the numerical re- sults from the classical works [Lin, Hu, 2008], [Hu, Lin, 2008]. As explained there the powerful THz radiation in IJJs corresponds to a new type of mathematical solutions of the problem (1), (2) with strong inductive coupling, i.e. s ~ -0.5. The phase  ( x, y , t ) in a particular equation (junction) is a sum of three terms: a linear with respect to time term vt (resistive term), a static  kink term pi _ kink ( x, y ) and an oscillating term (with average zero):  ( x, y, t )  vt  pi _ kink ( x, y )  oscillating _ term( x, y, t ) . In addition we succeed to specify the structure of the oscillating term. These results will be reported elsewhere. Parallelization of the difference scheme OpenMP parallelization of the difference scheme from previous section is made by a straightfor- ward parallelization of all DO loops in our FORTRAN program by using the OMP DO directive [Chapman, Gabriele, …, 2008]: !\$OMP DO [Clauses] Fortran DO loop to be executed in parallel !\$OMP END DO 253 Tests in double precision arithmetic are done on the CPU part of one computational node of the HybriLIT cluster [HybriLIT, 2016], part of the Multifunctional Information and Computing Complex of JINR [MICCOM, 2016]. The CPU part of one computational node of this cluster consists of 2 x Intel® Xeon® Processors E5-2695 v2 (24 cores, 48 threads with hyper-threading). The results are giv- en in Table 1. Number of threads Time (seconds) Speedup Parallel efficiency (%) 1 549.97 1 100 6 99.87 5.51 91.8 12 52.79 10.42 86.8 18 35.33 15.57 86.5 24 26.54 20.72 86.3 30 23.75 23.16 77.2 36 21.51 25.57 71.0 42 19.75 27.85 66.3 48 18.28 30.09 62.7 Table 1 Computational domain size is: N eq x N x x N y = 10 x 3000 x 3000, N time _ steps  300 , where N eq is the number of equations, N x is the number of mesh points in x - direction, N y is the number of mesh points in y - direction and N time _ steps is the number of steps in time. The threads from Table 1 are dis- tributed by the scatter thread affinity [Chapman, Gabriele, …, 2008]. The parallel efficiency when 24 OpenMP threads are distributed per each of the 24 physical cores is very good, about 86.3%. When the number of threads exceeds 24 and we start to run 2 threads per core, parallel efficiency decreases fast- er. However we have additional speedup gain of using hyper-threading up to 30 when we use all 48 physical threads. The results for one node of the CPU platform IICT-BAS cluster are analogous. In the CPU platform of this cluster one node consists of 2 x Intel® Xeon® Processors X5560 (8 cores, 16 threads) and maximal speedup is achieved when using all physical threads and it is about 11. Because very good scalability inside one computational node is achieved, it is natural to choose fully hybrid MPI+OpenMP strategy [Rabenseifner, Hager, …, 2009] for multiplying the paralleliza- tion effect when using many computational nodes: one MPI process per cluster node and OpenMP on the cores of the node. We make domain decomposition [Gropp, Lusk, …, 1999] with one MPI process working on one cluster node. Ghost points are introduced to support receives between MPI processes. Inside a subdomain the work is shared between OpenMP threads. At every mesh point a system with the matrix S is solved. Solving the system with the matrix S is the smallest piece of work distributed between threads. Only master threads communicate outside the parallel OpenMP region with ordered MPI sends and receives. We choose a time interval when 12 computational nodes of the CPU platform in the IICT-BAS cluster [IICT-BAS, 2016] were free and we achieve significant speedup up to 116. Results are shown in Table 2. Number of nodes Time (seconds) Speedup Relative to one node efficiency (%) 1 454.50 10.86 100 2 229.34 21.52 99.1 4 116.72 42.28 97.3 8 59.25 83.31 95.9 12 42.57 115.95 89.0 Table 2 We choose larger computational domain size for this test to ensure enough work for all 12 *16  192 threads: N eq x N x x N y  10 x 3600 x 3600 ~ 13.107 , N time _ steps  150 . 254 Conclusions OpenMP and Hybrid MPI+OpenMP parallelization of a centered finite difference scheme for solving systems of perturbed Sine-Gordon equation is tested on the HybriLIT cluster and on the IICT- BAS cluster. Tests show significant speedup up to 116. A very good base for further computationally intensive investigation of the mechanism of powerful THz radiation in IJJs is made. References Chapman B., Gabriele J., Ruud V.D.P. Using OpenMP: portable shared memory parallel program- ming. 2008.  Vol. 10. MIT press. Gropp W., Lusk E., Skjellum A. Using MPI: portable parallel programming with the message-passing interface.  1999.  Vol. 1. MIT press. HybriLIT. Heterogeneous computing cluster HybriLIT of LIT JINR.  2016.  URL: http://hybrilit.jinr.ru. Hu X., Lin Sh. Three-dimensional phase-kink state in a thick stack of Josephson junctions and te- rahertz radiation // Physical Review B.  2008.  Vol. 78, Issue 13.  P. 134510. IICT-BAS. Advanced Computing and Data Centre at IICT-BAS.  2016. URL: http://www.hpc.acad.bg/. Kazacha G. S., Serdyukova S. I. Numerical Investigation of the Behaviour of Solutions of the Sine- Gordon Equation with a Singularity for Large t // Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fiziki.  1993.  Vol. 33, Issue 3.  P. 417-427. Lin Sh., Hu X. Possible dynamic states in inductively coupled intrinsic Josephson junctions of layered high-T c superconductors // Physical review letters.  2008.  Vol. 100, Issue 24.  P. 247006. MICCOM, Multifunctional Information and Computing Complex of JINR.  2016. URL: https://miccom.jinr.ru. Ozyuzer L., et al. Emission of coherent THz radiation from superconductors // Science.  2007.  Vol. 318, Issue 5854.  P. 1291-1293. Rabenseifner R., Hager G., Jost G. Hybrid MPI/OpenMP parallel programming on clusters of multi- core SMP nodes. 17th Euromicro international conference on parallel, distributed and network- based processing. IEEE.  2009. Sakai S., Bodin P., Pedersen N. F. Fluxons in thin-film superconductor-insulator superlatices // Jour- nal of applied physics.  1993.  Vol. 73, Issue 5.  P. 2411-2418. Tsujimoto Manabu, et al. Geometrical resonance conditions for THz radiation from the intrinsic Jo- sephson junctions in Bi 2 Sr 2 CaCu 2 O 8 // Physical review letters.  2010.  Vol. 105, Issue 3.  P. 037005. 255