Parallelization of a finite difference scheme for solving
             systems of 2D Sine-Gordon equations
                       I. G. Hristov1,2,a, R. D. Hristova1,2, S. N. Dimova1
                                     1
                                      University of Sofia “St. Kliment Ohridski”,
                                    5 James Bourchier Blvd, Sofia, 1164, Bulgaria
                   2
                       Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, 141980, Russia
                                           E-mail: a christov_ivan@abv.bg


      Systems of perturbed 2D Sine-Gordon equations coupled via a cyclic tridiagonal matrix are solved numeri-
cally by a second-order centered finite difference scheme. The systems are considered on rectangular domains.
First an OpenMP parallel program is realized and very good performance scalability inside one computational
node is achieved. The tests on one computational node of the CPU platform in the IICT-BAS cluster, Bulgaria,
show speedup about 11 and on the CPU part of one node in the HybriLIT cluster, JINR, Russia, show speedup
about 30. After that the OpenMP parallelization effect is multiplied by a fully hybrid strategy: one MPI process
per cluster node and OpenMP on the cores of the node. The Hybrid MPI+OpenMP parallelization is realized as
follows. Domain decomposition is made and ghost points are introduced to support receives between MPI pro-
cesses. Inside a subdomain the work is shared between OpenMP threads. At every mesh point a system with one
and the same cyclic tridiagonal matrix is solved. Solving the system with this matrix is the smallest piece of
work distributed between threads. Only master threads communicate outside the parallel OpenMP region with
ordered MPI sends and receives. Computations on 12 nodes of the CPU platform in the IICT-BAS cluster show
significant speedup up to 116.

    Keywords: Systems of 2D Sine-Gordon equations, Finite difference method, OpenMP, MPI, Hybrid
MPI+OpenMP parallelization

We greatly thank to the opportunity to use the computational resources of the HybriLIT cluster and the IICT-BAS cluster.
This work is supported by the National Science Fund of Bulgaria under Grant DFNI-I02/8/2014 and by the National Science
Fund of BMSE under grant I02/9/2014.


                                                               © 2016 Ivan G. Hristov, Radoslava D. Hristova, Stefka N. Dimova


                                                                                                                       250
Introduction
      Recently the number of computational clusters in the world is rapidly increasing together with the
increasing of the total number of cores per cluster. The standard parallel programming technologies
MPI and OpenMP evolve together with the evolution of the computational resources. As a result using
serious computational resources nowadays becomes more accessible and parallel programming be-
comes more portable and efficient. This gives opportunity to investigate easier large scale mathemati-
cal models of practical interest and also to study theoretically many classical problems of mathemati-
cal physics in higher dimensions.
      In this work we solve numerically a system of 2D perturbed Sine-Gordon equations. On one hand
to solve this system is important theoretically because it is a generalization of the classical perturbed
1D Sine-Gordon equation. As a result of considering systems of equations, new 2D effects are ex-
pected and new types of solutions, without analogues in the 1D case are possible. On the other hand,
solving this system is of strong practical interest because it models the phase dynamics in the so called
Intrinsic Josephson Junctions (IJJs) [Sakai, Bodin, …, 1993]. IJJs are a potential source of THz radia-
tion. Powerful THz radiation from IJJs was first achieved experimentally in 2007 [Ozyuzer, at al.,
2007]. Then it was understood that standing waves of plasma oscillation have been built in the cavity
formed by the side surfaces of the crystal, exactly as in a laser. In 2008 standing waves were first
found in a computer simulation [Lin, Hu, 2008].
      Although the leading role of the excitation of cavity oscillations was well understood [Tsujimoto,
et al., 2010], the mechanism of the powerful THz radiation is still not fully explained. For further in-
vestigation of this mechanism, very long time simulations of a quasi 3D model (systems of 2D equa-
tions) are needed. In some cases computational domain size may be very large: 106  108 mesh points
with 108  109 time steps. So we have a pretty large problem and we really need serious computational
resources.
      This work is concentrated mainly on the computational aspect of solving the problem. The goals
of the work are:
     1. To make OpenMP realization of a difference scheme for solving systems of 2D perturbed Si-
         ne-Gordon equations.
     2. To test the performance scalability of the OpenMP program on the CPU part of one computa-
         tional node in the HybriLIT cluster and on one node of the CPU platform in the IICT-BAS
         cluster.
     3. To multiply the parallelization effect by a domain decomposition and fully hybrid strategy:
         one MPI process per cluster node and OpenMP on the cores of the node (if good OpenMP per-
         formance scalability inside one node is achieved).
     4. To test the performance scalability of the Hybrid MPI+OpenMP program on multiple nodes in
         the CPU platform in the IICT-BAS cluster.


Mathematical model
     We consider systems of 2D perturbed Sine-Gordon equations:

                      S (tt  t  sin    )   , ( x, y )    R 2 .                          (1)

     Here  is the 2D Laplace operator.  is a given domain in R . S is the N eq  N eq cyclic
                                                                               2


tridiagonal matrix:


                                                                                                   251
                                            1 s 0 . 0 s 
                                                          
                                             s 1 s 0 . 0
                                            . . . . . .
                                         S               ,
                                              . . . . . . 
                                            0 . 0 s 1 s 
                                                         
                                              s . 0 0 s 1

     where 0.5  s  0 and N eq is the number of equations. The unknown is the column vector
 ( x, y, t )  (1 ,...,  N )T . Neumann boundary conditions are considered:
                       eq


                                           
                                              |   0 .                                               (2)
                                           n

      In (2) n denotes the exterior normal to the boundary  . To close the problem (1), (2) appropri-
ate initial conditions are posed. The choice of the initial conditions is not a trivial task and it will be
discussed in a future work. The model (1), (2) describes very well the phase dynamics in N eq periodi-
cally stacked IJJs. The parameter s represents the inductive coupling between adjacent Josephson
junctions,  is the dissipation parameter,  is the external current. All the units are normalized as in
[Sakai, 1993].


Numerical scheme
      We follow the approach for construction of the numerical scheme from [Kazacha, Serdyukova,
1993]. We solve the problem numerically in rectangular domains by using second order central finite
differences with respect to all derivatives in equation (1). Let  be the step size in time and h be the
step size in both x and y space directions, N x - the number of nodes in x direction and N y - the
number of nodes in y direction. Let the mesh nodes be:
     xk  kh, k  0,, N x  1,       yi  ih, i  0,, N y  1, t j  j ,      j  0,1,
                                ~
and the approximate solution be  .
     By using the notations:


the finite difference equations that approximate system (1) are:


                                                                                                     252
     For every k  1,, N x  2, i  1,, N y  2 a system with the cyclic tridiagonal matrix S is
solved with respect to zˆkl ,i , l  1, , N eq , using Gaussian elimination. The diagonally dominance of the
matrix S ( 0.5  s  0) ensures the stability of the algorithm. Because of the specific tridiagonal
structure of S we need only 9 N eq  12 floating point operations per solving one system. To approxi-
mate the Neumann boundary conditions (2) third order one-sided finite differences are used. At the left
and the right boundaries of the rectangular domain the difference equations are:

                zˆ0,i  (18 zˆ1,i  9 zˆ2,i  2 zˆ3,i ) /11 , zˆN x 1,i  (18 zˆN x 2,i  9 zˆN x 3,i  2 zˆN x 4,i ) /11 .

At the top and the bottom boundaries the difference equations are:

              zˆk ,0  (18 zˆk ,1  9 zˆk ,2  2 zˆk ,3 ) /11 , zˆk , N 1  (18 zˆk , N  2  9 zˆk , N 3  2 zˆk , N  4 ) /11 .
                                                                     y                 y               y              y


     The resulting approximation error of the scheme is O(h 2   2 ) . The number of the floating point
operations in one time-step for the constructed difference scheme is O ( N x N y N eq ) . The stability con-
dition of the scheme is:
                                                   ( 1  2s )h
                                                             .                                                                       (3)
                                                        2
     In (3) 1  2s  min , where min is the minimal eigenvalue of S ,               2 in the denominator cor-
responds to dimension = 2. The conservativity of the scheme is checked numerically as in [Kazacha,
Serdyukova, 1993].
      By using the described below parallelization of the difference scheme, we carried out numerical
experiments with real physical parameters. Our results are in good agreement with the numerical re-
sults from the classical works [Lin, Hu, 2008], [Hu, Lin, 2008]. As explained there the powerful THz
radiation in IJJs corresponds to a new type of mathematical solutions of the problem (1), (2) with
strong inductive coupling, i.e. s ~ -0.5. The phase  ( x, y , t ) in a particular equation (junction) is a
sum of three terms: a linear with respect to time term vt (resistive term), a static  kink term
 pi _ kink ( x, y ) and an oscillating term (with average zero):
                           ( x, y, t )  vt  pi _ kink ( x, y )  oscillating _ term( x, y, t ) .
In addition we succeed to specify the structure of the oscillating term. These results will be reported
elsewhere.


Parallelization of the difference scheme
    OpenMP parallelization of the difference scheme from previous section is made by a straightfor-
ward parallelization of all DO loops in our FORTRAN program by using the OMP DO directive
[Chapman, Gabriele, …, 2008]:

      !\$OMP DO [Clauses]

        Fortran DO loop to be executed in parallel

      !\$OMP END DO


                                                                                                                                      253
     Tests in double precision arithmetic are done on the CPU part of one computational node of the
HybriLIT cluster [HybriLIT, 2016], part of the Multifunctional Information and Computing Complex
of JINR [MICCOM, 2016]. The CPU part of one computational node of this cluster consists of 2 x
Intel® Xeon® Processors E5-2695 v2 (24 cores, 48 threads with hyper-threading). The results are giv-
en in Table 1.
             Number of threads Time (seconds) Speedup Parallel efficiency (%)
             1                      549.97           1          100
             6                      99.87            5.51       91.8
             12                     52.79            10.42      86.8
             18                     35.33            15.57      86.5
             24                     26.54            20.72      86.3
             30                     23.75            23.16      77.2
             36                     21.51            25.57      71.0
             42                     19.75            27.85      66.3
             48                     18.28            30.09      62.7

                                                  Table 1

     Computational domain size is: N eq x N x x N y = 10 x 3000 x 3000, N time _ steps  300 , where N eq is
the number of equations, N x is the number of mesh points in x - direction, N y is the number of mesh
points in y - direction and N time _ steps is the number of steps in time. The threads from Table 1 are dis-
tributed by the scatter thread affinity [Chapman, Gabriele, …, 2008]. The parallel efficiency when 24
OpenMP threads are distributed per each of the 24 physical cores is very good, about 86.3%. When the
number of threads exceeds 24 and we start to run 2 threads per core, parallel efficiency decreases fast-
er. However we have additional speedup gain of using hyper-threading up to 30 when we use all 48
physical threads. The results for one node of the CPU platform IICT-BAS cluster are analogous. In the
CPU platform of this cluster one node consists of 2 x Intel® Xeon® Processors X5560 (8 cores, 16
threads) and maximal speedup is achieved when using all physical threads and it is about 11.
      Because very good scalability inside one computational node is achieved, it is natural to choose
fully hybrid MPI+OpenMP strategy [Rabenseifner, Hager, …, 2009] for multiplying the paralleliza-
tion effect when using many computational nodes: one MPI process per cluster node and OpenMP on
the cores of the node. We make domain decomposition [Gropp, Lusk, …, 1999] with one MPI process
working on one cluster node. Ghost points are introduced to support receives between MPI processes.
Inside a subdomain the work is shared between OpenMP threads. At every mesh point a system with
the matrix S is solved. Solving the system with the matrix S is the smallest piece of work distributed
between threads. Only master threads communicate outside the parallel OpenMP region with ordered
MPI sends and receives. We choose a time interval when 12 computational nodes of the CPU platform
in the IICT-BAS cluster [IICT-BAS, 2016] were free and we achieve significant speedup up to 116.
Results are shown in Table 2.
        Number of nodes Time (seconds) Speedup Relative to one node efficiency (%)
        1                    454.50             10.86     100
        2                    229.34             21.52     99.1
        4                    116.72             42.28     97.3
        8                    59.25              83.31     95.9
        12                   42.57              115.95    89.0

                                                  Table 2
    We choose larger computational domain size for this test to ensure enough work for all
12 *16  192 threads: N eq x N x x N y  10 x 3600 x 3600 ~ 13.107 , N time _ steps  150 .


                                                                                                      254
Conclusions
     OpenMP and Hybrid MPI+OpenMP parallelization of a centered finite difference scheme for
solving systems of perturbed Sine-Gordon equation is tested on the HybriLIT cluster and on the IICT-
BAS cluster. Tests show significant speedup up to 116. A very good base for further computationally
intensive investigation of the mechanism of powerful THz radiation in IJJs is made.


References
Chapman B., Gabriele J., Ruud V.D.P. Using OpenMP: portable shared memory parallel program-
    ming. 2008.  Vol. 10. MIT press.
Gropp W., Lusk E., Skjellum A. Using MPI: portable parallel programming with the message-passing
    interface.  1999.  Vol. 1. MIT press.
HybriLIT. Heterogeneous computing cluster HybriLIT of LIT JINR.  2016.  URL:
    http://hybrilit.jinr.ru.
Hu X., Lin Sh. Three-dimensional phase-kink state in a thick stack of Josephson junctions and te-
    rahertz radiation // Physical Review B.  2008.  Vol. 78, Issue 13.  P. 134510.
IICT-BAS. Advanced Computing and Data Centre at IICT-BAS.  2016. URL:
     http://www.hpc.acad.bg/.
Kazacha G. S., Serdyukova S. I. Numerical Investigation of the Behaviour of Solutions of the Sine-
     Gordon Equation with a Singularity for Large t // Zhurnal Vychislitel'noi Matematiki i
     Matematicheskoi Fiziki.  1993.  Vol. 33, Issue 3.  P. 417-427.
Lin Sh., Hu X. Possible dynamic states in inductively coupled intrinsic Josephson junctions of layered
     high-T c superconductors // Physical review letters.  2008.  Vol. 100, Issue 24.  P. 247006.
MICCOM, Multifunctional Information and Computing Complex of JINR.  2016. URL:
     https://miccom.jinr.ru.
Ozyuzer L., et al. Emission of coherent THz radiation from superconductors // Science.  2007. 
     Vol. 318, Issue 5854.  P. 1291-1293.
Rabenseifner R., Hager G., Jost G. Hybrid MPI/OpenMP parallel programming on clusters of multi-
     core SMP nodes. 17th Euromicro international conference on parallel, distributed and network-
     based processing. IEEE.  2009.
Sakai S., Bodin P., Pedersen N. F. Fluxons in thin-film superconductor-insulator superlatices // Jour-
     nal of applied physics.  1993.  Vol. 73, Issue 5.  P. 2411-2418.
Tsujimoto Manabu, et al. Geometrical resonance conditions for THz radiation from the intrinsic Jo-
     sephson junctions in Bi 2 Sr 2 CaCu 2 O 8 // Physical review letters.  2010.  Vol. 105, Issue 3.
      P. 037005.


                                                                                                 255