=Paper= {{Paper |id=Vol-1852/p12 |storemode=property |title=Distributed and parallel version of the FDTD simulation algorithm |pdfUrl=https://ceur-ws.org/Vol-1852/p12.pdf |volume=Vol-1852 |authors=Giuseppe Amenta,Grazia Lo Sciuto }} ==Distributed and parallel version of the FDTD simulation algorithm== https://ceur-ws.org/Vol-1852/p12.pdf
         Distributed and parallel version of the FDTD
                      simulation algorithm
                                                Giuseppe Amenta, Grazia Lo Sciuto1
                                  1
                                      Department of Electrical, Electronic and Informatics Engineering
                                                          University of Catania
                                                               Catania, Italy
                                                          glosciuto@dii.unict.it


   Abstract—This paper proposes an innovative algorithm for                   This paper analyses the most important aspect of distribut-
running in a distributed and parallel fashion the well-known               ing the typical FDTD algorithm and shows the performance
Finite-Difference Time-Domain method. Given the dependence                 improvements when the distributed version is used. The fol-
among data, for the proposed distributed version, care has been
taken in the approach splitting the involved data. This is to avoid        lowing sections of the paper describe, firstly the mathematical
a great deal of data exchange among different hosts, so as to min-         background, secondly the sequential version of the algorithm,
imize the communication overhead. Performance comparisons                  thirdly an initial optimization for the algorithm, fourthly the
between the proposed solution and the sequential one has shown             distributed and parallel version. Finally, conclusions are drawn.
that improvements can be achieved using the proposed version
running on several hosts. Moreover, the distributed version let                                    II. BACKGROUND
overcome the problem of central memory exhaustion, typical of
this method, due to the large matrices to be handled.                         The technique called FIT (Finite Integration Theory [3])
   Index Terms—FDTD, parallelization, distribution, algorithms,            represents a numerical method which is implemented by elec-
data dependence                                                            tromagnetic simulation software systems (such as CST studio
                                                                           suite [18]). Such a technique has been applied to arbitrary
                        I. I NTRODUCTION                                   grids and corresponds to the FDTD on orthogonal hexagonal
   The FDTD Finite-Difference Time-Domain method [15],                     grids, when the time derivative is discretized by the leap-frog
[19], [13] provides, in its initial formulation, the discretization        algorithm. The resulting algorithm can be further developed
of fields on two interlaced orthogonal Cartesian grids. This               for taking advantage of parallelization. We have explored the
method is expressed as an integral formulation describing the              equations of the FDTD algorithm with the integral variables.
electromagnetic field by means of state variables, which are               We take for simplicity the one-dimensional case (this can be
defined as integral quantities associated to well-defined geo-             obviously extended) of a linearly polarized electromagnetic
metric elements of the dual grids (closed path integral along              wave in x-direction, propagating along the z direction. The
the lines and flows through the corresponding surfaces) [8],               equations relevant to the discussion here are Maxwell’s curl
[7].                                                                       equations, as:
   The original formulation consists of a sequential algorithm
executing a possibly huge number of iterations on a single core                                    −∂t B       = O×E
of a host. Distribution is highly desirable for solving the typical
                                                                                                    ∂t D       = O×H
problems on the domain under investigations, i.e. electromag-
netic fields, etc., due to the high amount of memory used and                 These equations can be integrated on dual meshes at inter-
computation necessary. However, a distributed version of the               laced time intervals, based on the leap-frog scheme as shown
algorithm has to minimize communication overhead due to                    in Fig. 1.
data exchange among hosts. When the distributed version is                    This method leads to the following equations, representing
available, then an underlying platform can be put to use for               an iterative scheme.
achieving real parallelism and distribution [1], [2], [12], [16].
   A proper analysis of the data dependencies among the                          n+ 21               n− 21           ∆t c0 n      1            1
several parts of the initial sequential algorithms is therefore                D̃x       (k)   =   D̃x       (k) −        [Hy (k + ) − Hyn (k − )]
                                                                                                                      ∆z          2            2
needed to achieve a distributed version minimising data ex-
changes. In this field, several automatic or semi-automatic                                        1
                                                                                          Ẽ   =      D̃
techniques can assist in finding structural and data depen-                                        εr
dencies [5], [10], [11], [14], [9], [4]. In order to find our                         1                   1    ∆t c0 n+ 21           n+ 1
                                                                           B̃yn+1 (k + )       = B̃yn (k + ) −      [Ẽx (k + 1) − Ẽx 2 (k)]
proposed solution, we have put to use such techniques as well                         2                   2     ∆z
as manually devising our distribution approach.
                                                                                                   1
                                                                                         H̃    =      B̃
  Copyright c 2017 held by the authors.                                                            µr


                                                                      74
Fig. 1. Discretization in one-dimension with the dual grids scheme. S =
δx × δz, S̃ = δy × δz, where δx and δy are the discretization steps.




                                                                                                 Fig. 3.     Incidence directions in 2D.




Fig. 2. Leap frog scheme. Time interlacing of the fields E, D, H and B in
the FDTD method.



   The above iterative equations are updated (time marching)
as shown in the leap-frog scheme of Fig. 2. The steps marked
in Fig. 2 with 2 and 4 are relative to the costitutive relations of
the fields and are punctual operations. While the steps marked
with 1 and 3 are relative to the Maxwell’s curl equations (the
data used for these updates are located in adjacent cells).                                                Fig. 4.   The Yee cell.
   Of course, an electromagnetic wave can not propagate at a
speed higher than the light velocity. Therefore, the time step
must not exceed the time that the wave would take to propagate
between 2 points on the grid. In 2D the wave propagation                                                                             n−m 2
                                                                                                 fg (n∆t) = fg (n) = e−(              p )     (2)
direction changing the traveling distance must be taken into
account (see Fig. 3 and 4).
                                                                                    In the Gaussian pulse equation in the discrete version (2),
   In 3D in order to implement the iterative scheme we use
                                                                                 the ∆t temporal time is not included explicitly.
the Yee cell that is shown in the Fig. 4.
                                                                                    The time-discrete version of the equation is appropriate to
A. The Gaussian Pulse                                                            be translated into an algorithm which can be computed for a
                                                                                 given set of data. Finally, it is necessary to take into account
  A continuous Gaussian pulse is defined as:
                                                                                 other features of the physical problem, that is the following
                                        t−d
                                      −( wgg )2                                  three constraints.
                         fg (t) = e                                  (1)
                                                                                   1) The electric boundary condition, since in the internal
   Where the dg component represents the time delay, while                            points of a perfect conductor the electric field E van-
the wg parameter generically is the width of the pulse. The                           ishes.
corresponding peak value is obtained for t = dg . If we express                    2) The magnetic boundary condition, which can ba easily
the dg delay and the wg pulse width, depending on the time                            handled if the computational domain stops at magnetical
step ∆t, we have the discrete version of the previous equation:                       nodes.

                                                                            75
  3) The radiation condition, for the ability to absorb waves           pulse = exp(-0.5*(pow((t0-T)/2.0)));
     without reflection.                                                dz[IC][JC] = pulse;
            III. S EQUENTIAL I MPLEMENTATION
                                                                          The second for loop calculates matrix ez[][], by means
   The study of the physical phenomenon and its mathematical            of the matrices ga[][] and dz[][], (the second matrix is
functions lead us to outline its implementation. The algorithm          updated by the first for loop and the gaussian impulse). The
takes as input several fundamental values for the study of this         code is as follows:
phenomenon, including 5 matrices, initialized to the needed
values for the computation, and initialized in an optimized way         ez[i][j] = ga[i][j] * dz[i][j];
(see the following). Once values are received, the algorithm
executes some checks and then reaches the 4 fundamental for                The third for loop calculates the matrix hx[][], adding it
loops, and the pulse variable calculation.                              to the multiplication between the constant value 0.5 and the
   The inputs for the algorithm are:                                    updated ez[][] matrix. The next column of the matrix is
   1) the initial matrices ga[][], dz[][], ez[][],                      required (with further implications, as in the previous cases).
       hx[][] and hy[][], containing the necessary val-                 The relative code is:
       ues required to perform the calculation;
   2) the number of steps nsteps to be performed, express-              hx[i][j] = hx[i][j] + 0.5 * (ez[i][j] -
       ing the number of times that the algorithm has to be             ez[i][j+1]);
       executed;
   3) T the absolute time, from which the study of the physical            Finally, the fourth for loop calculates the matrix hy[][],
       phenomenon starts;                                               like in the third loop, by using matrix hy[][], the constant
   4) the source parameter t0 , essential for calculating the           value 0.5 and the updated matrix ez[][]. The corresponding
       pulse variable (usually set to 20.0);                            code is:
   5) the spread of the source, needed for calculating the pulse
       variable (set to 6.0).                                           hy[i][j] = hy[i][j] + 0.5 * (ez[i+1][j]
   There are two control loops, an external while and an inner          - ez[i][j]);
for. The first external loop checks whether or not the condition
that ends the execution of the algorithm (value 0 or a negative            The said algorithm has been developed in C++, and even
number) has been reached. It is a condition taken as an input.          the sequential version has better performances than some
   The second loop checks the number of nsteps that have to             corresponding commercial software systems. The simulation
be executed by the algorithm. It also responsible for increasing        times for our developed tool were significantly lower than
the T absolute time variable, which measures the time from              those required by simulations–using the same hardware. This
the zero instant of the beginning of the experiment until to            reduction is extremely advantageous when complex electro-
the instant n where the experiment ends, and also is used to            magnetic structures are studied.
compute the electromagnetic values. The time variable is a
float, since time flows with steps of 0.5.                                       IV. DYNAMIC VS S TATIC I NITIALIZATION
   The subsequent 4 for loops compute 4 of the 5 starting ma-
trices (ga[][] is constant). The first loop computes dz[][]                Firstly, it is required to initialize the 5 matrices that
matrix, using hy[][] and hx[][] matrices, multiplied by                 represent the electromagnetic magnitudes under study. They
the constant 0.5. With respect to the elements of matrix                are: ga[][], dz[][], ez[][], hx[][], hy[][].
hy[][] to be computed, the previous row of matrix hy[][]                Both the size of the matrices and the values they have to
and the next column of hx[][] have to be made available.                have, will be given as input to the algorithm. In particular,
The dependence from such data has great implications for the            the size of all matrices must be identical between them, while
transformation of the algorithm into its distributed version.           the values can be different. This derives from the mathematical
   Then, element by element there will be some additions to             function used to study the physical phenomenon. Such values
compute the dz[][] matrix, as in the following:                         can be given in a configuration file, in order to make the
                                                                        program more versatile and suitable for the investigation of
dz[i][j] = dz[i][j] + 0.5*(hy[i][j] -                                   the phenomenon from different starting conditions.
hy[i-1][j] - hx[i][j] + hx[i][j-1]);                                       The dynamic initialization of the 5 starting matrices is
                                                                        provided through the initMatrix function, which has as input
  Before the second for loop, a block of instructions sets the          the matrix to initialize, its size, and the initial value that the
pulse variable according to time T. After that, the resulting           matrix elements must take. These elements are float types.
value is assigned to a given portion of dz[][], previously              initMatrix function consists of 2 for loops, one for loop scans
calculated. This is said the Gaussian impulse of the source.            the rows and one for loop scans the columns. Within the 2 for
The corresponding code is as follows:                                   loops, it is performed an indexed assignment of the matrices
                                                                        to the desired value. It can be expected that a small part of the

                                                                   76
                            TABLE I
I NITIALIZATION VALUES RECOMMENDED FOR STUDYING THE PHYSICAL
                          PHENOMENON .

                     Matrices     Initialization Values
                     ga[][]                 1.0
                     dz[][]                  0
                     ez[][]                  0
                     hx[][]                  0
                     hy[][]                  0

                           TABLE II
  E XECUTION AND COMPILATION TIMES OF THE STATIC AND DYNAMIC
                          VERSIONS .

                   Static Initialization             Dynamic Initialization
 Dim.    Compiling      Execution      Executable   Compiling Execution            Fig. 5. Example of the split into 4 sub-matrices, for the 5 initial matrices
  50       6.326 s       2.142 s         1.5 Mb      0.712 s       2.102 s         (ga[][], dz[][], ez[][], hx[][] and hy[][]).
  70      15.843 s       2.995 s         4.1 Mb      0.495 s       3.106 s
  80      95.257 s       5.591 s         6.0 Mb      0.453 s       5.649 s
                                                                                   dimensional array representing the matrix; the integers i and
                                                                                   j are used to scan the matrix; the row and column are int
array will be set to a starting value other than the rest (which                   variables holding the total number of rows and columns; and
is usually 0 or 1.0).                                                              finally the value contains the float value of the cells in the
   A performance comparison between the proposed dynamic
                                                                                   array, given to initMatrix function. Further tests indicated an
initialization and a static initialization of the matrices is shown
                                                                                   improvement in performance, hence reducing the slight gap
in table II. Compilation and execution average times are given
                                                                                   between the two approaches.
for both versions.
   If the matrix size increases, a slight improvement in perfor-                             V. PARALLEL AND D ISTRIBUTED VERSION
mances of the static initialized matrices has been observed.                          By distributing the 5 starting matrices on multiple hosts it is
This result is apparently positive, considering the big size                       possible to parallelize and distribute execution, overcome the
typical of the matrices, however it only affects about 0.7% of                     physical limits of a single host (large amount of data in volatile
the initialization time, and would depend on the matrix size.                      memory, etc.), and speed up the processing [6], [17]. The
In addition, this approach has several limits, some of which                       conceived solution splits the matrices into x submatrices, and
are really considerable. A brief list of such limits is reported                   distributes them into the x hosts available for the processing.
below:                                                                             The results on each host should then be sent to a server
   1) The compilation time, which must be actually added to                        collecting and showing them.
       the runtime program for each run, is in percentage really                      The intrinsic structure of the implementation (see the
       high compared to the run time (due to the huge amount                       above description of the sequential version), limits the
       of data needed for pre-storage). This consideration,                        distribution approach. In order to compute matrices (i.e.
       combined with the fact that data are not constant, but                      dz[][], hx[][] and hy[][]) some rows or columns are
       taken as input, will be such that this time has to be                       needed from, previous or successive, other matrices (hy[][],
       considered whenever the program is executed.                                hx[][] and ez[][]). The parallelism of the problem at
   2) The limit on the number of elements that can be com-                         hand can be seen according to the producer-consumer model.
       piled, which causes a virtual memory consumption error                      Once the row or column values have been calculated, which
       (depending on the host, and for a standard host being                       are needed by another host to calculate other sub-matrices,
       around 853 elements).                                                       these are sent by means of sockets. The underlying idea for
   3) The dependence on a c++ compiler (or g++) that each                          distribution is to have either a chain of n hosts, or a number
       host must have installed on itself to let the program                       of hosts organised as a grid. Moreover, each host running a
       work.                                                                       part of the processing manages the communication with the
   4) The inability to create an executable version of the soft-                   nearest hosts.
       ware system, which in fact reduces the user-friendliness                       The first part of the execution is handled by server.cpp
       for less experienced users.                                                 program that chooses the number of hosts that will be used
   To further improve the dynamic initialization performance,                      (among the available n), according to the size of the 5
we have used pointers to access matrix elements, since the                         starting matrices: ga[][], dz[][], ez[][], hx[][]
arithmetic pointer is faster than array indexing.                                  and hy[][]. Once the x hosts have been chosen, the above
                                                                                   matrices are split into x sub-matrices, and sent to the x hosts.
k = rows * columns;
                                                                                   The sub-matrices are spread depending on the position of the
for (i = 0; i