=Paper=
{{Paper
|id=Vol-1852/p12
|storemode=property
|title=Distributed and parallel version of the FDTD
simulation algorithm
|pdfUrl=https://ceur-ws.org/Vol-1852/p12.pdf
|volume=Vol-1852
|authors=Giuseppe Amenta,Grazia Lo Sciuto
}}
==Distributed and parallel version of the FDTD
simulation algorithm==
Distributed and parallel version of the FDTD simulation algorithm Giuseppe Amenta, Grazia Lo Sciuto1 1 Department of Electrical, Electronic and Informatics Engineering University of Catania Catania, Italy glosciuto@dii.unict.it Abstract—This paper proposes an innovative algorithm for This paper analyses the most important aspect of distribut- running in a distributed and parallel fashion the well-known ing the typical FDTD algorithm and shows the performance Finite-Difference Time-Domain method. Given the dependence improvements when the distributed version is used. The fol- among data, for the proposed distributed version, care has been taken in the approach splitting the involved data. This is to avoid lowing sections of the paper describe, firstly the mathematical a great deal of data exchange among different hosts, so as to min- background, secondly the sequential version of the algorithm, imize the communication overhead. Performance comparisons thirdly an initial optimization for the algorithm, fourthly the between the proposed solution and the sequential one has shown distributed and parallel version. Finally, conclusions are drawn. that improvements can be achieved using the proposed version running on several hosts. Moreover, the distributed version let II. BACKGROUND overcome the problem of central memory exhaustion, typical of this method, due to the large matrices to be handled. The technique called FIT (Finite Integration Theory [3]) Index Terms—FDTD, parallelization, distribution, algorithms, represents a numerical method which is implemented by elec- data dependence tromagnetic simulation software systems (such as CST studio suite [18]). Such a technique has been applied to arbitrary I. I NTRODUCTION grids and corresponds to the FDTD on orthogonal hexagonal The FDTD Finite-Difference Time-Domain method [15], grids, when the time derivative is discretized by the leap-frog [19], [13] provides, in its initial formulation, the discretization algorithm. The resulting algorithm can be further developed of fields on two interlaced orthogonal Cartesian grids. This for taking advantage of parallelization. We have explored the method is expressed as an integral formulation describing the equations of the FDTD algorithm with the integral variables. electromagnetic field by means of state variables, which are We take for simplicity the one-dimensional case (this can be defined as integral quantities associated to well-defined geo- obviously extended) of a linearly polarized electromagnetic metric elements of the dual grids (closed path integral along wave in x-direction, propagating along the z direction. The the lines and flows through the corresponding surfaces) [8], equations relevant to the discussion here are Maxwell’s curl [7]. equations, as: The original formulation consists of a sequential algorithm executing a possibly huge number of iterations on a single core −∂t B = O×E of a host. Distribution is highly desirable for solving the typical ∂t D = O×H problems on the domain under investigations, i.e. electromag- netic fields, etc., due to the high amount of memory used and These equations can be integrated on dual meshes at inter- computation necessary. However, a distributed version of the laced time intervals, based on the leap-frog scheme as shown algorithm has to minimize communication overhead due to in Fig. 1. data exchange among hosts. When the distributed version is This method leads to the following equations, representing available, then an underlying platform can be put to use for an iterative scheme. achieving real parallelism and distribution [1], [2], [12], [16]. A proper analysis of the data dependencies among the n+ 21 n− 21 ∆t c0 n 1 1 several parts of the initial sequential algorithms is therefore D̃x (k) = D̃x (k) − [Hy (k + ) − Hyn (k − )] ∆z 2 2 needed to achieve a distributed version minimising data ex- changes. In this field, several automatic or semi-automatic 1 Ẽ = D̃ techniques can assist in finding structural and data depen- εr dencies [5], [10], [11], [14], [9], [4]. In order to find our 1 1 ∆t c0 n+ 21 n+ 1 B̃yn+1 (k + ) = B̃yn (k + ) − [Ẽx (k + 1) − Ẽx 2 (k)] proposed solution, we have put to use such techniques as well 2 2 ∆z as manually devising our distribution approach. 1 H̃ = B̃ Copyright c 2017 held by the authors. µr 74 Fig. 1. Discretization in one-dimension with the dual grids scheme. S = δx × δz, S̃ = δy × δz, where δx and δy are the discretization steps. Fig. 3. Incidence directions in 2D. Fig. 2. Leap frog scheme. Time interlacing of the fields E, D, H and B in the FDTD method. The above iterative equations are updated (time marching) as shown in the leap-frog scheme of Fig. 2. The steps marked in Fig. 2 with 2 and 4 are relative to the costitutive relations of the fields and are punctual operations. While the steps marked with 1 and 3 are relative to the Maxwell’s curl equations (the data used for these updates are located in adjacent cells). Fig. 4. The Yee cell. Of course, an electromagnetic wave can not propagate at a speed higher than the light velocity. Therefore, the time step must not exceed the time that the wave would take to propagate between 2 points on the grid. In 2D the wave propagation n−m 2 fg (n∆t) = fg (n) = e−( p ) (2) direction changing the traveling distance must be taken into account (see Fig. 3 and 4). In the Gaussian pulse equation in the discrete version (2), In 3D in order to implement the iterative scheme we use the ∆t temporal time is not included explicitly. the Yee cell that is shown in the Fig. 4. The time-discrete version of the equation is appropriate to A. The Gaussian Pulse be translated into an algorithm which can be computed for a given set of data. Finally, it is necessary to take into account A continuous Gaussian pulse is defined as: other features of the physical problem, that is the following t−d −( wgg )2 three constraints. fg (t) = e (1) 1) The electric boundary condition, since in the internal Where the dg component represents the time delay, while points of a perfect conductor the electric field E van- the wg parameter generically is the width of the pulse. The ishes. corresponding peak value is obtained for t = dg . If we express 2) The magnetic boundary condition, which can ba easily the dg delay and the wg pulse width, depending on the time handled if the computational domain stops at magnetical step ∆t, we have the discrete version of the previous equation: nodes. 75 3) The radiation condition, for the ability to absorb waves pulse = exp(-0.5*(pow((t0-T)/2.0))); without reflection. dz[IC][JC] = pulse; III. S EQUENTIAL I MPLEMENTATION The second for loop calculates matrix ez[][], by means The study of the physical phenomenon and its mathematical of the matrices ga[][] and dz[][], (the second matrix is functions lead us to outline its implementation. The algorithm updated by the first for loop and the gaussian impulse). The takes as input several fundamental values for the study of this code is as follows: phenomenon, including 5 matrices, initialized to the needed values for the computation, and initialized in an optimized way ez[i][j] = ga[i][j] * dz[i][j]; (see the following). Once values are received, the algorithm executes some checks and then reaches the 4 fundamental for The third for loop calculates the matrix hx[][], adding it loops, and the pulse variable calculation. to the multiplication between the constant value 0.5 and the The inputs for the algorithm are: updated ez[][] matrix. The next column of the matrix is 1) the initial matrices ga[][], dz[][], ez[][], required (with further implications, as in the previous cases). hx[][] and hy[][], containing the necessary val- The relative code is: ues required to perform the calculation; 2) the number of steps nsteps to be performed, express- hx[i][j] = hx[i][j] + 0.5 * (ez[i][j] - ing the number of times that the algorithm has to be ez[i][j+1]); executed; 3) T the absolute time, from which the study of the physical Finally, the fourth for loop calculates the matrix hy[][], phenomenon starts; like in the third loop, by using matrix hy[][], the constant 4) the source parameter t0 , essential for calculating the value 0.5 and the updated matrix ez[][]. The corresponding pulse variable (usually set to 20.0); code is: 5) the spread of the source, needed for calculating the pulse variable (set to 6.0). hy[i][j] = hy[i][j] + 0.5 * (ez[i+1][j] There are two control loops, an external while and an inner - ez[i][j]); for. The first external loop checks whether or not the condition that ends the execution of the algorithm (value 0 or a negative The said algorithm has been developed in C++, and even number) has been reached. It is a condition taken as an input. the sequential version has better performances than some The second loop checks the number of nsteps that have to corresponding commercial software systems. The simulation be executed by the algorithm. It also responsible for increasing times for our developed tool were significantly lower than the T absolute time variable, which measures the time from those required by simulations–using the same hardware. This the zero instant of the beginning of the experiment until to reduction is extremely advantageous when complex electro- the instant n where the experiment ends, and also is used to magnetic structures are studied. compute the electromagnetic values. The time variable is a float, since time flows with steps of 0.5. IV. DYNAMIC VS S TATIC I NITIALIZATION The subsequent 4 for loops compute 4 of the 5 starting ma- trices (ga[][] is constant). The first loop computes dz[][] Firstly, it is required to initialize the 5 matrices that matrix, using hy[][] and hx[][] matrices, multiplied by represent the electromagnetic magnitudes under study. They the constant 0.5. With respect to the elements of matrix are: ga[][], dz[][], ez[][], hx[][], hy[][]. hy[][] to be computed, the previous row of matrix hy[][] Both the size of the matrices and the values they have to and the next column of hx[][] have to be made available. have, will be given as input to the algorithm. In particular, The dependence from such data has great implications for the the size of all matrices must be identical between them, while transformation of the algorithm into its distributed version. the values can be different. This derives from the mathematical Then, element by element there will be some additions to function used to study the physical phenomenon. Such values compute the dz[][] matrix, as in the following: can be given in a configuration file, in order to make the program more versatile and suitable for the investigation of dz[i][j] = dz[i][j] + 0.5*(hy[i][j] - the phenomenon from different starting conditions. hy[i-1][j] - hx[i][j] + hx[i][j-1]); The dynamic initialization of the 5 starting matrices is provided through the initMatrix function, which has as input Before the second for loop, a block of instructions sets the the matrix to initialize, its size, and the initial value that the pulse variable according to time T. After that, the resulting matrix elements must take. These elements are float types. value is assigned to a given portion of dz[][], previously initMatrix function consists of 2 for loops, one for loop scans calculated. This is said the Gaussian impulse of the source. the rows and one for loop scans the columns. Within the 2 for The corresponding code is as follows: loops, it is performed an indexed assignment of the matrices to the desired value. It can be expected that a small part of the 76 TABLE I I NITIALIZATION VALUES RECOMMENDED FOR STUDYING THE PHYSICAL PHENOMENON . Matrices Initialization Values ga[][] 1.0 dz[][] 0 ez[][] 0 hx[][] 0 hy[][] 0 TABLE II E XECUTION AND COMPILATION TIMES OF THE STATIC AND DYNAMIC VERSIONS . Static Initialization Dynamic Initialization Dim. Compiling Execution Executable Compiling Execution Fig. 5. Example of the split into 4 sub-matrices, for the 5 initial matrices 50 6.326 s 2.142 s 1.5 Mb 0.712 s 2.102 s (ga[][], dz[][], ez[][], hx[][] and hy[][]). 70 15.843 s 2.995 s 4.1 Mb 0.495 s 3.106 s 80 95.257 s 5.591 s 6.0 Mb 0.453 s 5.649 s dimensional array representing the matrix; the integers i and j are used to scan the matrix; the row and column are int array will be set to a starting value other than the rest (which variables holding the total number of rows and columns; and is usually 0 or 1.0). finally the value contains the float value of the cells in the A performance comparison between the proposed dynamic array, given to initMatrix function. Further tests indicated an initialization and a static initialization of the matrices is shown improvement in performance, hence reducing the slight gap in table II. Compilation and execution average times are given between the two approaches. for both versions. If the matrix size increases, a slight improvement in perfor- V. PARALLEL AND D ISTRIBUTED VERSION mances of the static initialized matrices has been observed. By distributing the 5 starting matrices on multiple hosts it is This result is apparently positive, considering the big size possible to parallelize and distribute execution, overcome the typical of the matrices, however it only affects about 0.7% of physical limits of a single host (large amount of data in volatile the initialization time, and would depend on the matrix size. memory, etc.), and speed up the processing [6], [17]. The In addition, this approach has several limits, some of which conceived solution splits the matrices into x submatrices, and are really considerable. A brief list of such limits is reported distributes them into the x hosts available for the processing. below: The results on each host should then be sent to a server 1) The compilation time, which must be actually added to collecting and showing them. the runtime program for each run, is in percentage really The intrinsic structure of the implementation (see the high compared to the run time (due to the huge amount above description of the sequential version), limits the of data needed for pre-storage). This consideration, distribution approach. In order to compute matrices (i.e. combined with the fact that data are not constant, but dz[][], hx[][] and hy[][]) some rows or columns are taken as input, will be such that this time has to be needed from, previous or successive, other matrices (hy[][], considered whenever the program is executed. hx[][] and ez[][]). The parallelism of the problem at 2) The limit on the number of elements that can be com- hand can be seen according to the producer-consumer model. piled, which causes a virtual memory consumption error Once the row or column values have been calculated, which (depending on the host, and for a standard host being are needed by another host to calculate other sub-matrices, around 853 elements). these are sent by means of sockets. The underlying idea for 3) The dependence on a c++ compiler (or g++) that each distribution is to have either a chain of n hosts, or a number host must have installed on itself to let the program of hosts organised as a grid. Moreover, each host running a work. part of the processing manages the communication with the 4) The inability to create an executable version of the soft- nearest hosts. ware system, which in fact reduces the user-friendliness The first part of the execution is handled by server.cpp for less experienced users. program that chooses the number of hosts that will be used To further improve the dynamic initialization performance, (among the available n), according to the size of the 5 we have used pointers to access matrix elements, since the starting matrices: ga[][], dz[][], ez[][], hx[][] arithmetic pointer is faster than array indexing. and hy[][]. Once the x hosts have been chosen, the above matrices are split into x sub-matrices, and sent to the x hosts. k = rows * columns; The sub-matrices are spread depending on the position of the for (i = 0; i