=Paper=
{{Paper
|id=Vol-2344/paper3
|storemode=property
|title=Features of hardware implementation of Particle Swarm Optimization (PSO) on FPGA
|pdfUrl=https://ceur-ws.org/Vol-2344/paper3.pdf
|volume=Vol-2344
|authors=Danila Nikiforovskii,Ivan Deyneka,Daniil Smirnov
|dblpUrl=https://dblp.org/rec/conf/micsecs/NikiforovskiiDS18
}}
==Features of hardware implementation of Particle Swarm Optimization (PSO) on FPGA==
Features of hardware implementation of Particle Swarm Optimization (PSO) on FPGA Danila Nikiforovskii, Ivan Deyneka, and Daniil Smirnov Research Institute of Light-Guided Photonics, ITMO University, Saint-Petersburg, Russia {danikiforovskii, igdeyneka, dsmirnov}@corp.ifmo.ru http://sf.ifmo.ru Abstract. Swarm Intelligence algorithm family is a widespread tool to solve optimization problems occurring in various areas of computer sci- ence and data processing. Nevertheless, the majority of existing solutions are software-based. On the other hand, the very essence of independent agents predisposes to hardware implementation employing parallel cal- culations. Field Programmable Gate Array (FPGA) is the perfect plat- form to implement such system. Particle Swarm Optimization hardware implementation using FPGA made by Intel FPGA (formerly Altera) is described in this paper. Both genuine parallel and sequential schemes are considered, as well as different approaches regarding arithmetic op- erations. Keywords: FPGA, PSO, swarm, Intel, optimization, hardware, robust- ness, parallel 1 Introduction Modern computer science and IT face various optimization problems in different areas, such as curve fitting, data mining, artificial intelligence learning, etc. To solve the problems, where the analytical evaluation of gradients is impossible or have a high computational cost, several meta-heuristic algorithms have been proposed. The swarm algorithm family is one of such groups. It consists of population- based algorithms with group behavior emerging from individual behavior of in- dependent agents moving in search space to find a solution. As with any other population-based algorithms, the different movement and communication rules and topologies can change the overall behavior of the particles and algorithm as a whole. Even within the group of algorithms with identical ruleset, convergence and the ability to find correct results depends strongly on the choice of algorithm parameters. Because of the probabilistic nature of the agent behavior, there is no universally suitable parameters. The selection of appropriate problem-specific parameters is a problem itself, which requireS additional effort to solve by various methods, including meta-optimization. 2 Danila Nikiforovskii et al. One of the swarm intelligence algorithms advantages is the relative ease of the software implementation, which ensues from the basic principle of simple movement rules of individual agents. Therefore, most of the swarm algorithms implementations are software based. It is often sufficient, as it provides a result in a finite time, which is acceptable in lab environment. On the other hand, optimization problems can occur in the area of device control and management. In this case, the timing constraints are presented, which leads to software im- plementation being unable to find the result in a suitable time and without additional customization, it, generally, damaging the precision. On the other hand, the multi-agent nature of swarm algorithms allows con- struction of the hardware system based on the independent agent blocks. Be- cause the agents are, in general, identical, the development of only one block is required, which can be then reused or re-instantiated. There are different approaches and platforms to implement such systems, such as development of ASICs with dedicated computation of the agent mechanics. It is possible to use GPUs instead. However,the use of GPUs implies the conventional desktop com- puter or at least the CPU. The use of the FPGAs, on the other hand, allows integration of swarm intelligence implementation into another system imple- mented on FPGAs, which eliminates the need for construction of more complex interfaces between swarm intelligence and other devices. The ability to perform genuine parallel computations with the consideration of affordable cost makes them the perfect computational platform to perform mathematical tasks that imply the repeating operations in independent entities, such as calculating the invert Hessian matrix for every data point in curve fitting problems or cost function calculation for every agent in swarm intelligence algorithms. FPGA based hardware implementation of Particle Swarm Optimization Algorithm is described in this paper. 2 Particle Swarm Optimization Particle Swarm Optimization is one of the swarm intelligence algorithms. It is the algorithm to solve the multi-dimensional global optimization problem i.e. to find such vector x : f (x) < f (x0 ); ∀x0 6= x, given m-variable function f (x1 , x2 , . . . , xm ) : Rm → R. There are different methods to define the m- variable cost function f . The most obvious one is to define it by an explicit analytical expression. However, in this case, it is possible to calculate the deriva- tives and the gradient, as well as second derivatives to construct Hessian matrix. Therefore, it is possible to use other methods to solve optimization problems, such as well-known gradient descent method or LevenbergMarquardt algorithm; although Particle Swarm Algorithm can find the appropriate solution in this case, it is less effective. There are situations, however, where calculation of derivatives is very com- putationally difficult; in this scenario, swarm intelligence is much more suitable. If the analytical representation of the function is unknown, cost function can be defined as a series of observations or measurements. In this case the coordinate- Features of hardware implementation of PSO on FPGA 3 vector x corresponds to the parameters of such measurement; these parameters can reflect both the physical values (e.g. the position of the measurement, voltage levels etc.) as well as specific instructions for the external black box performing the measurement. In this case, there can be no information about gradient in general sense, which renders traditional methods relying on it virtually useless. Particle Swarm Optimization algorithm performs optimization by moving the agents, each representing the candidate solution in search space. Aside from the position vector, the velocity vector is assigned to each agent. In addition, each agent possesses the knowledge of the best previously found personal position and cost function value at that position. The only way of interaction between agents is the best previously found global position, available to every agent. On each iteration of the algorithm, the new value of the velocity vector is calculated for every agent using the expression v j+1 = ω ∗ v j + φp ∗ rp ∗ (p − xj ) + φg ∗ rg ∗ (g − xj ) (1) where ω, φp and φg are predefined constant values. rp and rg are random values in range of [0;1] with the uniform distribution[1]. After the update of the velocity, the agent moves to a new location according to the simple motion law: xj+1 = v j+1 + xj (2) Each step concludes with an evaluation of cost function at the new point. If f (xj+1 ) < f (p)) , which means that current position is better then previously found personal best, then the value of p is updated, i.e. its old value is replaced with xj+1 . The final step of the iteration is the evaluation of g . If f (pmin ) < f (g), where pmin is the overall best of personal values of a swarm , then the value of g is replaced with pmin . There are many different variations of the algorithm. The degree of interac- tion can be changed, there are possibilities to add another terms to eq. (1) and so on. Still, if basic principles of PSO remain unchanged, the conclusions of this article can be applied to any variation of PSO. The basic version of PSO, based strictly on equations (1) and (2) is described in this article. 3 FPGA-based hardware implementation There are several key points in constructing the hardware-based implementa- tion of Particle Swarm Algorithm. The main advantage is, as aforementioned, the ability to use parallel computing[3]. Since each agent is independent, there are possibilities to implement the algorithm using both genuine parallel and sequential approach. 4 Danila Nikiforovskii et al. Fig. 1: Genuine parallel system scheme. The genuine parallel system is depicted on Figure 1. Since the only way of interaction is broadcast value g-vector, each of the agent blocks can run indepen- dently, which can vastly improve performance. In addition, since these blocks are identical as well, single developed block can be re-instantiated. However, each block has to include the evaluation of optimization function, which can be com- putationally costly, as well as a random number generator and an arithmetic system to perform multiplication and addition. Moreover, as all agents are in- dependent, the values of x, v and p should be stored inside each block to allow simultaneous access. Therefore, memory blocks cannot be employed to store these values. Tightly coupled memory should be used instead. Fig. 2: Sequential system scheme. The system built with the sequential approach is depicted on Figure 2. Pa- rameters of agent (x, v, p) are stored in memory. After the processing in the agent block and evaluating the cost function, the updated parameters are writ- ten back, and the next agent can be processed. This is possible because agents are independent, but instead of implementing the copies of agent blocks, the system is reduced to one block. Both embedded FPGA blocks and any external memory device can be used, as there are no need for simultaneous access. In m- Features of hardware implementation of PSO on FPGA 5 dimensional optimization problem that employ n agents, there is need to store three values (x, v, p) per dimension per agent, one value of f (p) per agent, one value of g per dimension, and one value of f (g). Therefore, total number of bits needed to implement PSO is ((3m + 1)n + m + 1)b, (3) where b is number of bits per number. For instance, 128 agents in a 6- dimensional optimization problem would demand 76.2 Kbit in single precision. As with other components, there are only one instance of cost function evaluation block. If the evaluation of cost function involves any external device, there is no possibility to implement simultaneous evaluation of cost function, so sequential approach presents the only viable option. As shown in equations (1) and (2), the only operations employed are ad- dition and multiplication. Therefore, there are two different strategies consid- ering the implementations of arithmetic blocks. The first one involves the use of fixed-point or integer arithmetic, which demands the quantization of search space. In this case, fixed-point coordinates correspond to address of a single cell. However, if it is impossible to perform evaluation of the cost function with just integer arithmetic operations, the integer values of candidate solu- tion point coordinates should be converted back to floating-point values ev- ery time the cost function is evaluated. The transformation of the cost func- tion is also necessary to adjust to changed scale of variables, e.g. function f (x1 , x2 ) = x21 + 2x2 ; x1 , x2 ∈ [−1, 1] should be transformed into f 0 (x01 , x02 ) = x01 2 x02 ( 32767 ) + 2 32767 ; x01 , x02 ∈ [−32767, 32767]. To perform the initial seeding as well as the calculations according to equation (1), several stochastic values should be generated. In fixed-point strategy, the initial seeding, i.e. generating the random values in search space is still integer and can be implemented by any widespread method, e.g. such as linear feedback shift registers (LFSRs)[4]. The values of rp and rg however, are distributed in range of [0; 1], which goes against the integer approach. On the other hand, coordinates of the products rp (p − x) and rg (g − x) are distributed in range of [0; (p − x)] and [0; (g − x)] correspondingly, which can be implemented using LFSRs as well. Parameters ω, φp and φg which, in general, might be real values, are still constants, and implementation of the multiplication could be reduced to the set of integer addition, bit shifting (for division by the power of 2) and integer multiplication. Fixed-point approach definitely leads to precision loss, as solution can not be improved after the cell containing optimal value is found. Still, many optimization problems do not require precise solution, as good enough solution acquired in reasonable time is considered better than precise solution that could be acquired only in infinite time. If the precision is insufficient, the second round of optimization can be performed, with new search space being the neighborhood of the preliminary solution point. Despite fixed-point approach is more effective considering intrinsic PSO op- erations, there can be no overall advantage, because there is need to convert values. Cost function itself may contain computational-costly operations such as 6 Danila Nikiforovskii et al. exponentiation, logarithms and so on, which cannot be reduced to fixed-point calculations without the significant loss of precision. The other way is to use floating-point arithmetic in particle swarm algorithm intrinsic mechanic. In this scenario, there is no need to convert values. If there is any CPU present in system, the use of floating-point approach allows native interchange of data between CPU and PSO block. However, this strategy implies the use of specialized arithmetic blocks for addition and multiplication; therefore, area demands increase. In context of random number generation, the only challenge floating-point strategy poses is the generation of uniformly distributed value. The LFSR bus output is uniformly distributed when treated as integer, but distribution of floating-point number obtained by straightforward reading of LFSR bus out- put is far from uniform. Several ways of generating the floating-point random numbers are available, considering the possibility of fixing the output range of generator. For example, certain LSFR bits can be used as a significand, and the exponent value being adjusted to make the resulting value just below the range upper threshold. To prevent precision-related issues linked to representation of the values from affecting the behavior of the swarm, normalized search space approach can be utilized. It is possible to scale the variables in the cost function in such way that search space is transformed to [−1; 1]m ; in this case all agents are moving with velocities of same order across all dimensions. As PSO is a method to solve multi-dimensional problems, to build the system in genuine parallel fashion, there is need to implement independent calculation subsystem not only for every agent, but for every dimension within agent block as well. Equations (1) and (2) consist of 5 multiplications and 5 additions per agent per dimension in it, which vastly increase the area demands even in fixed- point approach. As it seems impractical, a finite state machine (FSM) based scheduling can reduce this to 1 multiplication and 1 addition per agent. In case of sequential strategy, i.e. when only one agent block is implemented, it is possible to construct parallel scheme regarding dimensions only. The cost function evaluation can be a very computationally costly task. For example, the evaluation of the sum of the squares of residuals Pn 2 i=0 (Φ(x(i), β0 , β1 , . . . , βm ) − y(i)) (4) where x(i) and y(i) are the coordinates of data points, βi - independent pa- rameters of target function, demand calculations of Φ(x) n times for every can- didate parameter set. Φ(x) itself can be quite complex and can contain multiple entries of resource intensive functions as exponentiation, trigonometric functions etc. For example, it takes 21 clock cycles to perform logarithm function calcula- tion, commonly used in statistic-related tasks, in Intel FPGA dedicated IP-core in single precision, versus 7 clock cycles to perform addition. In the majority of applications, the evaluation of cost function demands more time than the op- erations of PSO. Therefore, the performance increase achieved by construction of parallel PSO system is relatively small especially considering the FPGA area Features of hardware implementation of PSO on FPGA 7 demands. Still, the absence of dependencies between candidate solutions allows the implementation of multiple blocks of cost function evaluation. The major setback is, again, the FPGA area requirements, as genuine parallel dataflow de- mands n independent cost function evaluation blocks. It is possible to construct a system with fewer cost function blocks, however, dividing the agents into sub- sets assigned to separate blocks and employing the scheduling system inside the subset. 4 Validation Examples Using the considerations described above, the hardware version of Particle Swarm Optimization was implemented on the Intel FPGA EP4CE115F29C7N (Cyclone IV) platform, which is a part of Terasic DE2-115 development board. Consecu- tive FSM-based scheduling and genuine parallel schemes have been implemented, as well as mixed scheme with partial parallelism. 4.1 Parallel strategy, fixed-point approach The multi-dimensional sphere function Pm 2 i=0 (xi − ai ) (5) have been chosen as a cost function in the implementation of parallel PSO sys- tem with fixed-point approach. As there is no need for additional arithmetic blocks neither for PSO nor for cost function evaluation, math operations were implemented using standard VHDL operators. The results of the experimental run are shown on Figure 3. Fig. 3: Best achieved cost function value on selected iteration in genuine parallel fixed-point PSO system As could be seen on the Figure 4 , the large portion of FPGA is used by the system, although sphere function is relatively easy to implement. Least squares 8 Danila Nikiforovskii et al. Fig. 4: FPGA resources utilized by hardware implementation of genuine parallel fixed-point PSO system. Different shades correspond to different instances of agent block. method relies on equations that can be considered multidimensional sphere equa- tions. This allows the use of the genuine parallel fixed-point PSO system to solve curve fitting problems. 4.2 Sequential strategy, floating-point approach The multi-dimensional Rosenbrock function Pm/2 2 2 2 i=0 (100(x2i−1 − x2i ) + (x2i−1 − 1) (6) have been chosen as a cost function in the implementation sequential PSO system with floating-point approach[2]. The arithmetical operations, both for PSO and for evaluation of cost function, were implemented using Intel FPGA floating-point arithmetic IP-cores. Results of hardware test are shown on the Figure 5. Algorithm converges after 223 iterations. Behavior of the hardware system does not diverge from software modeling, and all of the launches ended with an optimal value founded. As seen on the Figure 6, the area of the FPGA employed by sequential system is considerably smaller than area employed by parallel system, as expected, at expense of calculation speed. To estimate the calculation time needed to evaluate cost function, behavioral simulation was carried out in the ModelSim environment. 5 Conclusion Different aspects of FPGA hardware implementation of particle swarm algorithm are described in this paper. Both genuine parallel and sequential schemes of PSO are viable and can be used to solve optimization problems. Intel FPGA EP4CE115F29C7N (Cyclone IV) platform was used for implementation. Features of hardware implementation of PSO on FPGA 9 Fig. 5: Best achieved cost function value on selected iteration in sequential floating-point PSO system. Fig. 6: FPGA resources utilized by hardware implementation of sequential floating-point PSO system. Both simulation and hardware testing have verified the efficiency of the hard- ware implementation. Results achieved by testing correspond to those achieved by preliminary modelling. Achieved values are the global minimums of the bench- mark functions used in testing. The acceleration of the algorithm may be achieved by employing the paral- lel calculations of vectors v and x as well as cost function value. There is ex- pected trade-off between FPGA area demands and performance boost achieved by parallel computing. If the problem-specific cost function is computationally costly, e.g. curve fitting problems or data mining problems, it takes the largest portion of computation time to evaluate the value of cost function. Therefore, there is no need to employ parallel scheme to boost the performance of agent computing blocks. Instead, parallel blocks of cost function evaluation should be implemented, if possible, to achieve performance increase. It takes 964 clock cy- 10 Danila Nikiforovskii et al. cles to perform one iteration of PSO in sequential hardware scheme. Algorithm converges after 223 iterations (data averaged based on 50 runs), or 2.15 ms. Thus, even the sequential hardware implementation of Particle Swarm Algo- rithm is faster than software implementation due to the use of dedicated arith- metic blocks. Hardware implementation of the PSO on FPGA allows the use of the system with other devices, namely CPU portions of SoCs (systems on chip) or soft-processors, which in turn allow construction of complex control and management devices with built-in PSO system. Acknowledgments. This work was supported by the Ministry of Science and Higher Education of the Russian Federation (The unique identifier of the project: goszadanie No. 8.3134.2017/4.6). References 1. Calazan, R.M., Nedjah, N., Mourelle, L.M.: Parallel coprocessor for PSO. Interna- tional Journal of High Performance Systems Architecture (4) (2011) 233240. 2. Tang, K., Yao, X., Suganthan, P.N., MacNish, C., Chen, Y.P., Chen, C.M., Yang, Z.: Benchmark functions for the CEC 2008 special session and competition on large scale global optimization. Tech. rep. (2007) 3. Calazan, R.M., Nedjah, N., Mourelle, L.M.: A hardware accelerator for Particle Swarm Optimization Applied Soft Computing, Elsevier (2013) 4. Press, W., Teukolsky, S., Vetterling, W.; Flannery, B.: Numerical Recipes: The Art of Scientific Computing, Third Edition. Cambridge University Press. (2007).