<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Features of hardware implementation of Particle Swarm Optimization (PSO) on FPGA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Danila Nikiforovskii</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Deyneka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniil Smirnov</string-name>
          <email>dsmirnovg@corp.ifmo.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Institute of Light-Guided Photonics, ITMO University</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Swarm Intelligence algorithm family is a widespread tool to solve optimization problems occurring in various areas of computer science and data processing. Nevertheless, the majority of existing solutions are software-based. On the other hand, the very essence of independent agents predisposes to hardware implementation employing parallel calculations. Field Programmable Gate Array (FPGA) is the perfect platform to implement such system. Particle Swarm Optimization hardware implementation using FPGA made by Intel FPGA (formerly Altera) is described in this paper. Both genuine parallel and sequential schemes are considered, as well as di erent approaches regarding arithmetic operations.</p>
      </abstract>
      <kwd-group>
        <kwd>FPGA</kwd>
        <kwd>PSO</kwd>
        <kwd>swarm</kwd>
        <kwd>Intel</kwd>
        <kwd>optimization</kwd>
        <kwd>hardware</kwd>
        <kwd>robustness</kwd>
        <kwd>parallel</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Modern computer science and IT face various optimization problems in di erent
areas, such as curve tting, data mining, arti cial intelligence learning, etc. To
solve the problems, where the analytical evaluation of gradients is impossible
or have a high computational cost, several meta-heuristic algorithms have been
proposed.</p>
      <p>The swarm algorithm family is one of such groups. It consists of
populationbased algorithms with group behavior emerging from individual behavior of
independent agents moving in search space to nd a solution. As with any other
population-based algorithms, the di erent movement and communication rules
and topologies can change the overall behavior of the particles and algorithm as
a whole. Even within the group of algorithms with identical ruleset, convergence
and the ability to nd correct results depends strongly on the choice of algorithm
parameters. Because of the probabilistic nature of the agent behavior, there is
no universally suitable parameters. The selection of appropriate problem-speci c
parameters is a problem itself, which requireS additional e ort to solve by various
methods, including meta-optimization.</p>
      <p>One of the swarm intelligence algorithms advantages is the relative ease of
the software implementation, which ensues from the basic principle of simple
movement rules of individual agents. Therefore, most of the swarm algorithms
implementations are software based. It is often su cient, as it provides a result
in a nite time, which is acceptable in lab environment. On the other hand,
optimization problems can occur in the area of device control and management.
In this case, the timing constraints are presented, which leads to software
implementation being unable to nd the result in a suitable time and without
additional customization, it, generally, damaging the precision.</p>
      <p>On the other hand, the multi-agent nature of swarm algorithms allows
construction of the hardware system based on the independent agent blocks.
Because the agents are, in general, identical, the development of only one block
is required, which can be then reused or re-instantiated. There are di erent
approaches and platforms to implement such systems, such as development of
ASICs with dedicated computation of the agent mechanics. It is possible to use
GPUs instead. However,the use of GPUs implies the conventional desktop
computer or at least the CPU. The use of the FPGAs, on the other hand, allows
integration of swarm intelligence implementation into another system
implemented on FPGAs, which eliminates the need for construction of more complex
interfaces between swarm intelligence and other devices. The ability to perform
genuine parallel computations with the consideration of a ordable cost makes
them the perfect computational platform to perform mathematical tasks that
imply the repeating operations in independent entities, such as calculating the
invert Hessian matrix for every data point in curve tting problems or cost
function calculation for every agent in swarm intelligence algorithms. FPGA
based hardware implementation of Particle Swarm Optimization Algorithm is
described in this paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Particle Swarm Optimization</title>
      <p>Particle Swarm Optimization is one of the swarm intelligence algorithms. It
is the algorithm to solve the multi-dimensional global optimization problem
i.e. to nd such vector x : f (x) &lt; f (x0); 8x0 6= x, given m-variable function
f (x1; x2; : : : ; xm) : Rm ! R. There are di erent methods to de ne the
mvariable cost function f . The most obvious one is to de ne it by an explicit
analytical expression. However, in this case, it is possible to calculate the
derivatives and the gradient, as well as second derivatives to construct Hessian matrix.
Therefore, it is possible to use other methods to solve optimization problems,
such as well-known gradient descent method or LevenbergMarquardt algorithm;
although Particle Swarm Algorithm can nd the appropriate solution in this
case, it is less e ective.</p>
      <p>
        There are situations, however, where calculation of derivatives is very
computationally di cult; in this scenario, swarm intelligence is much more suitable.
If the analytical representation of the function is unknown, cost function can be
de ned as a series of observations or measurements. In this case the
coordinatevector x corresponds to the parameters of such measurement; these parameters
can re ect both the physical values (e.g. the position of the measurement, voltage
levels etc.) as well as speci c instructions for the external black box performing
the measurement. In this case, there can be no information about gradient in
general sense, which renders traditional methods relying on it virtually useless.
Particle Swarm Optimization algorithm performs optimization by moving the
agents, each representing the candidate solution in search space. Aside from the
position vector, the velocity vector is assigned to each agent. In addition, each
agent possesses the knowledge of the best previously found personal position and
cost function value at that position. The only way of interaction between agents
is the best previously found global position, available to every agent. On each
iteration of the algorithm, the new value of the velocity vector is calculated for
every agent using the expression
vj+1 = !
vj + p rp (p
xj ) + g
rg (g
xj )
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where !, p and g are prede ned constant values. rp and rg are random
values in range of [0;1] with the uniform distribution[1].
      </p>
      <p>
        After the update of the velocity, the agent moves to a new location according
to the simple motion law:
xj+1 = vj+1 + xj
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
      </p>
      <p>Each step concludes with an evaluation of cost function at the new point. If
f (xj+1) &lt; f (p)) , which means that current position is better then previously
found personal best, then the value of p is updated, i.e. its old value is replaced
with xj+1. The nal step of the iteration is the evaluation of g . If f (pmin) &lt;
f (g), where pmin is the overall best of personal values of a swarm , then the
value of g is replaced with pmin.</p>
      <p>
        There are many di erent variations of the algorithm. The degree of
interaction can be changed, there are possibilities to add another terms to eq. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and
so on. Still, if basic principles of PSO remain unchanged, the conclusions of this
article can be applied to any variation of PSO. The basic version of PSO, based
strictly on equations (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) is described in this article.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>FPGA-based hardware implementation</title>
      <p>There are several key points in constructing the hardware-based
implementation of Particle Swarm Algorithm. The main advantage is, as aforementioned,
the ability to use parallel computing[3]. Since each agent is independent, there
are possibilities to implement the algorithm using both genuine parallel and
sequential approach.</p>
      <p>The genuine parallel system is depicted on Figure 1. Since the only way of
interaction is broadcast value g-vector, each of the agent blocks can run
independently, which can vastly improve performance. In addition, since these blocks are
identical as well, single developed block can be re-instantiated. However, each
block has to include the evaluation of optimization function, which can be
computationally costly, as well as a random number generator and an arithmetic
system to perform multiplication and addition. Moreover, as all agents are
independent, the values of x; v and p should be stored inside each block to allow
simultaneous access. Therefore, memory blocks cannot be employed to store
these values. Tightly coupled memory should be used instead.</p>
      <p>The system built with the sequential approach is depicted on Figure 2.
Parameters of agent (x; v; p) are stored in memory. After the processing in the
agent block and evaluating the cost function, the updated parameters are
written back, and the next agent can be processed. This is possible because agents
are independent, but instead of implementing the copies of agent blocks, the
system is reduced to one block. Both embedded FPGA blocks and any external
memory device can be used, as there are no need for simultaneous access. In
mdimensional optimization problem that employ n agents, there is need to store
three values (x; v; p) per dimension per agent, one value of f (p) per agent, one
value of g per dimension, and one value of f (g).</p>
      <p>
        Therefore, total number of bits needed to implement PSO is
((3m + 1)n + m + 1)b;
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
where b is number of bits per number. For instance, 128 agents in a
6dimensional optimization problem would demand 76.2 Kbit in single precision.
As with other components, there are only one instance of cost function evaluation
block. If the evaluation of cost function involves any external device, there is no
possibility to implement simultaneous evaluation of cost function, so sequential
approach presents the only viable option.
      </p>
      <p>
        As shown in equations (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), the only operations employed are
addition and multiplication. Therefore, there are two di erent strategies
considering the implementations of arithmetic blocks. The rst one involves the use
of xed-point or integer arithmetic, which demands the quantization of search
space. In this case, xed-point coordinates correspond to address of a single
cell. However, if it is impossible to perform evaluation of the cost function
with just integer arithmetic operations, the integer values of candidate
solution point coordinates should be converted back to oating-point values
every time the cost function is evaluated. The transformation of the cost
function is also necessary to adjust to changed scale of variables, e.g. function
f (x1; x2) = x21 + 2x2; x1; x2 2 [ 1; 1] should be transformed into f 0(x01; x02) =
( 32x70167 )2 + 2 32x70267 ; x01; x02 2 [ 32767; 32767].
      </p>
      <p>
        To perform the initial seeding as well as the calculations according to equation
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), several stochastic values should be generated. In xed-point strategy, the
initial seeding, i.e. generating the random values in search space is still integer
and can be implemented by any widespread method, e.g. such as linear feedback
shift registers (LFSRs)[4]. The values of rp and rg however, are distributed in
range of [0; 1], which goes against the integer approach. On the other hand,
coordinates of the products rp(p x) and rg(g x) are distributed in range
of [0; (p x)] and [0; (g x)] correspondingly, which can be implemented using
LFSRs as well. Parameters !; p and g which, in general, might be real values,
are still constants, and implementation of the multiplication could be reduced to
the set of integer addition, bit shifting (for division by the power of 2) and integer
multiplication. Fixed-point approach de nitely leads to precision loss, as solution
can not be improved after the cell containing optimal value is found. Still, many
optimization problems do not require precise solution, as good enough solution
acquired in reasonable time is considered better than precise solution that could
be acquired only in in nite time. If the precision is insu cient, the second round
of optimization can be performed, with new search space being the neighborhood
of the preliminary solution point.
      </p>
      <p>Despite xed-point approach is more e ective considering intrinsic PSO
operations, there can be no overall advantage, because there is need to convert
values. Cost function itself may contain computational-costly operations such as
exponentiation, logarithms and so on, which cannot be reduced to xed-point
calculations without the signi cant loss of precision.</p>
      <p>The other way is to use oating-point arithmetic in particle swarm algorithm
intrinsic mechanic. In this scenario, there is no need to convert values. If there
is any CPU present in system, the use of oating-point approach allows native
interchange of data between CPU and PSO block. However, this strategy implies
the use of specialized arithmetic blocks for addition and multiplication; therefore,
area demands increase.</p>
      <p>In context of random number generation, the only challenge oating-point
strategy poses is the generation of uniformly distributed value. The LFSR bus
output is uniformly distributed when treated as integer, but distribution of
oating-point number obtained by straightforward reading of LFSR bus
output is far from uniform. Several ways of generating the oating-point random
numbers are available, considering the possibility of xing the output range of
generator. For example, certain LSFR bits can be used as a signi cand, and the
exponent value being adjusted to make the resulting value just below the range
upper threshold.</p>
      <p>To prevent precision-related issues linked to representation of the values from
a ecting the behavior of the swarm, normalized search space approach can be
utilized. It is possible to scale the variables in the cost function in such way that
search space is transformed to [ 1; 1]m; in this case all agents are moving with
velocities of same order across all dimensions.</p>
      <p>
        As PSO is a method to solve multi-dimensional problems, to build the system
in genuine parallel fashion, there is need to implement independent calculation
subsystem not only for every agent, but for every dimension within agent block
as well. Equations (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) consist of 5 multiplications and 5 additions per
agent per dimension in it, which vastly increase the area demands even in
xedpoint approach. As it seems impractical, a nite state machine (FSM) based
scheduling can reduce this to 1 multiplication and 1 addition per agent. In case
of sequential strategy, i.e. when only one agent block is implemented, it is possible
to construct parallel scheme regarding dimensions only.
      </p>
      <p>The cost function evaluation can be a very computationally costly task. For
example, the evaluation of the sum of the squares of residuals</p>
      <p>
        Pn
i=0( (x(i); 0; 1; : : : ; m)
y(i))2
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
where x(i) and y(i) are the coordinates of data points, i - independent
parameters of target function, demand calculations of (x) n times for every
candidate parameter set. (x) itself can be quite complex and can contain multiple
entries of resource intensive functions as exponentiation, trigonometric functions
etc. For example, it takes 21 clock cycles to perform logarithm function
calculation, commonly used in statistic-related tasks, in Intel FPGA dedicated IP-core
in single precision, versus 7 clock cycles to perform addition. In the majority of
applications, the evaluation of cost function demands more time than the
operations of PSO. Therefore, the performance increase achieved by construction
of parallel PSO system is relatively small especially considering the FPGA area
demands. Still, the absence of dependencies between candidate solutions allows
the implementation of multiple blocks of cost function evaluation. The major
setback is, again, the FPGA area requirements, as genuine parallel data ow
demands n independent cost function evaluation blocks. It is possible to construct
a system with fewer cost function blocks, however, dividing the agents into
subsets assigned to separate blocks and employing the scheduling system inside the
subset.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Validation Examples</title>
      <p>Using the considerations described above, the hardware version of Particle Swarm
Optimization was implemented on the Intel FPGA EP4CE115F29C7N (Cyclone
IV) platform, which is a part of Terasic DE2-115 development board.
Consecutive FSM-based scheduling and genuine parallel schemes have been implemented,
as well as mixed scheme with partial parallelism.
4.1</p>
      <p>Parallel strategy, xed-point approach
The multi-dimensional sphere function</p>
      <p>Pim=0(xi
ai)2
(5)
have been chosen as a cost function in the implementation of parallel PSO
system with xed-point approach. As there is no need for additional arithmetic
blocks neither for PSO nor for cost function evaluation, math operations were
implemented using standard VHDL operators. The results of the experimental
run are shown on Figure 3.</p>
      <p>As could be seen on the Figure 4 , the large portion of FPGA is used by the
system, although sphere function is relatively easy to implement. Least squares
method relies on equations that can be considered multidimensional sphere
equations. This allows the use of the genuine parallel xed-point PSO system to solve
curve tting problems.
have been chosen as a cost function in the implementation sequential PSO
system with oating-point approach[2]. The arithmetical operations, both for
PSO and for evaluation of cost function, were implemented using Intel FPGA
oating-point arithmetic IP-cores. Results of hardware test are shown on the
Figure 5. Algorithm converges after 223 iterations. Behavior of the hardware
system does not diverge from software modeling, and all of the launches ended
with an optimal value founded. As seen on the Figure 6, the area of the FPGA
employed by sequential system is considerably smaller than area employed by
parallel system, as expected, at expense of calculation speed. To estimate the
calculation time needed to evaluate cost function, behavioral simulation was
carried out in the ModelSim environment.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Di erent aspects of FPGA hardware implementation of particle swarm algorithm
are described in this paper. Both genuine parallel and sequential schemes of
PSO are viable and can be used to solve optimization problems. Intel FPGA
EP4CE115F29C7N (Cyclone IV) platform was used for implementation.
Fig. 5: Best achieved cost function value on selected iteration in sequential
oating-point PSO system.</p>
      <p>Fig. 6: FPGA resources utilized by hardware implementation of sequential
oating-point PSO system.</p>
      <p>Both simulation and hardware testing have veri ed the e ciency of the
hardware implementation. Results achieved by testing correspond to those achieved
by preliminary modelling. Achieved values are the global minimums of the
benchmark functions used in testing.</p>
      <p>The acceleration of the algorithm may be achieved by employing the
parallel calculations of vectors v and x as well as cost function value. There is
expected trade-o between FPGA area demands and performance boost achieved
by parallel computing. If the problem-speci c cost function is computationally
costly, e.g. curve tting problems or data mining problems, it takes the largest
portion of computation time to evaluate the value of cost function. Therefore,
there is no need to employ parallel scheme to boost the performance of agent
computing blocks. Instead, parallel blocks of cost function evaluation should be
implemented, if possible, to achieve performance increase. It takes 964 clock
cycles to perform one iteration of PSO in sequential hardware scheme. Algorithm
converges after 223 iterations (data averaged based on 50 runs), or 2.15 ms.
Thus, even the sequential hardware implementation of Particle Swarm
Algorithm is faster than software implementation due to the use of dedicated
arithmetic blocks. Hardware implementation of the PSO on FPGA allows the use
of the system with other devices, namely CPU portions of SoCs (systems on
chip) or soft-processors, which in turn allow construction of complex control and
management devices with built-in PSO system.</p>
      <p>Acknowledgments. This work was supported by the Ministry of Science and
Higher Education of the Russian Federation (The unique identi er of the project:
goszadanie No. 8.3134.2017/4.6).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Calazan</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nedjah</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mourelle</surname>
            ,
            <given-names>L.M.:</given-names>
          </string-name>
          <article-title>Parallel coprocessor for PSO</article-title>
          .
          <source>International Journal of High Performance Systems Architecture (4)</source>
          (
          <year>2011</year>
          )
          <fpage>233240</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suganthan</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MacNish</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Benchmark functions for the CEC 2008 special session and competition on large scale global optimization</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Calazan</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nedjah</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mourelle</surname>
            ,
            <given-names>L.M.:</given-names>
          </string-name>
          <article-title>A hardware accelerator for Particle Swarm Optimization Applied Soft Computing</article-title>
          ,
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Press, W.,
          <string-name>
            <surname>Teukolsky</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vetterling</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Flannery</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Numerical Recipes: The Art of Scienti c Computing, Third Edition</article-title>
          . Cambridge University Press. (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>