FPGA Implementation Strategies for Efficient Machine
Learning Systems
Cristian Randieri1 , Valerio Francesco Puglisi2
1
    eCampus University, Via Isimbardi, 10, Novedrate, 22060, Italy
2
    University of Catania, Dept. of Mathematics and Computer Science, Catania, Italy


                                             Abstract
                                             The paper deals with the FPGA implementation of artificial neurons for Machine Learning systems. Nowadays Machine
                                             Learning is always more used in is different application fields and in many cases, FPGA implementations represent the best
                                             compromise between performance, and reduced power consumption. Modern FPGAs are equipped with specific circuits
                                             suitable for the implementation of the multiply and accumulate operation. These circuits called DSP blocks can be used for
                                             the implementation of the synapses of artificial neurons. However, the DSP blocks are not the only solution to implement
                                             this operation. This paper compares artificial neuron implementation considering DSP blocks based neurons and CLB-based
                                             ones. Comparisons are performed in terms of hardware resources, timing, and power consumption. Results show that DSP
                                             blocks based neurons are characterized by best performances in terms of power consumption and maximum frequency

                                             Keywords
                                             Machine Learning, FPGA, Neural Networks


1. Introduction                                                                                                                       signers have always to consider aspects relative to the
                                                                                                                                      hardware design. These aspects require the engineers
Machine Learning is a field of Artificial Intelligence based                                                                          the capability to choose between different architectures
on statistical methods to improve performances of algo-                                                                               and different hardware resources.
rithms in data pattern identification [1, 2, 3]. Machine                                                                                 This aspect is especially true in the design of Machine
learning algorithms can be divided into three main cat-                                                                               Learning systems in which FPGA engineers must be able
egories: Supervised, Unsupervised, and Reinforcement                                                                                  to identify the appropriate hardware resource for each
Learning. The first two are characterized by training and                                                                             operation [21]. This is the case of the FPGA implementa-
inference phases. In Reinforcement Learning the train-                                                                                tion of artificial neurons that represent the basic element
ing and inference phases are not separated. In the last                                                                               of artificial Neural Networks. This paper compares arti-
decades, we assisted in an incredible spread of Machine                                                                               ficial neuron implementation of FPGA considering DSP
Learning both in research and industry. The reasons are                                                                               blocks based neurons and CLB-based ones. Comparisons
essentially two: the availability of data thanks to internet                                                                          are performed in terms of hardware resources, timing,
diffusion; The availability new circuits and devices op-                                                                              and power consumption.
timized for Machine Learning applications [4, 5, 6, 7, 8].
These two reasons made possible the realization of Ma-
chine Learning systems that are always more used in                                                                                   2. Background
different fields [9, 10, 11, 12, 13, 14]. Although in the
first phase Machine Learning systems were mainly im-               An artificial neuron is composed of several synapses that
plemented in remote data centers, in the last few years            perform the multiplication between the inputs and pre-
there is an increasing diffusion of "Embedded Machine              calculated weights (obtained during the training phase), a
Learning". This paradigm involves the implementation               multi-input adder, and an activation function. The block
of Machine-Learning systems inside objects, for example,           diagram of an artificial neuron is shown in Fig. 1
cars, wearable devices, smartphones, etc. In this scenario,           The critical elements in terms of hardware complexity
FPGAs play a crucial role thanks to their reconfigurabil-          are the synapses and the nonlinear function. However,
ity and high computing [15, 16, 17, 18, 19, 20]. Different         the nonlinear function can be simplified by replacing the
from what happens for microprocessors and GPUs, FP-                traditional Sigmoid function with the Satlin one. In terms
GAs based design requires RTL design capabilities. De-             of equation the artificial neuron implements Eq.1, where
                                                                   𝜑 is the activation function 𝑤𝑖 are the weigths and finally
ICYRIME 2022: International Conference of Yearly Reports on Infor-
                                                                   𝑥𝑖 are the inputs.
matics, Mathematics, and Engineering. Catania, August 26-29, 2022
" cristian.randieri@unicampus.it (C. Randieri);                                                                                                            𝑦 = 𝜑(𝑤𝑖 * 𝑥𝑖 )                   (1)
valeriopuglisi@unict.it (V. F. Puglisi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License     FPGA designers have essentially two possibilities for
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        the weights implementation. The first one consists of the


                                                                                                                                 24
Cristian Randieri et al. CEUR Workshop Proceedings                                                                  24–29


Figure 1: Artifical Neuron Block diagram composed of N synapses, a multi-input adder, and an activation function.


use of DSP Blocks while the second one consists of the            It is composed of a LUT, a MUX, and a D Flip-Flop.
use of the FPGA Logic Blocks.                                   The LUT is used for the implementation of the switching
                                                                functions while the MUX manages the Flip-Flop that
2.1. DSP Blocks                                                 can be used to make sequential the switching function
                                                                implemented in the LUT or as a single Flip Flop for the
FPGAs are nowadays used for digital signal processing           realization of fully sequential circuits.
(DSP) applications thanks to their capability to implement        This paper investigates the implementation of artificial
custom, fully parallel data-paths [22].                         neurons using DSP block and Logic blocks.
   DSP applications make use of multipliers and accu-
mulators that are best implemented in dedicated DSP
                                                                2.3. Methods
slices.
   Xilinx FPGAs for example have many dedicated, full-     In order to perform the comparison, we code in VHDL an
custom, low-power DSP slices, optimized for high speed,    artificial neuron at RTL level. The number of synapses
small size. During the design phase, designers can set     and the number of bits are application dependent. Con-
the synthesizer to implement multiplications on these      sequently, we chose a specific setup for out experiment
specific hardware resources. The number of DSP slices      that is 3 inputs (that implies three synapses), and a 16 bit
depends on the FPGA family. In general, more expensive     datapath.
FPGAs are equipped with a high number of DSP slices           For what concern the activation function, we replace
with respect to cheaper ones. In Fig. 2 is shown the       the sigmoid function with the satlins. This considerably
block diagram of a XILINX DSP block provided in serie 7    simplifies the hardware complexity of the circuit. In fact,
FPGAs.                                                     the implementation of the sigmoid implies the use of LUT
                                                           while the satlin function can be easily implemented using
2.2. FPGA Logic Blocks                                     multiplexer and comparators.
                                                              Fig. 4 shows the synthesis options that give designer
Another possibility for the implementation of multipli- the possibility to choose how to synthetize the multipli-
ers consists in the use of the logic blocks contained in ers thougth the setting -maxdsp. There are three main
the FPGA slices (with reference to XILINX architecture) possibility
Logic blocks represent the basic elements of FPGA used
for the implementation of the switching functions. In            • With -maxdsp =0 the synthetizer implement mul-
general a multiplier like any other digital circuit can be          tipliers using hthe FPGA Logic Blokcs
expressed in terms of switching functions. During the            • With -maxdsp = -1 the synthetizer choose au-
mapping phase of the design flow, the IDE used for the              tonoumosly the implementation strategy
FPGA implementation maps the switching function in               •  With -maxdsp = N with N>0 designer choose
the logic blocks. The logic blocks are usually based on             the maximum number of DSP blocks involved in
LUTs (Lookup Tables). In Fig 3 is show a simplified Logic           the design. If for example a project requires 8
Blocks.                                                             multiplies and -maxdsp = 2 then 2 multiplier will


                                                           25
Cristian Randieri et al. CEUR Workshop Proceedings                                                               24–29


Figure 2: XILINX DSP block schematic.


Figure 3: XILINX simplified Logic Block composed of a LUT, a MUX, and a D Flip-Flop


       be implemented thougth DSP blocks while the             3. Experimental Results
       other 6 will be implemented using Logic Blocks.
                                                           The two architecture, the one using DSP blocks and
   An alternative method to set this synthesis option con- the one implementing multipliers with Logic Blocks has
sist in the writing o f a tcl command as follows:          been synthesized and implemented using the VIVADO
                                                           toolchain on an kintex 7 FPGA device. In addition after
set_property                                               the implementation has been performed a post-implementation
STEPS.SYNTH_DESIGN.ARGS.MAX_DSP 0                          test bench for two main reasons:
                                                                  • Check the correct behavior of the artificial neouron
                                                                  • Have an accurate power consumption estimation
                                                                    using the SAIF file.
                                                                The post-implementation simulation has been per-


                                                          26
Cristian Randieri et al. CEUR Workshop Proceedings                                                                 24–29


Figure 4: XILINX VIVADO synthesis options. Thesetting max dsp allows designer to choose the implementation strategy for
the multipliers


formed using the VIVADO simulator tool. Simulations               Table 1
have been performed by injecting random signals at the            Power consumption (measured at 100Mhz)
input of the DUT (the neuron). Input has been provided
through a txt file. This simulation has been used not only                            Utilization
to proof the correct behavior of the implemented system                      DSP BLOKS     LOGIC BLOCKS
but also to generate the so-called SAIF file for the power                     2.1 mW           3.5 mW
estimation.
   Switching Activity Interchange Format is an ASCII file
that captures the switching activity in the design.
   This file is needed for the Vivado power estimation              The maximum frequency reachable by the two archi-
tool. In fact, as it is known from the theory, CMOS cir-          tectures is shown in Tab. 2. Also in this case the best
cuits’ power consumption is the composition of three              performance are obtained implementing the synapses
main termsAmong these three terms the switching power             using DSP blocks instead of Logic Blocks.
represent the most important one an it depends on the
switching activity of the circuit node [23].                      Table 2
   it is defined in eq. 2,                                        Maximum frequency
                                2
                      𝑃 = 𝛼𝐶𝑓 𝑉𝑑𝑑                      (2)                             Utilization
   Where a is the switching activity, C is the switching             DSP BLOKS     LOGIC BLOCKS
capacitance, f is the clock frequency and Vdd the sup-                110 Mhz           89 Mhz
ply voltage. Post implementation simulation allows to
estimate the 𝛼 parameter.
   Power consumption results are shown in Tab1. DSP
based neurons are characterized by a reduced power con- 4. Conclusion
sumption. This feature is very important for embedded
systems that are not directly connected to the power In this paper, we investigate the implementation of artifi-
grid and are powered by batteries or energy harvesting cial neurons on FPGA. Modern FPGAs offer the possibil-
sources.                                                 ity to implement multiplication on specific DSP blocks


                                                             27
Cristian Randieri et al. CEUR Workshop Proceedings                                                              24–29


and this feature fits perfectly with the implementation          An advanced neural network based solution to en-
of artificial neural networks. In fact, artificial neural        force dispatch continuity in smart grids, Applied
networks are composed of artificial neurons requiring            Soft Computing 62 (2018) 768–775.
multiplications. Experiments has been performed to char- [9] R. Giuliano, The next generation network in 2030:
acterize artificial neurons implemented using with DSP           Applications, services, and enabling technologies,
block with artificial neurons implemented with Logic             in: 2021 8th International Conference on Electrical
Blocks based multipliers Results show that DSP blocks            Engineering, Computer Science and Informatics
based neurons are characterized by best performances in          (EECSI), IEEE, 2021, pp. 294–298.
terms of power consumption and maximum frequency            [10] G. De Magistris, S. Russo, P. Roma, J. Starczewski,
                                                                 C. Napoli, An explainable fake news detector based
                                                                 on named entity recognition and stance classifica-
References                                                       tion applied to covid-19, Information (Switzerland)
                                                                 13 (2022). doi:10.3390/info13030137.
 [1] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac-
                                                            [11] G. Capizzi, C. Napoli, S. Russo, M. Woźniak, Lessen-
      caro, Yolov3-based mask and face recognition al-
                                                                 ing stress and anxiety-related behaviors by means
      gorithm for individual protection applications, in:
                                                                 of ai-driven drones for aromatherapy, in: CEUR
      CEUR Workshop Proceedings, volume 2768, 2020,
                                                                 Workshop Proceedings, volume 2594, 2020, pp. 7–
      pp. 41–45.
                                                                 12.
 [2] G. Capizzi, G. Lo Sciuto, M. Woźniak, R. Damaše-
                                                            [12] N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting
      vicius, A clustering based system for automated
                                                                 impaired people with a following robotic assistant
      oil spill detection by satellite remote sensing, in:
                                                                 by means of end-to-end visual target navigation
      Artificial Intelligence and Soft Computing: 15th In-
                                                                 and reinforcement learning approaches, in: CEUR
      ternational Conference, ICAISC 2016, Zakopane,
                                                                 Workshop Proceedings, volume 3118, 2021, pp. 51–
      Poland, June 12-16, 2016, Proceedings, Part II 15,
                                                                 63.
      Springer, 2016, pp. 613–623.
                                                            [13] V. Ponzi, S. Russo, V. Bianco, C. Napoli, A. Wa-
 [3] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa-
                                                                 jda, Psychoeducative social robots for an healthier
      jda, Automatic rgb inference based on facial emo-
                                                                 lifestyle using artificial intelligence: a case-study,
      tion recognition, in: CEUR Workshop Proceedings,
                                                                 in: CEUR Workshop Proceedings, volume 3118,
      volume 3092, 2021, pp. 66–74.
                                                                 2021, pp. 26–33.
 [4] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu,
                                                            [14] R. Aureli, N. Brandizzi, G. Magistris, R. Brociek, A
      M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W.
                                                                 customized approach to anomalies detection by us-
      Hwu, J. P. Strachan, K. Roy, et al., Puma: A pro-
                                                                 ing autoencoders, in: CEUR Workshop Proceedings,
      grammable ultra-efficient memristor-based acceler-
                                                                 volume 3092, 2021, pp. 53–59.
      ator for machine learning inference, in: Proceed-
                                                            [15] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, L. Wang, A
      ings of the Twenty-Fourth International Confer-
                                                                 high performance fpga-based accelerator for large-
      ence on Architectural Support for Programming
                                                                 scale convolutional neural networks, in: 2016 26th
      Languages and Operating Systems, 2019, pp. 715–
                                                                 International Conference on Field Programmable
      731.
                                                                 Logic and Applications (FPL), IEEE, 2016, pp. 1–9.
 [5] R. Brociek, G. Magistris, F. Cardia, F. Coppa,
                                                            [16] G. Magistris, C. Rametta, G. Capizzi, C. Napoli, Fpga
      S. Russo, Contagion prevention of covid-19 by
                                                                 implementation of a parallel dds for wide-band ap-
      means of touch detection for retail stores, in: CEUR
                                                                 plications, in: CEUR Workshop Proceedings, vol-
      Workshop Proceedings, volume 3092, 2021, pp. 89–
                                                                 ume 3092, 2021, pp. 12–16.
      94.
                                                            [17] L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili,
 [6] S. Acciarito, A. Cristini, L. Di Nunzio, G. M. Khanal,
                                                                 Y. Duan, O. Al-Shamma, J. Santamaría, M. A. Fadhel,
      G. Susi, An a vlsi driving circuit for memristor-
                                                                 M. Al-Amidie, L. Farhan, Review of deep learning:
      based stdp, in: 2016 12th Conference on Ph. D. Re-
                                                                 Concepts, cnn architectures, challenges, applica-
      search in Microelectronics and Electronics (PRIME),
                                                                 tions, future directions, Journal of big Data 8 (2021)
      IEEE, 2016, pp. 1–4.
                                                                 1–74.
 [7] G. Lo Sciuto, G. Susi, G. Cammarata, G. Capizzi, A
                                                            [18] C. Ciancarelli, G. De Magistris, S. Cognetta,
      spiking neural network-based model for anaerobic
                                                                 D. Appetito, C. Napoli, D. Nardi, A gan ap-
      digestion process, in: 2016 International Sympo-
                                                                 proach for anomaly detection in spacecraft teleme-
      sium on Power Electronics, Electrical Drives, Au-
                                                                 tries,    Lecture Notes in Networks and Sys-
      tomation and Motion (SPEEDAM), IEEE, 2016, pp.
                                                                 tems 531 LNNS (2023) 393–402. doi:10.1007/
      996–1003.
                                                                 978-3-031-18050-7_38.
 [8] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana,
                                                            [19] G. C. Cardarilli, L. Di Nunzio, R. Fazzolari,


                                                          28
Cristian Randieri et al. CEUR Workshop Proceedings              24–29


     M. Panella, M. Re, A. Rosato, S. Span, A paral-
     lel hardware implementation for 2-d hierarchical
     clustering based on fuzzy logic, IEEE Transactions
     on Circuits and Systems II: Express Briefs 68 (2020)
     1428–1432.
[20] C. Napoli, G. De Magistris, C. Ciancarelli, F. Corallo,
     F. Russo, D. Nardi, Exploiting wavelet recur-
     rent neural networks for satellite telemetry data
     modeling, prediction and control, Expert Sys-
     tems with Applications 206 (2022). doi:10.1016/
     j.eswa.2022.117831.
[21] F. Silvestri, S. Acciarito, G. C. Cardarilli, G. M.
     Khanal, L. Di Nunzio, R. Fazzolari, M. Re, Fpga im-
     plementation of a low-power qrs extractor, in: Ap-
     plications in Electronics Pervading Industry, Envi-
     ronment and Society: APPLEPIES 2017 6, Springer,
     2019, pp. 9–15.
[22] Xilinx, 7 series dsp48e1 slice (2018).
[23] N. H. Weste, D. Harris, CMOS VLSI design: a cir-
     cuits and systems perspective, Pearson Education
     India, 2015.


                                                           29