FPGA Implementation Strategies for Efficient Machine Learning Systems Cristian Randieri1 , Valerio Francesco Puglisi2 1 eCampus University, Via Isimbardi, 10, Novedrate, 22060, Italy 2 University of Catania, Dept. of Mathematics and Computer Science, Catania, Italy Abstract The paper deals with the FPGA implementation of artificial neurons for Machine Learning systems. Nowadays Machine Learning is always more used in is different application fields and in many cases, FPGA implementations represent the best compromise between performance, and reduced power consumption. Modern FPGAs are equipped with specific circuits suitable for the implementation of the multiply and accumulate operation. These circuits called DSP blocks can be used for the implementation of the synapses of artificial neurons. However, the DSP blocks are not the only solution to implement this operation. This paper compares artificial neuron implementation considering DSP blocks based neurons and CLB-based ones. Comparisons are performed in terms of hardware resources, timing, and power consumption. Results show that DSP blocks based neurons are characterized by best performances in terms of power consumption and maximum frequency Keywords Machine Learning, FPGA, Neural Networks 1. Introduction signers have always to consider aspects relative to the hardware design. These aspects require the engineers Machine Learning is a field of Artificial Intelligence based the capability to choose between different architectures on statistical methods to improve performances of algo- and different hardware resources. rithms in data pattern identification [1, 2, 3]. Machine This aspect is especially true in the design of Machine learning algorithms can be divided into three main cat- Learning systems in which FPGA engineers must be able egories: Supervised, Unsupervised, and Reinforcement to identify the appropriate hardware resource for each Learning. The first two are characterized by training and operation [21]. This is the case of the FPGA implementa- inference phases. In Reinforcement Learning the train- tion of artificial neurons that represent the basic element ing and inference phases are not separated. In the last of artificial Neural Networks. This paper compares arti- decades, we assisted in an incredible spread of Machine ficial neuron implementation of FPGA considering DSP Learning both in research and industry. The reasons are blocks based neurons and CLB-based ones. Comparisons essentially two: the availability of data thanks to internet are performed in terms of hardware resources, timing, diffusion; The availability new circuits and devices op- and power consumption. timized for Machine Learning applications [4, 5, 6, 7, 8]. These two reasons made possible the realization of Ma- chine Learning systems that are always more used in 2. Background different fields [9, 10, 11, 12, 13, 14]. Although in the first phase Machine Learning systems were mainly im- An artificial neuron is composed of several synapses that plemented in remote data centers, in the last few years perform the multiplication between the inputs and pre- there is an increasing diffusion of "Embedded Machine calculated weights (obtained during the training phase), a Learning". This paradigm involves the implementation multi-input adder, and an activation function. The block of Machine-Learning systems inside objects, for example, diagram of an artificial neuron is shown in Fig. 1 cars, wearable devices, smartphones, etc. In this scenario, The critical elements in terms of hardware complexity FPGAs play a crucial role thanks to their reconfigurabil- are the synapses and the nonlinear function. However, ity and high computing [15, 16, 17, 18, 19, 20]. Different the nonlinear function can be simplified by replacing the from what happens for microprocessors and GPUs, FP- traditional Sigmoid function with the Satlin one. In terms GAs based design requires RTL design capabilities. De- of equation the artificial neuron implements Eq.1, where πœ‘ is the activation function 𝑀𝑖 are the weigths and finally ICYRIME 2022: International Conference of Yearly Reports on Infor- π‘₯𝑖 are the inputs. matics, Mathematics, and Engineering. Catania, August 26-29, 2022 " cristian.randieri@unicampus.it (C. Randieri); 𝑦 = πœ‘(𝑀𝑖 * π‘₯𝑖 ) (1) valeriopuglisi@unict.it (V. F. Puglisi) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License FPGA designers have essentially two possibilities for Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) the weights implementation. The first one consists of the 24 Cristian Randieri et al. CEUR Workshop Proceedings 24–29 Figure 1: Artifical Neuron Block diagram composed of N synapses, a multi-input adder, and an activation function. use of DSP Blocks while the second one consists of the It is composed of a LUT, a MUX, and a D Flip-Flop. use of the FPGA Logic Blocks. The LUT is used for the implementation of the switching functions while the MUX manages the Flip-Flop that 2.1. DSP Blocks can be used to make sequential the switching function implemented in the LUT or as a single Flip Flop for the FPGAs are nowadays used for digital signal processing realization of fully sequential circuits. (DSP) applications thanks to their capability to implement This paper investigates the implementation of artificial custom, fully parallel data-paths [22]. neurons using DSP block and Logic blocks. DSP applications make use of multipliers and accu- mulators that are best implemented in dedicated DSP 2.3. Methods slices. Xilinx FPGAs for example have many dedicated, full- In order to perform the comparison, we code in VHDL an custom, low-power DSP slices, optimized for high speed, artificial neuron at RTL level. The number of synapses small size. During the design phase, designers can set and the number of bits are application dependent. Con- the synthesizer to implement multiplications on these sequently, we chose a specific setup for out experiment specific hardware resources. The number of DSP slices that is 3 inputs (that implies three synapses), and a 16 bit depends on the FPGA family. In general, more expensive datapath. FPGAs are equipped with a high number of DSP slices For what concern the activation function, we replace with respect to cheaper ones. In Fig. 2 is shown the the sigmoid function with the satlins. This considerably block diagram of a XILINX DSP block provided in serie 7 simplifies the hardware complexity of the circuit. In fact, FPGAs. the implementation of the sigmoid implies the use of LUT while the satlin function can be easily implemented using 2.2. FPGA Logic Blocks multiplexer and comparators. Fig. 4 shows the synthesis options that give designer Another possibility for the implementation of multipli- the possibility to choose how to synthetize the multipli- ers consists in the use of the logic blocks contained in ers thougth the setting -maxdsp. There are three main the FPGA slices (with reference to XILINX architecture) possibility Logic blocks represent the basic elements of FPGA used for the implementation of the switching functions. In β€’ With -maxdsp =0 the synthetizer implement mul- general a multiplier like any other digital circuit can be tipliers using hthe FPGA Logic Blokcs expressed in terms of switching functions. During the β€’ With -maxdsp = -1 the synthetizer choose au- mapping phase of the design flow, the IDE used for the tonoumosly the implementation strategy FPGA implementation maps the switching function in β€’ With -maxdsp = N with N>0 designer choose the logic blocks. The logic blocks are usually based on the maximum number of DSP blocks involved in LUTs (Lookup Tables). In Fig 3 is show a simplified Logic the design. If for example a project requires 8 Blocks. multiplies and -maxdsp = 2 then 2 multiplier will 25 Cristian Randieri et al. CEUR Workshop Proceedings 24–29 Figure 2: XILINX DSP block schematic. Figure 3: XILINX simplified Logic Block composed of a LUT, a MUX, and a D Flip-Flop be implemented thougth DSP blocks while the 3. Experimental Results other 6 will be implemented using Logic Blocks. The two architecture, the one using DSP blocks and An alternative method to set this synthesis option con- the one implementing multipliers with Logic Blocks has sist in the writing o f a tcl command as follows: been synthesized and implemented using the VIVADO toolchain on an kintex 7 FPGA device. In addition after set_property the implementation has been performed a post-implementation STEPS.SYNTH_DESIGN.ARGS.MAX_DSP 0 test bench for two main reasons: β€’ Check the correct behavior of the artificial neouron β€’ Have an accurate power consumption estimation using the SAIF file. The post-implementation simulation has been per- 26 Cristian Randieri et al. CEUR Workshop Proceedings 24–29 Figure 4: XILINX VIVADO synthesis options. Thesetting max dsp allows designer to choose the implementation strategy for the multipliers formed using the VIVADO simulator tool. Simulations Table 1 have been performed by injecting random signals at the Power consumption (measured at 100Mhz) input of the DUT (the neuron). Input has been provided through a txt file. This simulation has been used not only Utilization to proof the correct behavior of the implemented system DSP BLOKS LOGIC BLOCKS but also to generate the so-called SAIF file for the power 2.1 mW 3.5 mW estimation. Switching Activity Interchange Format is an ASCII file that captures the switching activity in the design. This file is needed for the Vivado power estimation The maximum frequency reachable by the two archi- tool. In fact, as it is known from the theory, CMOS cir- tectures is shown in Tab. 2. Also in this case the best cuits’ power consumption is the composition of three performance are obtained implementing the synapses main termsAmong these three terms the switching power using DSP blocks instead of Logic Blocks. represent the most important one an it depends on the switching activity of the circuit node [23]. Table 2 it is defined in eq. 2, Maximum frequency 2 𝑃 = 𝛼𝐢𝑓 𝑉𝑑𝑑 (2) Utilization Where a is the switching activity, C is the switching DSP BLOKS LOGIC BLOCKS capacitance, f is the clock frequency and Vdd the sup- 110 Mhz 89 Mhz ply voltage. Post implementation simulation allows to estimate the 𝛼 parameter. Power consumption results are shown in Tab1. DSP based neurons are characterized by a reduced power con- 4. Conclusion sumption. This feature is very important for embedded systems that are not directly connected to the power In this paper, we investigate the implementation of artifi- grid and are powered by batteries or energy harvesting cial neurons on FPGA. Modern FPGAs offer the possibil- sources. ity to implement multiplication on specific DSP blocks 27 Cristian Randieri et al. CEUR Workshop Proceedings 24–29 and this feature fits perfectly with the implementation An advanced neural network based solution to en- of artificial neural networks. In fact, artificial neural force dispatch continuity in smart grids, Applied networks are composed of artificial neurons requiring Soft Computing 62 (2018) 768–775. multiplications. Experiments has been performed to char- [9] R. Giuliano, The next generation network in 2030: acterize artificial neurons implemented using with DSP Applications, services, and enabling technologies, block with artificial neurons implemented with Logic in: 2021 8th International Conference on Electrical Blocks based multipliers Results show that DSP blocks Engineering, Computer Science and Informatics based neurons are characterized by best performances in (EECSI), IEEE, 2021, pp. 294–298. terms of power consumption and maximum frequency [10] G. De Magistris, S. Russo, P. Roma, J. Starczewski, C. Napoli, An explainable fake news detector based on named entity recognition and stance classifica- References tion applied to covid-19, Information (Switzerland) 13 (2022). doi:10.3390/info13030137. [1] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vac- [11] G. Capizzi, C. Napoli, S. Russo, M. WoΕΊniak, Lessen- caro, Yolov3-based mask and face recognition al- ing stress and anxiety-related behaviors by means gorithm for individual protection applications, in: of ai-driven drones for aromatherapy, in: CEUR CEUR Workshop Proceedings, volume 2768, 2020, Workshop Proceedings, volume 2594, 2020, pp. 7– pp. 41–45. 12. [2] G. Capizzi, G. Lo Sciuto, M. WoΕΊniak, R. DamaΕ‘e- [12] N. Dat, V. Ponzi, S. Russo, F. Vincelli, Supporting vicius, A clustering based system for automated impaired people with a following robotic assistant oil spill detection by satellite remote sensing, in: by means of end-to-end visual target navigation Artificial Intelligence and Soft Computing: 15th In- and reinforcement learning approaches, in: CEUR ternational Conference, ICAISC 2016, Zakopane, Workshop Proceedings, volume 3118, 2021, pp. 51– Poland, June 12-16, 2016, Proceedings, Part II 15, 63. Springer, 2016, pp. 613–623. [13] V. Ponzi, S. Russo, V. Bianco, C. Napoli, A. Wa- [3] N. Brandizzi, V. Bianco, G. Castro, S. Russo, A. Wa- jda, Psychoeducative social robots for an healthier jda, Automatic rgb inference based on facial emo- lifestyle using artificial intelligence: a case-study, tion recognition, in: CEUR Workshop Proceedings, in: CEUR Workshop Proceedings, volume 3118, volume 3092, 2021, pp. 66–74. 2021, pp. 26–33. [4] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, [14] R. Aureli, N. Brandizzi, G. Magistris, R. Brociek, A M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W. customized approach to anomalies detection by us- Hwu, J. P. Strachan, K. Roy, et al., Puma: A pro- ing autoencoders, in: CEUR Workshop Proceedings, grammable ultra-efficient memristor-based acceler- volume 3092, 2021, pp. 53–59. ator for machine learning inference, in: Proceed- [15] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, L. Wang, A ings of the Twenty-Fourth International Confer- high performance fpga-based accelerator for large- ence on Architectural Support for Programming scale convolutional neural networks, in: 2016 26th Languages and Operating Systems, 2019, pp. 715– International Conference on Field Programmable 731. Logic and Applications (FPL), IEEE, 2016, pp. 1–9. [5] R. Brociek, G. Magistris, F. Cardia, F. Coppa, [16] G. Magistris, C. Rametta, G. Capizzi, C. Napoli, Fpga S. Russo, Contagion prevention of covid-19 by implementation of a parallel dds for wide-band ap- means of touch detection for retail stores, in: CEUR plications, in: CEUR Workshop Proceedings, vol- Workshop Proceedings, volume 3092, 2021, pp. 89– ume 3092, 2021, pp. 12–16. 94. [17] L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, [6] S. Acciarito, A. Cristini, L. Di Nunzio, G. M. Khanal, Y. Duan, O. Al-Shamma, J. SantamarΓ­a, M. A. Fadhel, G. Susi, An a vlsi driving circuit for memristor- M. Al-Amidie, L. Farhan, Review of deep learning: based stdp, in: 2016 12th Conference on Ph. D. Re- Concepts, cnn architectures, challenges, applica- search in Microelectronics and Electronics (PRIME), tions, future directions, Journal of big Data 8 (2021) IEEE, 2016, pp. 1–4. 1–74. [7] G. Lo Sciuto, G. Susi, G. Cammarata, G. Capizzi, A [18] C. Ciancarelli, G. De Magistris, S. Cognetta, spiking neural network-based model for anaerobic D. Appetito, C. Napoli, D. Nardi, A gan ap- digestion process, in: 2016 International Sympo- proach for anomaly detection in spacecraft teleme- sium on Power Electronics, Electrical Drives, Au- tries, Lecture Notes in Networks and Sys- tomation and Motion (SPEEDAM), IEEE, 2016, pp. tems 531 LNNS (2023) 393–402. doi:10.1007/ 996–1003. 978-3-031-18050-7_38. [8] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana, [19] G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, 28 Cristian Randieri et al. CEUR Workshop Proceedings 24–29 M. Panella, M. Re, A. Rosato, S. Span, A paral- lel hardware implementation for 2-d hierarchical clustering based on fuzzy logic, IEEE Transactions on Circuits and Systems II: Express Briefs 68 (2020) 1428–1432. [20] C. Napoli, G. De Magistris, C. Ciancarelli, F. Corallo, F. Russo, D. Nardi, Exploiting wavelet recur- rent neural networks for satellite telemetry data modeling, prediction and control, Expert Sys- tems with Applications 206 (2022). doi:10.1016/ j.eswa.2022.117831. [21] F. Silvestri, S. Acciarito, G. C. Cardarilli, G. M. Khanal, L. Di Nunzio, R. Fazzolari, M. Re, Fpga im- plementation of a low-power qrs extractor, in: Ap- plications in Electronics Pervading Industry, Envi- ronment and Society: APPLEPIES 2017 6, Springer, 2019, pp. 9–15. [22] Xilinx, 7 series dsp48e1 slice (2018). [23] N. H. Weste, D. Harris, CMOS VLSI design: a cir- cuits and systems perspective, Pearson Education India, 2015. 29