A Survey on FPGA-based Deep Neural Network Accelerators Mingyuan Li1 , Hengyi Li2 , and Lin Meng2 1 School of computer science and information engineering, Hefei University of Technology, Xuancheng, Anhui China. limingyuan0827@163.com 2 Dept.of Electronic and Computer Engineering, Ritsumeikan University. Kusatsu, Shiga, Japan. {gr0468kx@ed,menglin@fc} .ritsumei.ac.jp Abstract Currently, deep learning technologies have achieved great success in applying deep neural networks(DNNs) to multiple domains. However, their high functional intensity, whether computational or memory, has become a heavy burden in the utilization of deep learning, especially in constrained resource platforms. A potential solution is FPGA, which provides effective means for optimizing and accelerating DNNs. Therefore, an important of field of research has been the development of DNN applications with FPGA accelerators. In this paper, existing optimization techniques are evaluated to provide a comprehensive overview of FPGA-based DNN accelerators. The review herein addresses software- and hardware-level acceleration techniques (including, but not limited to, model compression, parameter quantization, and energy-efficiency in structural design). 1 Introduction In recent years, deep neural networks(DNNs) have made substantial progress across a broad array of applications with excellent performance in all instances. Such applications include computer vision tasks, natural language processing, protection of cultural heritage [1, 2], and many others besides. As such, the flexibility of DNNs brings great convenience to modern standards of living. However, the demand for computation and memory in both quantity and complexity makes their deployment a heavy burden on resource constrained hardware platforms, such as robotics, mobile devices, etc. Nevertheless, training processes can be implemented on powerful devices such as GPUs, for which the inference process always works on these limited resource platforms. At the same time, research has revealed that there is massive redundancy in a given DNN’s operations [3]. Therefore, research on the optimization and acceleration of DNNs has grown increasingly prominent. One avenue of study is the field programmable gate array(FPGA) which provides an ef- fective solution for DNN acceleration. FPGA has superior energy efficiency when compared to GPUs and CPUs, and although deeper networks have higher accuracy, they also greatly increase the number of parameters and the model size. The deeper network also brings an increased requirement for computing, bandwidth, and storage, meaning DNNs exert a heavy strain on resource constrained devices. Thanks to the rapid evolution of DNNs, possessing reprogrammable and reconfigurable hardware makes FPGA-based devices well suited to sup- porting them. FPGA’s also feature high throughput, low power consumption, and high parallel workflow, which make the device’s performance excellent for DNNs. In particular, the latest release of Intel FPGA-Agilex possesses improved chip layout, optimizing the architecture’s de- sign and algorithm which results in considerable enhanced flexibility and stability. Based on 61 The easychair Class File Mokhov, Sutcliffe and Voronkov this, FPGA can accelerate DNNs with a high level of efficiency on edge devices, and research has proposed multiple techniques for further acceleration in performance. In this paper, we conduct a survey on the FPGA-based optimization for DNNs. For their implementation, optimization techniques as detailed within can be divided into two categories: software level, on algorithms; and hardware level, based on the FPGA itself. Subsequent to this review will be the introduction of optimizing methods on the software level, the expansion of acceleration techniques based on FPGA architecture, and a summary conclusion. 2 DNNs optimization on software level Multiple techniques on software level to improve DNNs with high efficiency have been proposed. This section gives an overview of software level optimizations on DNNs. 2.1 Pruning and quantifying Pruning and quantification are effective ways to compress neural n etworks. Network pruning is to remove redundant connections and ensure the effectiveness of neural network connections to improve efficiency. Data quantization is to quantify the DNN model parameters by replacing the float-point r epresentation w ith t he fi xed-point re presentation or re ducing th e nu mber of bits used for the representation. And with experimental verification, t he q uantified da ta has little impact on the accuracy of the models. At the same time, for the reason that FPGA is not suitable for floating-point o peration a nd t o o ptimize t he p arameters o f D NNs, data quantification i s a lso e ssential f or D NN m odels t o b e d eployed o n FPGA. The Binarization network is a very effective solution. Many scientists have engaged in research. Rastegari M et al. [4] proposed two kinds of binarization networks to reduce the model storage space through the binarization operation of weight. The former is called Binary-Weight- Networks, which is to approximate its property weight with binary values, and the convolution operation is estimated only by addition and subtraction. The latter is called XNOR-Networks, the weights and the inputs of the convolutional layers as well as fully connected layers are all approximated by binarization, and the convolution can be estimated by XNOR and bit counting operations. In Binary-Weight Networks, the filters are approximate with binary values resulting in 32 memory saving. And XNOR-Networks makes convolution 58× faster and 32× memory savings. Currently, 32-bit floating p oint d ata d oes n ot p erform p erfectly o n D NNS F PGA accel- erators, so most of the advanced accelerators replace 32-bit floating p oint d ata w ith lower fixed-point r epresentations. I n [ 5] P odili e t a l. p roposed t o r eplace t he 3 2-bits floating-point data with 32-bits fixed-point d ata. According t o [ 6], Q iu e t a l. p roposed t o u se 1 6-bits fixed- point data to replace the 32-bits floating-point d ata. A nd i n [ 7] G uo e t a l. p roposed that data quantization strategy helps reduce the bit-width down to 8-bit with negligible accuracy loss. These improvements on floating p oint b its g reatly i mprove t he e fficiency o f computation without decreasing accuracy. By removing the insignificant channels o f t he n etwork a nd q uantizing t he weights a nd bias expressed by floating point numbers (high precision) to be low precision integers, the size of the models and computation demand can be greatly reduced, which are effective means for DNN model compression. 62 The easychair Class File Mokhov, Sutcliffe and Voronkov 2.2 Knowledge of distillation Knowledge distillation is to transfer dark knowledge-what deep learning methods actually learn, from complex (teacher) model to simple (student) model by minimizing a loss function. Gener- ally speaking, the teacher model has strong ability and performance, while the student model is more compact. Through knowledge distillation, it is hoped that the student model can ap- proach or surpass the teacher model as much as possible, so as to obtain similar prediction accuracy with less complexity. Hiton [8] adopted the strategy of feature matching within the Softmax layer for processing. Its essence was to use Softmax output as supervision. But in order to make the score vector softer, distillation temperature T is added to the Softmax layer to improve the performance of distillation. The model trains the teacher at T = 1, uses the output probability of the teacher softmax as the soft label at high temperature to fuse with the hard label to supervise the student, and weighs the loss of the two. With 61 specialist models, there is a 4.4 percent relative improvement in test accuracy overall. Zagoruyko [9] thought that direct transferring of feature map as knowledge from teacher to student is too rigid and the effect is poor. He hoped that the student could pay attention to the areas that the teacher takes care. Therefore, he took the absolute values of feature planes of different channels in the feature map for power operation and then added them together, narrowing the distance between teacher and student’s attention map. It will have better effects than direct transfer feature map. Yang [10] thought that hard labels would lead to overfitting of the model, but soft labels would contribute to the generalization ability of the model. In this regard, he proposed a method that did not calculate the additional loss of all classes, but selected several classes with the highest confidence score. During the training of teacher in the experiment, a constraint was added to teacher’s loss for selection. And during the training of students, the teacher’s soft labels obtained previously are combined with the hard labels. Experiments show that this method improves the classification efficiency of data sets by 3 to 8 percent. 2.3 low-rank matrix factorization While DNNs have achieved tremendous successes for many tasks, the training process of these networks is time and resource expensive. One of the major reasons is that DNNs are trained in a large number of parameters. Meanwhile, Low-rank factorization is a very effective method to reduce the number of parameters. Sainath et al. [11] proposed a low-rank matrix factorization of the final weight layer, and applied this low-rank technique to DNNs for both acoustic modeling and language modeling. This method reduced the number of parameters of the network by 30-50 percent. For some simple DNN models, a few low-rank approximation and clustering schemes for the convolutional kernels were proposed in [12]. They exploited the redundancy present within the convolutional filters to derive approximations that significantly reduce the required compu- tation, and their method achieved 2× speedup for a single convolutional layer with 1 percent drop in classification accuracy. The work in [13] proposed using different tensor decomposition schemes. This is achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain. Reporting a 4.5× speedup with 1 percent drop in accuracy in text recognition. 63 The easychair Class File Mokhov, Sutcliffe and Voronkov 2.4 Filter dimension reduction Most of the advanced DNN models, such as Googlenet, and ResNet [14], use a large convolu- tional filter s ize i n t he fi rst co nvolution la yer, thus gi ving th e DN N mo del a la rger acceptance area for better performance. However, larger filter sizes tend to b e computationally expensive. Karpathy [15] proposed that 7 × 7 filter c an b e r eplaced b y 3 × 3 s tacked fi lters. In this way the network has smaller costs, and requires only about 50 percent the MACC operations required by a 7 × 7 filter. Gschwend [16], another researcher working on this study replaced the 7 × 7 filter with a 3 × 3 filter, a nd p roved t hat t he a ccuracy d ecreased by l ess t han 1 p ercent a fter t he r eplacement. It proved that the use of a smaller filter c an b e a pplied w ithout c ompromising t he accuracy. 3 Hardware-level acceleration on FPGA There are multiple hardware platforms for DNNs, such as GPU, CPU, ASI, FPGA, etc. It is hard to say which works best for all deep learning applications. FPGA just offers some distinct advantages for DNNs. In this section, FPGA-based accelerations of DNNs are introduced in this section. 3.1 Acceleration based on sparsity The high percentage of sparsity causes a serious problem of computation resource under- utilization in sparse CNN accelerators, especially for the irregularity of sparsity. However, FPGA provides an effective solution on hardware level. Yijin Guan et al. [17]. proposed an accelerator named Crane. In the accelerator, DMA can only obtain non-zero activation data and weights, and store them on the chip for convolution processing. The output RAM stores all generated results and transmits them to the output unit for convolution post-processing, including activation functions, pooling, and encoding. Experimental results show that Crane improves performance by 27 - 88 percent and reduces energy consumption by 16 - 48 percent, respectively, compared to the counterparts. Zhang et al. [18]. proposed a software-based coarse-grained pruning technique to sig- nificantly r educe t he i rregularities o f s parse s ynapses. H e c ombines c oarse-grained pruning techniques with local quantization techniques, which can significantly r educe t he i ndex size and improve the network compression ratio. They further designed a hardware accelerator, Cambricon-X, to efficiently address the remaining sparse synapses and neuronal irregularities. Experimental results over a number of representative sparse networks show that the accelerator achieves, on average, 7.23× speedup and 6.43× energy saving against the state-of-the-art NN accelerator. Zhou et al. [19], proposed an accelerator featuring processing elements (PE) -based archi- tecture consisting of multiple PEs. Indexing modules efficiently select and transmit the desired neurons to the connected PE, reducing bandwidth requirements, while each PE stores irregular and compressed synapses in an asynchronous manner for local computation. Compared with a state-of-the-art sparse neural network accelerator, the accelerator is 1.71× and 1.37× better in terms of performance and energy efficiency, respectively. 3.2 Structure specialized for DNNs With the feature of hardware reprogrammable and reconfigurable, DNNs can be accelerated on FPGA by elaborately designing the implementation method. 64 The easychair Class File Mokhov, Sutcliffe and Voronkov Sina Ghaffari et al. [20] designed two kinds of specialized hardware architectures for DNNs. The first architecture is suitable for the small DNNs of applications. The researchers designed specific hardware for each individual layer. The second architecture has one hardware designed for each layer that is used several times as we need different layers. There is a control loop deciding when to use each hardware. With this technique, the network can have as many layers as needed with the same resources. This architecture is extensive and can be easily used for large networks. Although PE parallel computation improves the computation speed, there may be time delay inside PEs while FPGA is carrying out convolution calculations. Eyi Wang et al. [21] realizes the pipeline optimization of convolution operations by adding registers between two data processing nodes. During the process of data flow, each register stores the calculated data of the node in each clock cycle and will cache the data in the clock cycle to the next calculated node. 3.3 Resources utilization One of the key issues for FPGA-based DNN accelerators is that the computational throughput might not match well for the memory bandwidth provided by the FPGA platform. And many methods have failed to achieve optimal performance without making full use of memory band- width and logical resources. As a result, making full use of FPGA resources has been a very important research direction. In [22], Zhang proposed an analytical design scheme using the roofline model. For the solution of a CNN design, the research quantitatively analyzes its computing throughput and required memory bandwidth using various optimization techniques, such as loop tiling and transformation. Then, with the help of roofline model, we can identify the solution with best performance and lowest FPGA resource requirement. In [5], Huimin proposed an end-to-end FPGA based CNN accelerator with all the layers mapped on one chip, so that different layers can work concurrently in a pipelined structure to increase the throughput. And a methodology which can find the optimized parallelism strategy for each layer is proposed to achieve high throughput and high resource utilization. 3.4 Data flow optimization For DNNs accelerator based on FPGA without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. Therefore, the optimization of data flow is very important in our opinion. Yufei Ma et al. [23] put forward through quantitative analysis and optimization method based on many design variables to optimize the convolution cycle, through the search design variable configuration, they put forward CNN hardware accelerators clear data flow to minimize memory access and data movement. At the same time, data flow is also used to maximize resource utilization in order to obtain high performance. In [24], Ding proposed an FPGA-based depthwise separable CNN accelerator with all the layers working concurrently in a pipelined fashion to improve the system throughput and per- formance. To implement the accelerator, The paper presented a custom computing engine architecture to handle the dataflow between adjacent layers by using double-buffering-based memory channels. This method achieved up to 17.6× speed up and 29.4× low power than CPU and GPU implementations respectively. 65 The easychair Class File Mokhov, Sutcliffe and Voronkov 3.5 DNNs implementation on FPGA The complexity and development overhead of the HDL(Hardware Description Language) make it difficult to implement the algorithms on FPGA-based platforms efficiently, especially for DNNs. There have been multiple tools to bridge the gap between DNNs and FPGA, which liberates researchers to concentrate on the study of DNN algorithms. For example: Vitis AI, the AI development environment of Xilinx for AI inference on Xilinx hardware platforms, supports mainstream frameworks such as Caffe, PyTorch, TensorFlow, and latest models capable of diverse deep learning tasks. The Xilinx also provides the Vitis HLS tool to synthesize a C or C++ function into RTL code for acceleration in programmable logic device. TF2FPGA [25], a framework that extends the well known TensorFlow system with au- tomic FPGA acceleration capabilities, enables automatic and transparent generation of high throughput DNN accelerators implemented on FPGA. 4 Conclusion As presented here, a survey on the DNN acceleration technologies provides an illustration of an ideal FPGA accelerator as the embodiment of a high level of hardware and software co- operation. At the software level, this review summarizes the existing techniques for DNN acceleration, which are prerequisites for them to be applied on FPGAs. And concerning hard- ware, the featured approaches further optimize acceleration while focusing on different aspects. Due to FPGA-based DNN accelerators being the vital feature for embedded application imple- mentation, this study forms a comprehensive reference for future research. References [1] Lyu Bing, Hiroyuki Tomiyama, and Lin Meng. Frame detection and text line segmentation for early japanese books understanding. In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods - ICPRAM,, pages 600–606. INSTICC, SciTePress, 2020. [2] Lin Meng, Bing Lyu, Zhiyu Zhang, C.V.Aravinda, Naoto Kamitoku, and Katsuhiro Yamazaki. Oracle bone inscription detector based on ssd. ICIAP2019, pages 126–136, 2019. [3] Hengyi Li, Zhichen Wang, Xuebin Yue, Wenwen Wang, Tomiyama Hiroyuki, and Lin Meng. A comprehensive analysis of low-impact computations in deep learning workloads. In Proceedings of the 2021 on Great Lakes Symposium on VLSI, GLSVLSI ’21, New York, NY, USA, 2021. Association for Computing Machinery. [4] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-net: ImageNet classification using binary convolutional neural networks. [5] Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. A high performance FPGA-based accelerator for large-scale convolutional neural networks. In 2016 26th International Conference on Field Programmable Logic and Applications (FPL), pages 1–9. IEEE. [6] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. Going deeper with embedded FPGA platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 26–35. ACM. [7] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. Angel-eye: A complete design flow for mapping CNN onto em- bedded FPGA. 37(1):35–47. [8] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 66 The easychair Class File Mokhov, Sutcliffe and Voronkov [9] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. [10] Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan Yuille. Knowledge distillation in generations: More tolerant teachers educate better students. [11] Tara N. Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6655–6659. IEEE. [12] Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. [13] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural net- works with low rank expansions. [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE. [15] A Karpathy. Cs231n neural networks part 3: learning and evaluation, 2016. [16] Seyyed Hossein Hasanpour, Mohammad Rouhani, Mohsen Fayyaz, Mohammad Sabokrou, and Ehsan Adeli. Towards principled design of deep convolutional networks: Introducing SimpNet. [17] Yijin Guan, Guangyu Sun, Zhihang Yuan, Xingchen Li, Ningyi Xu, Shu Chen, Jason Cong, and Yuan Xie. Crane: Mitigating accelerator under-utilization caused by sparsity irregularities in CNNs. 69(7):931–943. [18] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–12. IEEE. [19] Xuda Zhou, Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang, Xuehai Zhou, Ling Li, Tianshi Chen, and Yunji Chen. Cambricon-s: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach. In 2018 51st Annual IEEE/ACM Interna- tional Symposium on Microarchitecture (MICRO), pages 15–28. IEEE. [20] Sina Ghaffari and Saeed Sharifian. FPGA-based convolutional neural network accelerator de- sign using high level synthesize. In 2016 2nd International Conference of Signal Processing and Intelligent Systems (ICSPIS), pages 1–6. IEEE. [21] Enyi Wang and Dehui Qiu. Acceleration and implementation of convolutional neural network based on FPGA. In 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT), pages 321–325. IEEE. [22] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 161–170. ACM. [23] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 45–54. ACM. [24] Wei Ding, Zeyu Huang, Zunkai Huang, Li Tian, Hui Wang, and Songlin Feng. Designing efficient accelerator of depthwise separable convolutional neural network on FPGA. 97:278–286. [25] Spyridon Mouselinos, Vasileios Leon, Sotirios Xydis, Dimitrios Soudris, and Kiamal Pekmestzi. Tf2fpga: A framework for projecting and accelerating tensorflow cnns on fpga platforms. In 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), pages 1–4. IEEE, 2019. 67