<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Survey on FPGA-based Deep Neural Network Accelerators</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mingyuan Li</string-name>
          <email>limingyuan0827@163.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hengyi Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lin Meng</string-name>
          <email>menglin@fc</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept.of Electronic and Computer Engineering, Ritsumeikan University. Kusatsu</institution>
          ,
          <addr-line>Shiga</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of computer science and information engineering, Hefei University of Technology</institution>
          ,
          <addr-line>Xuancheng, Anhui</addr-line>
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Currently, deep learning technologies have achieved great success in applying deep neural networks(DNNs) to multiple domains. However, their high functional intensity, whether computational or memory, has become a heavy burden in the utilization of deep learning, especially in constrained resource platforms. A potential solution is FPGA, which provides effective means for optimizing and accelerating DNNs. Therefore, an important of field of research has been the development of DNN applications with FPGA accelerators. In this paper, existing optimization techniques are evaluated to provide a comprehensive overview of FPGA-based DNN accelerators. The review herein addresses software- and hardware-level acceleration techniques (including, but not limited to, model compression, parameter quantization, and energy-efficiency in structural design).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In recent years, deep neural networks(DNNs) have made substantial progress across a broad
array of applications with excellent performance in all instances. Such applications include
computer vision tasks, natural language processing, protection of cultural heritage [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], and
many others besides. As such, the flexibility of DNNs brings great convenience to modern
standards of living. However, the demand for computation and memory in both quantity and
complexity makes their deployment a heavy burden on resource constrained hardware platforms,
such as robotics, mobile devices, etc. Nevertheless, training processes can be implemented on
powerful devices such as GPUs, for which the inference process always works on these limited
resource platforms. At the same time, research has revealed that there is massive redundancy
in a given DNN’s operations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Therefore, research on the optimization and acceleration of
DNNs has grown increasingly prominent.
      </p>
      <p>One avenue of study is the field programmable gate array(FPGA) which provides an
effective solution for DNN acceleration. FPGA has superior energy efficiency when compared
to GPUs and CPUs, and although deeper networks have higher accuracy, they also greatly
increase the number of parameters and the model size. The deeper network also brings an
increased requirement for computing, bandwidth, and storage, meaning DNNs exert a heavy
strain on resource constrained devices. Thanks to the rapid evolution of DNNs, possessing
reprogrammable and reconfigurable hardware makes FPGA-based devices well suited to
supporting them. FPGA’s also feature high throughput, low power consumption, and high parallel
workflow, which make the device’s performance excellent for DNNs. In particular, the latest
release of Intel FPGA-Agilex possesses improved chip layout, optimizing the architecture’s
design and algorithm which results in considerable enhanced flexibility and stability. Based on
this, FPGA can accelerate DNNs with a high level of efficiency on edge devices, and research
has proposed multiple techniques for further acceleration in performance.</p>
      <p>In this paper, we conduct a survey on the FPGA-based optimization for DNNs. For their
implementation, optimization techniques as detailed within can be divided into two categories:
software level, on algorithms; and hardware level, based on the FPGA itself. Subsequent to
this review will be the introduction of optimizing methods on the software level, the expansion
of acceleration techniques based on FPGA architecture, and a summary conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>DNNs optimization on software level</title>
      <p>Multiple techniques on software level to improve DNNs with high efficiency have been proposed.
This section gives an overview of software level optimizations on DNNs.
2.1</p>
      <sec id="sec-2-1">
        <title>Pruning and quantifying</title>
        <p>Pruning and quantification are effective ways to compress neural n etworks. Network pruning is
to remove redundant connections and ensure the effectiveness of neural network connections to
improve efficiency. Data quantization is to quantify the DNN model parameters by replacing
the float-point r epresentation w ith t he fi xed-point re presentatioonr re ducing th e nu mber of
bits used for the representation. And with experimental verification, t he q uantified da ta has
little impact on the accuracy of the models. At the same time, for the reason that FPGA
is not suitable for floating-point o peration a nd t o o ptimize t he p arameters o f D NNs, data
quantification i s a lso e ssential f or D NN m odels t o b e d eployed o n FPGA.</p>
        <p>
          The Binarization network is a very effective solution. Many scientists have engaged in
research. Rastegari M et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] proposed two kinds of binarization networks to reduce the model
storage space through the binarization operation of weight. The former is called
Binary-WeightNetworks, which is to approximate its property weight with binary values, and the convolution
operation is estimated only by addition and subtraction. The latter is called XNOR-Networks,
the weights and the inputs of the convolutional layers as well as fully connected layers are all
approximated by binarization, and the convolution can be estimated by XNOR and bit counting
operations. In Binary-Weight Networks, the filters are approximate with binary values resulting
in 32 memory saving. And XNOR-Networks makes convolution 58 faster and 32 memory
savings.
        </p>
        <p>Currently, 32-bit floating p oint d ata d oes n ot p erform p erfectly o n D NNS F PGA
accelerators, so most of the advanced accelerators replace 32-bit floating p oint d ata w ith lower
fixed-point r epresentations. I n 5[] P odili e t a l. p roposed t o r eplace t he 3 2-bits floating-point
data with 32-bits fixed-point d ata. A ccording t o 6][, Q iu e t a l. p roposed t o u se 1 6-bits
fixedpoint data to replace the 32-bits floating-point d ata. A nd i n7] [G uo e t a l. p roposed that
data quantization strategy helps reduce the bit-width down to 8-bit with negligible accuracy
loss. These improvements on floating p oint b its g reatly i mprove t he e fficiency ocfomputation
without decreasing accuracy.</p>
        <p>By removing the insignificant channels o f t he n etwork a nd q uantizing t he weights a nd bias
expressed by floating p oint numbers (high precision) to b e low precision integers, the size of the
models and computation demand can be greatly reduced, which are effective means for DNN
model compression.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Knowledge of distillation</title>
        <p>Knowledge distillation is to transfer dark knowledge-what deep learning methods actually learn,
from complex (teacher) model to simple (student) model by minimizing a loss function.
Generally speaking, the teacher model has strong ability and performance, while the student model
is more compact. Through knowledge distillation, it is hoped that the student model can
approach or surpass the teacher model as much as possible, so as to obtain similar prediction
accuracy with less complexity.</p>
        <p>
          Hiton [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] adopted the strategy of feature matching within the Softmax layer for processing.
Its essence was to use Softmax output as supervision. But in order to make the score vector
softer, distillation temperature T is added to the Softmax layer to improve the performance
of distillation. The model trains the teacher at T = 1, uses the output probability of the
teacher softmax as the soft label at high temperature to fuse with the hard label to supervise
the student, and weighs the loss of the two. With 61 specialist models, there is a 4.4 percent
relative improvement in test accuracy overall.
        </p>
        <p>
          Zagoruyko [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] thought that direct transferring of feature map as knowledge from teacher to
student is too rigid and the effect is poor. He hoped that the student could pay attention to
the areas that the teacher takes care. Therefore, he took the absolute values of feature planes
of different channels in the feature map for power operation and then added them together,
narrowing the distance between teacher and student’s attention map. It will have better effects
than direct transfer feature map.
        </p>
        <p>
          Yang [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] thought that hard labels would lead to overfitting of the model, but soft labels
would contribute to the generalization ability of the model. In this regard, he proposed a
method that did not calculate the additional loss of all classes, but selected several classes with
the highest confidence score. During the training of teacher in the experiment, a constraint
was added to teacher’s loss for selection. And during the training of students, the teacher’s
soft labels obtained previously are combined with the hard labels. Experiments show that this
method improves the classification efficiency of data sets by 3 to 8 percent.
2.3
        </p>
        <p>low-rank matrix factorization
While DNNs have achieved tremendous successes for many tasks, the training process of these
networks is time and resource expensive. One of the major reasons is that DNNs are trained
in a large number of parameters. Meanwhile, Low-rank factorization is a very effective method
to reduce the number of parameters.</p>
        <p>
          Sainath et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] proposed a low-rank matrix factorization of the final weight layer, and
applied this low-rank technique to DNNs for both acoustic modeling and language modeling.
This method reduced the number of parameters of the network by 30-50 percent.
        </p>
        <p>
          For some simple DNN models, a few low-rank approximation and clustering schemes for
the convolutional kernels were proposed in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. They exploited the redundancy present within
the convolutional filters to derive approximations that significantly reduce the required
computation, and their method achieved 2 speedup for a single convolutional layer with 1 percent
drop in classification accuracy.
        </p>
        <p>
          The work in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] proposed using different tensor decomposition schemes. This is achieved
by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are
rank-1 in the spatial domain. Reporting a 4:5 speedup with 1 percent drop in accuracy in
text recognition.
2.4
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Filter dimension reduction</title>
        <p>
          Most of the advanced DNN models, such as Googlenet, and ResNet [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], use a large
convolutional filter s ize i n t he fi rst co nvolution la yer, th us gi ving th e DN N mo delaarger acceptance
area for better performance. However, larger filter s izes t end t o b e c omputationally expensive.
        </p>
        <p>
          Karpathy [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] proposed that 7 7 filter c an b e r eplaced b y 3 3 s tacked fi lters. Inthis
way the network has smaller costs, and requires only about 50 percent the MACC operations
required by a 7 7 filter.
        </p>
        <p>
          Gschwend [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], another researcher working on this study replaced the 7 7 filter with a 3 3
filter, a nd p roved t hat t he a ccuracyd ecreased by l ess t han 1p ercent a fter t he r eplacement.It
proved that the use of a smaller filter c an b e a pplied w ithout c ompromising t he accurayc.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Hardware-level acceleration on FPGA</title>
      <p>There are multiple hardware platforms for DNNs, such as GPU, CPU, ASI, FPGA, etc. It is
hard to say which works best for all deep learning applications. FPGA just offers some distinct
advantages for DNNs. In this section, FPGA-based accelerations of DNNs are introduced in
this section.
3.1</p>
      <sec id="sec-3-1">
        <title>Acceleration based on sparsity</title>
        <p>The high percentage of sparsity causes a serious problem of computation resource
underutilization in sparse CNN accelerators, especially for the irregularity of sparsity. However,
FPGA provides an effective solution on hardware level.</p>
        <p>
          Yijin Guan et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. proposed an accelerator named Crane. In the accelerator, DMA can
only obtain non-zero activation data and weights, and store them on the chip for convolution
processing. The output RAM stores all generated results and transmits them to the output
unit for convolution post-processing, including activation functions, pooling, and encoding.
Experimental results show that Crane improves performance by 27 - 88 percent and reduces
energy consumption by 16 - 48 percent, respectively, compared to the counterparts.
        </p>
        <p>
          Zhang et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. proposed a software-based coarse-grained pruning technique to
significantly r educe t he i rregularities o f s parse s ynapses. H e c ombines c oarse-grained pruning
techniques with local quantization techniques, which can significantly r educe t he i ndex size
and improve the network compression ratio. They further designed a hardware accelerator,
Cambricon-X, to efficiently address the remaining sparse synapses and neuronal irregularities.
Experimental results over a number of representative sparse networks show that the accelerator
achieves, on average, 7:23 speedup and 6:43 energy saving against the state-of-the-art NN
accelerator.
        </p>
        <p>
          Zhou et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], proposed an accelerator featuring processing elements (PE) -based
architecture consisting of multiple PEs. Indexing modules efficiently select and transmit the desired
neurons to the connected PE, reducing bandwidth requirements, while each PE stores irregular
and compressed synapses in an asynchronous manner for local computation. Compared with a
state-of-the-art sparse neural network accelerator, the accelerator is 1:71 and 1:37 better in
terms of performance and energy efficiency, respectively.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Structure specialized for DNNs</title>
        <p>With the feature of hardware reprogrammable and reconfigurable, DNNs can b e accelerated on
FPGA by elaborately designing the implementation method.</p>
        <p>
          Sina Ghaffari et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] designed two kinds of specialized hardware architectures for DNNs.
The first architecture is suitable for the small DNNs of applications. The researchers designed
specific hardware for each individual layer. The second architecture has one hardware designed
for each layer that is used several times as we need different layers. There is a control loop
deciding when to use each hardware. With this technique, the network can have as many layers
as needed with the same resources. This architecture is extensive and can be easily used for
large networks.
        </p>
        <p>
          Although PE parallel computation improves the computation speed, there may be time
delay inside PEs while FPGA is carrying out convolution calculations. Eyi Wang et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
realizes the pipeline optimization of convolution operations by adding registers between two
data processing nodes. During the process of data flow, each register stores the calculated data
of the node in each clock cycle and will cache the data in the clock cycle to the next calculated
node.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Resources utilization</title>
        <p>One of the key issues for FPGA-based DNN accelerators is that the computational throughput
might not match well for the memory bandwidth provided by the FPGA platform. And many
methods have failed to achieve optimal performance without making full use of memory
bandwidth and logical resources. As a result, making full use of FPGA resources has been a very
important research direction.</p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], Zhang proposed an analytical design scheme using the roofline model. For the
solution of a CNN design, the research quantitatively analyzes its computing throughput and
required memory bandwidth using various optimization techniques, such as loop tiling and
transformation. Then, with the help of roofline model, we can identify the solution with best
performance and lowest FPGA resource requirement.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], Huimin proposed an end-to-end FPGA based CNN accelerator with all the layers
mapped on one chip, so that different layers can work concurrently in a pipelined structure to
increase the throughput. And a methodology which can find the optimized parallelism strategy
for each layer is proposed to achieve high throughput and high resource utilization.
3.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Data flow optimization</title>
        <p>For DNNs accelerator based on FPGA without fully studying the convolution loop optimization
before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and
manage data movement efficiently. Therefore, the optimization of data flow is very important
in our opinion.</p>
        <p>
          Yufei Ma et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] put forward through quantitative analysis and optimization method
based on many design variables to optimize the convolution cycle, through the search design
variable configuration, they put forward CNN hardware accelerators clear data flow to minimize
memory access and data movement. At the same time, data flow is also used to maximize
resource utilization in order to obtain high performance.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], Ding proposed an FPGA-based depthwise separable CNN accelerator with all the
layers working concurrently in a pipelined fashion to improve the system throughput and
performance. To implement the accelerator, The paper presented a custom computing engine
architecture to handle the dataflow between adjacent layers by using double-buffering-based
memory channels. This method achieved up to 17:6 speed up and 29:4 low power than CPU
and GPU implementations respectively.
3.5
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>DNNs implementation on FPGA</title>
        <p>The complexity and development overhead of the HDL(Hardware Description Language) make
it difficult to implement the algorithms on FPGA-based platforms efficiently, especially for
DNNs. There have been multiple tools to bridge the gap between DNNs and FPGA, which
liberates researchers to concentrate on the study of DNN algorithms. For example:</p>
        <p>Vitis AI, the AI development environment of Xilinx for AI inference on Xilinx hardware
platforms, supports mainstream frameworks such as Caffe, PyTorch, TensorFlow, and latest
models capable of diverse deep learning tasks. The Xilinx also provides the Vitis HLS tool to
synthesize a C or C++ function into RTL code for acceleration in programmable logic device.</p>
        <p>
          TF2FPGA [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], a framework that extends the well known TensorFlow system with
automic FPGA acceleration capabilities, enables automatic and transparent generation of high
throughput DNN accelerators implemented on FPGA.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>As presented here, a survey on the DNN acceleration technologies provides an illustration of
an ideal FPGA accelerator as the embodiment of a high level of hardware and software
cooperation. At the software level, this review summarizes the existing techniques for DNN
acceleration, which are prerequisites for them to be applied on FPGAs. And concerning
hardware, the featured approaches further optimize acceleration while focusing on different aspects.
Due to FPGA-based DNN accelerators being the vital feature for embedded application
implementation, this study forms a comprehensive reference for future research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Lyu</given-names>
            <surname>Bing</surname>
          </string-name>
          , Hiroyuki Tomiyama, and
          <string-name>
            <given-names>Lin</given-names>
            <surname>Meng</surname>
          </string-name>
          .
          <article-title>Frame detection and text line segmentation for early japanese books understanding</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods - ICPRAM</source>
          ,, pages
          <fpage>600</fpage>
          -
          <lpage>606</lpage>
          . INSTICC, SciTePress,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Lin</surname>
            <given-names>Meng</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bing Lyu</surname>
            , Zhiyu Zhang,
            <given-names>C.V.</given-names>
          </string-name>
          <string-name>
            <surname>Aravinda</surname>
            , Naoto Kamitoku, and
            <given-names>Katsuhiro</given-names>
          </string-name>
          <string-name>
            <surname>Yamazaki</surname>
          </string-name>
          .
          <article-title>Oracle bone inscription detector based on ssd</article-title>
          .
          <source>ICIAP2019</source>
          , pages
          <fpage>126</fpage>
          -
          <lpage>136</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Hengyi</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Zhichen</given-names>
            <surname>Wang</surname>
          </string-name>
          , Xuebin Yue, Wenwen Wang,
          <string-name>
            <surname>Tomiyama Hiroyuki</surname>
            , and
            <given-names>Lin</given-names>
          </string-name>
          <string-name>
            <surname>Meng</surname>
          </string-name>
          .
          <article-title>A comprehensive analysis of low-impact computations in deep learning workloads</article-title>
          .
          <source>In Proceedings of the 2021 on Great Lakes Symposium on VLSI, GLSVLSI '21</source>
          , New York, NY, USA,
          <year>2021</year>
          .
          <article-title>Association for Computing Machinery</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Rastegari</surname>
          </string-name>
          , Vicente Ordonez, Joseph Redmon, and
          <string-name>
            <given-names>Ali</given-names>
            <surname>Farhadi</surname>
          </string-name>
          .
          <article-title>XNOR-net: ImageNet classification using binary convolutional neural networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Huimin</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xitian</given-names>
            <surname>Fan</surname>
          </string-name>
          , Li Jiao, Wei Cao,
          <string-name>
            <surname>Xuegong Zhou</surname>
            , and
            <given-names>Lingli</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>A high performance FPGA-based accelerator for large-scale convolutional neural networks</article-title>
          .
          <source>In 2016 26th International Conference on Field Programmable Logic and Applications (FPL)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jiantao</given-names>
            <surname>Qiu</surname>
          </string-name>
          , Jie Wang, Song Yao, Kaiyuan Guo,
          <string-name>
            <given-names>Boxun</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Erjin</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song,
          <string-name>
            <given-names>Yu</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Huazhong</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Going deeper with embedded FPGA platform for convolutional neural network</article-title>
          .
          <source>In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</source>
          , pages
          <fpage>26</fpage>
          -
          <lpage>35</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kaiyuan</given-names>
            <surname>Guo</surname>
          </string-name>
          , Lingzhi Sui, Jiantao Qiu, Jincheng Yu,
          <string-name>
            <given-names>Junbin</given-names>
            <surname>Wang</surname>
          </string-name>
          , Song Yao, Song Han,
          <string-name>
            <given-names>Yu</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Huazhong</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Angel-eye: A complete design flow for mapping CNN onto embedded FPGA</article-title>
          .
          <volume>37</volume>
          (
          <issue>1</issue>
          ):
          <fpage>35</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          , Oriol Vinyals, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distilling the knowledge in a neural network</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Zagoruyko</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nikos</given-names>
            <surname>Komodakis</surname>
          </string-name>
          .
          <article-title>Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Chenglin</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Lingxi Xie, Siyuan Qiao, and
          <string-name>
            <given-names>Alan</given-names>
            <surname>Yuille</surname>
          </string-name>
          .
          <article-title>Knowledge distillation in generations: More tolerant teachers educate better students</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Tara</surname>
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Sainath</surname>
            , Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and
            <given-names>Bhuvana</given-names>
          </string-name>
          <string-name>
            <surname>Ramabhadran</surname>
          </string-name>
          .
          <article-title>Low-rank matrix factorization for deep neural network training with high-dimensional output targets</article-title>
          .
          <source>In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          , pages
          <fpage>6655</fpage>
          -
          <lpage>6659</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Emily</surname>
            <given-names>Denton</given-names>
          </string-name>
          , Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus.
          <article-title>Exploiting linear structure within convolutional networks for efficient evaluation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Max</surname>
            <given-names>Jaderberg</given-names>
          </string-name>
          , Andrea Vedaldi, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Speeding up convolutional neural networks with low rank expansions</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A</given-names>
            <surname>Karpathy</surname>
          </string-name>
          .
          <article-title>Cs231n neural networks part 3: learning</article-title>
          and evaluation,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Seyyed</given-names>
            <surname>Hossein</surname>
          </string-name>
          <string-name>
            <surname>Hasanpour</surname>
          </string-name>
          , Mohammad Rouhani, Mohsen Fayyaz, Mohammad Sabokrou, and
          <string-name>
            <given-names>Ehsan</given-names>
            <surname>Adeli</surname>
          </string-name>
          .
          <article-title>Towards principled design of deep convolutional networks: Introducing SimpNet</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Yijin</surname>
            <given-names>Guan</given-names>
          </string-name>
          , Guangyu Sun, Zhihang Yuan,
          <string-name>
            <given-names>Xingchen</given-names>
            <surname>Li</surname>
          </string-name>
          , Ningyi Xu, Shu Chen, Jason Cong, and
          <string-name>
            <given-names>Yuan</given-names>
            <surname>Xie</surname>
          </string-name>
          . Crane:
          <article-title>Mitigating accelerator under-utilization caused by sparsity irregularities in CNNs</article-title>
          .
          <volume>69</volume>
          (
          <issue>7</issue>
          ):
          <fpage>931</fpage>
          -
          <lpage>943</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Shijin</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu,
          <string-name>
            <given-names>Ling</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qi</given-names>
            <surname>Guo</surname>
          </string-name>
          , Tianshi Chen, and
          <string-name>
            <given-names>Yunji</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Cambricon-x: An accelerator for sparse neural networks</article-title>
          .
          <source>In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Xuda</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Zidong Du, Qi Guo, Shaoli Liu, Chengsi Liu, Chao Wang,
          <string-name>
            <surname>Xuehai Zhou</surname>
            ,
            <given-names>Ling</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Tianshi</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , and
            <given-names>Yunji</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Cambricon-s: Addressing irregularity in sparse neural networks through a cooperative software/hardware approach</article-title>
          .
          <source>In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)</source>
          , pages
          <fpage>15</fpage>
          -
          <lpage>28</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Sina</given-names>
            <surname>Ghaffari</surname>
          </string-name>
          and
          <string-name>
            <given-names>Saeed</given-names>
            <surname>Sharifian</surname>
          </string-name>
          .
          <article-title>FPGA-based convolutional neural network accelerator design using high level synthesize</article-title>
          .
          <source>In 2016 2nd International Conference of Signal Processing and Intelligent Systems (ICSPIS)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Enyi</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dehui</given-names>
            <surname>Qiu</surname>
          </string-name>
          .
          <article-title>Acceleration and implementation of convolutional neural network based on FPGA</article-title>
          .
          <source>In 2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT)</source>
          , pages
          <fpage>321</fpage>
          -
          <lpage>325</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Chen</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Peng Li,
          <string-name>
            <given-names>Guangyu</given-names>
            <surname>Sun</surname>
          </string-name>
          , Yijin Guan, Bingjun Xiao, and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Cong</surname>
          </string-name>
          .
          <article-title>Optimizing FPGA-based accelerator design for deep convolutional neural networks</article-title>
          .
          <source>In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</source>
          , pages
          <fpage>161</fpage>
          -
          <lpage>170</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Yufei</surname>
            <given-names>Ma</given-names>
          </string-name>
          , Yu Cao, Sarma Vrudhula, and
          <article-title>Jae-sun Seo. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks</article-title>
          .
          <source>In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>54</lpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Wei</surname>
            <given-names>Ding</given-names>
          </string-name>
          , Zeyu Huang, Zunkai Huang, Li Tian,
          <string-name>
            <given-names>Hui</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Songlin</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Designing efficient accelerator of depthwise separable convolutional neural network on FPGA</article-title>
          .
          <volume>97</volume>
          :
          <fpage>278</fpage>
          -
          <lpage>286</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Spyridon</surname>
            <given-names>Mouselinos</given-names>
          </string-name>
          , Vasileios Leon, Sotirios Xydis, Dimitrios Soudris, and Kiamal Pekmestzi.
          <article-title>Tf2fpga: A framework for projecting and accelerating tensorflow cnns on fpga platforms</article-title>
          .
          <source>In 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>