<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Design of Precision Configurable Multiply Accumulate Unit for  Neural Network Accelerator 1 </article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jian Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinru Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinhe Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lijie Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shengli Lu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hao Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electronic Science and Engineering, Southeast University</institution>
          ,
          <addr-line>Nanjing, Jiangsu</addr-line>
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>29</fpage>
      <lpage>39</lpage>
      <abstract>
        <p>  With the increasing size of models in Convolutional Neural Networks (CNNs) recently, the demand for memory size, bandwidth and computational resources has gradually become a central issue. Quantification has a pivotal role in dramatically reducing the computation of CNN models and bandwidth. However, quantization technology is difficult to improve the throughput and power efficiency of accurately fixed accelerators. Different applications have different requirements for accelerators in all aspects, and accurately fixing accelerators lacks the flexibility to meet these requirements. In this paper, a precision configurable processing unit (PE) is proposed, which not only simplifies the computing unit and the external complex configurable logic, but also introduces the concept of approximate calculation, while ensuring a certain precision of CNN. For the first time, approximate computation is introduced in a configurable computational unit, which allows the architecture to further reduce power consumption based on bit-level flexibility and to accommodate parameters from different quantization methods of the network. The design of this paper is implemented in SMIC 40nm process library. Compared with Bit Fusion [1], this method achieves the lowest accuracy of 98.49% in Lenet, ensuring that the area and power consumption are reduced by 53.2% and 19.8% respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Precision Scaling</kwd>
        <kwd>Quantization</kwd>
        <kwd>Approximate Calculation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>both activations and weights to be scaled in proportion. Compared with the traditional data gating
approach, DVAFS, with a shortened critical path and dynamic adjustment of the clock frequency,
adopts the sparsity of convolution in a dedicated processor architecture during the chip implementation
and achieves variable voltage and frequency with accuracy.</p>
      <p>
        However, with the increase of performance, the lower precision components require complex
configurable logic. The decrease in precision also requires more activations and weights to perform the
computation of precision-configurable units such as Bit-Fusion [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and BitBlade[16]. The increase in
activation and weight requirements also increases the demand for bandwidth and logic resources,
which additionally leads to an increase in power consumption and a reduction in hardware utilization,
reducing the benefits of quantification. This paper designs a precision configurable module and
introduces an approximation method to try to reduce power consumption and improve hardware
utilization.
      </p>
      <p>
        The main contributions of this paper are as follows: (1) A precision-configurable computational unit
is proposed to simplify the computational unit and the complex external configurable logic under the
premise of a certain accuracy of the neural network; (2) For the first time, an approximation is
introduced in the configurable unit, which enables the architecture to further reduce power
consumption on the basis of bit-level flexibility and to adapt to parameters from multiple quantization
methods of the network; (3) The design of this paper is implemented in SMIC 40nm process library.
Compared with Bit Fusion [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], this method achieves the lowest accuracy of 98.49% in Lenet, ensuring
that the area and power consumption are reduced by 53.2% and 19.8% respectively.
      </p>
      <p>The remainder of this paper is organized as follows. The second section introduces the background
work of Quantization Compression and precision scalability. The third section analyzes the hardware
design content and main innovations, and the fourth section evaluates the performance of the whole
design. Finally, the fifth section summarizes the full text.</p>
    </sec>
    <sec id="sec-2">
      <title>2  Related Work </title>
      <p>
        When CNNs are applied in embedded devices, it is necessary to consider not only the demand on
memory size, bandwidth and computational resources brought by the huge computational volume, but
also the problem of limited energy supply. Quantization [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5-8</xref>
        ] is a method used to reduce the storage
and computation of CNNs. Although quantization brings some accuracy loss, its impact on accuracy
loss is negligible.
      </p>
      <p>
        MAC operations account for 99% of the total operations in CNN. 97.3% of MAC operations can be
performed at less than 4 bits without affecting the accuracy, and even most of the operations can be
done at 1 bit. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] Since the number of multiplication operations is proportional to the product of
operand bit widths, the quantified network is able to speed up significantly. As the bit widths of
activation and weight are reduced, the number of bits to access the memory is reduced and thus the
power consumption to access the memory is also reduced. However, quantization techniques are
difficult to apply to DNN accelerators with fixed bit widths to improve their throughput and energy
efficiency, so it is especially significant to design MACs that can dynamically adapt to the bit widths of
operands.
      </p>
      <p>
        Accuracy-scaling MACs can adapt to the input parameters of different quantified network, which
makes the hardware much more flexible. Precision-scaling MACs are efficiently parallelized or
serialized. Data gating is first proposed in configurable arithmetic circuits, after which Subword
Parallelism, Divide and Conquer, and Bit-serial architectures were proposed, respectively. Bit-Fusion
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a 2D precision-scaling method based on Divide-and-Conquer. This method computes and
communicates with fine-grained as possible without loss of precision, and reduce the power
consumption of access memory while increasing the on-chip storage capacity by reducing the total
number of bits of on-chip and off-chip memory.
      </p>
      <p>Precision-scaling units are generally composed of adders, multipliers and external configuration
units. The research on adders and multipliers is very mature, and we fuse the external configuration
unit with the MAC unit in order to reduce the overall area and power consumption. The basic principle
of calculating multiplication in this design is the same as that of Bit-Fusion. However, this design takes
into account the fact that the partial sums of high bits and low bits do not affect each other and can be
added by bit splicing. Thus, the use of bit splicing in our design reduces the use of accumulator and
thus reduces the area overhead. In addition, in this design, there is no need to make up 1-bit sign bit
after splitting into lower bits. The smallest unit of this design implements a 2-bit multiplication
operation, which further saves area overhead compared to Bit-Fusion which implements a 3-bit
multiplication operation.</p>
      <p>In addition to the above optimization methods, this paper also introduces approximate means in
configurable computing units for the first time. We use the LOA adder to approximate optimize the
configurable computing unit, and further reduce the area and power consumption on the premise of
ensuring accuracy.</p>
    </sec>
    <sec id="sec-3">
      <title>3  Proposed Design </title>
      <p>The main purpose of this design is to make the architecture have bit level flexibility on the basis of
reducing power consumption, and finally be able to adapt to the parameters from various quantization
methods of the network. The core of this design is the dynamic implementation of operand bit-width
adjustment with a multiplexer to select the bit-width mode. The configurable MAC architecture is able
to dynamically implement calculations for three cases--8×8, 4×4, and 2×2, which is sufficient for
application to neural networks and avoids fine-grained calculations.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1  Throughput Analysis </title>
      <p>As the variety of computations supported by MAC increases, the complexity of the hardware design
increases. Because fine-grained computations can lead to complex architectures, the required
granularity needs to be carefully chosen. If two computation cases have similar accuracy, the one with
better throughput can be used instead of the other, which reduces the variety of computations and
simplifies the hardware design.</p>
      <p>In our design, the configurable MAC is applied to the LeNet5 network with the activation and
weight quantified to 8 bit, 4 bit and 2 bit, respectively. Since the quantization of activation affects the
accuracy more than the quantization of weight, only the six computation cases-- 8×8, 8×4, 8×2, 4×4,
4×2, and 2×2 --shown in Table II are considered. As shown in Table I, there is little difference in
accuracy between these six computation cases.</p>
      <p>Res34 
73.6 
73.1 
71.5 
N/A 
N/A 
N/A 
8×4 </p>
      <sec id="sec-4-1">
        <title>Throughput </title>
        <p>VGG16 
ResNe‐t152 
2×2 </p>
        <p>The effect of simultaneous quantization of weights and activation on the training results was
investigated on the PyTorch platform to verify the feasibility of the simplified computational cases. It
is evident from Fig.1 that the quantization of weights and activation has little effect on the output
accuracy of the trained model. The activation is more sensitive to changes in the number of
quantization bits due to a larger range of activation quantization errors. Therefore, the accuracy
configurable MAC unit supports 8×8, 4×4, and 2×2 computations, which can greatly reduce the
complexity of hardware computation within the accuracy loss allowed.</p>
        <sec id="sec-4-1-1">
          <title>Figure1.The relationship between quantification of weights, activation and accuracy in LeNet‐5 </title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3.2  Configurable unit </title>
      <p>1)Using a 2-bit multiplier based on multiplexer: The smallest computation unit of this design is the
bit-level processing element, which is capable of 2-bit multiplication. When this design perform the
2bit multiplication, the multiplexer determines the operand with or without sign, avoiding the addition
of sign bits in Bit-Fusion.</p>
      <p>A and B are two signed/unsigned numbers. The inputs and outputs of the multipliers are in the
form of the complement of signed numbers. Cond(1) represents the case of "unsigned A × unsigned
B"; cond(2) represents the "signed A × unsigned B"</p>
      <p>A,B</p>
      <p>
        A&lt;&lt;2
0
（1）
（3）
（2）
sel
 4 A[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]B[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]  2 A[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]B[0]  2 A[0]B[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]  A[0]B[0] (2)
 cond (1)  ([2B[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]  B[0]]  2)  A[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
cond (3) : A B  (2 A[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]  A[0])  (2B[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]  B[0])
 4 A[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]B[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]  2 A[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]B[0]  2 A[0]B[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]  A[0]B[0].
      </p>
      <p>(3)
2)Adopting bit-splicing: Bit-splicing method directly merges partial products that do not interfere
with each other. Taking the 4×4 shown in Fig. 3 as an example, the 4-bit A and B are first split into
2bit A[3:2], A[1:0], B[3:2] and B[1:0]. The product A[1:0]×B[1:0] of the lower 2-bit and the product
A[3:2]×B[3:2] of the higher 2-bit do not interfere with each other and can be directly bit-spliced
without going through accumulation. In the 4-bit multiplication, a bit-splicer is used instead of an
adder and a shifter to avoid shifting and accumulation of the higher four bits of the partial sum.
8×8 
16 
9 
8 </p>
      <p>As shown in Table III, the number of multipliers, shifters, and adders required by the
bit-splicingbased approach proposed in this paper is significantly less than that required in Bit-Fusion, and this
advantage becomes more and more obvious as the number of bits of the computed multipliers
becomes larger. In comparison, this design have smaller area and lower power consumption, so the
overall area and power consumption of the configurable MAC are smaller.</p>
      <sec id="sec-5-1">
        <title>TABLE III. COMPARISON BETWEEN BIT‐FUSION AND BIT‐SPLICING  FOR MULTIPLICATION </title>
        <sec id="sec-5-1-1">
          <title>Bit‐Fusion  This design </title>
          <p>  2×2  4×4  8×8  2×2  4×4 </p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>3‐bit signed multiplier </title>
      </sec>
      <sec id="sec-5-3">
        <title>2‐bit multiplier based on multiplexer </title>
      </sec>
      <sec id="sec-5-4">
        <title>Minimum </title>
        <p>multiplier </p>
      </sec>
      <sec id="sec-5-5">
        <title>Number of </title>
        <p>multipliers </p>
      </sec>
      <sec id="sec-5-6">
        <title>Number of shifters </title>
      </sec>
      <sec id="sec-5-7">
        <title>Number of adders </title>
        <p>3)Building configurable multipliers: The 2BM which perform 2-bit multiplication are arranged in
the space. As shown in Fig. 4, a complete configurable MAC is composed of 16 2BMs and is capable
of accommodating MAC operations of 2bit, 4bit, and 8bit DNN layers.</p>
        <p>M
16 
15 
15 </p>
        <p>&lt;&lt;2
A0,1
B2,3
A2,3
B0,1
A0,1
B0,1
A2,3
B2,3
2BM
2BM
2BM
2BM
&lt;&lt;2
BA00,,11 2BM
BA40,,51 2BM
BA20,,31 2BM
A0,1
B6,7 2BM
BA24,,35 2BM
BA04,,15 2BM
BA04,,15 2BM
A4,5
B2,3 2BM</p>
        <p>M
M
M
M
4 
3 
3 
&lt;&lt;2
&lt;&lt;2</p>
        <p>M
M
1 
0 
0 
2BM BA02,,13</p>
        <p>A2,3
2BM B4,5
2BM BA22,,33</p>
        <p>A2,3
2BM B6,7
2BM BA06,,17
&lt;&lt;2 2BM BA46,,57
2BM BA26,,37</p>
        <p>A6,7
&lt;&lt;2 2BM B6,7
4 
1 
2 </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3.3  Approximate calculation unit </title>
      <p>In this experiment, we perform a hardware implementation of the above proposed model and a
simple analysis of the whole implementation is performed in Design Compiler. The area of the whole
design and the percentage of the adder module, 2BMs and the external configuration module are
counted in Table V, and it is found that the adder area accounts for a larger percentage. Due to the fault
tolerance of CNNs, the adder is approximated to operate with a certain accuracy.
100 </p>
      <sec id="sec-6-1">
        <title>Adder </title>
        <p>796.41 
54.1 
2BM 
511.9 
34.8 </p>
      </sec>
      <sec id="sec-6-2">
        <title>External configuration  module </title>
        <p>163.52 
11.1 </p>
        <p>Approximate adders are mainly classified as Accuracy Configurable Adder (ACA), Speculative
Carry Select Addition (SCSA), Carry-Skip Adder (CSA), Error-Tolerant Adder (ETA) and Low or
Adder (LOA). The comparison of various approximate adders is as shown in Table VI. It is found that
LOA has the smallest area and power consumption due to the complete use of logic or gates for low-bit
operation, but it has the highest error rate because the accuracy is not considered. Since adders have
different requirements for different bit widths, we finally choose LOA.</p>
        <sec id="sec-6-2-1">
          <title>TABLE VI. COMPARISON OF APPROXIMATE ADDERS [17] </title>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>Types of  Area  Delay  Power </title>
        <p>adders  (um2)  (ns)  (uW) </p>
        <p>LOA  53.2  0.39  65.9 
ETAII  71.6  0.55  80.6 
ACA  73.8  0.25  118.4 
SCSA  109.2  0.32  134.5 
CSA  142.5  0.39  97.8 </p>
        <p>Error rate </p>
        <p>(%) 
89.99 
5.85/16.94 
16.66/16.34 </p>
        <p>5.85 
0.18/0.91 </p>
        <p>Mean Relative Error 
(um2) 
1.0 
2.6 
18.9 
2.6 
0.15 </p>
        <p>The approximate bit widths required for different bit-width adders are different. We modeled each
adder in Matlab, selected certain equal intervals, and tested its MRED using Monte Carlo method.
Because the MRED of the computational unit is usually required to be less than 5%[18], we finally
determined the approximate bit-width to be 2/3/4/6 for bit-width 6/8/10/16 bits. We list three cases for
different bit widths in Table VII for comparison.</p>
        <p>After modeling the selected low approximation bits corresponding to different bit-width adders, the
hardware implementation of each adder will be performed and the implemented hardware will be
tested in Design Compiler, and the final test results are shown in Table VII. Finally, we use case 2 as
the final bit width selection.</p>
        <sec id="sec-6-3-1">
          <title>TABLE VIII. COMPARISON OF NUMBER OF ADDERS AND AREA </title>
          <p>wBiditt‐h  Number  (Aurmea2)   To(tuaml a2r)e a  Area of A(pupmr2o)x imation 
6  6  30.88  185.28  23.46 
8  4  40.93  163.72  29.45 
10  4  50.99  203.96  35.43 </p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>Total area of </title>
        <p>Approximation (um2) 
140.76 
117.80 
141.72 
wBiditt‐h  Number  (Aurmea2)   To(tuaml a2r)e a  Area of A(pupmr2o)x imation  ApprTooxtiaml aatrieoan o (fu m2) 
16  3  81.15  243.45  57.45  172.35 
As shown in Figure VIII, we count the number of adders, the area of a single accurate and
approximate adder, and the total area of a fixed bit width adder. The sum of the statistical exact adder
area and the sum of the approximate post adder area are compared to achieve a gain of 39.41% in area,
which is a huge gain for the entire computational unit.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>4  Evaluation </title>
      <p>
        Bit-Fusion [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which was presented at ISCA in 2018, was selected for comparison. The MAC
design of Bit-Fusion is to add a symbol bit to the original 2*2 multiplication unit, and finally build it
into a minimum calculation unit of 3*3. For the case of 4*4 or 8*8, a shifter is used to replace the
multiplication carry, and the results are combined and added. The design of this paper optimizes the
external configuration module and multiplication in bit fusion to reduce the area and power
consumption of the computing unit. The design of Bit-Fusion is realized under the 45nm process. This
paper will realize and compare the Bit-Fusion design and this design under the same experimental
conditions. The design is tested in Design Compiler using SMIC 40nm process library.
4.1  Performance Analysis 
      </p>
      <p>1)Comparison of 2-bit minimum multiplication: Bit-Fusion uses four full adders(FAs) and three
half adders(HAs) while this design uses only one FAs, four HAs and two data selectors. The benefits
of using this design are considerable as it reduces the area by 35.5% and the power consumption by
47.5% compared to the Bit-Fusion design.</p>
      <p>2)Comparison of 8-bit multiplication: The 8-bit multiplier is compatible with sixteen multipliers
with 2-bit input or four multipliers with 4-bit input. Due to the bit-selective design and the use of
bitsplicing, the benefits of this design are considerable as it reduces the area by 53.1% and the power
consumption by 40% compared to the Bit-Fusion design.</p>
      <sec id="sec-7-1">
        <title>TABLE IX. AREA AND POWER CONSUMPTION IN PRECISE AND APPROXIMATE CASE  </title>
        <p>8‐Bit </p>
        <sec id="sec-7-1-1">
          <title>Precise case  Approximate case  Comparison (%)  (SMIC 40) </title>
          <p>Area(um2)  1471.83  1247.99  17.9% </p>
          <p>Power(mW)  1.59  1.46  8.9% </p>
          <p>The final approximate design solution is compared with the comparative design Bit-Fusion, and the
results are shown in Table Ⅹ. The accuracy configurable unit is greater than 53.2% in power
consumption and greater than 19.8% in area reduction.</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>TABLE X. AREA AND POWER CONSUMPTION OF OUR DESIGN AND BIT‐FUSION IN APPROXIMATE CASE  </title>
        <p>8‐Bit </p>
        <sec id="sec-7-2-1">
          <title>Bit‐Fusion  this design  (SMIC 40) </title>
          <p>Area(um2)  1912.5  1247.99  53.2 
Power(uW)  1.75  1.46  19.8 </p>
        </sec>
        <sec id="sec-7-2-2">
          <title>Comparison (%) </title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4.2  Accuracy Analysis </title>
      <sec id="sec-8-1">
        <title>Bit‐width </title>
      </sec>
      <sec id="sec-8-2">
        <title>Number </title>
      </sec>
      <sec id="sec-8-3">
        <title>Accuracy of each adder  MRED of MAC </title>
        <p>2 
3 
4 
6 </p>
      </sec>
      <sec id="sec-8-4">
        <title>Input Mode </title>
        <p>4 
8 </p>
      </sec>
      <sec id="sec-8-5">
        <title>MRED of MAC </title>
        <p>0.019 </p>
        <p>
          Based on the fault-tolerance of CNNs, the accuracy-configurable unit enables the circuit to accept a
variety of network parameters by means of adding additional configuration units. In this paper, from
the perspective of improving the flexibility of accelerators, the precision- scaling MAC is designed to
adapt to multiple network structures while ensuring low power consumption. The hardware
performance of the accelerator will be improved, and the worst accuracy in Lenet will reach more than
98.49%. The precision-configurable cell carried out in SMIC 40nm process has a power gain of more
than 53.2% and an area reduction gain of more than 19.8% compared to Bit-Fusion [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>6  References </title>
      <p>[14] T Luo, S Liu, L Li, et al. DaDianNao: A Neural Network Supercomputer[J]. IEEE</p>
      <p>TRANSACTIONS ON COMPUTERS, 2017, 66(1): 73-88.
[15] Y-H Chen, T Krishna, J S Emer, et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator
for Deep Convolutional Neural Networks[J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS,
2017, 52(1): 127-138.
[16] S Ryu, H Kim, W Yi, et al. BitBlade: Area and Energy-Efficient Precision-Scalable Neural
Network Accelerator with Bitwise Summation; proceedings of the 2019 56th ACM/IEEE Design
Automation Conference (DAC), F 2-6 June 2019, 2019 [C].
[17] Jiang,H.,Liu,C., Liu,L. et al.(2017) A Review. Classification, and Comparative Evaluation of
Approximate Arit-metic Circuits.ACMJournal onEmerging Technologies in Computing
Systems(JETC),13,1-34.
[18] Q. Li, X. Fan, J. Chen, H. Li and H. Liu, "A Hardware Efficient Approximate Shift Multiplier
with High Accuracy," 2021 IEEE 14th International Conference on ASIC (ASICON), 2021, pp.
1-5, doi: 10.1109/ASICON52560.2021.9620363.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N</given-names>
            <surname>Suda</surname>
          </string-name>
          , et al.
          <source>Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks [M]</source>
          .
          <source>2018 ACM/IEEE 45TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA)</source>
          .
          <year>2018</year>
          :
          <fpage>764</fpage>
          -
          <lpage>775</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          , “
          <article-title>Learning structured sparsity in deep neural networks,”</article-title>
          <source>in Proc. 30th Int. Conf. Neural Inf. Process. Syst. (NIPS)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2082</fpage>
          -
          <lpage>2090</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          et al.,
          <article-title>“DNN dataflow choice is overrated</article-title>
          ,” Sep.
          <year>2018</year>
          , arXiv:
          <year>1809</year>
          .
          <volume>04070</volume>
          . [Online]. Available: https://arxiv.org/abs/
          <year>1809</year>
          .04070.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Camus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cacciotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlachter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Enz</surname>
          </string-name>
          , “
          <article-title>Design of approximate circuits by fabrication of false timing paths: The carry cutback adder</article-title>
          ,
          <source>” IEEE J. Emerg. Sel. Topics Circuits Syst.</source>
          , vol.
          <volume>8</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>746</fpage>
          -
          <lpage>757</lpage>
          , Dec.
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D D</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S S</given-names>
            <surname>Talathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V S</given-names>
            <surname>Annapureddy. Fixed Point</surname>
          </string-name>
          <article-title>Quantization of Deep Convolutional Networks [M]</article-title>
          .
          <source>INTERNATIONAL CONFERENCE ON MACHINE LEARNING</source>
          , VOL
          <volume>48</volume>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W J</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z F</given-names>
            <surname>Wang. A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR</surname>
          </string-name>
          <string-name>
            <surname>PAPERS</surname>
          </string-name>
          ,
          <year>2020</year>
          ,
          <volume>67</volume>
          (
          <issue>10</issue>
          ):
          <fpage>3484</fpage>
          -
          <lpage>3497</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B</given-names>
            <surname>Jacob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S</given-names>
            <surname>Kligys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.
          <article-title>Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-</article-title>
          <string-name>
            <surname>Only Inference</surname>
          </string-name>
          [M].
          <source>2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)</source>
          .
          <year>2018</year>
          :
          <fpage>2704</fpage>
          -
          <lpage>2713</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.
          <source>Quantized Convolutional Neural Networks for Mobile Devices [M]</source>
          .
          <source>2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)</source>
          .
          <year>2016</year>
          :
          <fpage>4820</fpage>
          -
          <lpage>4828</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Baugh</surname>
            ,
            <given-names>C. R.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Wooley</surname>
          </string-name>
          , “
          <article-title>A Two's Complement Parallel Array MultiplicationAlgorithm,”</article-title>
          <source>IEEE Trans. Computers</source>
          , Vol.
          <volume>22</volume>
          , pp.
          <fpage>1045</fpage>
          -
          <lpage>1047</lpage>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J</given-names>
            <surname>Albericio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P</given-names>
            <surname>Judd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T</given-names>
            <surname>Hetherington</surname>
          </string-name>
          , et al. Cnvlutin:
          <string-name>
            <surname>Ineffectual-Neuron-Free Deep Neural Network Computing</surname>
          </string-name>
          [M].
          <source>2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA)</source>
          .
          <year>2016</year>
          :
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A</given-names>
            <surname>Parashar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Rhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Mukkara</surname>
          </string-name>
          , et al.
          <source>SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks [M]. 44TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA</source>
          <year>2017</year>
          ).
          <year>2017</year>
          :
          <fpage>27</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q</given-names>
            <surname>Guo</surname>
          </string-name>
          , et al.
          <article-title>Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software</article-title>
          /Hardware Approach [M].
          <source>2018 51ST ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO)</source>
          .
          <year>2018</year>
          :
          <fpage>15</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.
          <article-title>Cambricon-X: An accelerator for sparse neural networks</article-title>
          ;
          <source>proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), F 15-19 Oct</source>
          .
          <year>2016</year>
          ,
          <year>2016</year>
          [C].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>