<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Design of Precision Configurable Multiply Accumulate Unit for Neural Network Accelerator 1</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Jian</forename><surname>Chen</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Xinru</forename><surname>Zhou</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Xinhe</forename><surname>Li</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Lijie</forename><surname>Wang</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Shengli</forename><surname>Lu</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Hao</forename><surname>Liu</surname></persName>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">School of Electronic Science and Engineering</orgName>
								<orgName type="institution">Southeast University</orgName>
								<address>
									<settlement>Nanjing</settlement>
									<region>Jiangsu</region>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Internet of Things and Cloud Computing Technology</orgName>
								<orgName type="laboratory">AIoTC2022@International Conference on Artificial Intelligence</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Design of Precision Configurable Multiply Accumulate Unit for Neural Network Accelerator 1</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">8209B70934A8C2AE3EF32BC9D9B22741</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:46+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Convolutional Neural Networks</term>
					<term>Precision Scaling</term>
					<term>Quantization</term>
					<term>Approximate Calculation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>With the increasing size of models in Convolutional Neural Networks (CNNs) recently, the demand for memory size, bandwidth and computational resources has gradually become a central issue. Quantification has a pivotal role in dramatically reducing the computation of CNN models and bandwidth. However, quantization technology is difficult to improve the throughput and power efficiency of accurately fixed accelerators. Different applications have different requirements for accelerators in all aspects, and accurately fixing accelerators lacks the flexibility to meet these requirements. In this paper, a precision configurable processing unit (PE) is proposed, which not only simplifies the computing unit and the external complex configurable logic, but also introduces the concept of approximate calculation, while ensuring a certain precision of CNN. For the first time, approximate computation is introduced in a configurable computational unit, which allows the architecture to further reduce power consumption based on bit-level flexibility and to accommodate parameters from different quantization methods of the network. The design of this paper is implemented in SMIC 40nm process library. Compared with Bit Fusion [1], this method achieves the lowest accuracy of 98.49% in Lenet, ensuring that the area and power consumption are reduced by 53.2% and 19.8% respectively.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>CNNs have achieved great success in many computer vision tasks such as image recognition <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref> and target recognition <ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b7">[7]</ref><ref type="bibr" target="#b8">[8]</ref>. In the recent development of CNNs, the increasing model size has led to significant demands on memory size, bandwidth and computational resources <ref type="bibr" target="#b7">[7,</ref><ref type="bibr" target="#b8">8]</ref>.</p><p>To address these issues, many model compression methods such as pruning <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b9">9]</ref> and quantization <ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b7">[7]</ref><ref type="bibr" target="#b8">[8]</ref> have been proposed to reduce the storage and computational requirements of CNNs. Quantization can significantly reduce the size of CNN models and alleviate the memory-intensive problem, which is beneficial to reduce the bandwidth requirements <ref type="bibr" target="#b5">[6]</ref>. However, most of the current accelerators fail to utilize quantization models to solve the computationally-intensive problem <ref type="bibr" target="#b5">[6]</ref>. Most accelerators <ref type="bibr" target="#b14">[14,</ref><ref type="bibr" target="#b15">15]</ref> perform multiply accumulate (MAC) operations with fixed high precision, but many MAC operations with quantization are not necessary for such high precision <ref type="bibr" target="#b5">[6]</ref>. Quantization techniques are difficult to improve the throughput rate and power efficiency of the precision-fixed accelerators. Different applications have different requirements for accelerators in various aspects, and precisionfixed accelerators lack the flexibility to meet these requirements.</p><p>Therefore, many precision-configurable CNN accelerators have been recently proposed <ref type="bibr" target="#b10">[10]</ref><ref type="bibr" target="#b11">[11]</ref><ref type="bibr" target="#b12">[12]</ref><ref type="bibr" target="#b13">[13]</ref>, where activations and weights can be partially or fully scaled. For example, the Dynamic Voltage, Accuracy, and Frequency Scaling (DVAFS) <ref type="bibr" target="#b10">[10]</ref>, first proposed by Bert Moons et al., is based on the data gating approach and reuses full adder units that are not effective at scaled accuracy, which allows both activations and weights to be scaled in proportion. Compared with the traditional data gating approach, DVAFS, with a shortened critical path and dynamic adjustment of the clock frequency, adopts the sparsity of convolution in a dedicated processor architecture during the chip implementation and achieves variable voltage and frequency with accuracy.</p><p>However, with the increase of performance, the lower precision components require complex configurable logic. The decrease in precision also requires more activations and weights to perform the computation of precision-configurable units such as Bit-Fusion <ref type="bibr" target="#b13">[13]</ref> and BitBlade <ref type="bibr" target="#b16">[16]</ref>. The increase in activation and weight requirements also increases the demand for bandwidth and logic resources, which additionally leads to an increase in power consumption and a reduction in hardware utilization, reducing the benefits of quantification. This paper designs a precision configurable module and introduces an approximation method to try to reduce power consumption and improve hardware utilization.</p><p>The main contributions of this paper are as follows: (1) A precision-configurable computational unit is proposed to simplify the computational unit and the complex external configurable logic under the premise of a certain accuracy of the neural network; (2) For the first time, an approximation is introduced in the configurable unit, which enables the architecture to further reduce power consumption on the basis of bit-level flexibility and to adapt to parameters from multiple quantization methods of the network; (3) The design of this paper is implemented in SMIC 40nm process library. Compared with Bit Fusion <ref type="bibr" target="#b0">[1]</ref>, this method achieves the lowest accuracy of 98.49% in Lenet, ensuring that the area and power consumption are reduced by 53.2% and 19.8% respectively.</p><p>The remainder of this paper is organized as follows. The second section introduces the background work of Quantization Compression and precision scalability. The third section analyzes the hardware design content and main innovations, and the fourth section evaluates the performance of the whole design. Finally, the fifth section summarizes the full text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>When CNNs are applied in embedded devices, it is necessary to consider not only the demand on memory size, bandwidth and computational resources brought by the huge computational volume, but also the problem of limited energy supply. Quantization <ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b7">[7]</ref><ref type="bibr" target="#b8">[8]</ref> is a method used to reduce the storage and computation of CNNs. Although quantization brings some accuracy loss, its impact on accuracy loss is negligible.</p><p>MAC operations account for 99% of the total operations in CNN. 97.3% of MAC operations can be performed at less than 4 bits without affecting the accuracy, and even most of the operations can be done at 1 bit. <ref type="bibr" target="#b0">[1]</ref> Since the number of multiplication operations is proportional to the product of operand bit widths, the quantified network is able to speed up significantly. As the bit widths of activation and weight are reduced, the number of bits to access the memory is reduced and thus the power consumption to access the memory is also reduced. However, quantization techniques are difficult to apply to DNN accelerators with fixed bit widths to improve their throughput and energy efficiency, so it is especially significant to design MACs that can dynamically adapt to the bit widths of operands.</p><p>Accuracy-scaling MACs can adapt to the input parameters of different quantified network, which makes the hardware much more flexible. Precision-scaling MACs are efficiently parallelized or serialized. Data gating is first proposed in configurable arithmetic circuits, after which Subword Parallelism, Divide and Conquer, and Bit-serial architectures were proposed, respectively. Bit-Fusion <ref type="bibr" target="#b0">[1]</ref> is a 2D precision-scaling method based on Divide-and-Conquer. This method computes and communicates with fine-grained as possible without loss of precision, and reduce the power consumption of access memory while increasing the on-chip storage capacity by reducing the total number of bits of on-chip and off-chip memory.</p><p>Precision-scaling units are generally composed of adders, multipliers and external configuration units. The research on adders and multipliers is very mature, and we fuse the external configuration unit with the MAC unit in order to reduce the overall area and power consumption. The basic principle of calculating multiplication in this design is the same as that of Bit-Fusion. However, this design takes into account the fact that the partial sums of high bits and low bits do not affect each other and can be added by bit splicing. Thus, the use of bit splicing in our design reduces the use of accumulator and thus reduces the area overhead. In addition, in this design, there is no need to make up 1-bit sign bit after splitting into lower bits. The smallest unit of this design implements a 2-bit multiplication operation, which further saves area overhead compared to Bit-Fusion which implements a 3-bit multiplication operation.</p><p>In addition to the above optimization methods, this paper also introduces approximate means in configurable computing units for the first time. We use the LOA adder to approximate optimize the configurable computing unit, and further reduce the area and power consumption on the premise of ensuring accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Proposed Design</head><p>The main purpose of this design is to make the architecture have bit level flexibility on the basis of reducing power consumption, and finally be able to adapt to the parameters from various quantization methods of the network. The core of this design is the dynamic implementation of operand bit-width adjustment with a multiplexer to select the bit-width mode. The configurable MAC architecture is able to dynamically implement calculations for three cases--8×8, 4×4, and 2×2, which is sufficient for application to neural networks and avoids fine-grained calculations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Throughput Analysis</head><p>As the variety of computations supported by MAC increases, the complexity of the hardware design increases. Because fine-grained computations can lead to complex architectures, the required granularity needs to be carefully chosen. If two computation cases have similar accuracy, the one with better throughput can be used instead of the other, which reduces the variety of computations and simplifies the hardware design.</p><p>In our design, the configurable MAC is applied to the LeNet5 network with the activation and weight quantified to 8 bit, 4 bit and 2 bit, respectively. Since the quantization of activation affects the accuracy more than the quantization of weight, only the six computation cases--8×8, 8×4, 8×2, 4×4, 4×2, and 2×2 --shown in Table II are considered. As shown in Table <ref type="table" target="#tab_0">I</ref>, there is little difference in accuracy between these six computation cases. The effect of simultaneous quantization of weights and activation on the training results was investigated on the PyTorch platform to verify the feasibility of the simplified computational cases. It is evident from Fig. <ref type="figure">1</ref> that the quantization of weights and activation has little effect on the output accuracy of the trained model. The activation is more sensitive to changes in the number of quantization bits due to a larger range of activation quantization errors. Therefore, the accuracy configurable MAC unit supports 8×8, 4×4, and 2×2 computations, which can greatly reduce the complexity of hardware computation within the accuracy loss allowed.</p><p>Figure1.The relationship between quantification of weights, activation and accuracy in LeNet-5</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Configurable unit 1)Using a 2-bit multiplier based on multiplexer:</head><p>The smallest computation unit of this design is the bit-level processing element, which is capable of 2-bit multiplication. When this design perform the 2bit multiplication, the multiplexer determines the operand with or without sign, avoiding the addition of sign bits in Bit-Fusion.</p><p>A and B are two signed/unsigned numbers. The inputs and outputs of the multipliers are in the form of the complement of signed numbers. Cond(1) represents the case of "unsigned A × unsigned B"; cond(2) represents the "signed A × unsigned B" Figure2. 2-bit multipiler design of this paper, note: In later papers we call this 2BM</p><formula xml:id="formula_0">]. 0 [ ] 0 [ ] 1 [ ] 0 [ 2 ] 0 [ ] 1 [ 2 ] 1 [ ] 1 [ 4 ]) 0 [ ] 1 [ 2 ( ]) 0 [ ] 1 [ 2 ( : ) 1 ( B A B A B A B A B B A A B A cond         <label>(1)</label></formula><p>].</p><formula xml:id="formula_1">[ ) 2 ]] 0 [ ] 1 [ 2 ([ ) 1 ( ] 0 [ ] 0 [ ] 1 [ ] 0 [ 2 ] 0 [ ] 1 [ 2 ] 1 [ ] 1 [ 4 ]) 0 [ ] 1 [ 2 ( ]) 0 [ ] 1 [ 2 ( : ) 2 ( A B B cond B A B A B A B A B B A A B A cond                <label>1</label></formula><p>].</p><formula xml:id="formula_3">0 [ ] 0 [ ] 1 [ ] 0 [ 2 ] 0 [ ] 1 [ 2 ] 1 [ ] 1 [ 4 ]) 0 [ ] 1 [ 2 ( ]) 0 [ ] 1 [ 2 ( : ) 3 ( B A B A B A B A B B A A B A cond           <label>(3)</label></formula><p>2)Adopting bit-splicing: Bit-splicing method directly merges partial products that do not interfere with each other. Taking the 4×4 shown in Fig. <ref type="figure">3</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A,B</head><p>A[3:2]×B[3:2] of the higher 2-bit do not interfere with each other and can be directly bit-spliced without going through accumulation. In the 4-bit multiplication, a bit-splicer is used instead of an adder and a shifter to avoid shifting and accumulation of the higher four bits of the partial sum.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure3. 4*4 multiplier of bit-splicing method</head><p>As shown in Table <ref type="table" target="#tab_2">III</ref>, the number of multipliers, shifters, and adders required by the bit-splicingbased approach proposed in this paper is significantly less than that required in Bit-Fusion, and this advantage becomes more and more obvious as the number of bits of the computed multipliers becomes larger. In comparison, this design have smaller area and lower power consumption, so the overall area and power consumption of the configurable MAC are smaller. </p><p>3)Building configurable multipliers: The 2BM which perform 2-bit multiplication are arranged in the space. As shown in Fig. <ref type="figure">4</ref>, a complete configurable MAC is composed of 16 2BMs and is capable of accommodating MAC operations of 2bit, 4bit, and 8bit DNN layers. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure4. The architecture of configurable MAC</head><p>The steps to implement the multiplication of two 8-bit signed numbers A and B in a configurable MAC are as follows:</p><p>Step1: Split the two multipliers A,B into four 2-bit numbers respectively --A[1:0] (denoted as A 0,1 ), A[3: 2] (denoted as A 2,3 ), A <ref type="bibr">[5:4]</ref> (denoted as A 4,5 ), A <ref type="bibr">[7:6]</ref> (denoted as A 6,7 ), B[1:0] (denoted as B 0,1 ), B[3:2] (denoted as B 2,3 ), B <ref type="bibr">[5:4]</ref> (denoted as B 4,5 ), and B <ref type="bibr">[7:6]</ref> (denoted as B 6,7 ).</p><p>Step2: A 0,1 , A 2,3 , A 4,5 , and A 6,7 are broadcast to each 2BM of four cells, respectively.</p><p>Step3: Each cell receives the complete bits of B and assigns them to each 2BM accordingly.</p><p>Step4: The partial products obtained from the four cells are shifted accordingly and then added up to obtain the product of A and B.</p><p>4)Configuring output bandwidth mode: For different configuration modes, the input bandwidth is 8 bits, but the corresponding output bandwidth varies greatly in different configuration modes as shown in Table <ref type="table" target="#tab_3">IV</ref>. To reduce the huge pressure on the output bandwidth, the bandwidths in different modes are reused. In both the 8×8 and 4×4 input modes, the output bandwidth is reused to the 2×2 mode, and the final overall output bandwidth is 64 bits, thus reducing the area and power loss due to bandwidth. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Approximate calculation unit</head><p>In this experiment, we perform a hardware implementation of the above proposed model and a simple analysis of the whole implementation is performed in Design Compiler. The area of the whole design and the percentage of the adder module, 2BMs and the external configuration module are counted in Table <ref type="table" target="#tab_4">V</ref>, and it is found that the adder area accounts for a larger percentage. Due to the fault tolerance of CNNs, the adder is approximated to operate with a certain accuracy. Approximate adders are mainly classified as Accuracy Configurable Adder (ACA), Speculative Carry Select Addition (SCSA), Carry-Skip Adder (CSA), Error-Tolerant Adder (ETA) and Low or Adder (LOA). The comparison of various approximate adders is as shown in Table <ref type="table" target="#tab_5">VI</ref>. It is found that LOA has the smallest area and power consumption due to the complete use of logic or gates for low-bit operation, but it has the highest error rate because the accuracy is not considered. Since adders have different requirements for different bit widths, we finally choose LOA. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure5. LOA schematic</head><p>The approximate bit widths required for different bit-width adders are different. We modeled each adder in Matlab, selected certain equal intervals, and tested its MRED using Monte Carlo method. Because the MRED of the computational unit is usually required to be less than 5% <ref type="bibr" target="#b18">[18]</ref>, we finally determined the approximate bit-width to be 2/3/4/6 for bit-width 6/8/10/16 bits. We list three cases for different bit widths in Table VII for comparison.  After modeling the selected low approximation bits corresponding to different bit-width adders, the hardware implementation of each adder will be performed and the implemented hardware will be tested in Design Compiler, and the final test results are shown in Table <ref type="table" target="#tab_6">VII</ref>. Finally, we use case 2 as the final bit width selection. As shown in Figure <ref type="figure">VIII</ref>, we count the number of adders, the area of a single accurate and approximate adder, and the total area of a fixed bit width adder. The sum of the statistical exact adder area and the sum of the approximate post adder area are compared to achieve a gain of 39.41% in area, which is a huge gain for the entire computational unit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation</head><p>Bit-Fusion <ref type="bibr" target="#b0">[1]</ref>, which was presented at ISCA in 2018, was selected for comparison. The MAC design of Bit-Fusion is to add a symbol bit to the original 2*2 multiplication unit, and finally build it into a minimum calculation unit of 3*3. For the case of 4*4 or 8*8, a shifter is used to replace the multiplication carry, and the results are combined and added. The design of this paper optimizes the external configuration module and multiplication in bit fusion to reduce the area and power consumption of the computing unit. The design of Bit-Fusion is realized under the 45nm process. This paper will realize and compare the Bit-Fusion design and this design under the same experimental conditions. The design is tested in Design Compiler using SMIC 40nm process library.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Performance Analysis 1)Comparison of 2-bit minimum multiplication:</head><p>Bit-Fusion uses four full adders(FAs) and three half adders(HAs) while this design uses only one FAs, four HAs and two data selectors. The benefits of using this design are considerable as it reduces the area by 35.5% and the power consumption by 47.5% compared to the Bit-Fusion design.</p><p>2)Comparison of 8-bit multiplication: The 8-bit multiplier is compatible with sixteen multipliers with 2-bit input or four multipliers with 4-bit input. Due to the bit-selective design and the use of bitsplicing, the benefits of this design are considerable as it reduces the area by 53.1% and the power consumption by 40% compared to the Bit-Fusion design.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure7. Comparison of area</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure8.Comparison of power consumption</head><p>3)Introduction of approximate adders: In this paper, we introduce approximate adders into precision-scaling schemes for the first time. For the selection of the approximation adder, we choose the LOA in this paper. After the hardware implementation, we test and compare with the exact design in Design Compiler, and the specific results are shown in Table <ref type="table">Ⅸ</ref>, where our design achieve some gains in area and power consumption.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Accuracy Analysis</head><p>Table <ref type="table" target="#tab_0">ⅩI</ref> shows the approximate bit width corresponding to each adder. In this paper, after modeling the selected low approximation bits corresponding to different bit width input modes in MATLAB, the corresponding MRED is tested, and the final test results are shown in Table <ref type="table" target="#tab_1">ⅩII</ref>. Finally, we use the case in lenet and test the final accuracy. In each mode, we use 10000 pictures to test the final accuracy, as shown in table ⅩII. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>Based on the fault-tolerance of CNNs, the accuracy-configurable unit enables the circuit to accept a variety of network parameters by means of adding additional configuration units. In this paper, from the perspective of improving the flexibility of accelerators, the precision-scaling MAC is designed to adapt to multiple network structures while ensuring low power consumption. The hardware performance of the accelerator will be improved, and the worst accuracy in Lenet will reach more than 98.49%. The precision-configurable cell carried out in SMIC 40nm process has a power gain of more than 53.2% and an area reduction gain of more than 19.8% compared to Bit-Fusion <ref type="bibr" target="#b0">[1]</ref>.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>as an example, the 4-bit A and B are first split into 2bit A[3:2], A[1:0], B[3:2] and B[1:0]. The product A[1:0]×B[1:0] of the lower 2-bit and the product</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure6.</head><label></label><figDesc>Figure6. Different bits of LOA and the corresponding MRED</figDesc><graphic coords="7,185.04,314.52,222.36,117.54" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>TABLE I .</head><label>I</label><figDesc>TOP-1 ACCURACY FOR VARIOUS COMPUTATIONAL CASES IN IMAGENET<ref type="bibr" target="#b5">[6]</ref> </figDesc><table><row><cell>Computational cases</cell><cell>Alex-Net</cell><cell>VGG16</cell><cell>Res18</cell><cell>Res34</cell><cell>Res50</cell></row><row><cell>8×8</cell><cell>54.5</cell><cell>71.1</cell><cell>69.6</cell><cell>73.6</cell><cell>76.2</cell></row><row><cell>8×4</cell><cell>54.2</cell><cell>70.1</cell><cell>70.1</cell><cell>73.1</cell><cell>74.7</cell></row><row><cell>8×2</cell><cell>50.2</cell><cell>N/A</cell><cell>67.6</cell><cell>71.5</cell><cell>72.8</cell></row><row><cell>4×4</cell><cell>54.4</cell><cell>70.5</cell><cell>67.0</cell><cell>N/A</cell><cell>73.8</cell></row><row><cell>4×2</cell><cell>50.5</cell><cell>N/A</cell><cell>N/A</cell><cell>N/A</cell><cell>N/A</cell></row><row><cell>2×2</cell><cell>51.3</cell><cell>69.1</cell><cell>67.0</cell><cell>N/A</cell><cell>74.2</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>TABLE II .</head><label>II</label><figDesc>THROUGHPUT OF BITBLADE IN DIFFERENT COMPUTING SITUATIONS<ref type="bibr" target="#b5">[6]</ref> </figDesc><table><row><cell>Throughput</cell><cell>2×2</cell><cell>4×2</cell><cell>4×4</cell><cell>8×2</cell><cell>8×4</cell><cell>8×8</cell></row><row><cell>VGG16</cell><cell>4.33</cell><cell>2.54</cell><cell>1.38</cell><cell>1.38</cell><cell>0.71</cell><cell>0.36</cell></row><row><cell>ResNe-t152</cell><cell>3.67</cell><cell>2.31</cell><cell>1.30</cell><cell>1.30</cell><cell>0.69</cell><cell>0.35</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>TABLE III .</head><label>III</label><figDesc>COMPARISON BETWEEN BIT-FUSION AND BIT-SPLICING FOR MULTIPLICATION</figDesc><table><row><cell></cell><cell></cell><cell>Bit-Fusion</cell><cell></cell><cell></cell><cell>This design</cell><cell></cell></row><row><cell></cell><cell>2×2</cell><cell>4×4</cell><cell>8×8</cell><cell>2×2</cell><cell>4×4</cell><cell>8×8</cell></row><row><cell>Minimum multiplier</cell><cell cols="3">3-bit signed multiplier</cell><cell cols="3">2-bit multiplier based on multiplexer</cell></row><row><cell>Number of multipliers</cell><cell>1</cell><cell>4</cell><cell>16</cell><cell>1</cell><cell>4</cell><cell>16</cell></row><row><cell>Number of shifters</cell><cell>0</cell><cell>3</cell><cell>15</cell><cell>0</cell><cell>1</cell><cell>9</cell></row><row><cell>Number of adders</cell><cell>0</cell><cell>3</cell><cell>15</cell><cell>0</cell><cell>2</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>TABLE IV .</head><label>IV</label><figDesc>INPUT AND OUTPUT BANDWIDTH FOR DIFFERENT MODES</figDesc><table><row><cell>Input Mode</cell><cell>Input</cell><cell>Output</cell></row><row><cell>(bit×bit)</cell><cell>bandwidth(bit)</cell><cell>bandwidth(bit)</cell></row><row><cell>2×2</cell><cell></cell><cell>64</cell></row><row><cell>4×4</cell><cell>8</cell><cell>32</cell></row><row><cell>8×8</cell><cell></cell><cell>16</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>TABLE V .</head><label>V</label><figDesc>PERCENTAGE OF AREA OF EACH PART IN THE HARDWARE IMPLEMENTATION</figDesc><table><row><cell></cell><cell>this design</cell><cell>Adder</cell><cell>2BM</cell><cell>External configuration module</cell></row><row><cell>Area(um 2 )</cell><cell>1471.83</cell><cell>796.41</cell><cell>511.9</cell><cell>163.52</cell></row><row><cell>Percentage (%)</cell><cell>100</cell><cell>54.1</cell><cell>34.8</cell><cell>11.1</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>TABLE VI .</head><label>VI</label><figDesc>COMPARISON OF APPROXIMATE ADDERS<ref type="bibr" target="#b17">[17]</ref> </figDesc><table><row><cell>Types of</cell><cell>Area</cell><cell>Delay</cell><cell>Power</cell><cell>Error rate</cell><cell>Mean Relative Error</cell></row><row><cell>adders</cell><cell>(um 2 )</cell><cell>(ns)</cell><cell>(uW)</cell><cell>(%)</cell><cell>(um 2 )</cell></row><row><cell>LOA</cell><cell>53.2</cell><cell>0.39</cell><cell>65.9</cell><cell>89.99</cell><cell>1.0</cell></row><row><cell>ETAII</cell><cell>71.6</cell><cell>0.55</cell><cell>80.6</cell><cell>5.85/16.94</cell><cell>2.6</cell></row><row><cell>ACA</cell><cell>73.8</cell><cell>0.25</cell><cell>118.4</cell><cell>16.66/16.34</cell><cell>18.9</cell></row><row><cell>SCSA</cell><cell>109.2</cell><cell>0.32</cell><cell>134.5</cell><cell>5.85</cell><cell>2.6</cell></row><row><cell>CSA</cell><cell>142.5</cell><cell>0.39</cell><cell>97.8</cell><cell>0.18/0.91</cell><cell>0.15</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>TABLE VII .</head><label>VII</label><figDesc>BIT-WIDTH SELECTION OF ADDERS AND OVERALL ACCURACY TESTING</figDesc><table><row><cell>Case</cell><cell>Bit-width</cell><cell>Accuracy of each adder</cell><cell>MRED of MAC</cell></row><row><cell></cell><cell>6</cell><cell>2</cell><cell></cell></row><row><cell>Case1</cell><cell>10</cell><cell>2</cell><cell>0.0025</cell></row><row><cell></cell><cell>16</cell><cell>4</cell><cell></cell></row><row><cell></cell><cell>6</cell><cell>2</cell><cell></cell></row><row><cell>Case2</cell><cell>10</cell><cell>4</cell><cell>0.0300</cell></row><row><cell></cell><cell>16</cell><cell>6</cell><cell></cell></row><row><cell></cell><cell>6</cell><cell>3</cell><cell></cell></row><row><cell>Case3</cell><cell>10</cell><cell>6</cell><cell>1.8396</cell></row><row><cell></cell><cell>16</cell><cell>8</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>TABLE VIII .</head><label>VIII</label><figDesc>COMPARISON OF NUMBER OF ADDERS AND AREA</figDesc><table><row><cell>Bit-width</cell><cell>Number</cell><cell>Area (um 2 )</cell><cell>Total area (um 2 )</cell><cell>Area of Approximation (um 2 )</cell><cell>Total area of Approximation (um 2 )</cell></row><row><cell>16</cell><cell>3</cell><cell>81.15</cell><cell>243.45</cell><cell>57.45</cell><cell>172.35</cell></row><row><cell>Bit-width</cell><cell>Number</cell><cell>Area (um 2 )</cell><cell>Total area (um 2 )</cell><cell>Area of Approximation (um 2 )</cell><cell>Total area of Approximation (um 2 )</cell></row><row><cell>6</cell><cell>6</cell><cell>30.88</cell><cell>185.28</cell><cell>23.46</cell><cell>140.76</cell></row><row><cell>8</cell><cell>4</cell><cell>40.93</cell><cell>163.72</cell><cell>29.45</cell><cell>117.80</cell></row><row><cell>10</cell><cell>4</cell><cell>50.99</cell><cell>203.96</cell><cell>35.43</cell><cell>141.72</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>TABLE IX .</head><label>IX</label><figDesc>AREA AND POWER CONSUMPTION IN PRECISE AND APPROXIMATE CASE 8</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>-Bit (SMIC 40) Precise case Approximate case Comparison (%)</head><label></label><figDesc>The final approximate design solution is compared with the comparative design Bit-Fusion, and the results are shown in TableⅩ. The accuracy configurable unit is greater than 53.2% in power consumption and greater than 19.8% in area reduction.</figDesc><table><row><cell>Area(um 2 )</cell><cell>1471.83</cell><cell>1247.99</cell><cell>17.9%</cell></row><row><cell>Power(mW)</cell><cell>1.59</cell><cell>1.46</cell><cell>8.9%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_10"><head>TABLE X .</head><label>X</label><figDesc>AREA AND POWER CONSUMPTION OF OUR DESIGN AND BIT-FUSION IN APPROXIMATE CASE</figDesc><table><row><cell>8-Bit (SMIC 40)</cell><cell>Bit-Fusion</cell><cell>this design</cell><cell>Comparison (%)</cell></row><row><cell>Area(um 2 )</cell><cell>1912.5</cell><cell>1247.99</cell><cell>53.2</cell></row><row><cell>Power(uW)</cell><cell>1.75</cell><cell>1.46</cell><cell>19.8</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_11"><head>TABLE XI .</head><label>XI</label><figDesc>BIT-WIDTH SELECTION OF ADDERS AND OVERALL ACCURACY TESTING</figDesc><table><row><cell cols="2">Bit-width Number</cell><cell>Accuracy of each adder</cell><cell>MRED of MAC</cell></row><row><cell>6</cell><cell>6</cell><cell>2</cell><cell></cell></row><row><cell>8 10</cell><cell>4 4</cell><cell>3 4</cell><cell>0.030</cell></row><row><cell>16</cell><cell>3</cell><cell>6</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_12"><head>TABLE XII .</head><label>XII</label><figDesc>MRED AND RECOGNITION ACCURACY UNDER DIFFERENT INPUT MODES</figDesc><table><row><cell>Input Mode</cell><cell cols="2">MRED of MAC Recognition Accuracy</cell></row><row><cell>4</cell><cell>0.019</cell><cell>99.12%</cell></row><row><cell>8</cell><cell>0.030</cell><cell>99.20%</cell></row><row><cell>Input Mode</cell><cell cols="2">MRED of MAC Recognition Accuracy</cell></row><row><cell>2</cell><cell>0</cell><cell>98.49%</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks</title>
		<author>
			<persName><surname>Sharma</surname></persName>
		</author>
		<author>
			<persName><surname>Park</surname></persName>
		</author>
		<author>
			<persName><surname>Suda</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM/IEEE 45TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA)</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">2018</biblScope>
			<biblScope unit="page" from="764" to="775" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Learning structured sparsity in deep neural networks</title>
		<author>
			<persName><forename type="first">W</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. 30th Int. Conf. Neural Inf. Process. Syst. (NIPS)</title>
				<meeting>30th Int. Conf. Neural Inf. ess. Syst. (NIPS)</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="2082" to="2090" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">DNN dataflow choice is overrated</title>
		<author>
			<persName><forename type="first">X</forename><surname>Yang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1809.04070</idno>
		<ptr target="https://arxiv.org/abs/1809.04070" />
		<imprint>
			<date type="published" when="2018-09">Sep. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Design of approximate circuits by fabrication of false timing paths: The carry cutback adder</title>
		<author>
			<persName><forename type="first">V</forename><surname>Camus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cacciotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schlachter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Enz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE J. Emerg. Sel. Topics Circuits Syst</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="746" to="757" />
			<date type="published" when="2018-12">Dec. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Fixed Point Quantization of Deep Convolutional Networks</title>
		<author>
			<persName><forename type="first">S S</forename><surname>D D Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Talathi</surname></persName>
		</author>
		<author>
			<persName><surname>Annapureddy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">INTERNATIONAL CONFERENCE ON MACHINE LEARNING</title>
		<imprint>
			<biblScope unit="volume">48</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note>M</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A Precision-Scalable Energy-Efficient Convolutional Neural Network Accelerator</title>
		<author>
			<persName><forename type="first">W J</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z F</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J</title>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title/>
	</analytic>
	<monogr>
		<title level="j">IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS</title>
		<imprint>
			<biblScope unit="volume">67</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="3484" to="3497" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference</title>
		<author>
			<persName><surname>Jacob</surname></persName>
		</author>
		<author>
			<persName><surname>Kligys</surname></persName>
		</author>
		<author>
			<persName><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">2018</biblScope>
			<biblScope unit="page" from="2704" to="2713" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Quantized Convolutional Neural Networks for Mobile Devices</title>
		<author>
			<persName><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><surname>Leng</surname></persName>
		</author>
		<author>
			<persName><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">2016</biblScope>
			<biblScope unit="page" from="4820" to="4828" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A Two&apos;s Complement Parallel Array MultiplicationAlgorithm</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">R</forename><surname>Baugh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">A</forename><surname>Wooley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Computers</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page" from="1045" to="1047" />
			<date type="published" when="1973">1973</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing</title>
		<author>
			<persName><surname>Albericio</surname></persName>
		</author>
		<author>
			<persName><surname>Judd</surname></persName>
		</author>
		<author>
			<persName><surname>Hetherington</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA)</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">2016</biblScope>
			<biblScope unit="page" from="1" to="13" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks</title>
		<author>
			<persName><surname>Parashar</surname></persName>
		</author>
		<author>
			<persName><surname>Rhu</surname></persName>
		</author>
		<author>
			<persName><surname>Mukkara</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">44TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE</title>
				<meeting><address><addrLine>ISCA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">2017</biblScope>
			<biblScope unit="page" from="27" to="40" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach</title>
		<author>
			<persName><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><surname>Du</surname></persName>
		</author>
		<author>
			<persName><surname>Guo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO)</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">2018</biblScope>
			<biblScope unit="page" from="15" to="28" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Cambricon-X: An accelerator for sparse neural networks</title>
		<author>
			<persName><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><surname>Du</surname></persName>
		</author>
		<author>
			<persName><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)</title>
				<meeting>the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)</meeting>
		<imprint>
			<date type="published" when="2016-10-19">15-19 Oct. 2016, 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">DaDianNao: A Neural Network Supercomputer</title>
		<author>
			<persName><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE TRANSACTIONS ON COMPUTERS</title>
		<imprint>
			<biblScope unit="volume">66</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="73" to="88" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note>J</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks</title>
		<author>
			<persName><forename type="first">Y-H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Krishna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J S</forename><surname>Emer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE JOURNAL OF SOLID-STATE CIRCUITS</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="127" to="138" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note>J</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">BitBlade: Area and Energy-Efficient Precision-Scalable Neural Network Accelerator with Bitwise Summation</title>
		<author>
			<persName><surname>Ryu</surname></persName>
		</author>
		<author>
			<persName><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><surname>Yi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">proceedings of the 2019 56th ACM/IEEE Design Automation Conference (DAC)</title>
				<meeting>the 2019 56th ACM/IEEE Design Automation Conference (DAC)</meeting>
		<imprint>
			<date type="published" when="2019-06-06">2-6 June 2019, 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">A Review. Classification, and Comparative Evaluation of Approximate Arit-metic Circuits</title>
		<author>
			<persName><forename type="first">H</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACMJournal onEmerging Technologies in Computing Systems(JETC)</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="1" to="34" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">A Hardware Efficient Approximate Shift Multiplier with High Accuracy</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
		<idno type="DOI">10.1109/ASICON52560.2021.9620363</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 14th International Conference on ASIC (ASICON)</title>
				<imprint>
			<date type="published" when="2021">2021. 2021</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
