Post-training quantization of neural network through correlation maximization Maria Pushkareva Iakov Karandashev Center of Optical Neural Technologies, Scientific Research Center of Optical Neural Technologies, Scientific Research Institute of System Analisys RAS Institute of System Analisys RAS Moscow, Russia Moscow, Russia pushkareva.mariia@yandex.ru karandashev@niisi.ras.ru Abstract—In this paper, we propose a method for believe that this result is correct and in what follows, we quantizing the weights of neural networks by maximizing the choose the interval boundaries and the quantized values of correlations between the initial and quantized weights, taking the weights inside of the intervals proceeding from a into account the distribution of the weight density in each maximal correlation principle. layer. Quantization is performed after the neural network training without further post-training. We tested the algorithm Frequently to get a sufficient accuracy of the neural using the ImageNet dataset for VGG-16, ResNet-50, and network processing we have to combine the quantization and Xception neural networks [2]. In the case of ResNet-50 and the neural network post-training. This procedure requires Xception neural networks, 4-5 bits of memory are required for substantial resources [5, 6]. In the present paper, we perform the weights of a single layer to obtain acceptable Top-5 the quantization after the neural network training without the accuracy, for VGG-16, 3-4 bits are sufficient to store the following post-training. This method allows us to reduce the weights of a single layer. quantization costs substantially. Keywords—weights quantization, post-learning, linear II. ESTIMATE OF CORRELATION AND ITS GRADIENT quantization, exponential quantization For each layer, let us quantize the weights inside the I. INTRODUCTION interval [ w , w ] , where w m in is the minimal and w m a x m in m ax The majority of neural networks, which we use when is the maximal value of the weights in the given layer, solving image recognition problems, have many parameters respectively. (We normalized all the weights, so that that have to be stored. Consequently, a substantial memory w  ( w  w ) /  w ). Let B be the number of bits necessary to capacity is necessary and this requirement limits such neural B networks applicability. For example, the storage capacity of store the weights of one layer. Then n  2 is the number of the neural networks VGG-16 and ResNet152V2 are 528 [3] the quantized weight values, as x we define the boundaries and 232 [4] MBs, respectively. The quantization and i reduction of the number of weights are basic approaches of the intervals where the weight values are constant. allowing us to decrease the memory size necessary to store Consequently, the neural network weights. w m in  x 0  x1  ...  x n  1  x n  w m a x . (1) The quantization is a reduction in the variety of the different values of the weights in the layer. The most popular Let w be the input weights, and y i the quantized value quantization methods are application of the fixed-point inside the interval ( x i , x i  1 ) . It is not evident how we have to formats in place of the floating-point formats [9], binarization [10], ternarization [11], use of a logarithmic choose the interval boundaries x i as well as the values x i scale [12] and so on. Usually, one can reduce the number of inside each interval. In the present paper we suppose that the the weights with the aid of such methods as pruning stronger correlation between the input and the quantized algorithms [13], sharing weights (including application of the values the less the error of the quantized neural network convolution operation) [14], tensor expansions [15] and so performance compared with the initial neural network: on. In the present paper, we explore the quantization process wy  (w, y)   m ax . (2) for trained neural networks. The number B of bits per weight  w y defines the number of different values of the weights; and it, B where w y is a covariation between the initial and the consequently, is equal to 2 . We perform the quantization process independently for each layer. In the given layer we quantized values;  w and  y are standard deviations of split the whole range of weights from the minimal to the the values of the input weight and their quantized values, B maximal value into 2 intervals. Then the weights belonging respectively. For simplicity we suppose that inside a layer to one interval we replace by a single value. In what follows the distribution of the weights is symmetric and the averaged we examine a question related the optimal choiсe of the value of the weights is equal to zero. As we show in what interval boundaries as well as the values with which we have follows this assumption is nearly always carried out in the to replace the weights. case of large deep neural networks. The authors of paper [7] discussed the optimal We can estimate the covariation w y between the input quantization problem for the Hopfield neural network. They and the quantized values as showed that when maximizing correlations between the initial and quantized values of the weights it was possible to minimize the errors of the quantized neural network. We Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science  the intervals xi , which we use as the optimization wy   w y (w ) p (w )dw . (3) parameters. For the given values of x i we define the  Where p ( w ) is the density of the weight distribution inside quantized weights y i with the aid of Eq. (8). the layer. Since y is a constant inside the interval ( x i , x i  1 ) , As the density function p ( w ) we use its kernel the last equation takes the form estimation calculated taking into account 10,000 random n 1 x i 1 weights from the layer. When the number of the weights is w y   yi  w p (w )dw . (4) less than 10,000, all the weights have to be taken into i0 xi account. The integral formulas Eq. (6) we replaced by numerical Similarly, the variance of the quantized values is estimates for c i and p i calculated using the real weights in  n 1 x i 1 the layer:  y   y p (w )dw   yi  p (w )dw . 2 2 2 (5)  i0 xi c i   w I ( x i  w  x i 1 ) / N w w If we introduce additional notations (11) p i   I ( x i  w  x i 1 ) / N w x i 1 x i 1 w сi   w p ( w ) d w and p i   p (w )dw . (6) xi xi Here N w is the number of the weights in the layer, we can simplify Eqs. (4-5) significantly: I ( x i  w  x i  1 ) is an indicator function that accounts for the weights belonging to the interval ( x i , x i  1 ) only. n 1 n 1 To initialize the gradient ascent it is necessary to w y   yiсi  y   yi pi . 2 2 (7) choose an initial partition that is an initial set i0 i0 [ w m in , x 1 , x 2 , ..., x n  1 , w m a x ] . In our simulations, we used the To maximize the correlation (see Eq. 2) we have two sets linear and exponential partitions from [8] as the initial sets of parameters. They are the boundaries of the intervals x i and examined quantization of the pre-trained neural networks ResNet-50, Xception, and VGG-16. We employed the and the quantized values y inside the intervals. For some i programming language Python and the framework Keras [2]. time, we forget about splitting into intervals and suppose that we know the boundaries x . Then after the optimization i procedure with regard to y we obtain: i yi  ci / pi . (8) When we substitute this value into Eq. (2), account for the formulas (4-5), and have in mind that   c o n s t it is w independent of either x i or y i , we obtain the following optimization problem: n 1  (w, y)  c / p  m ax . 2 i i (9) i0 We differentiate this expression with respect to x and i Fig. 1. Examples of weight histograms in convolution and fully connected obtain an expression for the gradient   : i layers for VGG-16 and ResNet-50. As a result of minimization we obtained the optimal  p ( x i )( y i  1  y i )( y i  1  y i  2 x i ) i   . (10) boundaries of the intervals [ w m in , x 1 , x 2 , ..., x n  1 , w m a x ] as xi 2 w well as the corresponding set of the quantized weights III. DESCRIPTION OF QUANTIZATION PROCEDURE [ y 0 , y 1 , ..., y n  1 ] . Then we used the quantized weights in place of the input weights without post-training of the neural The obtained expression (10) for the gradient   i networks. The Python code is given in Appendix B. allowed us to implement a quick algorithm for correlation maximization based on the gradient descent algorithm. In the IV. RESULTS course of the algorithm running it adjusts the boundaries of In Fig. 1, we show weight histograms for convolution and fully connected layers of the VGG-16 and ResNet-50 VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 116 Data Science neural networks. As we mentioned before, the weight quantization, implemented the correlation maximization distributions in the layers of deep neural networks are nearly algorithm, and after that calculated the values of the average symmetric with respect to zero and their average values are correlations. close to zero. Tables 1 – 3 show that when the number of bits is small (up to 5 bits, i.e. when the number of intervals is less or TABLE I. CORRELATIONS AVERAGED OVER ALL LAYERS FOR equal to 32) the maximization algorithm does increase the RESNET-50 NEURAL NETWORK AFTER LINEAR (LINEAR) AND correlation averaged over the layers. The linear quantization EXPONENTIAL (EXPONENTIAL) QUANTIZATION AND SUBSEQUENT CORRELATION MAXIMIZATION with the subsequent maximization leads to the average Average correlation for ResNet-50 correlation growth in all the examined cases. The exponential quantization with the subsequent maximization Lin Max Exp Max provides the growth of the average correlation only when 3 bit 0.8169 0.9376 0.9461 0.9659 the number of intervals is equal to 8, 16 or 32. When the number of bits is larger (that is when the number of intervals 4 bit 0.8473 0.9726 0.9822 0.9898 is 64 or 128), the maximization algorithm fails since for 5 bit 0.9233 0.9890 0.9945 0.9972 some layers its results are worth comparing with the results 6 bit 0.9739 0.9957 0.9983 0.9990 of the exponential quantization. 7 bit 0.9925 0.9980 0.9995 0.9993 TABLE IVb. TOP-1 ACCURACY FOR VGG-16 NEURAL NETWORK; EXP DENOTES EXPONENTIAL QUANTIZATION, OPT STANDS TABLE II. CORRELATIONS AVERAGED OVER ALL LAYERS FOR FOR OPTIMAL QUANTIZATION, MAX DENOTES THE BEST OF XCEPTION NEURAL NETWORK AFTER LINEAR (LINEAR) AND TWO WAYS OF QUANTIZATION, AND FAIL OPT IS FRACTION OF EXPONENTIAL (EXPONENTIAL) QUANTIZATION AND NEURAL NETWORK LAYERS WHERE OPTIMIZATION LEADS TO SUBSEQUENT CORRELATION MAXIMIZATION WORTH CORRELATION THAN EXPONENTIAL QUANTIZATION VGG-16 Average correlation for Xception %fail exp opt max Lin Max 3 bit 0% 0.69 0.73 0.76 Exp Max 4 bit 0% 0.87 0.91 0.91 3 bit 0.8630 0.9552 0.9566 0.9735 5 bit 0% 0.89 0.9 0.93 6 bit 0% 0.9 0.93 0.93 4 bit 0.9072 0.9814 0.9855 0.9919 7 bit 6% 0.9 0.94 0.94 32 bit 0.94 5 bit 0.9611 0.9926 0.9955 0.9972 6 bit 0.9880 0.9960 0.9986 0.9984 TABLE IVc. TOP-1 ACCURACY FOR XCEPTION NEURAL NETWORK; EXP DENOTES EXPONENTIAL QUANTIZATION, OPT 7 bit 0.9967 0.9972 0.9996 0.9986 STANDS FOR OPTIMAL QUANTIZATION, MAX DENOTES THE BEST OF TWO WAYS OF QUANTIZATION, AND FAIL OPT IS TABLE III. CORRELATIONS AVERAGED OVER ALL LAYERS FOR FRACTION OF NEURAL NETWORK LAYERS WHERE VGG-16 NEURAL NETWORK AFTER LINEAR (LINEAR) AND OPTIMIZATION LEADS TO WORTH CORRELATION THAN EXPONENTIAL (EXPONENTIAL) QUANTIZATION AND EXPONENTIAL QUANTIZATION SUBSEQUENT CORRELATION MAXIMIZAION Xception Average correlation for VGG-16 %fail exp opt max 3 bit 0% 0 0.02 0.01 Lin Max Exp Max 4 bit 0% 0.43 0.65 0.61 5 bit 10% 0.89 0.86 0.89 3 bit 0.8325 0.9342 0.9464 0.9669 6 bit 15% 0.89 0.9 0.9 4 bit 0.8469 0.9659 0.9816 0.9899 7 bit 34% 0.92 0.86 0.92 32 bit 0.92 5 bit 0.8968 0.9862 0.9943 0.9973 May be the maximization algorithm does not always run 6 bit 0.9579 0.9943 0.9983 0.9992 correctly because the speed of the gradient ascent lr has to 7 bit 0.9872 0.9980 0.9995 0.9995 be chosen more accurately. This was the reason why for TABLE IVa. TOP-1 ACCURACY FOR RESNET50 NEURAL each layer we also used an algorithm allowing us to select a NETWORK; EXP DENOTES EXPONENTIAL QUANTIZATION, OPT quantization with the maximal correlation. When after our STANDS FOR OPTIMAL QUANTIZATION, MAX DENOTES THE optimization the correlation decreased, we used the initial BEST OF TWO WAYS OF QUANTIZATION, AND FAIL OPT IS FRACTION OF NEURAL NETWORK LAYERS WHERE exponential quantization. Such an algorithm for quantizing a OPTIMIZATION LEADS TO WORTH CORRELATION THAN neural network, we called the “best”. EXPONENTIAL QUANTIZATION In Tables 4a-c are Top-5 accuracies for the ResNet-50, Res-Net-50 VGG-16 and Xception neural networks quantized using the %fail exp opt max 3 bit 0% 0.03 0.02 0 exponential scale (exp), by the algorithm for maximizing the 4 bit 0% 0.71 0.82 0.79 correlation with the prior exponential splitting (opt), and 5 bit 0% 0.91 0.93 0.93 with the aid of the algorithm holding the exponential 6 bit 7% 0.94 0.94 0.94 7 bit 26% 0.94 0.94 0.94 distribution in the layers where the correlation maximization 32 bit 0.94 failed (max). The column %fail max shows the percentage In Tables 1, 2, and 3, we present the values of the of layers for which  m a x c o r r   e x p . The examined neural average correlations of prior weights and weights after quantization for the ResNet-50, Xception, and VGG-16 networks confirmed our hypothesis that the larger the neural networks. We used the linear and exponential correlation between the input and quantized values the better VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 117 Data Science the accuracy of the quantized neural network. This is true we choose the quantization with the maximal correlation only if the above-mentioned correlation is larger on each between the quantized and input weights. The figures for layer of the neural network while an increase of the Top-1 accuracies are given in Appendix A. correlation averaged over all the layers does not guarantee an increase of the accuracy. 1 ResNet exp In Tables 4a-c, we also compare the Top-5 accuracies of Top 5 accuracy the neural networks quantized with the aid of our best algorithm (max) with Top-5 accuracies of the initial neural 0,5 networks (32 bit). For the neural networks ResNet-50 and Xception the Top-5 accuracy drop was 20-30% and less than 3.5% when for storing weights we reserved 4 bit and 5 0 bit, respectively. 3 bit 4 bit 5 bit 6 bit 7 bit max corr exponential ResNet-50 linear 1 1 Xception exp Top 5 accuracy Top 5 accuracy 0,5 0,5 0 3 bit 4 bit 5 bit 6 bit 7 bit 0 max corr linear 3 bit 4 bit 5 bit 6 bit 7 bit max corr exponential ResNet exp 1 1 VGG-16 exp Top 5 accuracy Top 5 accuracy 0,5 0,5 0 3 bit 4 bit 5 bit 6 bit 7 bit 0 max corr exponential 3 bit 4 bit 5 bit 6 bit 7 bit max corr exponential Fig. 3. Top-5 accuracies for ResNet-50, Xception and VGG-16: 1 Xception linear exponential quantization and quantization with correlation maximization with prior exponential splitting (max corr). Top 5 accuracy V. CONCLUSIONS 0,5 We developed the algorithm for neural network quantization based on maximization of correlations between the quantized and the input weights. When using this 0 algorithm no post-training of the neural network is necessary. 3 bit 4 bit 5 bit 6 bit 7 bit Reserving 5 bits per layer, we succeeded in quantization of max corr linear the VGG-16 neural network that leaded to 1% of the Top-5 accuracy drop only. Under such compression, the required Fig. 2. Top-5 accuracies for ResNet-50, Xception and VGG-16: linear memory necessary to store weights is approximately 6 times quantization and quantization with correlation maximization with prior less than in the case of the full precision float (32 bits). For linear splitting (max corr). comparison, in paper [5] the VGG-16 neural network was compressed about 2.5 times by quantization and full In the case of the VGG-16 neural network, the Top-5 algorithm compression is around 49 times, however their accuracy drop was about 20% when we reserved 3 bits per neural network required a post-training for which a layer and it was less than 3% when the number of the substantial computing power was necessary. When reserved bits was larger. comparing with the results of paper [8] we see that for the In Figure 3a-b, for the Xception, ResNet-50 and VGG- ResNet-50, VGG-16 and Xception neural networks at the 16 neural networks we show the dependences of the Top-5 same compression our Top-1 and Top-5 accuracies are accuracies on the number of bits reserved for storage of each better. Our compression allows to use only 3-4 bits to weight obtained when we initialized them by linear and achieve more than 0.6 Top-5 accuracy for different exponential splitting. The model accuracy increases when architecture without re-training. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 118 Data Science APPENDIX A: TOP-1 ACCURACIES ACKNOWLEDGMENT 1 ResNet-50 linear The work financially supported by State Program of Top 1 accuracy SRISA RAS No. 0065-2019-0003 (AAA-A19- 119011590090-2). REFERENCES [1] ImageNet – huge image dataset [Online]. URL: http://www.image- 0 net.org. 3 bit 4 bit 5 bit 6 bit 7 bit max corr linear [2] Models for image classification with weights trained on ImageNet [Online]. URL: https://keras.io/applications/. [3] K. Simonyan and A. Zisserman, “Very Deep Convolutional 1 Xception linear Networks for Large-Scale Image Recognition,” arXiv Preprint: Top 1 accuracy 1409.1556. [4] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv Preprint: 1512.03385. [5] S. Han, H. Mao and W.J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman 0 coding,” CoRR, ArXiv Preprint: 1510.00149. 2, 2015. 3 bit 4 bit 5 bit 6 bit 7 bit [6] Sh. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, “Dorefa-net: max corr linear Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv Preprint: 1606.06160. 1 VGG-16 linear [7] B.V. Kryzhanovsky, M.V. Kryzhanovsky and M.Yu. Malsagov, “Discretization of a matrix in quadratic functional binary Top 1 accuracy optimization,” Doklady Mathematics, vol. 83, pp. 413-417, 2011. DOI: 10.1134/S1064562411030197. [8] M.Yu. Malsagov, E.M. Khayrov, M.M. Pushkareva and I.M. Karandashev, “Exponential discretization of weights of neural network connections in pre-trained neural networks,” preprint, 2020. 0 [9] M. Courbariaux, Y. Bengio and J. David, “Training deep neural 3 bit 4 bit 5 bit 6 bit 7 bit max corr linear networks with low precision multiplications,” arXiv Preprint: 1412.7024. [10] M. Courbariaux, Y. Bengio, J.-P. David, “BinaryConnect: Training 1 ResNet exp deep neural networks with binary weights during propagations,” Top 1 accuracy Conference on Neural Information Processing Systems, arXiv:1511.00363. [11] Zh. Lin, M. Courbariaux, R. Memisevic and Y. Bengio, “Neural networks with few multiplications,” Proceedings of the International Conference on Learning Representations, arXiv:1510.03009. 0 [12] E.H. Lee, D. Miyashita, E. Chai, B. Murmann and S.S. Wong, 3 bit 4 bit 5 bit 6 bit 7 bit “LogNet: Energy-efficient neural networks using logarithmic max corr exponential computation,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2017. [13] S. Han, J. Pool, J. Tran and W. Dally, “Learning both Weights and 1 Xception exp Connections for Efficient Neural Networks,” arXiv: 1506.02626, Top 1 accuracy 2015. [14] W. Chen, J. Wilson, S. Tyree and K. Weinberger, “Compressing Neural Networks with the Hashing Trick. Compressing Neural Networks with the Hashing Trick,” arXiv: 1504.04788, 2015. [15] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets and V. Lempitsky, 0 “Speeding-up Convolutional Neural Networks Using Fine-tuned CP- 3 bit 4 bit 5 bit 6 bit 7 bit Decomposition,” 3rd International Conference on Learning max corr exponential Representations, ICLR San Diego, CA, USA, Conference Track Proceedings, 2015. 1 VGG-16 exp Top 1 accuarcy 0 3 bit 4 bit 5 bit 6 bit 7 bit max corr exponential Fig. 4. Top-1 accuracies for ResNet-50, Xception and VGG-16: linear/ exponential (lin/exp) quantization and quantization with correlation maximization with prior linear/exponential splitting (max corr). VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 119 Data Science APPENDIX B: ALGORITHM # accessory functions def f(x, func, kde, X, x_min, x_max): y, px, cov, p = func(x, kde, X, x_min, x_max) return cov def grad(x, func, kde, X, x_min, x_max, alpha=10): y, px, cov, p = func(x, kde, X, x_min, x_max) step = alpha * px * (y[1:] - y[:-1]) * (y[1:] + y[:- 1] - 2 * x) / 2 return step def cov_kde(x0, kde, X, x_min, x_max): ''' calculate distribution function, quantized values and covariation on the set x0 X – weights in this layer, x_min and x_max – minimal and maximal weight value in the layer x0 current set (variable values only) ''' p = np.zeros(len(x0) + 1) C = np.zeros(len(x0) + 1) y = np.zeros(len(x0) + 1) x_ext = sorted(np.append(x0, [x_min, x_max])) for i in range(len(x_ext)-1): mask = np.logical_and(x_ext[i] < X, X <= x_ext[i + 1]) p[i] = len(X[mask]) C[i] = np.sum(X[mask]) if p[i] == 0: C[i] = 0 p[i] = 1 y=C/p px = kde.evaluate(x0) cov = np.linalg.norm(C / np.sqrt(p)) #/ sigma_kde return y, px, cov, p def results(kde, w, x0, x_min, x_max, func, bits, kde_std, ans_case='CG'): ''' correlation maximization procedure for initial set x0 (only variable values), w – layer, kde – kernel density estimation on random sample from weights, x_min and x_max – minimal and maximal weight values ''' n_d = 2 ** bits fx = lambda x: -f(x, func, kde, w, x_min, x_max) gradx = lambda x: -grad(x, func, kde, w, x_min, x_max, alpha) tol_curr = 1e-4 alpha = 10 ans = minimize(fun=fx, x0=x0, jac=gradx, method='CG', tol=tol_curr solutions = ans['x'] correlations = -ans['fun'] gradients = np.linalg.norm(gradx(ans['x'])) / alpha / n_d return solutions, correlations, gradients VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 120