INTRODUCTION

Post-training quantization of neural network through correlation maximization

Maria Pushkareva

pushkareva.mariia@yandex.ru 0

Iakov Karandashev

karandashev@niisi.ras.ru 0 0 Center of Optical Neural Technologies, Scientific Research, Institute of System Analisys RAS , Moscow , Russia

2020

115 120

-In this paper, we propose a method for quantizing the weights of neural networks by maximizing the correlations between the initial and quantized weights, taking into account the distribution of the weight density in each layer. Quantization is performed after the neural network training without further post-training. We tested the algorithm using the ImageNet dataset for VGG-16, ResNet-50, and Xception neural networks [2]. In the case of ResNet-50 and Xception neural networks, 4-5 bits of memory are required for the weights of a single layer to obtain acceptable Top-5 accuracy, for VGG-16, 3-4 bits are sufficient to store the weights of a single layer.

weights quantization post-learning linear quantization exponential quantization

INTRODUCTION

The majority of neural networks, which we use when solving image recognition problems, have many parameters that have to be stored. Consequently, a substantial memory capacity is necessary and this requirement limits such neural networks applicability. For example, the storage capacity of the neural networks VGG-16 and ResNet152V2 are 528 [3] and 232 [4] MBs, respectively. The quantization and reduction of the number of weights are basic approaches allowing us to decrease the memory size necessary to store the neural network weights.

The quantization is a reduction in the variety of the different values of the weights in the layer. The most popular quantization methods are application of the fixed-point formats in place of the floating-point formats [9], binarization [10], ternarization [11], use of a logarithmic scale [12] and so on. Usually, one can reduce the number of the weights with the aid of such methods as pruning algorithms [13], sharing weights (including application of the convolution operation) [14], tensor expansions [15] and so on.

In the present paper, we explore the quantization process for trained neural networks. The number B of bits per weight defines the number of different values of the weights; and it,

B consequently, is equal to 2 . We perform the quantization process independently for each layer. In the given layer we split the whole range of weights from the minimal to the

B maximal value into 2 intervals. Then the weights belonging to one interval we replace by a single value. In what follows we examine a question related the optimal choiсe of the interval boundaries as well as the values with which we have to replace the weights.

The authors of paper [7] discussed the optimal quantization problem for the Hopfield neural network. They showed that when maximizing correlations between the initial and quantized values of the weights it was possible to minimize the errors of the quantized neural network. We believe that this result is correct and in what follows, we choose the interval boundaries and the quantized values of the weights inside of the intervals proceeding from a maximal correlation principle.

Frequently to get a sufficient accuracy of the neural network processing we have to combine the quantization and the neural network post-training. This procedure requires substantial resources [5, 6]. In the present paper, we perform the quantization after the neural network training without the following post-training. This method allows us to reduce the quantization costs substantially.

II. ESTIMATE OF CORRELATION AND ITS GRADIENT

For each layer, let us quantize the weights inside the interval [ w min , w max ] , where w m in is the minimal and w m ax is the maximal value of the weights in the given layer, respectively. (We normalized all the weights, so that w  ( w  w ) /  w ). Let B be the number of bits necessary to store the weights of one layer. Then n  2 B is the number of the quantized weight values, as x i we define the boundaries of the intervals where the weight values are constant. Consequently, w m in  x 0  x1  ...  x n 1  x n  w m ax . (1)

Let w be the input weights, and y i the quantized value inside the interval ( xi , xi 1 ) . It is not evident how we have to choose the interval boundaries xi as well as the values xi inside each interval. In the present paper we suppose that the stronger correlation between the input and the quantized values the less the error of the quantized neural network performance compared with the initial neural network:  ( w , y ) 

 m a x .

w y   w y (2) where w y

is a covariation between the initial and the quantized values;  w and  y are standard deviations of the values of the input weight and their quantized values, respectively. For simplicity we suppose that inside a layer the distribution of the weights is symmetric and the averaged value of the weights is equal to zero. As we show in what follows this assumption is nearly always carried out in the case of large deep neural networks.

We can estimate the covariation w y between the input and the quantized values as

 w y   w y ( w ) p ( w ) d w .

  Where p ( w ) is the density of the weight distribution inside the layer. Since y is a constant inside the interval ( x i , x i 1 ) , the last equation takes the form

n 1 xi1 w y   y i  w p ( w ) d w .

i  0

xi Similarly, the variance of the quantized values is If we introduce additional notations

 n 1 xi1  y2   y 2 p ( w ) d w   y i2  p ( w ) d w .

 i  0 xi1 xi1 с i   w p ( w ) d w and p i   p ( w ) d w .

xi xi xi we can simplify Eqs. (4-5) significantly:

n 1 w y   y i с i i  0

n 1  y2   y i2 p i .

i  0

To maximize the correlation (see Eq. 2) we have two sets of parameters. They are the boundaries of the intervals x i and the quantized values y i inside the intervals. For some time, we forget about splitting into intervals and suppose that we know the boundaries x i . Then after the optimization procedure with regard to y i we obtain:

y i  c i / p i .

When we substitute this value into Eq. (2), account for the formulas (4-5), and have in mind that  w  c o n s t it is independent of either x i or y i , we obtain the following optimization problem:  ( w , y ) n 1  c i2 / p i  m a x .

i  0

We differentiate this expression with respect to x i and obtain an expression for the gradient   i :   i    x i  p ( x i ) ( y i 1  y i ) ( y i 1  y i  2 x i ) . (10) 2 w III.

DESCRIPTION OF QUANTIZATION PROCEDURE

The obtained expression (10) for the gradient   i allowed us to implement a quick algorithm for correlation maximization based on the gradient descent algorithm. In the course of the algorithm running it adjusts the boundaries of (3) (4) (5) (6) (7) (8) (9) the intervals x i , which

we use as the optimization parameters. For the given values of x i we define the quantized weights y i

with the aid of Eq. (8).

As the density function p ( w ) we use its kernel estimation calculated taking into account 10,000 random weights from the layer. When the number of the weights is less than 10,000, all the weights have to be taken into account. estimates for c i

The integral formulas Eq. (6) we replaced by numerical and p i calculated using the real weights in the layer: c i   w I ( x i  w  x i 1 ) / N w

w p i   I ( x i  w  x i 1 ) / N w w (11)

Here N w is the number of the weights in the layer, I ( x i  w  x i 1 ) is an indicator function that accounts for the weights belonging to the interval ( x i , x i 1 ) only.

To initialize the gradient ascent it is necessary to choose an initial partition that is an initial set [ w m in , x1 , x 2 , ..., x n 1 , w m ax ] . In our simulations, we used the linear and exponential partitions from [8] as the initial sets and examined quantization of the pre-trained neural networks ResNet-50, Xception, and VGG-16. We employed the programming language Python and the framework Keras [2].

As a result of minimization we obtained the optimal boundaries of the intervals [ w m in , x1 , x 2 , ..., x n 1 , w m ax ] as well as the corresponding set of the quantized weights [ y 0 , y1 , ..., y n 1 ] . Then we used the quantized weights in place of the input weights without post-training of the neural networks. The Python code is given in Appendix B.

IV.

RESULTS In Fig. 1, we show weight histograms for convolution and fully connected layers of the VGG-16 and ResNet-50

neural networks. As we mentioned before, the weight distributions in the layers of deep neural networks are nearly symmetric with respect to zero and their average values are close to zero. 3 bit 4 bit 5 bit 6 bit 7 bit 3 bit 4 bit 5 bit 6 bit 7 bit 3 bit 4 bit 5 bit 6 bit 7 bit

Lin

In Tables 1, 2, and 3, we present the values of the average correlations of prior weights and weights after quantization for the ResNet-50, Xception, and VGG-16 neural networks. We used the linear and exponential quantization, implemented the correlation maximization algorithm, and after that calculated the values of the average correlations.

Tables 1 – 3 show that when the number of bits is small (up to 5 bits, i.e. when the number of intervals is less or equal to 32) the maximization algorithm does increase the correlation averaged over the layers. The linear quantization with the subsequent maximization leads to the average correlation growth in all the examined cases. The exponential quantization with the subsequent maximization provides the growth of the average correlation only when the number of intervals is equal to 8, 16 or 32. When the number of bits is larger (that is when the number of intervals is 64 or 128), the maximization algorithm fails since for some layers its results are worth comparing with the results of the exponential quantization.

TABLE IVb. TOP-1 ACCURACY FOR VGG-16 NEURAL NETWORK; EXP DENOTES EXPONENTIAL QUANTIZATION, OPT STANDS FOR OPTIMAL QUANTIZATION, MAX DENOTES THE BEST OF TWO WAYS OF QUANTIZATION, AND FAIL OPT IS FRACTION OF NEURAL NETWORK LAYERS WHERE OPTIMIZATION LEADS TO WORTH CORRELATION THAN EXPONENTIAL QUANTIZATION VGG-16

TABLE IVc. TOP-1 ACCURACY FOR XCEPTION NEURAL NETWORK; EXP DENOTES EXPONENTIAL QUANTIZATION, OPT STANDS FOR OPTIMAL QUANTIZATION, MAX DENOTES THE BEST OF TWO WAYS OF QUANTIZATION, AND FAIL OPT IS

FRACTION OF NEURAL NETWORK LAYERS WHERE OPTIMIZATION LEADS TO WORTH CORRELATION THAN EXPONENTIAL QUANTIZATION

Xception 3 bit 4 bit 5 bit 6 bit 7 bit 32 bit 3 bit 4 bit 5 bit 6 bit 7 bit 32 bit

May be the maximization algorithm does not always run correctly because the speed of the gradient ascent lr has to be chosen more accurately. This was the reason why for each layer we also used an algorithm allowing us to select a quantization with the maximal correlation. When after our optimization the correlation decreased, we used the initial exponential quantization. Such an algorithm for quantizing a neural network, we called the “best”.

In Tables 4a-c are Top-5 accuracies for the ResNet-50, VGG-16 and Xception neural networks quantized using the exponential scale (exp), by the algorithm for maximizing the correlation with the prior exponential splitting (opt), and with the aid of the algorithm holding the exponential distribution in the layers where the correlation maximization failed (max). The column %fail max shows the percentage of layers for which  m ax corr   exp . The examined neural networks confirmed our hypothesis that the larger the correlation between the input and quantized values the better %fail 0% 0% 0% 0% 6% %fail 0% 0% 10% 15% 34% the accuracy of the quantized neural network. This is true only if the above-mentioned correlation is larger on each layer of the neural network while an increase of the correlation averaged over all the layers does not guarantee an increase of the accuracy.

In Tables 4a-c, we also compare the Top-5 accuracies of the neural networks quantized with the aid of our best algorithm (max) with Top-5 accuracies of the initial neural networks (32 bit). For the neural networks ResNet-50 and Xception the Top-5 accuracy drop was 20-30% and less than 3.5% when for storing weights we reserved 4 bit and 5 bit, respectively.

ResNet-50 linear 3 bit 5 bit 6 bit 7 bit

linear

ResNet exp 3 bit

4 bit max corr 5 bit 6 bit 7 bit exponential

Xception linear y c a r cu 0,5 c a 5 p o T 0 1 1 y c a r cu 0,5 c a 5 p o T 0 1 y c a r cu 0,5 c a 5 p o T 0 3 bit 5 bit 6 bit 7 bit linear

In the case of the VGG-16 neural network, the Top-5 accuracy drop was about 20% when we reserved 3 bits per layer and it was less than 3% when the number of the reserved bits was larger.

In Figure 3a-b, for the Xception, ResNet-50 and VGG16 neural networks we show the dependences of the Top-5 accuracies on the number of bits reserved for storage of each weight obtained when we initialized them by linear and exponential splitting. The model accuracy increases when we choose the quantization with the maximal correlation between the quantized and input weights. The figures for Top-1 accuracies are given in Appendix A.

ResNet exp 1 y c a r cu 0,5 c a 5 p o T 0 1 y c a r cu 0,5 c a 5 p o T 0 1 y c a r cu 0,5 c a 5 p o T 0 3 bit

4 bit max corr 5 bit

Xception exp 3 bit

4 bit max corr 5 bit

VGG-16 exp 3 bit

4 bit max corr 5 bit

We developed the algorithm for neural network quantization based on maximization of correlations between the quantized and the input weights. When using this algorithm no post-training of the neural network is necessary. Reserving 5 bits per layer, we succeeded in quantization of the VGG-16 neural network that leaded to 1% of the Top-5 accuracy drop only. Under such compression, the required memory necessary to store weights is approximately 6 times less than in the case of the full precision float (32 bits). For comparison, in paper [5] the VGG-16 neural network was compressed about 2.5 times by quantization and full algorithm compression is around 49 times, however their neural network required a post-training for which a substantial computing power was necessary. When comparing with the results of paper [8] we see that for the ResNet-50, VGG-16 and Xception neural networks at the same compression our Top-1 and Top-5 accuracies are better. Our compression allows to use only 3-4 bits to achieve more than 0.6 Top-5 accuracy for different architecture without re-training.

ACKNOWLEDGMENT

The work financially supported by State Program of SRISA RAS No. 0065-2019-0003 (AAA-A19119011590090-2). ImageNet – huge image dataset [Online]. URL: http://www.imagenet.org.

Models for image classification with weights trained on ImageNet [Online]. URL: https://keras.io/applications/.

K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv Preprint: 1409.1556.

K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” arXiv Preprint: 1512.03385.

S. Han, H. Mao and W.J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” CoRR, ArXiv Preprint: 1510.00149. 2, 2015.

Sh. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, “Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” arXiv Preprint: 1606.06160.

B.V. Kryzhanovsky, M.V. Kryzhanovsky and M.Yu. Malsagov, “Discretization of a matrix in quadratic functional binary optimization,” Doklady Mathematics, vol. 83, pp. 413-417, 2011. DOI: 10.1134/S1064562411030197.

M.Yu. Malsagov, E.M. Khayrov, M.M. Pushkareva and I.M. Karandashev, “Exponential discretization of weights of neural network connections in pre-trained neural networks,” preprint, 2020. M. Courbariaux, Y. Bengio and J. David, “Training deep neural networks with low precision multiplications,” arXiv Preprint: 1412.7024.

M. Courbariaux, Y. Bengio, J.-P. David, “BinaryConnect: Training deep neural networks with binary weights during propagations,” Conference on Neural Information Processing Systems, arXiv:1511.00363.

Zh. Lin, M. Courbariaux, R. Memisevic and Y. Bengio, “Neural networks with few multiplications,” Proceedings of the International Conference on Learning Representations, arXiv:1510.03009. E.H. Lee, D. Miyashita, E. Chai, B. Murmann and S.S. Wong, “LogNet: Energy-efficient neural networks using logarithmic computation,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2017.

S. Han, J. Pool, J. Tran and W. Dally, “Learning both Weights and Connections for Efficient Neural Networks,” arXiv: 1506.02626, 2015.

W. Chen, J. Wilson, S. Tyree and K. Weinberger, “Compressing Neural Networks with the Hashing Trick. Compressing Neural Networks with the Hashing Trick,” arXiv: 1504.04788, 2015. y c a r u c c a 1 p o T y c a r u c c a 1 p o T y c a r u c c a 1 p o T y c a r u c c a 1 p o T y c a r u c c a 1 p o T y c r a u c c a 1 p o T 1 0 1 0 1 0 1 0 1 0 1 0 APPENDIX A: TOP-1 ACCURACIES

ResNet-50 linear 3 bit 5 bit 6 bit 7 bit

linear

Xception linear 3 bit 5 bit 6 bit 7 bit

linear

VGG-16 linear 3 bit 5 bit 6 bit 7 bit

linear

ResNet exp 3 bit

4 bit max corr 5 bit

Xception exp 3 bit

4 bit max corr 5 bit

VGG-16 exp 3 bit

4 bit max corr 5 bit

APPENDIX B: ALGORITHM # accessory functions def f(x, func, kde, X, x_min, x_max): y, px, cov, p = func(x, kde, X, x_min, x_max) return cov def grad(x, func, kde, X, x_min, x_max, alpha=10): y, px, cov, p = func(x, kde, X, x_min, x_max) step = alpha * px * (y[1:] - y[:-1]) * (y[1:] + y[:1] - 2 * x) / 2

return step def cov_kde(x0, kde, X, x_min, x_max): ''' calculate distribution function, quantized values and covariation on the set x0

X – weights in this layer, x_min and x_max – minimal and maximal weight value in the layer x0 current set (variable values only) ''' p = np.zeros(len(x0) + 1) C = np.zeros(len(x0) + 1) y = np.zeros(len(x0) + 1) x_ext = sorted(np.append(x0, [x_min, x_max])) for i in range(len(x_ext)-1):

mask = np.logical_and(x_ext[i] < X, X <= x_ext[i + 1]) p[i] = len(X[mask]) C[i] = np.sum(X[mask]) if p[i] == 0:

C[i] = 0 p[i] = 1 y = C / p px = kde.evaluate(x0) cov = np.linalg.norm(C / np.sqrt(p)) #/ sigma_kde return y, px, cov, p def results(kde, w, x0, x_min, x_max, func, bits, kde_std, ans_case='CG'): ''' correlation maximization procedure for initial set x0 (only variable values), w – layer, kde – kernel density estimation on random sample from weights,

x_min and x_max – minimal and maximal weight values ''' n_d = 2 ** bits fx = lambda x: -f(x, func, kde, w, x_min, x_max) gradx = lambda x: -grad(x, func, kde, w, x_min, x_max, alpha) tol_curr = 1e-4 alpha = 10 ans = minimize(fun=fx, x0=x0, jac=gradx, method='CG', tol=tol_curr solutions = ans['x'] correlations = -ans['fun'] gradients = np.linalg.norm(gradx(ans['x'])) / alpha / n_d return solutions, correlations, gradients