<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Post-training quantization of neural network through correlation maximization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Pushkareva</string-name>
          <email>pushkareva.mariia@yandex.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iakov Karandashev</string-name>
          <email>karandashev@niisi.ras.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center of Optical Neural Technologies, Scientific Research, Institute of System Analisys RAS</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>115</fpage>
      <lpage>120</lpage>
      <abstract>
        <p>-In this paper, we propose a method for quantizing the weights of neural networks by maximizing the correlations between the initial and quantized weights, taking into account the distribution of the weight density in each layer. Quantization is performed after the neural network training without further post-training. We tested the algorithm using the ImageNet dataset for VGG-16, ResNet-50, and Xception neural networks [2]. In the case of ResNet-50 and Xception neural networks, 4-5 bits of memory are required for the weights of a single layer to obtain acceptable Top-5 accuracy, for VGG-16, 3-4 bits are sufficient to store the weights of a single layer.</p>
      </abstract>
      <kwd-group>
        <kwd>weights quantization</kwd>
        <kwd>post-learning</kwd>
        <kwd>linear quantization</kwd>
        <kwd>exponential quantization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The majority of neural networks, which we use when
solving image recognition problems, have many parameters
that have to be stored. Consequently, a substantial memory
capacity is necessary and this requirement limits such neural
networks applicability. For example, the storage capacity of
the neural networks VGG-16 and ResNet152V2 are 528 [3]
and 232 [4] MBs, respectively. The quantization and
reduction of the number of weights are basic approaches
allowing us to decrease the memory size necessary to store
the neural network weights.</p>
      <p>The quantization is a reduction in the variety of the
different values of the weights in the layer. The most popular
quantization methods are application of the fixed-point
formats in place of the floating-point formats [9],
binarization [10], ternarization [11], use of a logarithmic
scale [12] and so on. Usually, one can reduce the number of
the weights with the aid of such methods as pruning
algorithms [13], sharing weights (including application of the
convolution operation) [14], tensor expansions [15] and so
on.</p>
      <p>In the present paper, we explore the quantization process
for trained neural networks. The number B of bits per weight
defines the number of different values of the weights; and it,</p>
      <p>B
consequently, is equal to 2 . We perform the quantization
process independently for each layer. In the given layer we
split the whole range of weights from the minimal to the</p>
      <p>B
maximal value into 2 intervals. Then the weights belonging
to one interval we replace by a single value. In what follows
we examine a question related the optimal choiсe of the
interval boundaries as well as the values with which we have
to replace the weights.</p>
      <p>The authors of paper [7] discussed the optimal
quantization problem for the Hopfield neural network. They
showed that when maximizing correlations between the
initial and quantized values of the weights it was possible to
minimize the errors of the quantized neural network. We
believe that this result is correct and in what follows, we
choose the interval boundaries and the quantized values of
the weights inside of the intervals proceeding from a
maximal correlation principle.</p>
      <p>Frequently to get a sufficient accuracy of the neural
network processing we have to combine the quantization and
the neural network post-training. This procedure requires
substantial resources [5, 6]. In the present paper, we perform
the quantization after the neural network training without the
following post-training. This method allows us to reduce the
quantization costs substantially.</p>
      <p>II. ESTIMATE OF CORRELATION AND ITS GRADIENT</p>
      <p>For each layer, let us quantize the weights inside the
interval [ w min , w max ] , where w m in is the minimal and w m ax
is the maximal value of the weights in the given layer,
respectively. (We normalized all the weights, so that
w  ( w  w ) /  w ). Let B be the number of bits necessary to
store the weights of one layer. Then n  2 B is the number of
the quantized weight values, as x i we define the boundaries
of the intervals where the weight values are constant.
Consequently,
w m in  x 0  x1  ...  x n 1  x n  w m ax .
(1)</p>
      <p>Let w be the input weights, and y i the quantized value
inside the interval ( xi , xi 1 ) . It is not evident how we have to
choose the interval boundaries xi as well as the values xi
inside each interval. In the present paper we suppose that the
stronger correlation between the input and the quantized
values the less the error of the quantized neural network
performance compared with the initial neural network:
 ( w , y ) </p>
      <p> m a x .</p>
      <p>w y
 
w y
(2)
where w y</p>
      <p>is a covariation between the initial and the
quantized values;  w and  y are standard deviations of
the values of the input weight and their quantized values,
respectively. For simplicity we suppose that inside a layer
the distribution of the weights is symmetric and the averaged
value of the weights is equal to zero. As we show in what
follows this assumption is nearly always carried out in the
case of large deep neural networks.</p>
      <p>We can estimate the covariation w y between the input
and the quantized values as</p>
      <p>
w y   w y ( w ) p ( w ) d w .</p>
      <p> 
Where p ( w ) is the density of the weight distribution inside
the layer. Since y is a constant inside the interval ( x i , x i 1 ) ,
the last equation takes the form</p>
      <p>n 1 xi1
w y   y i  w p ( w ) d w .</p>
      <p>i  0</p>
      <p>xi
Similarly, the variance of the quantized values is
If we introduce additional notations</p>
      <p> n 1 xi1
 y2   y 2 p ( w ) d w   y i2  p ( w ) d w .</p>
      <p> i  0
xi1 xi1
с i   w p ( w ) d w and p i   p ( w ) d w .</p>
      <p>xi
xi
xi
we can simplify Eqs. (4-5) significantly:</p>
      <p>n 1
w y   y i с i
i  0</p>
      <p>n 1
 y2   y i2 p i .</p>
      <p>i  0</p>
      <p>To maximize the correlation (see Eq. 2) we have two sets
of parameters. They are the boundaries of the intervals x i
and the quantized values y i inside the intervals. For some
time, we forget about splitting into intervals and suppose that
we know the boundaries x i . Then after the optimization
procedure with regard to y i we obtain:</p>
      <p>y i  c i / p i .</p>
      <p>When we substitute this value into Eq. (2), account for
the formulas (4-5), and have in mind that  w  c o n s t it is
independent of either x i or y i , we obtain the following
optimization problem:
 ( w , y )
n 1
 c i2 / p i  m a x .</p>
      <p>i  0</p>
      <p>We differentiate this expression with respect to x i and
obtain an expression for the gradient   i :
  i 

 x i

p ( x i ) ( y i 1  y i ) ( y i 1  y i  2 x i ) . (10)
2 w
III.</p>
      <p>DESCRIPTION OF QUANTIZATION PROCEDURE</p>
      <p>The obtained expression (10) for the gradient   i
allowed us to implement a quick algorithm for correlation
maximization based on the gradient descent algorithm. In the
course of the algorithm running it adjusts the boundaries of
(3)
(4)
(5)
(6)
(7)
(8)
(9)
the intervals x i , which</p>
      <p>we use as the optimization
parameters. For the given values of x i
we define the
quantized weights y i</p>
      <p>with the aid of Eq. (8).</p>
      <p>As the density function p ( w ) we use its kernel
estimation calculated taking into account 10,000 random
weights from the layer. When the number of the weights is
less than 10,000, all the weights have to be taken into
account.
estimates for c i</p>
      <p>The integral formulas Eq. (6) we replaced by numerical
and p i calculated using the real weights in
the layer:
c i   w I ( x i  w  x i 1 ) / N w</p>
      <p>w
p i   I ( x i  w  x i 1 ) / N w
w
(11)</p>
      <p>Here N w is the number of the weights in the layer,
I ( x i  w  x i 1 ) is an indicator function that accounts for
the weights belonging to the interval ( x i , x i 1 ) only.</p>
      <p>To initialize the gradient ascent it is necessary to
choose an initial partition that is an initial set
[ w m in , x1 , x 2 , ..., x n 1 , w m ax ] . In our simulations, we used the
linear and exponential partitions from [8] as the initial sets
and examined quantization of the pre-trained neural networks
ResNet-50, Xception, and VGG-16. We employed the
programming language Python and the framework Keras [2].</p>
      <p>As a result of minimization we obtained the optimal
boundaries of the intervals [ w m in , x1 , x 2 , ..., x n 1 , w m ax ] as
well as the corresponding set of the quantized weights
[ y 0 , y1 , ..., y n 1 ] . Then we used the quantized weights in
place of the input weights without post-training of the neural
networks. The Python code is given in Appendix B.</p>
      <p>IV.</p>
    </sec>
    <sec id="sec-2">
      <title>RESULTS In Fig. 1, we show weight histograms for convolution and fully connected layers of the VGG-16 and ResNet-50</title>
      <p>neural networks. As we mentioned before, the weight
distributions in the layers of deep neural networks are nearly
symmetric with respect to zero and their average values are
close to zero.
3 bit
4 bit
5 bit
6 bit
7 bit
3 bit
4 bit
5 bit
6 bit
7 bit
3 bit
4 bit
5 bit
6 bit
7 bit</p>
      <p>Lin</p>
      <p>In Tables 1, 2, and 3, we present the values of the
average correlations of prior weights and weights after
quantization for the ResNet-50, Xception, and VGG-16
neural networks. We used the linear and exponential
quantization, implemented the correlation maximization
algorithm, and after that calculated the values of the average
correlations.</p>
      <p>Tables 1 – 3 show that when the number of bits is small
(up to 5 bits, i.e. when the number of intervals is less or
equal to 32) the maximization algorithm does increase the
correlation averaged over the layers. The linear quantization
with the subsequent maximization leads to the average
correlation growth in all the examined cases. The
exponential quantization with the subsequent maximization
provides the growth of the average correlation only when
the number of intervals is equal to 8, 16 or 32. When the
number of bits is larger (that is when the number of intervals
is 64 or 128), the maximization algorithm fails since for
some layers its results are worth comparing with the results
of the exponential quantization.</p>
      <p>TABLE IVb. TOP-1 ACCURACY FOR VGG-16 NEURAL NETWORK;
EXP DENOTES EXPONENTIAL QUANTIZATION, OPT STANDS
FOR OPTIMAL QUANTIZATION, MAX DENOTES THE BEST OF
TWO WAYS OF QUANTIZATION, AND FAIL OPT IS FRACTION OF
NEURAL NETWORK LAYERS WHERE OPTIMIZATION LEADS TO
WORTH CORRELATION THAN EXPONENTIAL QUANTIZATION
VGG-16</p>
      <p>TABLE IVc. TOP-1 ACCURACY FOR XCEPTION NEURAL
NETWORK; EXP DENOTES EXPONENTIAL QUANTIZATION, OPT
STANDS FOR OPTIMAL QUANTIZATION, MAX DENOTES THE
BEST OF TWO WAYS OF QUANTIZATION, AND FAIL OPT IS</p>
      <p>FRACTION OF NEURAL NETWORK LAYERS WHERE
OPTIMIZATION LEADS TO WORTH CORRELATION THAN
EXPONENTIAL QUANTIZATION</p>
      <p>Xception
3 bit
4 bit
5 bit
6 bit
7 bit
32 bit
3 bit
4 bit
5 bit
6 bit
7 bit
32 bit</p>
      <p>May be the maximization algorithm does not always run
correctly because the speed of the gradient ascent lr has to
be chosen more accurately. This was the reason why for
each layer we also used an algorithm allowing us to select a
quantization with the maximal correlation. When after our
optimization the correlation decreased, we used the initial
exponential quantization. Such an algorithm for quantizing a
neural network, we called the “best”.</p>
      <p>In Tables 4a-c are Top-5 accuracies for the ResNet-50,
VGG-16 and Xception neural networks quantized using the
exponential scale (exp), by the algorithm for maximizing the
correlation with the prior exponential splitting (opt), and
with the aid of the algorithm holding the exponential
distribution in the layers where the correlation maximization
failed (max). The column %fail max shows the percentage
of layers for which  m ax corr   exp . The examined neural
networks confirmed our hypothesis that the larger the
correlation between the input and quantized values the better
%fail
0%
0%
0%
0%
6%
%fail
0%
0%
10%
15%
34%
the accuracy of the quantized neural network. This is true
only if the above-mentioned correlation is larger on each
layer of the neural network while an increase of the
correlation averaged over all the layers does not guarantee
an increase of the accuracy.</p>
      <p>In Tables 4a-c, we also compare the Top-5 accuracies of
the neural networks quantized with the aid of our best
algorithm (max) with Top-5 accuracies of the initial neural
networks (32 bit). For the neural networks ResNet-50 and
Xception the Top-5 accuracy drop was 20-30% and less
than 3.5% when for storing weights we reserved 4 bit and 5
bit, respectively.</p>
      <p>ResNet-50 linear
3 bit
5 bit
6 bit 7 bit</p>
      <p>linear</p>
      <p>ResNet exp
3 bit</p>
      <p>4 bit
max corr
5 bit
6 bit 7 bit
exponential</p>
      <p>Xception linear
y
c
a
r
cu 0,5
c
a
5
p
o
T
0
1
1
y
c
a
r
cu 0,5
c
a
5
p
o
T
0
1
y
c
a
r
cu 0,5
c
a
5
p
o
T
0
3 bit
5 bit
6 bit 7 bit
linear</p>
      <p>In the case of the VGG-16 neural network, the Top-5
accuracy drop was about 20% when we reserved 3 bits per
layer and it was less than 3% when the number of the
reserved bits was larger.</p>
      <p>In Figure 3a-b, for the Xception, ResNet-50 and
VGG16 neural networks we show the dependences of the Top-5
accuracies on the number of bits reserved for storage of each
weight obtained when we initialized them by linear and
exponential splitting. The model accuracy increases when
we choose the quantization with the maximal correlation
between the quantized and input weights. The figures for
Top-1 accuracies are given in Appendix A.</p>
      <p>ResNet exp
1
y
c
a
r
cu 0,5
c
a
5
p
o
T
0
1
y
c
a
r
cu 0,5
c
a
5
p
o
T
0
1
y
c
a
r
cu 0,5
c
a
5
p
o
T
0
3 bit</p>
      <p>4 bit
max corr
5 bit</p>
      <p>Xception exp
3 bit</p>
      <p>4 bit
max corr
5 bit</p>
      <p>VGG-16 exp
3 bit</p>
      <p>4 bit
max corr
5 bit</p>
      <p>We developed the algorithm for neural network
quantization based on maximization of correlations between
the quantized and the input weights. When using this
algorithm no post-training of the neural network is necessary.
Reserving 5 bits per layer, we succeeded in quantization of
the VGG-16 neural network that leaded to 1% of the Top-5
accuracy drop only. Under such compression, the required
memory necessary to store weights is approximately 6 times
less than in the case of the full precision float (32 bits). For
comparison, in paper [5] the VGG-16 neural network was
compressed about 2.5 times by quantization and full
algorithm compression is around 49 times, however their
neural network required a post-training for which a
substantial computing power was necessary. When
comparing with the results of paper [8] we see that for the
ResNet-50, VGG-16 and Xception neural networks at the
same compression our Top-1 and Top-5 accuracies are
better. Our compression allows to use only 3-4 bits to
achieve more than 0.6 Top-5 accuracy for different
architecture without re-training.</p>
    </sec>
    <sec id="sec-3">
      <title>ACKNOWLEDGMENT</title>
      <p>The work financially supported by State Program of
SRISA RAS No. 0065-2019-0003
(AAA-A19119011590090-2).
ImageNet – huge image dataset [Online]. URL:
http://www.imagenet.org.</p>
      <p>Models for image classification with weights trained on ImageNet
[Online]. URL: https://keras.io/applications/.</p>
      <p>K. Simonyan and A. Zisserman, “Very Deep Convolutional
Networks for Large-Scale Image Recognition,” arXiv Preprint:
1409.1556.</p>
      <p>K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for
Image Recognition,” arXiv Preprint: 1512.03385.</p>
      <p>S. Han, H. Mao and W.J. Dally, “Deep compression: Compressing
deep neural network with pruning, trained quantization and huffman
coding,” CoRR, ArXiv Preprint: 1510.00149. 2, 2015.</p>
      <p>Sh. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen and Y. Zou, “Dorefa-net:
Training low bitwidth convolutional neural networks with low
bitwidth gradients,” arXiv Preprint: 1606.06160.</p>
      <p>B.V. Kryzhanovsky, M.V. Kryzhanovsky and M.Yu. Malsagov,
“Discretization of a matrix in quadratic functional binary
optimization,” Doklady Mathematics, vol. 83, pp. 413-417, 2011.
DOI: 10.1134/S1064562411030197.</p>
      <p>M.Yu. Malsagov, E.M. Khayrov, M.M. Pushkareva and I.M.
Karandashev, “Exponential discretization of weights of neural
network connections in pre-trained neural networks,” preprint, 2020.
M. Courbariaux, Y. Bengio and J. David, “Training deep neural
networks with low precision multiplications,” arXiv Preprint:
1412.7024.</p>
      <p>M. Courbariaux, Y. Bengio, J.-P. David, “BinaryConnect: Training
deep neural networks with binary weights during propagations,”
Conference on Neural Information Processing Systems,
arXiv:1511.00363.</p>
      <p>Zh. Lin, M. Courbariaux, R. Memisevic and Y. Bengio, “Neural
networks with few multiplications,” Proceedings of the International
Conference on Learning Representations, arXiv:1510.03009.
E.H. Lee, D. Miyashita, E. Chai, B. Murmann and S.S. Wong,
“LogNet: Energy-efficient neural networks using logarithmic
computation,” Proceedings of the IEEE International Conference on
Acoustics, Speech and Signal Processing, 2017.</p>
      <p>S. Han, J. Pool, J. Tran and W. Dally, “Learning both Weights and
Connections for Efficient Neural Networks,” arXiv: 1506.02626,
2015.</p>
      <p>W. Chen, J. Wilson, S. Tyree and K. Weinberger, “Compressing
Neural Networks with the Hashing Trick. Compressing Neural
Networks with the Hashing Trick,” arXiv: 1504.04788, 2015.
y
c
a
r
u
c
c
a
1
p
o
T
y
c
a
r
u
c
c
a
1
p
o
T
y
c
a
r
u
c
c
a
1
p
o
T
y
c
a
r
u
c
c
a
1
p
o
T
y
c
a
r
u
c
c
a
1
p
o
T
y
c
r
a
u
c
c
a
1
p
o
T
1
0
1
0
1
0
1
0
1
0
1
0
APPENDIX A: TOP-1 ACCURACIES</p>
      <p>ResNet-50 linear
3 bit
5 bit
6 bit 7 bit</p>
      <p>linear</p>
      <p>Xception linear
3 bit
5 bit
6 bit 7 bit</p>
      <p>linear</p>
      <p>VGG-16 linear
3 bit
5 bit
6 bit 7 bit</p>
      <p>linear</p>
      <p>ResNet exp
3 bit</p>
      <p>4 bit
max corr
5 bit</p>
      <p>Xception exp
3 bit</p>
      <p>4 bit
max corr
5 bit</p>
      <p>VGG-16 exp
3 bit</p>
      <p>4 bit
max corr
5 bit</p>
      <p>APPENDIX B: ALGORITHM
# accessory functions
def f(x, func, kde, X, x_min, x_max):
y, px, cov, p = func(x, kde, X, x_min, x_max)
return cov
def grad(x, func, kde, X, x_min, x_max, alpha=10):
y, px, cov, p = func(x, kde, X, x_min, x_max)
step = alpha * px * (y[1:] - y[:-1]) * (y[1:] +
y[:1] - 2 * x) / 2</p>
      <p>return step
def cov_kde(x0, kde, X, x_min, x_max):
'''
calculate distribution function, quantized values
and covariation on the set x0</p>
      <p>X – weights in this layer,
x_min and x_max – minimal and maximal weight
value in the layer
x0 current set (variable values only)
'''
p = np.zeros(len(x0) + 1)
C = np.zeros(len(x0) + 1)
y = np.zeros(len(x0) + 1)
x_ext = sorted(np.append(x0, [x_min, x_max]))
for i in range(len(x_ext)-1):</p>
      <p>mask = np.logical_and(x_ext[i] &lt; X, X &lt;=
x_ext[i + 1])
p[i] = len(X[mask])
C[i] = np.sum(X[mask])
if p[i] == 0:</p>
      <p>C[i] = 0
p[i] = 1
y = C / p
px = kde.evaluate(x0)
cov = np.linalg.norm(C / np.sqrt(p)) #/ sigma_kde
return y, px, cov, p
def results(kde, w, x0, x_min, x_max, func, bits,
kde_std, ans_case='CG'):
'''
correlation maximization procedure for initial set
x0 (only variable values),
w – layer,
kde – kernel density estimation on random sample
from weights,</p>
      <p>x_min and x_max – minimal and maximal weight
values
'''
n_d = 2 ** bits
fx = lambda x: -f(x, func, kde, w, x_min, x_max)
gradx = lambda x: -grad(x, func, kde, w, x_min,
x_max, alpha)
tol_curr = 1e-4
alpha = 10
ans = minimize(fun=fx, x0=x0, jac=gradx,
method='CG', tol=tol_curr
solutions = ans['x']
correlations = -ans['fun']
gradients = np.linalg.norm(gradx(ans['x'])) / alpha
/ n_d
return solutions, correlations, gradients</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>